How Zomato Boosted Performance 25% and Cut Compute Cost 30% Migrating Trino and Druid Workloads to AWS Graviton

Zomato is an India-based restaurant aggregator, food delivery, dining-out company with over 350,000 listed restaurants across more than 1,000 cities in India. The company relies heavily on data analytics to enrich the customer experience and improve business efficiency. Zomato’s engineering and product teams use data insights to refine their platform’s restaurant and cuisine recommendations, improve the accuracy of waiting times at restaurants, speed up the matching of delivery partners and improve overall food delivery process.

At Zomato, different teams have different requirements for data discovery based upon their business functions. For example, number of orders placed in specific area required by a city lead team, queries resolved per minute required by customer support team or most searched dishes on special events or days by marketing and other teams. Zomato’s Data Platform team is responsible for building and maintaining a reliable platform which serves these data insights to all business units.

Zomato’s Data Platform is powered by AWS services including Amazon EMR, Amazon Aurora MySQL-Compatible Edition and Amazon DynamoDB along with open source software Trino (formerly PrestoSQL) and Apache Druid for serving the previously mentioned business metrics to different teams. Trino clusters process over 250K queries by scanning 2PB of data and Apache Druid ingests over 20 billion events and serves 8 million queries every week. To deliver performance at Zomato scale, these massively parallel systems utilize horizontal scaling of nodes running on Amazon Elastic Compute Cloud (Amazon EC2) instances in their clusters on AWS. Performance of both these data platform components is critical to support all business functions reliably and efficiently in Zomato. To improve performance in a cost-effective manner, Zomato migrated these Trino and Druid workloads onto AWS Graviton-based Amazon EC2 instances.

Graviton-based EC2 instances are powered by Arm-based AWS Graviton processors. They deliver up to 40% better price performance than comparable x86-based Amazon EC2 instances. CPU and Memory intensive Java-based applications including Trino and Druid are suitable candidates for AWS Graviton based instances to optimize price-performance, as Java is well supported and generally performant out-of-the-box on arm64.

In this blog, we will walk you through an overview of Trino and Druid, how they fit into the overall Data Platform architecture and migration journey onto AWS Graviton based instances for these workloads. We will also cover challenges faced during migration, how Zomato team overcame those challenges, business gains in terms of cost savings and better performance along with future plans of Zomato on Graviton adoption for more workloads.

Trino overview

Trino is a fast, distributed SQL query engine for querying petabyte scale data, implementing massively parallel processing (MPP) architecture. It was designed as an alternative to tools that query Apache Hadoop Distributed File System (HDFS) using pipelines of MapReduce jobs, such as Apache Hive or Apache Pig, but Trino is not limited to querying HDFS only. It has been extended to operate over a multitude of data sources, including Amazon Simple Storage Service (Amazon S3), traditional relational databases and distributed data stores including Apache Cassandra, Apache Druid, MongoDB and more. When Trino executes a query, it does so by breaking up the execution into a hierarchy of stages, which are implemented as a series of tasks distributed over a network of Trino workers. This reduces end-to-end latency and makes Trino a fast tool for ad hoc data exploration over very large data sets.

Figure 1 – Trino architecture overview

Trino coordinator is responsible for parsing statements, planning queries, and managing Trino worker nodes. Every Trino installation must have a coordinator alongside one or more Trino workers. Client applications including Apache Superset and Redash connect to the coordinator via Presto Gateway to submit statements for execution. The coordinator creates a logical model of a query involving a series of stages, which is then translated into a series of connected tasks running on a cluster of Trino workers. Presto Gateway acts as a proxy/load-balancer for multiple Trino clusters.

Druid overview

Apache Druid is a real-time database to power modern analytics applications for use cases where real-time ingest, fast query performance and high uptime are important. Druid processes are deployed on three types of server nodes: Master nodes govern data availability and ingestion, Query nodes accept queries, execute them across the system, and return the results and Data nodes ingest and store queryable data. Broker processes receive queries from external clients and forward those queries to Data servers. Historicals are the workhorses that handle storage and querying on “historical” data. MiddleManager processes handle ingestion of new data into the cluster. Please refer here to learn more on detailed Druid architecture design.

Figure 2 – Druid architecture overview

Zomato’s Data Platform Architecture on AWS

Figure 3 – Zomato’s Data Platform landscape on AWS

Zomato’s Data Platform covers data ingestion, storage, distributed processing (enrichment and enhancement), batch and real-time data pipelines unification and a robust consumption layer, through which petabytes of data is queried daily for ad-hoc and near real-time analytics. In this section, we will explain the data flow of pipelines serving data to Trino and Druid clusters in the overall Data Platform architecture.

Data Pipeline-1: Amazon Aurora MySQL-Compatible database is used to store data by various microservices at Zomato. Apache Sqoop on Amazon EMR run Extract, Transform, Load (ETL) jobs at scheduled intervals to fetch data from Aurora MySQL-Compatible to transfer it to Amazon S3 in the Optimized Row Columnar (ORC) format, which is then queried by Trino clusters.

Data Pipeline-2: Debezium Kafka connector deployed on Amazon Elastic Container Service (Amazon ECS) acts as producer and continuously polls data from Aurora MySQL-Compatible database. On detecting changes in the data, it identifies the change type and publishes the change data event to Apache Kafka in Avro format. Apache Flink on Amazon EMR consumes data from Kafka topic, performs data enrichment and transformation and writes it in ORC format in Iceberg tables on Amazon S3. Trino clusters then query data from Amazon S3.

Data Pipeline-3: Moving away from other databases, Zomato had decided to go serverless with Amazon DynamoDB because of its high performance (single-digit millisecond latency), request rate (millions per second), extreme scale as per Zomato peak expectations, economics (pay as you go) and data volume (TB, PB, EB) for their business-critical apps including Food Cart, Product Catalog and Customer preferences. DynamoDB streams publish data from these apps to Amazon S3 in JSON format to serve this data pipeline. Apache Spark on Amazon EMR reads JSON data, performs transformations including conversion into ORC format and writes data back to Amazon S3 which is used by Trino clusters for querying.

Data Pipeline-4: Zomato’s core business applications serving end users include microservices, web and mobile applications. To get near real-time insights from these core applications is critical to serve customers and win their trust continuously. Services use a custom SDK developed by data platform team to publish events to the Apache Kafka topic. Then, two downstream data pipelines consume these application events available on Kafka via Apache Flink on Amazon EMR. Flink performs data conversion into ORC format and publishes data to Amazon S3 and in a parallel data pipeline, Flink also publishes enriched data onto another Kafka topic, which further serves data to an Apache Druid cluster deployed on Amazon EC2 instances.

Performance requirements for querying at scale

All of the described data pipelines ingest data into an Amazon S3 based data lake, which is then leveraged by three types of Trino clusters – Ad-hoc clusters for ad-hoc query use cases, with a maximum query runtime of 20 minutes, ETL clusters for creating materialized views to enhance performance of dashboard queries, and Reporting clusters to run queries for dashboards with various Key Performance Indicators (KPIs), with query runtime upto 3 minutes. ETL queries are run via Apache Airflow with a built-in query retry mechanism and a runtime of up to 3 hours.

Druid is used to serve two types of queries: computing aggregated metrics based on recent events and comparing aggregated metrics to historical data. For example, how is a specific metric in the current hour compared to the same last week. Depending on the use case, the service level objective for Druid query response time ranges from a few milliseconds to a few seconds.

Graviton migration of Druid cluster

Zomato first moved Druid nodes to AWS Graviton based instances in their test cluster environment to determine query performance. Nodes running brokers and middle-managers were moved from R5 to R6g instances and nodes running historicals were migrated from i3 to R6gd instances.   Zomato logged real-world queries from their production cluster and replayed them in their test cluster to validate the performance. Post validation, Zomato saw significant performance gains and reduced cost:

Performance gains

For queries in Druid, performance was measured using typical business hours (12:00 to 22:00 Hours) load of 14K queries, as shown here, where p99 query runtime reduced by 25%.

Figure 4 – Overall Druid query performance (Intel x86-64 vs. AWS Graviton)

Also, query performance improvement on the historical nodes of the Druid cluster are shown here, where p95 query runtime reduced by 66%.

Figure 5 –Query performance on Druid Historicals (Intel x86-64 vs. AWS Graviton)

Under peak load during business hours (12:00 to 22:00 Hours as shown in the provided graph), with increasingly loaded CPUs, Graviton based instances demonstrated close to linear performance resulting in better query runtime than equivalent Intel x86 based instances. This provided headroom to Zomato to reduce their overall node count in the Druid cluster for serving the same peak load query traffic.

Figure 6 – CPU utilization (Intel x86-64 vs. AWS Graviton)

Cost savings

A Cost comparison of Intel x86 vs. AWS Graviton based instances running Druid in a test environment along with the number, instance types and hourly On-demand prices in the Singapore region is shown here. There are cost savings of ~24% running the same number of Graviton based instances. Further, Druid cluster auto scales in production environment based upon performance metrics, so average cost savings with Graviton based instances are even higher at ~30% due to better performance.

Figure 7 – Cost savings analysis (Intel x86-64 vs. AWS Graviton)

Graviton migration of Trino clusters

Zomato also moved their Trino cluster in their test environment to AWS Graviton based instances and monitored query performance for different short and long-running queries. As shown here, mean wall (elapsed) time value for different Trino queries is lower on AWS Graviton instances than equivalent Intel x86 based instances, for most of the queries (lower is better).

Figure 8 – Mean Wall Time for Trino queries (Intel x86-64 vs. AWS Graviton)

Also, p99 query runtime reduced by ~33% after migrating the Trino cluster to AWS Graviton instances for a typical business day’s (7am – 7pm) mixed query load with ~15K queries.

Figure 9 –Query performance for a typical day (7am -7pm) load

Zomato’s team further optimized overall Trino query performance by enhancing Advanced Encryption Standard (AES) performance on Graviton for TLS negotiation with Amazon S3. It was achieved by enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics in extra JVM flags. As shown here, mean CPU time for queries is lower after enabling extra JVM flags, for most of the queries.

Figure 10 –Query performance after enabling extra JVM options with Graviton instances

Migration challenges and approach

Zomato team is using Trino version 359 and multi-arch or ARM64-compatible docker image for this Trino version was not available. As the team wanted to migrate their Trino cluster to Graviton based instances with minimal engineering efforts and time, they backported the Trino multi-arch supported UBI8 based Docker image to their Trino version 359.  This approach allowed faster adoption of Graviton based instances, eliminating the heavy lift of upgrading, testing and benchmarking the workload on a newer Trino version.

Next Steps

Zomato has already migrated AWS managed services including Amazon EMR and Amazon Aurora MySQL-Compatible database to AWS Graviton based instances. With the successful migration of two main open source software components (Trino and Druid) of their data platform to AWS Graviton with visible and immediate price-performance gains, the Zomato team plans to replicate that success with other open source applications running on Amazon EC2 including Apache Kafka, Apache Pinot, etc.

Conclusion

This post demonstrated the price/performance benefits of adopting AWS Graviton based instances for high throughput, near real-time big data analytics workloads running on Java-based, open source Apache Druid and Trino applications. Overall, Zomato reduced the cost of its Amazon EC2 usage by 30%, while improving performance for both time-critical and ad-hoc querying by as much as 25%. Due to better performance, Zomato was also able to right size compute footprint for these workloads on a smaller number of Amazon EC2 instances, with peak capacity of Apache Druid and Trino clusters reduced by 25% and 20% respectively.

Zomato migrated these open source software applications faster by quickly implementing customizations needed for optimum performance and compatibility with Graviton based instances. Zomato’s mission is “better food for more people” and Graviton adoption is helping with this mission by providing a more sustainable, performant, and cost-effective compute platform on AWS. This is certainly a “food for thought” for customers looking forward to improve price-performance and sustainability for their business-critical workloads running on Open Source Software (OSS).

Flatlogic Admin Templates banner

Create a CI/CD pipeline for .NET Lambda functions with AWS CDK Pipelines

The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework to define cloud infrastructure in familiar programming languages and provision it through AWS CloudFormation.

In this blog post, we will explore the process of creating a Continuous Integration/Continuous Deployment (CI/CD) pipeline for a .NET AWS Lambda function using the CDK Pipelines. We will cover all the necessary steps to automate the deployment of the .NET Lambda function, including setting up the development environment, creating the pipeline with AWS CDK, configuring the pipeline stages, and publishing the test reports. Additionally, we will show how to promote the deployment from a lower environment to a higher environment with manual approval.

Background

AWS CDK makes it easy to deploy a stack that provisions your infrastructure to AWS from your workstation by simply running cdk deploy. This is useful when you are doing initial development and testing. However, in most real-world scenarios, there are multiple environments, such as development, testing, staging, and production. It may not be the best approach to deploy your CDK application in all these environments using cdk deploy. Deployment to these environments should happen through more reliable, automated pipelines. CDK Pipelines makes it easy to set up a continuous deployment pipeline for your CDK applications, powered by AWS CodePipeline.

The AWS CDK Developer Guide’s Continuous integration and delivery (CI/CD) using CDK Pipelines page shows you how you can use CDK Pipelines to deploy a Node.js based Lambda function. However, .NET based Lambda functions are different from Node.js or Python based Lambda functions in that .NET code first needs to be compiled to create a deployment package. As a result, we decided to write this blog as a step-by-step guide to assist our .NET customers with deploying their Lambda functions utilizing CDK Pipelines.

In this post, we dive deeper into creating a real-world pipeline that runs build and unit tests, and deploys a .NET Lambda function to one or multiple environments.

Architecture

CDK Pipelines is a construct library that allows you to provision a CodePipeline pipeline. The pipeline created by CDK pipelines is self-mutating. This means, you need to run cdk deploy one time to get the pipeline started. After that, the pipeline automatically updates itself if you add new application stages or stacks in the source code.

The following diagram captures the architecture of the CI/CD pipeline created with CDK Pipelines. Let’s explore this architecture at a high level before diving deeper into the details.

Figure 1: Reference architecture diagram

The solution creates a CodePipeline with a AWS CodeCommit repo as the source (CodePipeline Source Stage). When code is checked into CodeCommit, the pipeline is automatically triggered and retrieves the code from the CodeCommit repository branch to proceed to the Build stage.

Build stage compiles the CDK application code and generates the cloud assembly.

Update Pipeline stage updates the pipeline (if necessary).

Publish Assets stage uploads the CDK assets to Amazon S3.

After Publish Assets is complete, the pipeline deploys the Lambda function to both the development and production environments. For added control, the architecture includes a manual approval step for releases that target the production environment.

Prerequisites

For this tutorial, you should have:

An AWS account

Visual Studio 2022
AWS Toolkit for Visual Studio
Node.js 18.x or later
AWS CDK v2 (2.67.0 or later required)
Git

Bootstrapping

Before you use AWS CDK to deploy CDK Pipelines, you must bootstrap the AWS environments where you want to deploy the Lambda function. An environment is the target AWS account and Region into which the stack is intended to be deployed.

In this post, you deploy the Lambda function into a development environment and, optionally, a production environment. This requires bootstrapping both environments. However, deployment to a production environment is optional; you can skip bootstrapping that environment for the time being, as we will cover that later.

This is one-time activity per environment for each environment to which you want to deploy CDK applications. To bootstrap the development environment, run the below command, substituting in the AWS account ID for your dev account, the region you will use for your dev environment, and the locally-configured AWS CLI profile you wish to use for that account. See the documentation for additional details.

cdk bootstrap aws://<DEV-ACCOUNT-ID>/<DEV-REGION>
–profile DEV-PROFILE
–cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess

‐‐profile specifies the AWS CLI credential profile that will be used to bootstrap the environment. If not specified, default profile will be used. The profile should have sufficient permissions to provision the resources for the AWS CDK during bootstrap process.

‐‐cloudformation-execution-policies specifies the ARNs of managed policies that should be attached to the deployment role assumed by AWS CloudFormation during deployment of your stacks.

Note: By default, stacks are deployed with full administrator permissions using the AdministratorAccess policy, but for real-world usage, you should define a more restrictive IAM policy and use that, refer customizing bootstrapping in AWS CDK documentation and Secure CDK deployments with IAM permission boundaries to see how to do that.

Create a Git repository in AWS CodeCommit

For this post, you will use CodeCommit to store your source code. First, create a git repository named dotnet-lambda-cdk-pipeline in CodeCommit by following these steps in the CodeCommit documentation.

After you have created the repository, generate git credentials to access the repository from your local machine if you don’t already have them. Follow the steps below to generate git credentials.

Sign in to the AWS Management Console and open the IAM console.
Create an IAM user (for example, git-user).
Once user is created, attach AWSCodeCommitPowerUser policy to the user.
Next. open the user details page, choose the Security Credentials tab, and in HTTPS Git credentials for AWS CodeCommit, choose Generate.

Download credentials to download this information as a .CSV file.

Clone the recently created repository to your workstation, then cd into dotnet-lambda-cdk-pipeline directory.

git clone <CODECOMMIT-CLONE-URL>
cd dotnet-lambda-cdk-pipeline

Alternatively, you can use git-remote-codecommit to clone the repository with git clone codecommit::<REGION>://<PROFILE>@<REPOSITORY-NAME> command, replacing the placeholders with their original values. Using git-remote-codecommit does not require you to create additional IAM users to manage git credentials. To learn more, refer AWS CodeCommit with git-remote-codecommit documentation page.

Initialize the CDK project

From the command prompt, inside the dotnet-lambda-cdk-pipeline directory, initialize a AWS CDK project by running the following command.

cdk init app –language csharp

Open the generated C# solution in Visual Studio, right-click the DotnetLambdaCdkPipeline project and select Properties. Set the Target framework to .NET 6.

Create a CDK stack to provision the CodePipeline

Your CDK Pipelines application includes at least two stacks: one that represents the pipeline itself, and one or more stacks that represent the application(s) deployed via the pipeline. In this step, you create the first stack that deploys a CodePipeline pipeline in your AWS account.

From Visual Studio, open the solution by opening the .sln solution file (in the src/ folder). Once the solution has loaded, open the DotnetLambdaCdkPipelineStack.cs file, and replace its contents with the following code. Note that the filename, namespace and class name all assume you named your Git repository as shown earlier.

Note: be sure to replace “<CODECOMMIT-REPOSITORY-NAME>” in the code below with the name of your CodeCommit repository (in this blog post, we have used dotnet-lambda-cdk-pipeline).

using Amazon.CDK;
using Amazon.CDK.AWS.CodeBuild;
using Amazon.CDK.AWS.CodeCommit;
using Amazon.CDK.AWS.IAM;
using Amazon.CDK.Pipelines;
using Constructs;
using System.Collections.Generic;

namespace DotnetLambdaCdkPipeline
{
public class DotnetLambdaCdkPipelineStack : Stack
{
internal DotnetLambdaCdkPipelineStack(Construct scope, string id, IStackProps props = null) : base(scope, id, props)
{

var repository = Repository.FromRepositoryName(this, “repository”, “<CODECOMMIT-REPOSITORY-NAME>”);

// This construct creates a pipeline with 3 stages: Source, Build, and UpdatePipeline
var pipeline = new CodePipeline(this, “pipeline”, new CodePipelineProps
{
PipelineName = “LambdaPipeline”,
SelfMutation = true,

// Synth represents a build step that produces the CDK Cloud Assembly.
// The primary output of this step needs to be the cdk.out directory generated by the cdk synth command.
Synth = new CodeBuildStep(“Synth”, new CodeBuildStepProps
{
// The files downloaded from the repository will be placed in the working directory when the script is executed
Input = CodePipelineSource.CodeCommit(repository, “master”),

// Commands to run to generate CDK Cloud Assembly
Commands = new string[] { “npm install -g aws-cdk”, “cdk synth” },

// Build environment configuration
BuildEnvironment = new BuildEnvironment
{
BuildImage = LinuxBuildImage.AMAZON_LINUX_2_4,
ComputeType = ComputeType.MEDIUM,

// Specify true to get a privileged container inside the build environment image
Privileged = true
}
})
});
}
}
}

In the preceding code, you use CodeBuildStep instead of ShellStep, since ShellStep doesn’t provide a property to specify BuildEnvironment. We need to specify the build environment in order to set privileged mode, which allows access to the Docker daemon in order to build container images in the build environment. This is necessary to use the CDK’s bundling feature, which is explained in later in this blog post.

Open the file src/DotnetLambdaCdkPipeline/Program.cs, and edit its contents to reflect the below. Be sure to replace the placeholders with your AWS account ID and region for your dev environment.

using Amazon.CDK;

namespace DotnetLambdaCdkPipeline
{
sealed class Program
{
public static void Main(string[] args)
{
var app = new App();
new DotnetLambdaCdkPipelineStack(app, “DotnetLambdaCdkPipelineStack”, new StackProps
{
Env = new Amazon.CDK.Environment
{
Account = “<DEV-ACCOUNT-ID>”,
Region = “<DEV-REGION>”
}
});
app.Synth();
}
}
}

Note: Instead of committing the account ID and region to source control, you can set environment variables on the CodeBuild agent and use them; see Environments in the AWS CDK documentation for more information. Because the CodeBuild agent is also configured in your CDK code, you can use the BuildEnvironmentVariableType property to store environment variables in AWS Systems Manager Parameter Store or AWS Secrets Manager.

After you make the code changes, build the solution to ensure there are no build issues. Next, commit and push all the changes you just made. Run the following commands (or alternatively use Visual Studio’s built-in Git functionality to commit and push your changes):

git add –all .
git commit -m ‘Initial commit’
git push

Then navigate to the root directory of repository where your cdk.json file is present, and run the cdk deploy command to deploy the initial version of CodePipeline. Note that the deployment can take several minutes.

The pipeline created by CDK Pipelines is self-mutating. This means you only need to run cdk deploy one time to get the pipeline started. After that, the pipeline automatically updates itself if you add new CDK applications or stages in the source code.

After the deployment has finished, a CodePipeline is created and automatically runs. The pipeline includes three stages as shown below.

Source – It fetches the source of your AWS CDK app from your CodeCommit repository and triggers the pipeline every time you push new commits to it.

Build – This stage compiles your code (if necessary) and performs a cdk synth. The output of that step is a cloud assembly.

UpdatePipeline – This stage runs cdk deploy command on the cloud assembly generated in previous stage. It modifies the pipeline if necessary. For example, if you update your code to add a new deployment stage to the pipeline to your application, the pipeline is automatically updated to reflect the changes you made.

Figure 2: Initial CDK pipeline stages

Define a CodePipeline stage to deploy .NET Lambda function

In this step, you create a stack containing a simple Lambda function and place that stack in a stage. Then you add the stage to the pipeline so it can be deployed.

To create a Lambda project, do the following:

In Visual Studio, right-click on the solution, choose Add, then choose New Project.
In the New Project dialog box, choose the AWS Lambda Project (.NET Core – C#) template, and then choose OK or Next.
For Project Name, enter SampleLambda, and then choose Create.
From the Select Blueprint dialog, choose Empty Function, then choose Finish.

Next, create a new file in the CDK project at src/DotnetLambdaCdkPipeline/SampleLambdaStack.cs to define your application stack containing a Lambda function. Update the file with the following contents (adjust the namespace as necessary):

using Amazon.CDK;
using Amazon.CDK.AWS.Lambda;
using Constructs;
using AssetOptions = Amazon.CDK.AWS.S3.Assets.AssetOptions;

namespace DotnetLambdaCdkPipeline
{
class SampleLambdaStack: Stack
{
public SampleLambdaStack(Construct scope, string id, StackProps props = null) : base(scope, id, props)
{
// Commands executed in a AWS CDK pipeline to build, package, and extract a .NET function.
var buildCommands = new[]
{
“cd /asset-input”,
“export DOTNET_CLI_HOME=”/tmp/DOTNET_CLI_HOME””,
“export PATH=”$PATH:/tmp/DOTNET_CLI_HOME/.dotnet/tools””,
“dotnet build”,
“dotnet tool install -g Amazon.Lambda.Tools”,
“dotnet lambda package -o output.zip”,
“unzip -o -d /asset-output output.zip”
};

new Function(this, “LambdaFunction”, new FunctionProps
{
Runtime = Runtime.DOTNET_6,
Handler = “SampleLambda::SampleLambda.Function::FunctionHandler”,

// Asset path should point to the folder where .csproj file is present.
// Also, this path should be relative to cdk.json file.
Code = Code.FromAsset(“./src/SampleLambda”, new AssetOptions
{
Bundling = new BundlingOptions
{
Image = Runtime.DOTNET_6.BundlingImage,
Command = new[]
{
“bash”, “-c”, string.Join(” && “, buildCommands)
}
}
})
});
}
}
}

Building inside a Docker container

The preceding code uses bundling feature to build the Lambda function inside a docker container. Bundling starts a new docker container, copies the Lambda source code inside /asset-input directory of the container, runs the specified commands that write the package files under /asset-output directory. The files in /asset-output are copied as assets to the stack’s cloud assembly directory. In a later stage, these files are zipped and uploaded to S3 as the CDK asset.

Building Lambda functions inside Docker containers is preferable than building them locally because it reduces the host machine’s dependencies, resulting in greater consistency and reliability in your build process.

Bundling requires the creation of a docker container on your build machine. For this purpose, the privileged: true setting on the build machine has already been configured.

Adding development stage

Create a new file in the CDK project at src/DotnetLambdaCdkPipeline/DotnetLambdaCdkPipelineStage.cs to hold your stage. This class will create the development stage for your pipeline.

using Amazon.CDK;
using Constructs;

namespace DotnetLambdaCdkPipeline
{
public class DotnetLambdaCdkPipelineStage : Stage
{
internal DotnetLambdaCdkPipelineStage(Construct scope, string id, IStageProps props = null) : base(scope, id, props)
{
Stack lambdaStack = new SampleLambdaStack(this, “LambdaStack”);
}
}
}

Edit src/DotnetLambdaCdkPipeline/DotnetLambdaCdkPipelineStack.cs to add the stage to your pipeline. Add the bolded line from the code below to your file.

using Amazon.CDK;
using Amazon.CDK.Pipelines;

namespace DotnetLambdaCdkPipeline
{
public class DotnetLambdaCdkPipelineStack : Stack
{
internal DotnetLambdaCdkPipelineStack(Construct scope, string id, IStackProps props = null) : base(scope, id, props)
{

var repository = Repository.FromRepositoryName(this, “repository”, “dotnet-lambda-cdk-application”);

// This construct creates a pipeline with 3 stages: Source, Build, and UpdatePipeline
var pipeline = new CodePipeline(this, “pipeline”, new CodePipelineProps
{
PipelineName = “LambdaPipeline”,
.
.
.
});

var devStage = pipeline.AddStage(new DotnetLambdaCdkPipelineStage(this, “Development”));
}
}
}

Next, build the solution, then commit and push the changes to the CodeCommit repo. This will trigger the CodePipeline to start.

When the pipeline runs, UpdatePipeline stage detects the changes and updates the pipeline based on the code it finds there. After the UpdatePipeline stage completes, pipeline is updated with additional stages.

Let’s observe the changes:

An Assets stage has been added. This stage uploads all the assets you are using in your app to Amazon S3 (the S3 bucket created during bootstrapping) so that they could be used by other deployment stages later in the pipeline. For example, the CloudFormation template used by the development stage, includes reference to these assets, which is why assets are first moved to S3 and then referenced in later stages.

A Development stage with two actions has been added. The first action is to create the change set, and the second is to execute it.

Figure 3: CDK pipeline with development stage to deploy .NET Lambda function

After the Deploy stage has completed, you can find the newly-deployed Lambda function by visiting the Lambda console, selecting “Functions” from the left menu, and filtering the functions list with “LambdaStack”. Note the runtime is .NET.

Running Unit Test cases in the CodePipeline

Next, you will add unit test cases to your Lambda function, and run them through the pipeline to generate a test report in CodeBuild.

To create a Unit Test project, do the following:

Right click on the solution, choose Add, then choose New Project.
In the New Project dialog box, choose the xUnit Test Project template, and then choose OK or Next.
For Project Name, enter SampleLambda.Tests, and then choose Create or Next.
Depending on your version of Visual Studio, you may be prompted to select the version of .NET to use. Choose .NET 6.0 (Long Term Support), then choose Create.
Right click on SampleLambda.Tests project, choose Add, then choose Project Reference. Select SampleLambda project, and then choose OK.

Next, edit the src/SampleLambda.Tests/UnitTest1.cs file to add a unit test. You can use the code below, which verifies that the Lambda function returns the input string as upper case.

using Xunit;

namespace SampleLambda.Tests
{
public class UnitTest1
{
[Fact]
public void TestSuccess()
{
var lambda = new SampleLambda.Function();

var result = lambda.FunctionHandler(“test string”, context: null);

Assert.Equal(“TEST STRING”, result);
}
}
}

You can add pre-deployment or post-deployment actions to the stage by calling its AddPre() or AddPost() method. To execute above test cases, we will use a pre-deployment action.

To add a pre-deployment action, we will edit the src/DotnetLambdaCdkPipeline/DotnetLambdaCdkPipelineStack.cs file in the CDK project, after we add code to generate test reports.

To run the unit test(s) and publish the test report in CodeBuild, we will construct a BuildSpec for our CodeBuild project. We also provide IAM policy statements to be attached to the CodeBuild service role granting it permissions to run the tests and create reports. Update the file by adding the new code (starting with “// Add this code for test reports”) below the devStage declaration you added earlier:

using Amazon.CDK;
using Amazon.CDK.Pipelines;

namespace DotnetLambdaCdkPipeline
{
public class DotnetLambdaCdkPipelineStack : Stack
{
internal DotnetLambdaCdkPipelineStack(Construct scope, string id, IStackProps props = null) : base(scope, id, props)
{
// …
// …
// …
var devStage = pipeline.AddStage(new DotnetLambdaCdkPipelineStage(this, “Development”));

// Add this code for test reports
var reportGroup = new ReportGroup(this, “TestReports”, new ReportGroupProps
{
ReportGroupName = “TestReports”
});

// Policy statements for CodeBuild Project Role
var policyProps = new PolicyStatementProps()
{
Actions = new string[] {
“codebuild:CreateReportGroup”,
“codebuild:CreateReport”,
“codebuild:UpdateReport”,
“codebuild:BatchPutTestCases”
},
Effect = Effect.ALLOW,
Resources = new string[] { reportGroup.ReportGroupArn }
};

// PartialBuildSpec in AWS CDK for C# can be created using Dictionary
var reports = new Dictionary<string, object>()
{
{
“reports”, new Dictionary<string, object>()
{
{
reportGroup.ReportGroupArn, new Dictionary<string,object>()
{
{ “file-format”, “VisualStudioTrx” },
{ “files”, “**/*” },
{ “base-directory”, “./testresults” }
}
}
}
}
};
// End of new code block
}
}
}

Finally, add the CodeBuildStep as a pre-deployment action to the development stage with necessary CodeBuildStepProps to set up reports. Add this after the new code you added above.

devStage.AddPre(new Step[]
{
new CodeBuildStep(“Unit Test”, new CodeBuildStepProps
{
Commands= new string[]
{
“dotnet test -c Release ./src/SampleLambda.Tests/SampleLambda.Tests.csproj –logger trx –results-directory ./testresults”,
},
PrimaryOutputDirectory = “./testresults”,
PartialBuildSpec= BuildSpec.FromObject(reports),
RolePolicyStatements = new PolicyStatement[] { new PolicyStatement(policyProps) },
BuildEnvironment = new BuildEnvironment
{
BuildImage = LinuxBuildImage.AMAZON_LINUX_2_4,
ComputeType = ComputeType.MEDIUM
}
})
});

Build the solution, then commit and push the changes to the repository. Pushing the changes triggers the pipeline, runs the test cases, and publishes the report to the CodeBuild console. To view the report, after the pipeline has completed, navigate to TestReports in CodeBuild’s Report Groups as shown below.

Figure 4: Test report in CodeBuild report group

Deploying to production environment with manual approval

CDK Pipelines makes it very easy to deploy additional stages with different accounts. You have to bootstrap the accounts and Regions you want to deploy to, and they must have a trust relationship added to the pipeline account.

To bootstrap an additional production environment into which AWS CDK applications will be deployed by the pipeline, run the below command, substituting in the AWS account ID for your production account, the region you will use for your production environment, the AWS CLI profile to use with the prod account, and the AWS account ID where the pipeline is already deployed (the account you bootstrapped at the start of this blog).

cdk bootstrap aws://<PROD-ACCOUNT-ID>/<PROD-REGION>
–profile <PROD-PROFILE>
–cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess
–trust <PIPELINE-ACCOUNT-ID>

The –trust option indicates which other account should have permissions to deploy AWS CDK applications into this environment. For this option, specify the pipeline’s AWS account ID.

Use below code to add a new stage for production deployment with manual approval. Add this code below the “devStage.AddPre(…)” code block you added in the previous section, and remember to replace the placeholders with your AWS account ID and region for your prod environment.

var prodStage = pipeline.AddStage(new DotnetLambdaCdkPipelineStage(this, “Production”, new StageProps
{
Env = new Environment
{
Account = “<PROD-ACCOUNT-ID>”,
Region = “<PROD-REGION>”
}
}), new AddStageOpts
{
Pre = new[] { new ManualApprovalStep(“PromoteToProd”) }
});

To support deploying CDK applications to another account, the artifact buckets must be encrypted, so add a CrossAccountKeys property to the CodePipeline near the top of the pipeline stack file, and set the value to true (see the line in bold in the code snippet below). This creates a KMS key for the artifact bucket, allowing cross-account deployments.

var pipeline = new CodePipeline(this, “pipeline”, new CodePipelineProps
{
PipelineName = “LambdaPipeline”,
SelfMutation = true,
CrossAccountKeys = true,
EnableKeyRotation = true, //Enable KMS key rotation for the generated KMS keys

// …
}

After you commit and push the changes to the repository, a new manual approval step called PromoteToProd is added to the Production stage of the pipeline. The pipeline pauses at this step and awaits manual approval as shown in the screenshot below.

Figure 5: Pipeline waiting for manual review

When you click the Review button, you are presented with the following dialog. From here, you can choose to approve or reject and add comments if needed.

Figure 6: Manual review approval dialog

Once you approve, the pipeline resumes, executes the remaining steps and completes the deployment to production environment.

Figure 7: Successful deployment to production environment

Clean up

To avoid incurring future charges, log into the AWS console of the different accounts you used, go to the AWS CloudFormation console of the Region(s) where you chose to deploy, select and click Delete on the stacks created for this activity. Alternatively, you can delete the CloudFormation Stack(s) using cdk destroy command. It will not delete the CDKToolkit stack that the bootstrap command created. If you want to delete that as well, you can do it from the AWS Console.

Conclusion

In this post, you learned how to use CDK Pipelines for automating the deployment process of .NET Lambda functions. An intuitive and flexible architecture makes it easy to set up a CI/CD pipeline that covers the entire application lifecycle, from build and test to deployment. With CDK Pipelines, you can streamline your development workflow, reduce errors, and ensure consistent and reliable deployments.
For more information on CDK Pipelines and all the ways it can be used, see the CDK Pipelines reference documentation.

About the authors:

Ankush Jain

Ankush Jain is a Cloud Consultant at AWS Professional Services based out of Pune, India. He currently focuses on helping customers migrate their .NET applications to AWS. He is passionate about cloud, with a keen interest in serverless technologies.

Sanjay Chaudhari

Sanjay Chaudhari is a Cloud Consultant with AWS Professional Services. He works with customers to migrate and modernize their Microsoft workloads to the AWS Cloud.

Strategies to optimize the costs of your builds on AWS CodeBuild

AWS CodeBuild is a fully managed continuous integration service that compiles source code, runs tests, and produces ready-to-deploy software packages. With CodeBuild, you don’t need to provision, manage, and scale your own build servers. You just specify the location of your source code and choose your build settings, and CodeBuild will run your build scripts for compiling, testing, and packaging your code.

CodeBuild uses simple pay-as-you-go pricing. There are no upfront costs or minimum fees. You pay only for the resources you use. You are charged for compute resources based on the duration it takes for your build to execute.

There are three main factors that contribute to build costs with CodeBuild:

Build duration
Compute types
Additional services

Understanding how to balance these factors is key to optimizing costs on AWS and this blog post will take a look at each.

Compute Types

CodeBuild offers three compute instance types with different amounts of memory and CPU, for example the Linux GPU Large compute type has 255GB of memory and 32 vCPUs and enables you to execute CI/CD workflow for deep learning purpose (ML/AI) with AWS CodePipeline. Incremental changes in your code, data, and ML models can now be tested for accuracy, before the changes are released through your pipeline.

The Linux 2XLarge instance type is another instance type with 145GB of memory and 72 vCPUs and is suitable for building large and complex applications that require high memory and CPU resources. It can help reduce build time, speed up delivery, and support multiple build environments.

The GPU and 2XLarge compute types are powerful but are also the most expensive compute types per minute. For most build tasks the small, medium or large instance compute types are more than adequate. Using the pricing listed in US East (Ohio) we can see the price variance between the small, medium and large Linux instance types in Figure 1 below.

Figure 1. AWS CodeBuild small, medium and large compute types vs cost per minute

Analyzing the CodeBuild compute costs leads us to a number of cost optimization considerations.

Right Sizing AWS CodeBuild Compute Types to Match Workloads

Right sizing is the process of matching instance types and sizes to your workload performance and capacity requirements at the lowest possible cost. It’s also the process of looking at deployed instances and identifying opportunities to eliminate or downsize without compromising capacity or other requirements, which results in lower costs.

Right sizing is a key mechanism for optimizing AWS costs, but it is often ignored by organizations when they first move to the AWS Cloud. They lift and shift their environments and expect to right size later. Speed and performance are often prioritized over cost, which results in oversized instances and a lot of wasted spend on unused resources.

CodeBuild monitors build resource utilization on your behalf and reports metrics through Amazon CloudWatch. These include metrics such as

CPU
Memory
Disk I/O

These metrics can be seen within the CodeBuild console, for an example see Figure 2 below:

Figure 2. Resource utilization metrics

Leveraging observability to measuring build resource usage is key to understanding how to rightsize and CodeBuild makes this easy with CloudWatch metrics readily available through the CodeBuild console.

Consider ARM / Graviton

If we compare the costs of arm1.small and general1.small over a ten minute period we can see that the arm based compute type is 32% less expensive.

Figure 3. Comparison of small arm and general compute types

But cost per minute while building is not the only benefit here, ARM processors are known for their energy efficiency and high performance. Compiling code directly on an ARM processor can potentially lead to faster execution times and improved overall system performance.

The ideal workload to migrate to ARM is one that doesn’t use architecture-specific dependencies or binaries, already runs on Linux and uses open-source components, for example: Migrating AWS Lambda functions to Arm-based AWS Graviton2 processors.

AWS Graviton processors are custom built by Amazon Web Services using 64-bit Arm Neoverse cores to deliver the best price performance for your cloud workloads. The AWS Graviton Fast Start program helps you quickly and easily move your workloads to AWS Graviton in as little as four hours for applications such as serverless, containerized, database, and caching.

Consider migrating Windows workloads to Linux

If we compare the cost of a general1.medium Windows vs Linux compute type we can see that the Linux Compute type is 43% less expensive over ten minutes:

Figure 4. Build times on Windows compared to Linux

Migrating to Linux is one strategy to not only reduce the costs of building and testing code in CodeBuild but also the cost of running in production.

The effort required to re-platform from Windows to Linux varies depending on how the application was implemented. The key is to identify and target workloads with the right characteristics, balancing strategic importance and implementation effort.

For example, older .Net applications may be able to be migrated to later versions of .NET (previously named .Net Core) first before deploying to Linux. AWS have a Porting Assistant for .NET that is an analysis tool that scans .NET Framework applications and generates a cross-platform compatibility assessment, helping you port your applications to Linux faster.

See our guide on how to escape unfriendly licensing practices by migrating Windows workloads to Linux.

Build duration

One of the dimensions of the CodeBuild pricing is the duration of each build. This is calculated in minutes, from the time you submit your build until your build is terminated, rounded up to the nearest minute. For example: if your build takes a total of 35 seconds using one arm1.small Linux instance on US East (Ohio), each build will cost the price of the full minute, which is $0.0034 in that case. Similarly, if your build takes a total of 5 minutes and 20 seconds, you’ll be charged for 6 minutes.

When you define your CodeBuild project, within a buildspec file, you can specify some of the phases of your builds. The phases you can specify are install, pre-build, build, and post-build. See the documentation to learn more about what each of those phases represent. Besides that, you can define how and where to upload reports and artifacts each build generates. It means that on each of those steps, you should do only what is necessary for the task you want to achieve. Installing dependencies that you won’t need, running commands that aren’t related to your task, or performing tests that aren’t necessary will affect your build time and unnecessarily increase your costs. Packaging and uploading target artifacts with unnecessary large files would cause a similar result.

On top of the CodeBuild phases and steps that you are directly in control, each time you start a build, it takes additional time to queue the task, provision the environment, download the source code (if applicable), and finalize. See Figure 5 below a breakdown of a succeeded build:

Figure 5. AWS CodeBuild Phase details

In the above example, for each build, it takes approximately 42 seconds on top of what is specified in the buildspec file. Considering this overhead, having several smaller builds instead of fewer larger builds can potentially increase your costs. With this in mind, you have to keep your builds as short as possible, by doing only what is necessary, so that you can minimize the costs. Furthermore, you have to find a good balance between the duration and the frequency of your builds, so that the overhead doesn’t take a large proportion of your build time. Let’s explore some approaches you can factor in to find this balance.

Build caching

A common way to save time and cost on your CodeBuild builds is with build caching. With build caching, you are able to store reusable pieces of your build environment, so that you can save time next time you start a new build. There are two types of caching:

Amazon S3 — Stores the cache in an Amazon S3 bucket that is available across multiple build hosts. If you have build artifacts that are more expensive to build than to download, this is a good option for you. For large build artifacts, this may not be the best option, because it can take longer to transfer over your network.

Local caching — Stores a cache locally on a build host that is available to that build host only. When you choose this option, the cache is immediately available on the build host, making it a good option for large build artifacts that would take long network transfer time. If you choose local caching, there are multiple cache modes you can choose including source cache mode, docker layer cache mode and custom cache mode.

Docker specific optimizations

Another strategy to optimize your build time and reduce your costs is using custom Docker images. When you specify your CodeBuild project, you can either use one of the default Docker images provided by CodeBuild, or use your own build environment packaged as a Docker image. When you create your own build environment as a Docker image, you can pre-package it with all the tools, test assets, and required dependencies. This can potentially save a significant amount of time, because on the install phase you won’t need to download packages from the internet, and on the build phase, when applicable, you won’t need to download e.g., large static test datasets.

To achieve that, you must specify the image value on the environment configuration when creating or updating your CodeBuild project. See Docker in custom image sample for CodeBuild to learn more about how to configure that. Keep in mind that larger Docker images can negatively affect your build time, therefore you should aim to keep your custom build environment as lean as possible, with only the mandatory contents. Another aspect to consider is to use Amazon Elastic Container Registry (ECR) to store your Docker images. Downloading the image from within the AWS network will be, in most of the cases, faster than downloading it from the public internet and can avoid bottlenecks from public repositories.

Consider which tests to run on the feature branch

If you are using a feature-branch approach, consider carefully which build steps and tests you are going to run on your branches. Running unit tests is a good example of what you should run on the feature branches, but unless you have very specific requirements, you probably don’t need penetration or integration tests at this point. Usually the feature branch changes often, hence running all types of tests all the time is a potential waste. Prefer to have your complex, long-running tests at a later stage of your CI/CD pipeline, as you build confidence on the version that you are to release.

Build once, deploy everywhere

It’s widely considered a best practice to avoid environment-specific code builds, therefore consider a build once, deploy everywhere strategy. There are many benefits to separating environment configuration from the build including reducing build costs, improve maintainability, scalability, and reduce the risk of errors.

Build once, deploy everywhere can be seen in the AWS Deployment Pipeline Reference Architecture where the Beta, Gamma and Prod stages are created from a single artifact created in the Build Stage:

Figure 6. Application Pipeline reference architecture

Additional Services

CloudWatch Logs and Metrics

Amazon CloudWatch can be used to monitor your builds, report when something goes wrong, take automatic actions when appropriate or simply keep logs of your builds.

CloudWatch metrics show the behavior of your builds over time. For example, you can monitor:

How many builds were attempted in a build project or an AWS account over time.
How many builds were successful in a build project or an AWS account over time.
How many builds failed in a build project or an AWS account over time.
How much time CodeBuild spent running builds in a build project or an AWS account over time.
Build resource utilization for a build or an entire build project. Build resource utilization metrics include metrics such as CPU, memory, and storage utilization.

However, you may incur charges from Amazon CloudWatch Logs for build log streams. For more information, see Monitoring AWS Codebuild in the CodeBuild User Guide and the CloudWatch pricing page.

Storage Costs

You can create an CodeBuild build project with a set of output artifacts and publish then to S3 buckets. Using S3 as a repository for your artifacts, you only pay for what you use. Check the S3 pricing page.

Encryption

Cloud security at AWS is the highest priority and encryption is an important part of CodeBuild security. Some encryption, such as for data in-transit, is provided by default and does not require you to do anything. Other encryption, such as for data at-rest, you can configure when you create your project or build. Codebuild uses Amazon KMS to encrypt the data at-rest.

Build artifacts, such as a cache, logs, exported raw test report data files, and build results, are encrypted by default using AWS managed keys and are free of charge. Consider using these keys if you don’t need to create your own key.

If you do not want to use these KMS keys, you can create and configure a customer managed key. For more information, see the documentation on creating KMS Keys and AWS Key Management Service concepts in the AWS Key Management Service User Guide.

Check the KMS pricing page.

Data transfer costs

You may incur additional charges if your builds transfer data, for example:

Avoid routing traffic over the internet when connecting to AWS services from within AWS by using VPC endpoints

Traffic that crosses an Availability Zone boundary typically incurs a data transfer charge. Use resources from the local Availability Zone whenever possible.
Traffic that crosses a Regional boundary will typically incur a data transfer charge. Avoid cross-Region data transfer unless your business case requires it
Use the AWS Pricing Calculator to help estimate the data transfer costs for your solution.
Use a dashboard to better visualize data transfer charges – this workshop will show how.

Here’s an Overview of Data Transfer Costs for Common Architectures on AWS.

Conclusion

In this blog post we discussed how compute types; build duration and use of additional services contribute to build costs with AWS CodeBuild.

We highlighted how right sizing compute types is an important practice for teams that want to reduce their build costs while still achieving optimal performance. The key to optimizing is by measuring and observing the workload and selecting the most appropriate compute instance based on requirements.

Further compute type cost optimizations can be found by targeting AWS Graviton processors and Linux environments. AWS Graviton Processors in particular offer several advantages over traditional x86-based instances and are designed by AWS to deliver the best price performance for your cloud workloads.

For further reading, see the summary of CI/CD best practices from the Practicing Continuous Integration and Continuous Delivery on AWS whitepaper,  my CI/CD pipeline is my release captain and also the cost optimization pillar from the AWS Well-Architected Framework which focuses on avoiding unnecessary costs.

About the authors:

Leandro Cavalcante Damascena

Leandro Damascena is a Solutions Architect at AWS, focused on Application Modernization and Serverless. Prior to that, he spent 20 years in various roles across the software delivery lifecycle, from development to architecture and operations. Outside of work, he loves spending time with his family and enjoying water sports like kitesurfing and wakeboarding.

Rafael Ramos

Rafael is a Solutions Architect at AWS, where he helps ISVs on their journey to the cloud. He spent over 13 years working as a software developer, and is passionate about DevOps and serverless. Outside of work, he enjoys playing tabletop RPG, cooking and running marathons.

Matt Laver

Matt Laver is a Solutions Architect at AWS working with SMB customers in the UK. He is passionate about DevOps and loves helping customers find simple solutions to difficult problems.

Unit Testing AWS Lambda with Python and Mock AWS Services

When building serverless event-driven applications using AWS Lambda, it is best practice to validate individual components.  Unit testing can quickly identify and isolate issues in AWS Lambda function code.  The techniques outlined in this blog demonstrates unit test techniques for Python-based AWS Lambda functions and interactions with AWS Services.

The full code for this blog is available in the GitHub project as a demonstrative example.

Example use case

Let’s consider unit testing a serverless application which provides an API endpoint to generate a document.  When the API endpoint is called with a customer identifier and document type, the Lambda function retrieves the customer’s name from DynamoDB, then retrieves the document text from DynamoDB for the given document type, finally generating and writing the resulting document to S3.

Figure 1. Example application architecture

Amazon API Gateway provides an endpoint to request the generation of a document for a given customer.  A document type and customer identifier are provided in this API call.
The endpoint invokes an AWS Lambda function that generates a document using the customer identifier and the document type provided.
An Amazon DynamoDB table stores the contents of the documents and the users name, which are retrieved by the Lambda function.
The resulting text document is stored to Amazon S3.

Our testing goal is to determine if an isolated “unit” of code works as intended. In this blog, we will be writing tests to provide confidence that the logic written in the above AWS Lambda function behaves as we expect. We will mock the service integrations to Amazon DynamoDB and S3 to isolate and focus our tests on the Lambda function code, and not on the behavior of the AWS Services.

Define the AWS Service resources in the Lambda function

Before writing our first unit test, let’s look at the Lambda function that contains the behavior we wish to test.  The full code for the Lambda function is available in the GitHub repository as src/sample_lambda/app.py.

As part of our Best practices for working AWS Lambda functions, we recommend initializing AWS service resource connections outside of the handler function and in the global scope.  Additionally, we can retrieve any relevant environment variables in the global scope so that subsequent invocations of the Lambda function do not repeatedly need to retrieve them.  For organization, we can put the resource and variables in a dictionary:

_LAMBDA_DYNAMODB_RESOURCE = { “resource” : resource(‘dynamodb’),
“table_name” : environ.get(“DYNAMODB_TABLE_NAME”,”NONE”) }

However, globally scoped code and global variables are challenging to test in Python, as global statements are executed on import, and outside of the controlled test flow.  To facilitate testing, we define classes for supporting AWS resource connections that we can override (patch) during testing.  These classes will accept a dictionary containing the boto3 resource and relevant environment variables.

For example, we create a DynamoDB resource class with a parameter “boto3_dynamodb_resource” that accepts a boto3 resource connected to DynamoDB:

class LambdaDynamoDBClass:
def __init__(self, lambda_dynamodb_resource):
self.resource = lambda_dynamodb_resource[“resource”]
self.table_name = lambda_dynamodb_resource[“table_name”]
self.table = self.resource.Table(self.table_name)

Build the Lambda Handler

The Lambda function handler is the method in the AWS Lambda function code that processes events. When the function is invoked, Lambda runs the handler method. When the handler exits or returns a response, it becomes available to process another event.

To facilitate unit test of the handler function, move as much of logic as possible to other functions that are then called by the Lambda hander entry point.  Also, pass the AWS resource global variables to these subsequent function calls.  This approach enables us to mock and intercept all resources and calls during test.

In our example, the handler references the global variables, and instantiates the resource classes to setup the connections to specific AWS resources.  (We will be able to override and mock these connections during unit test.)

Then the handler calls the create_letter_in_s3 function to perform the steps of creating the document, passing the resource classes.  This downstream function avoids directly referencing the global context or any AWS resource connections directly.

def lambda_handler(event: APIGatewayProxyEvent, context: LambdaContext) -> Dict[str, Any]:

global _LAMBDA_DYNAMODB_RESOURCE
global _LAMBDA_S3_RESOURCE

dynamodb_resource_class = LambdaDynamoDBClass(_LAMBDA_DYNAMODB_RESOURCE)
s3_resource_class = LambdaS3Class(_LAMBDA_S3_RESOURCE)

return create_letter_in_s3(
dynamo_db = dynamodb_resource_class,
s3 = s3_resource_class,
doc_type = event[“pathParameters”][“docType”],
cust_id = event[“pathParameters”][“customerId”])

Unit testing with mock AWS services

Our Lambda function code has now been written and is ready to be tested, let’s take a look at the unit test code!   The full code for the unit test is available in the GitHub repository as tests/unit/src/test_sample_lambda.py.

In production, our Lambda function code will directly access the AWS resources we defined in our function handler; however, in our unit tests we want to isolate our code and replace the AWS resources with simulations.  This isolation facilitates running unit tests in an isolated environment to prevent accidental access to actual cloud resources.

Moto is a python library for Mocking AWS Services that we will be using to simulate AWS resource our tests.  Moto supports many AWS resources, and it allows you to test your code with little or no modification by emulating functionality of these services.

Moto uses decorators to intercept and simulate responses to and from AWS resources.  By adding a decorator for a given AWS service, subsequent calls from the module to that service will be re-directed to the mock.

@moto.mock_dynamodb
@moto.mock_s3

Configure Test Setup and Tear-down

The mocked AWS resources will be used during the unit test suite.  Using the setUp() method allows you to define and configure the mocked global AWS Resources before the tests are run.

We define the test class and a setUp() method and initialize the mock AWS resource.  This includes configuring the resource to prepare it for testing, such as defining a mock DynamoDB table or creating a mock S3 Bucket.

class TestSampleLambda(TestCase):
def setUp(self) -> None:
dynamodb = boto3.resource(“dynamodb”, region_name=”us-east-1″)
dynamodb.create_table(
TableName = self.test_ddb_table_name,
KeySchema = [{“AttributeName”: “PK”, “KeyType”: “HASH”}],
AttributeDefinitions = [{“AttributeName”: “PK”,
“AttributeType”: “S”}],
BillingMode = ‘PAY_PER_REQUEST’

s3_client = boto3.client(‘s3’, region_name=”us-east-1″)
s3_client.create_bucket(Bucket = self.test_s3_bucket_name )

After creating the mocked resources, the setup function creates resource class object referencing those mocked resources, which will be used during testing.

mocked_dynamodb_resource = resource(“dynamodb”)
mocked_s3_resource = resource(“s3”)
mocked_dynamodb_resource = { “resource” : resource(‘dynamodb’),
“table_name” : self.test_ddb_table_name }
mocked_s3_resource = { “resource” : resource(‘s3’),
“bucket_name” : self.test_s3_bucket_name }
self.mocked_dynamodb_class = LambdaDynamoDBClass(mocked_dynamodb_resource)
self.mocked_s3_class = LambdaS3Class(mocked_s3_resource)

Test #1: Verify the code writes the document to S3

Our first test will validate our Lambda function writes the customer letter to an S3 bucket in the correct manner.  We will follow the standard test format of arrange, act, assert when writing this unit test.

Arrange the data we need in the DynamoDB table:

def test_create_letter_in_s3(self) -> None:

self.mocked_dynamodb_class.table.put_item(Item={“PK”:”D#UnitTestDoc”,
“data”:”Unit Test Doc Corpi”})
self.mocked_dynamodb_class.table.put_item(Item={“PK”:”C#UnitTestCust”,
“data”:”Unit Test Customer”})

Act by calling the create_letter_in_s3 function.  During these act calls, the test passes the AWS resources as created in the setUp().

test_return_value = create_letter_in_s3(
dynamo_db = self.mocked_dynamodb_class,
s3=self.mocked_s3_class,
doc_type = “UnitTestDoc”,
cust_id = “UnitTestCust”
)

Assert by reading the data written to the mock S3 bucket, and testing conformity to what we are expecting:

bucket_key = “UnitTestCust/UnitTestDoc.txt”
body = self.mocked_s3_class.bucket.Object(bucket_key).get()[‘Body’].read()

self.assertEqual(test_return_value[“statusCode”], 200)
self.assertIn(“UnitTestCust/UnitTestDoc.txt”, test_return_value[“body”])
self.assertEqual(body.decode(‘ascii’),”Dear Unit Test Customer;nUnit Test Doc Corpi”)

Tests #2 and #3: Data not found error conditions

We can also test error conditions and handling, such as keys not found in the database.  For example, if a customer identifier is submitted, but does not exist in the database lookup, does the logic handle this and return a “Not Found” code of 404?

To test this in test #2, we add data to the mocked DynamoDB table, but then submit a customer identifier that is not in the database.

This test, and a similar test #3 for “Document Types not found”, are implemented in the example test code on GitHub.

Test #4: Validate the handler interface

As the application logic resides in independently tested functions, the Lambda handler function provides only interface validation and function call orchestration.  Therefore, the test for the handler validates that the event is parsed correctly, any functions are invoked as expected, and the return value is passed back.

To emulate the global resource variables and other functions, patch both the global resource classes and logic functions.

@patch(“src.sample_lambda.app.LambdaDynamoDBClass”)
@patch(“src.sample_lambda.app.LambdaS3Class”)
@patch(“src.sample_lambda.app.create_letter_in_s3”)
def test_lambda_handler_valid_event_returns_200(self,
patch_create_letter_in_s3 : MagicMock,
patch_lambda_s3_class : MagicMock,
patch_lambda_dynamodb_class : MagicMock
):

Arrange for the test by setting return values for the patched objects.

patch_lambda_dynamodb_class.return_value = self.mocked_dynamodb_class
patch_lambda_s3_class.return_value = self.mocked_s3_class

return_value_200 = {“statusCode” : 200, “body”:”OK”}
patch_create_letter_in_s3.return_value = return_value_200

We need to provide event data when invoking the Lambda handler.  A good practice is to save test events as separate JSON files, rather than placing them inline as code. In the example project, test events are located in the folder “tests/events/”. During test execution, the event object is created from the JSON file using the utility function named load_sample_event_from_file.

test_event = self.load_sample_event_from_file(“sampleEvent1”)

Act by calling the lambda_handler function.

test_return_value = lambda_handler(event=test_event, context=None)

Assert by ensuring the create_letter_in_s3 function is called with the expected parameters based on the event, and a create_letter_in_s3 function return value is passed back to the caller.  In our example, this value is simply passed with no alterations.

patch_create_letter_in_s3.assert_called_once_with(
dynamo_db=self.mocked_dynamodb_class,
s3=self.mocked_s3_class,
doc_type=test_event[“pathParameters”][“docType”],
cust_id=test_event[“pathParameters”][“customerId”])

self.assertEqual(test_return_value, return_value_200)

Tear Down

The tearDown() method is called immediately after the test method has been run and the result is recorded.  In our example tearDown() method, we clean up any data or state created so the next test won’t be impacted.

Running the unit tests

The unittest Unit testing framework can be run using the Python pytest utility.  To ensure network isolation and verify the unit tests are not accidently connecting to AWS resources, the pytest-socket project provides the ability to disable network communication during a test.

pytest -v –disable-socket -s tests/unit/src/

The pytest command results in a PASSED or FAILED status for each test.  A PASSED status verifies that your unit tests, as written, did not encounter errors or issues,

Conclusion

Unit testing is a software development process in which different parts of an application, called units, are individually and independently tested. Tests validate the quality of the code and confirm that it functions as expected. Other developers can gain familiarity with your code base by consulting the tests. Unit tests reduce future refactoring time, help engineers get up to speed on your code base more quickly, and provide confidence in the expected behaviour.

We’ve seen in this blog how to unit test AWS Lambda functions and mock AWS Services to isolate and test individual logic within our code.

AWS Lambda Powertools for Python has been used in the project to validate hander events.   Powertools provide a suite of utilities for AWS Lambda functions to ease adopting best practices such as tracing, structured logging, custom metrics, idempotency, batching, and more.

Learn more about AWS Lambda testing in our prescriptive test guidance, and find additional test examples on GitHub.  For more serverless learning resources, visit Serverless Land.

About the authors:

Tom Romano

Tom Romano is a Solutions Architect for AWS World Wide Public Sector from Tampa, FL, and assists GovTech and EdTech customers as they create new solutions that are cloud-native, event driven, and serverless. He is an enthusiastic Python programmer for both application development and data analytics. In his free time, Tom flies remote control model airplanes and enjoys vacationing with his family around Florida and the Caribbean.

Kevin Hakanson

Kevin Hakanson is a Sr. Solutions Architect for AWS World Wide Public Sector based in Minnesota. He works with EdTech and GovTech customers to ideate, design, validate, and launch products using cloud-native technologies and modern development practices. When not staring at a computer screen, he is probably staring at another screen, either watching TV or playing video games with his family.

Disaster Recovery When Using Crossplane for Infrastructure Provisioning on AWS

We would like to acknowledge the help and support Vikram Sethi, Isaac Mosquera, and Carlos Santana provided to make this blog post better. Thank you!

In our previous blog posts [1,2,3], we have discussed a growing trend towards our customers adopting GitOps and Kubernetes native tooling to provision infrastructure resources. AWS customers are choosing open source tools such as Argo CD and Flux CD for rollout and Crossplane or Amazon Controllers for Kubernetes (ACK) for creating infrastructure resources.

In our conversation with some of our customers and partners, including Autodesk, Deutsche Kreditbank, Nike, and Upbound, we identified that, while there is a significant body of work at AWS on how to utilize multi-region and multiple availability zone (multi-AZ) architectures to enable disaster recovery (DR), DR in the context of GitOps and Kubernetes-native infrastructure rollout has not been explored as extensively. To address this issue and to bring attention to some of the key considerations, we’ve worked with engineers and architects from these companies to come up with failure scenarios and related solutions when managing AWS resources with Crossplane.

In this blog post, we discuss different backup considerations when employing GitOps and the use of Kubernetes-native infrastructure provisioning tools on Amazon Elastic Kubernetes Service (Amazon EKS). To provide Kubernetes clusters with capabilities to backup and recover Kubernetes objects, we use Velero. Velero is a popular open source solution which allows you to backup and restore Kubernetes objects to external storage backends such as Amazon Simple Storage Service (Amazon S3). For brevity, in this writing we focus on using Velero for DR when using Crossplane as the infrastructure provisioning tool, but the approach should be equally applicable to ACK as well. In particular, we will answer questions related to the following scenarios:

What should you do if the Kubernetes control plane fails and a new cluster needs to be brought up to manage AWS resources?
How can you bring AWS resources into another region when using Crossplane?
What should you do if one of the resources managed by Crossplane fails?

In this post, we will not go into the details of how to configure individual AWS services to guard against failures. If you are interested in learning more about disaster recovery strategies for specific AWS services, please refer to strategies and documentation available at AWS Elastic Disaster Recovery.

Managed Entities and Failure Scenarios

Crossplane makes use of the Kubernetes Resource Model (KRM) to represent the resources it manages as instances of Custom Resource Definitions (CRDs). These custom resource instances are held within the same Kubernetes Cluster as the Crossplane Controllers.

For example, to create an Amazon S3 bucket, you should deploy a Crossplane CRD of type Bucket to the cluster and also a corresponding Kubernetes object with the kind Bucket that instantiates the CRD and represents the underlying Amazon S3 Bucket resource. Therefore, for any infrastructure resource managed by Crossplane, there are two separate but interrelated entities that you need to look into when doing backups:

The underlying AWS resource, e.g. for the bucket discussed above, you must have a plan in place to back up its content.

The corresponding Kubernetes object and its managing Amazon EKS cluster that keep the actual state of the AWS resource in sync with its expected state as deployed to the Kubernetes cluster. This object is called the Managed Resource (MR).

In addition, it is possible to group multiple MRs into one Kubernetes object. Such objects are called Composite Resources (CRs) and they are responsible for gluing sub resources together and are defined by Composition and Composite Resource Definition objects. CRs can also include other CRs as one of their sub resources. Furthermore, it is possible to create CRs that contain MRs from different providers. For example, it is possible to create CRs which contain MRs from AWS and Helm providers.

Therefore, it is critical for these objects to maintain their relationships with other objects when restoring them from backups.

In this blog post we particularly focus on failure recovery scenarios for the managing Kubernetes cluster, the Crossplane controller that runs in this cluster, and the individual Kubernetes resources representing infrastructure resources. For the AWS infrastructure resource managed by Crossplane, we discuss the traditional disaster recovery scenario where a region becomes completely unavailable.

Scenario 1: Kubernetes Cluster Failure

For the first scenario we are going to look into cases where the managing cluster running the Crossplane controller unexpectedly fails and the management cluster is disconnected from the actual infrastructure resources it manages. While this does not impact the data plane (e.g., your Amazon S3 bucket from the earlier example will continue to operate as it used to), it prevents any change to the infrastructure resources via the management cluster.

A common mechanism to recover from a failed cluster is to copy the state of the failed cluster into a new and healthy cluster. This can be done by regularly backing up the Kubernetes etcd datastore of an existing cluster, so that it can be restored into a new cluster if things fail. While there are several solutions for backing up resources in Kubernetes, Velero is one of the more popular ones. Velero allows you to safely backup and restore Kubernetes objects. This can be done for the entire set of resources, or it can be done selectively and by using Kubernetes resource and label selectors. The following figure depicts how an Amazon RDS resource provisioned by Crossplane in a now failed cluster (left) can be restored and managed by Velero into a new cluster with an existing Crossplane controller (right).

As part of the recovery process, it is important to ensure that Kubernetes resources restored by Velero get into a stable state when reconciled by the Crossplane controller on the healthy cluster. For example, consider a scenario where Crossplane resources are backed up from a Kubernetes cluster which has an older version of Crossplane’s AWS provider. When Velero restores these resources into a new Kubernetes cluster which has a newer version of Crossplane’s AWS providers, Velero replicates the same old copy of the resource in the new cluster. If for arbitrary reasons (e.g. regressions, bugs, etc.) the new Crossplane AWS provider acts on the restored resources differently than the old provider, then Velero and the new Crossplane AWS provider may get into a race overriding the state of the resource on the cluster, in which case the external resource could end up flapping between a desired state and a non-desired state, hence not becoming fully stable.

Similarly, if the manifests representing your Crossplane resources are applied to your Kubernetes cluster using GitOps or similar automation, you should make sure to disable it before restoring the backup. Otherwise, once your backup is restored, any Kubernetes resources it contains could be overwritten by their representations from git. This could cause your restore to become inconsistent or worse, you could introduce skew with the underlying cloud provider resources. By disabling your GitOps workflows or any automation delivering your Crossplane manifests, you can ensure that the only manifests applied are those restored from backup.

When backups are made with Velero, they should contain objects managed by Crossplane and its providers, including composite resources, their sub resources, and managed resources. For a successful recovery, it is important for backed up objects to include all the dependencies. This includes, e.g., ProviderConfig objects and corresponding secrets. ProviderConfig objects are responsible for providing configuration including credentials for providers to connect to service providers such as AWS. Without importing the full chain of object dependencies, migrated crossplane resources have no way of getting reconciled correctly. In addition, to ensure resources are in-sync with Git, they need to be imported into your GitOps tooling. This is specific to the GitOps tooling used, which can become complex to manage. We intend to cover the intricacies of synchronizing restored resources and the corresponding Git counterparts in another blog post.

In general, a regular backup process before a cluster failure involves the following steps:

Scale down the AWS provider deployment to zero to prevent this instance of the provider from attempting to reconcile AWS resources.
Back up the required resources.
Repeat at a reasonable interval.

When restoring into a new cluster:

Install Velero and configure it to use the backup data source.
Ensure your AWS Auth ConfigMap setup does not get overwritten by Velero after you initiate the restore process, or else you will lose access to the new cluster.
Restore the backed up resources to the new cluster.
Scale up the AWS provider deployment to one in the new cluster.

During the backup and restoration process Velero preserves owner references, resource states, and external names of managed resources. Therefore from here on, the Kubernetes reconciliation process should kick in and proceed as if there were no changes.

Scenario 2: Failure in Crossplane Controller Upgrades

Upgrading the Crossplane controller for a given provider should not require treating failures as a full case of disaster recovery. When you install a new version of a provider, a ProviderRevision object is created that owns the new version of managed resources and transfers ownership from the old providers. In case the new revision of the provider fails, recovery should be as easy as rolling back to the old revision of the provider where ownership of the previous revision is reinstated.

However, it could be that due to other complications (see here for an example), a rollback from a failed provider upgrade would not help with recovering from a failure. In this case, a full cluster migration might be the fastest solution to bring things back on track.

In case of cluster migration, you can safely use Velero, similar to how it was discussed previously to back up resources and restore them to a new cluster. The new cluster does not need to be in the same region either, since the region in which managed AWS resources reside does not change. An example of such a migration can be found here.

Scenario 3: Region Failure

Region failures result in more complex failure scenarios to recover from, given how Kubernetes objects are connected to external AWS service entities.

For example, you may have an EKS cluster with Crossplane in the us-west-2 region. Let us assume that this cluster manages AWS resources in us-west-2 and us-east-1 regions. One day, for an unforeseeable reason, the entire us-west-2 region disappears from the earth. What should your recovery strategy look like if you need to restore infrastructure to another region?

You may have a Velero backup of objects from the managing Crossplane cluster in us-west-2 and you can restore them for Global AWS resources such as Route53 records and IAM roles, but you cannot simply restore regional resources to the new cluster in a different region. One reason for this is that all resources managed by the Crossplane AWS provider have “region” as a required configuration field (see the below Amazon Virtual Private Cloud (Amazon VPC) instance for example).

This field determines in which region Crossplane-managed AWS resources should be deployed. When you take a backup of the managed resources in the original cluster, these objects have the region specified in the resources. In our example, these resources may have the us-west-2 or us-east-1 region specified. Because of this, when you attempt to restore them into another cluster in another region, Crossplane will attempt to provision resources into the region that no longer exists. You will need to update your Kubernetes manifests to reflect region changes. Also, your stateful resources such as databases likely need to be restored from a snapshot. You may also have a hot replica in another region that your application can use. In either case, during the restoration process you still need a way to tell Crossplane which backup you want to restore from and update applications to point to the new database, preferably in an automated way.

You will also need to consider AWS regional differences and ensure compositions work in different regions. For example, the us-west-2 region has 4 availability zones, while the us-east-1 region has 6 availability zones. To accommodate this, you may need to create different compositions for each region to ensure availability requirements are met.

Because of these reasons, it’s likely you need to change manifests in your source of truth or use a mutation webhook to dynamically update the region field, instead of relying on a Kubernetes backup solution. You need to perform the restoration process using the continuous delivery (CD) tool of your choice. A rough process is depicted in the figure below and would look like this:

Create a new Kubernetes cluster with Crossplane in another region if one doesn’t exist.
Determine which region you want to restore to.
Determine which snapshots you want to restore from.
Make changes to your manifests to reflect information determined in the previous steps (preferably with automation such as a mutating webhook).
Let your CD tooling deploy resources.

Key Considerations

Based on the discussions here, we highly recommend that you start designing to guard against failure from the ground up to ensure uninterrupted operation of your applications.

In order to have a less hectic recovery process, here are some additional key considerations:

Automate the rollout of your infrastructure and application resources. This ensures that when disaster hits, you can more quickly respond by redirecting your rollout practices to an alternative solution.
Parameterize your deployments. Make regions, availability zones, instance types, replica counts, etc. configurable so you can change them to alternatives where necessary. This, combined with proper automation, allows you to quickly choose alternative targets for deploying infrastructure and application resources.
Make sure you can tag and specify which snapshot you want to restore for your backed up data stores. In Crossplane compositions, this can and needs to be defined when you are restoring your data into a new managing cluster.
Consider full migrations to alternative regions as a last resort. Failures happen a lot less frequently to an entire region. When recovering from failures, consider rollback options available natively by Crossplane. This involves rolling back to an older revision of the provider, or restoring a failed AWS service instance to remain in sync with the Crossplane counterpart.
Do not ignore full migrations to alternative regions. While rare, you would occasionally need to do a full migration to an entirely different region. Have it be part of your recovery planning. Do not ignore or postpone it!
Practice failure recovery scenarios a priori. We have seen examples like Chaos Monkey and other tools imitating a case of failure for deployments in order to prepare and practice a recovery process. Crossplane is no different. Ensure you practice failure recovery frequently to be prepared for an actual case of recovery when needed.

Conclusions

Managing AWS resources using a Kubernetes control plane has its own challenges when you need strategies to backup and restore resources. If you are simply migrating from one managing cluster to another, backup solutions such as Velero work nicely because no modifications to the source of truth are required.

If you are preparing for disaster recovery scenarios, things are more nuanced and may require modifications to your source of truth. Many AWS managed services offer easy ways to backup and restore your data to another region. Your Crossplane compositions need to be able to take advantage of these features according to your recovery objectives. Using templating and overlaying mechanisms, you can easily embed these objectives into Crossplane compositions. This means all AWS resources managed by Crossplane adhere to your recovery point objective. In the end, it is important to understand where you draw boundaries in your recovery process in what is possible via Kubernetes tooling and the control plane, and what needs to be taken care of out of band and using traditional failure recovery practices.

If you want to learn more about building shared services platforms (SSP) on EKS with Crossplane, you can schedule time with our experts.

References:

https://aws.amazon.com/blogs/opensource/comparing-aws-cloud-development-kit-and-aws-controllers-for-kubernetes/
https://aws.amazon.com/blogs/opensource/declarative-provisioning-of-aws-resources-with-spinnaker-and-crossplane/
https://aws.amazon.com/blogs/opensource/introducing-aws-blueprints-for-crossplane/

Flatlogic Admin Templates banner

Proactive Insights with Amazon DevOps Guru for RDS

Today, we are pleased to announce a new Amazon DevOps Guru for RDS capability: Proactive Insights. DevOps Guru for RDS is a fully-managed service powered by machine learning (ML), that uses the data collected by RDS Performance Insights to detect and alert customers of anomalous behaviors within Amazon Aurora databases. Since its release, DevOps Guru for RDS has empowered customers with information to quickly react to performance problems and to take corrective actions. Now, Proactive Insights adds recommendations related to operational issues that may prevent potential issues in the future.

Proactive Insights requires no additional set up for customers already using DevOps Guru for RDS, for both Amazon Aurora MySQL-Compatible Edition and Amazon Aurora PostgreSQL-Compatible Edition.

The following are example use cases of operational issues available for Proactive Insights today, with more insights coming over time:

Long InnoDB History for Aurora MySQL-Compatible engines – Triggered when the InnoDB history list length becomes very large.

Temporary tables created on disk for Aurora MySQL-Compatible engines – Triggered when the ratio of temporary tables created versus all temporary tables breaches a threshold.

Idle In Transaction for Aurora PostgreSQL-Compatible engines – Triggered when sessions connected to the database are not performing active work, but can keep database resources blocked.

To get started, navigate to the Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights. In the following screen capture, the number three indicates that there are three ongoing proactive insights. Click on that number to see the listing of the corresponding Proactive Insights, which may include RDS or other Proactive Insights supported by Amazon DevOps Guru.

Figure 1. Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights.

Ongoing problems (including reactive and proactive insights) are also highlighted against your database instance on the Database list page in the Amazon RDS console.

Figure 2. Proactive and Reactive Insights are highlighted against your database instance on the Database list page in the Amazon RDS console.

In the following sections, we will dive deep on these use cases of DevOps Guru for RDS Proactive Insights.

Long InnoDB History for Aurora MySQL-Compatible engines

The InnoDB history list is a global list of the undo logs for committed transactions. MySQL uses the history list to purge records and log pages when transactions no longer require the history.  If the InnoDB history list length grows too large, indicating a large number of old row versions, queries and even the database shutdown process can become slower.

DevOps Guru for RDS now detects when the history list length exceeds 1 million records and alerts users to close (either by commit or by rollback) any unnecessary long-running transactions before triggering database changes that involve a shutdown (this includes reboots and database version upgrades).

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS InnoDB History List Length Anomalous” Proactive Insight with an ongoing status. You will notice that Proactive Insights provides an “Insight overview”, “Metrics” and “Recommendations”.

Insight overview provides you basic information on this insight. In our case, the history list for row changes increased significantly, which affects query and shutdown performance.

Figure 3. Long InnoDB History for Aurora MySQL-Compatible engines Insight overview.

The Metrics panel gives you a graphical representation of the history list length and the timeline, allowing you to correlate it with any anomalous application activity that may have occurred during this window.

Figure 4. Long InnoDB History for Aurora MySQL-Compatible engines Metrics panel.

The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem. You will also notice the rationale behind the recommendation under the “Why is DevOps Guru recommending this?” column.

Figure 5. The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem.

Temporary tables created on disk for Aurora MySQL-Compatible engines

Sometimes it is necessary for the MySQL database to create an internal temporary table while processing a query. An internal temporary table can be held in memory and processed by the TempTable or MEMORY storage engine, or stored on disk by the InnoDB storage engine. An increase of temporary tables created on disk instead of in memory can impact the database performance.

DevOps Guru for RDS now monitors the rate at which the database creates temporary tables and the percentage of those temporary tables that use disk. When these values cross recommended levels over a given period of time, DevOps Guru for RDS creates an insight exposing this situation before it becomes critical.

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS Temporary Tables On Disk AnomalousProactive Insight with an ongoing status. You will notice this Proactive Insight provides an “Insight overview”, “Metrics” and “Recommendations”.

Insight overview provides you basic information on this insight. In our case, more than 58% of the total temporary tables created per second were using disk, with a sustained rate of two temporary tables on disk created every second, which indicates that query performance is degrading.

Figure 6. Temporary tables created on disk insight overview.

The Metrics panel shows you a graphical representation of the information specific for this insight. You will be presented with the evolution of the amount of temporary tables created on disk per second, the percentage of temporary tables on disk (out of the total number of database-created temporary tables), and of the overall rate at which the temporary tables are created (per second).

Figure 7. Temporary tables created on disk – evolution of the amount of temporary tables created on disk per second.

Figure 8. Temporary tables created on disk – the percentage of temporary tables on disk (out of the total number of database-created temporary tables).

Figure 9. Temporary tables created on disk – overall rate at which the temporary tables are created (per second).

The Recommendations section suggests actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more.

Figure 10. Temporary tables created on disk – actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more.

Additional explanations on this use case can be found by clicking on the “View troubleshooting doc” link.

Idle In Transaction for Aurora PostgreSQL-Compatible engines

A connection that has been idle in transaction  for too long can impact performance by holding locks, blocking other queries, or by preventing VACUUM (including autovacuum) from cleaning up dead rows.
PostgreSQL database requires periodic maintenance, which is known as vacuuming. Autovacuum in PostgreSQL automates the execution of VACUUM and ANALYZE commands. This process gathers the table statistics and deletes the dead rows. When vacuuming does not occur, this negatively impacts the database performance. It leads to an increase in table and index bloat (the disk space that was used by a table or index and is available for reuse by the database but has not been reclaimed), leads to stale statistics and can even end in transaction wraparound (when the number of unique transaction ids reaches its maximum of about two billion).

DevOps Guru for RDS monitors the time spent by sessions in an Aurora PostgreSQL database in idle in transaction state and raises initially a warning notification, followed by an alarm notification if the idle in transaction state continues (the current thresholds are 1800 seconds for the warning and 3600 seconds for the alarm).

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS Idle In Transaction Max Time AnomalousProactive Insight with an ongoing status. You will notice this Proactive Insights provides an “Insight overview”, “Metrics” and “Recommendations”.

In our case, a connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance.

Figure 11. A connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance.

The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started.

Figure 12. The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started.

As with the other insights, recommended actions are listed and a troubleshooting doc is linked for even more details on this use case.

Figure 13. Recommended actions are listed and a troubleshooting doc is linked for even more details on this use case.

Conclusion

With Proactive Insights, DevOpsGuru for RDS enhances its abilities to help you monitor your databases by notifying you about potential operational issues, before they become bigger problems down the road. To get started, you need to ensure that you have enabled Performance Insights on the database instance(s) you want monitored, as well as ensure and confirm that DevOps Guru is enabled to monitor those instances (for example by enabling it at account level, by monitoring specific CloudFormation stacks or by using AWS tags for specific Aurora resources). Proactive Insights is available in all regions where DevOps Guru for RDS is supported. To learn more about Proactive Insights, join us for a free hands-on Immersion Day (available in three time zones) on March 15th or April 12th.

About the authors:

Kishore Dhamodaran

Kishore Dhamodaran is a Senior Solutions Architect at AWS.

Raluca Constantin

Raluca Constantin is a Senior Database Engineer with the Relational Database Services (RDS) team at Amazon Web Services. She has 16 years of experience in the databases world. She enjoys travels, hikes, arts and is a proud mother of a 12y old daughter and a 7y old son.

Jonathan Vogel

Jonathan is a Developer Advocate at AWS. He was a DevOps Specialist Solutions Architect at AWS for two years prior to taking on the Developer Advocate role. Prior to AWS, he practiced professional software development for over a decade. Jonathan enjoys music, birding and climbing rocks.

Driving Action and Communication in AWS Amplify Open Source Projects

AWS Amplify is a complete solution that lets frontend web and mobile developers easily build, ship, and host full-stack applications on AWS, with the flexibility to leverage additional AWS services as use cases evolve. To build Amplify applications, customers typically use one of the open source Amplify libraries. At Amplify, we manage and build these open source projects on GitHub.

In the last year, the number of contributions across the Amplify projects has increased and the teams have scaled to meet customer needs while building across programming languages and frameworks. This necessitates a constant balance of collaboration with contributors and also flexible processes to continually move the projects forward. The goal is to provide a delightful experience for Front End developers building on AWS.

Open source is as much about relationships as it is about code. These relationships are being built anytime the team is collaborating with external contributors, developers, and customers at different touch points. A core element to facilitate open source is to facilitate relationships through consistent and transparent communication. There isn’t a clear, well-defined playbook for this. It requires iteration and continuous feedback from contributors to fine tune and make better. How we communicate can mean how the projects are planned, structured, and received by the developer community! And all of this may change over time!

In this post, we’ll cover some processes and tools that we’ve built and use at AWS Amplify to help build a vibrant and responsive open source community.

Organization

For context, an average of 35 external contributor pull requests (PRs) and 350 issues are opened each month on GitHub across the Amplify project repositories. In 2022, this equated to an 80% increase in issues and 66% increase in external contributor PRs from the previous year. An external contributor is any active contributor that is not a member of the Amplify GitHub organization. These projects range from the Amplify CLI to the client libraries like Amplify JS and Amplify UI. We also have a very active Discord server with over 19,000 members.

Along with the growth, it can be a cultural shift for both AWS engineers and customers to work in the open. Not every team member is used to working in public, and not all customers are used to working with AWS through GitHub or Discord.

Our ownership model is that each individual team manages their own repositories and is responsible for the project health and operations. This includes issue and Pull Request (PR) management, communication, and releases. This level of autonomy helps the projects move fast and remain flexible. As the service has grown, so has the need for standardization and operational tooling within each team.

Communication and transparency – reducing the time to response

The starting point for community is consistent communication and transparency. There are different elements to this and it fluctuates over time and as projects grow or slow in velocity. There is an expectation in open source software development that there is an ongoing dialog with the community. This may take the form of contributors helping on issue triage and workarounds, providing feedback on Request for Comments (RFCs), and submitting PRs. It could also be developers using the libraries and services within their applications.

In each of these scenarios, open communication is what helps to oil that machine. In open source, the goal is to determine an action for each issue, pull request, feature request, or question that is created. In most cases, proactive communication and quick responses in issues and PRs helps to determine this action faster. This also shapes the open source contributor experience. An external contributor is more likely to stay engaged in an issue thread or PR review if the maintainers are quick to respond.

Knowing this, we identified ways to reduce friction and make the experience better knowing that communication plays such a large role. These ways are:

Standardize the structure of the operational processes (GitHub issues labels, etc.)
Communicate as early and often as possible in response to open items (issues, PRs, etc.)
Proactively track updates on these items in order to follow up quickly

With these goals in mind, we had to operationalize processes to allow us to deliver on each. Thinking through this, two core themes surfaced:

Develop a consistent approach and cadence for communication
Reduce overall time-to-response (or action)

Tackling these first would improve the contributor experience and begin to strengthen our own internal culture. Once standardized, we could then proactively track open items and metrics.

Communications processes

How do we improve communication across our project? This is where that cultural shift comes in. Sometimes this requires changes to normal process and team communication to fully embrace working in the open. The project is always evolving and updating, including nights and weekends. Without a clear process to drive action, things can quickly begin to get missed.

The first task was to find the root cause of the communication friction points in a project repository. What are the communication touch points? At those points, where can automation help reduce friction and time to next action? The initial entry points to communication in a GitHub repository are:

A contributor is using the project and opens an item in the form of an issue, or
A contributor opens a PR to submit code for inclusion into the project

The team on-call engineers were initially shadowed to observe how they identified, and interacted with, new and updated issues and PRs. A common theme was that the context switching and frequent back and forth on these items was very time consuming and hard to track. After observing this same pattern across multiple projects, it was clear that the conversation should be streamlined.

We needed an efficient way to both remind the original posters of what is needed to help reproduce an issue and encourage them to also provide details. For each of these items, we had to determine the best way to expedite the path to action. Maintainers need to triage an issue to determine the next action to take. Maintainers also need to triage each pull request to determine if it aligns with the project and identify what the next steps are with respect to review and merge.

GitHub issue template forms to the rescue

We initially standardized the GitHub issue template forms across all of the Amplify projects after getting access to the Beta feature in early 2021. These forms are used each time a contributor opens an issue, pull request, or feature request in a repository. This allowed for more actionable conversations by collecting all of the required information up front. The goal of the form was to collect just the required information without introducing unnecessary friction for contributors. This is different from the original GitHub issue templates that allowed for more free-form data entry. With this approach, we were able to require standard information that had been found to be helpful in triaging issues while shadowing the maintainers. Here is a screenshot of a portion of the current GitHub issue form template in the Amplify CLI repository.

This is the form that is used to open new issues in the repository. It asks the user to check some boxes before opening an issue, such as whether they have installed the latest version of the Amplify CLI, searched for duplicate or closed issues, and read the guide for submitting bug reports. It also asks some open ended questions about the user’s environment.

As using the new form to file an issue became the standard, the starting point for communication became clearer, allowing future interactions on issues and PRs to be more direct. Through the structured nature and standardized collection of data fields, we were able to significantly improve the quality of our search results. This has helped to reduce duplicate issues (although they may still exist in some forms in older issues) and increase visibility and traction on current open issues and feature requests.

Standardization

As the Amplify project quickly grew, the operational processes needed to grow with it and remain consistent across teams. Repository standardization tools are typically templates used to structure a project when it is first created. The original templates are helpful but there is an ongoing challenge of maintaining, or changing, the structure over time as these dimensions change. To account for this, we created an open source repository audit tool that provides a declarative approach to keep certain items such as labels, project topic tags, and descriptions up to date.

The audit tool also includes a dashboard to provide insight into other items, such as GitHub actions, required links, and the number of good first issues. The required items are defined in a configuration file and each repository is checked against that structure. This is a simple but effective way to quickly check across all the projects without manually checking each project.

Repositories will change over time but we need to make sure that there is consistency in the core structure. For instance, new labels are added, some are removed. Or, the CODEOWNERs.md file needs updated. The audit tool queries the repository metadata and provides a self-service way for each team to check if the repo is in good standing as part of their standard operating processes or has fallen out of date and needs attention. This screenshot shows a portion of the audit tool for the Amplify JS repository. The green check mark next to the links indicates that these items are present in the repository.

We want customers to have a consistent entry point when landing in any AWS Amplify repository. This includes the same nomenclature and core labels to correctly communicate the lifecycle of issues and PRs. This screenshot shows the required labels section of the tool. The set of required labels is compared against the labels in the repository for any discrepancies.

Data and tooling – separating out external contributor events

To accurately gauge activities on issues and opened items, it was critical to determine whether events were initiated by someone on the Amplify team or an external contributor in order prioritize communication and the triaging process. Our triaging process involves troubleshooting or reproducing an issue by a project maintainer. We needed to isolate GitHub issue (and pull request) event data created by external contributors in a separate tool to support this effort.

To track this event stream from GitHub, we built serverless data processes using AWS Lambda to capture issues and PRs that have been updated, and then capture any new events (comments, labels, etc.) on the issues. The data allows for ad-hoc querying against events to isolate if activity is increasing in a certain area. This also helps teams to quickly spot items when there are lags in timezones or as team on-call rotations change. All of these queries take into account the active members of the Amplify organization to identify those events only triggered by external contributors.

Capturing this data has allowed us to build tools to proactively engage in open and closed issues. The following sections outline these tools.

Trending issues

With the external contributor data separated from Amplify team events, we are able to start tracking, in near real-time, issues that have an increased amount of activity within a given time period. One way has been to create a dashboard that highlights trending issues that are receiving increased activity from external contributors. This is used at the individual project level and also across all of the projects. This helps to identify cross-project themes that may be starting without constant, manual, checking of issues to see what has changed.

The trending dashboard only ranks issues based on the activity that they are receiving from external contributors within a given time period. It takes into account recent comments and the aggregate reactions on comments that are not created by Amplify maintainers. This provides a holistic view to the themes that contributors and developers may be experiencing across the entirety of Amplify without the noise of Amplify team comments.

This includes issues, both open and closed post triage when there’s already been communication. It’s important to track closed issues (that are not locked) to not miss any new comments or activity. Since the issue is closed, those comments may not be seen unless a team member happens to view the issue. This screenshot shows the trending list of open issues in the Amplify CLI repository. The list of issues and aggregated metadata (number of comments and reactions) only includes data from external contributors.

Closed issues with increased activity

As previously mentioned, closed issue visibility is a challenge since it’s difficult to track which closed issues were updated and their specific changes – new comment, increase in number of reactions. The trending dashboard also tracks this activity.

The dashboard is especially useful for issues that are already closed that may continue to receive events. Without automation, it’s very time consuming (and manual) to identify an increased spike in reactions or external contributor comments on issues across one project, let alone over 20. Often times in projects, these comments go unnoticed unless an Amplify team member is tagged or subscribed to a specific item and happens to see the notification.

Having a UI that highlights these items also helps to reduce notification fatigue. The Amplify team (engineers, product, and developer experience) receive many notifications across items for many of the Amplify repositories. It is helpful to have a single dashboard to spot check if activity has increased on an older, closed issue.

Metrics to track

So how do we know that all of these processes and tools are making a positive impact? A few key metrics helped:

Mean Time to Respond (MTTR) – This measures how responsive we are to customers and encourages communicating as soon as possible once an issue or Pull Request is opened.

Mean Time to Close (MTTC) – This measures the overall timeframe that it takes to close an issue or Pull Request.

The definition of these metrics used with the constant stream of event data has allowed us to track the MTTC and MTTR of items at the repository and organization level.

MTTR is primarily about response times. How quickly do we respond back throughout the lifecycle of an issue (or PR)? A lower MTTC indicates that issues are closed faster. This may mean a few things: Maintainers have answered questions, reproduced issues, and actioned PRs. There are a lot of factors that contribute to an item being closed, which may have varying timelines. One example is feature requests that may remain open longer than a normal issue. There are caveats to each metric, but this helps to isolate issues that need further investigation.

Even with the progress in tracking, a few consistent challenges presented themselves. The back and forth communication on issues can be very sporadic. The next step was to isolate which issues and PRs needed a response.

Pending response

It’s difficult and time consuming to keep track of comment responses in issues and PRs utilizing only the notifications view within GitHub. One mechanism to keep track of this is identifying items that were awaiting a response and the original poster has now followed up (i.e. responded).

The ideal flow is something like this:

Issue is opened
Team responds with follow up or questions
Issue is labeled pending-response

A dashboard displays issues that have the pending-response labels AND an external contributor has responded

This helps reduce active response times and highlights when issues have received a reply. Additionally, this ensures that issues are actioned and don’t remain in an unknown state (i.e. not triaged) while awaiting a reply. Similar to the trending issues above, the team created a dashboard to surface any issue that fit this criteria. This screenshot shows the list of issues that need a response from the maintainers. These issues all have the pending-response label and an external contributor has been the most recent to comment.

Conclusion

Open source is not static. Tools, teams, and community are always evolving and what is working today will certainly change over the next year. Working backwards from consistency and communication has allowed Amplify to focus attention on prioritizing proactive action to make it a pleasant experience for community members to contribute to the projects.

We continue to identify more efficient ways to strengthen the developer relationships and facilitate more open communication across the Amplify repositories. As the community and project evolve, it’s important to remain flexible, communicate early and often, and continue to improve what the team can control.

Interested in learning more about open source at AWS Amplify? Follow @AWSAmplify on Twitter to get the latest updates about the Amplify Contributor Program, explore the open source Amplify GitHub repositories, or join the Amplify Discord server.

Flatlogic Admin Templates banner

Develop a serverless application in Python using Amazon CodeWhisperer

While writing code to develop applications, developers must keep up with multiple programming languages, frameworks, software libraries, and popular cloud services from providers such as AWS. Even though developers can find code snippets on developer communities, to either learn from them or repurpose the code, manually searching for the snippets with an exact or even similar use case is a distracting and time-consuming process. They have to do all of this while making sure that they’re following the correct programming syntax and best coding practices.

Amazon CodeWhisperer, a machine learning (ML) powered coding aide for developers, lets you overcome those challenges. Developers can simply write a comment that outlines a specific task in plain English, such as “upload a file to S3.” Based on this, CodeWhisperer automatically determines which cloud services and public libraries are best-suited for the specified task, it creates the specific code on the fly, and then it recommends the generated code snippets directly in the IDE. And this isn’t about copy-pasting code from the web, but generating code based on the context of your file, such as which libraries and versions you have, as well as the existing code. Moreover, CodeWhisperer seamlessly integrates with your Visual Studio Code and JetBrains IDEs so that you can stay focused and never leave the development environment. At the time of this writing, CodeWhisperer supports Java, Python, JavaScript, C#, and TypeScript.

In this post, we’ll build a full-fledged, event-driven, serverless application for image recognition. With the aid of CodeWhisperer, you’ll write your own code that runs on top of AWS Lambda to interact with Amazon Rekognition, Amazon DynamoDB, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), Amazon Simple Storage Service (Amazon S3), and third-party HTTP APIs to perform image recognition. The users of the application can interact with it by either sending the URL of an image for processing, or by listing the images and the objects present on each image.

Solution overview

To make our application easier to digest, we’ll split it into three segments:

Image download – The user provides an image URL to the first API. A Lambda function downloads the image from the URL and stores it on an S3 bucket. Amazon S3 automatically sends a notification to an Amazon SNS topic informing that a new image is ready for processing. Amazon SNS then delivers the message to an Amazon SQS queue.

Image recognition – A second Lambda function handles the orchestration and processing of the image. It receives the message from the Amazon SQS queue, sends the image for Amazon Rekognition to process, stores the recognition results on a DynamoDB table, and sends a message with those results as JSON to a second Amazon SNS topic used in section three. A user can list the images and the objects present on each image by calling a second API which queries the DynamoDB table.

3rd-party integration – The last Lambda function reads the message from the second Amazon SQS queue. At this point, the Lambda function must deliver that message to a fictitious external e-mail server HTTP API that supports only XML payloads. Because of that, the Lambda function converts the JSON message to XML. Lastly, the function sends the XML object via HTTP POST to the e-mail server.

The following diagram depicts the architecture of our application:

Figure 1. Architecture diagram depicting the application architecture. It contains the service icons with the component explained on the text above.

Prerequisites

Before getting started, you must have the following prerequisites:

An AWS account and an Administrator user

Install and authenticate the AWS CLI. You can authenticate with an AWS Identity and Access Management (IAM) user or an AWS Security Token Service (AWS STS) token.
Install Python 3.7 or later.
Install Node Package Manager (npm).

Install the AWS CDK Toolkit.
Install the AWS Toolkit for VS Code or for JetBrains.

Install Git.

Configure environment

We already created the scaffolding for the application that we’ll build, which you can find on this Git repository. This application is represented by a CDK app that describes the infrastructure according to the architecture diagram above. However, the actual business logic of the application isn’t provided. You’ll implement it using CodeWhisperer. This means that we already declared using AWS CDK components, such as the API Gateway endpoints, DynamoDB table, and topics and queues. If you’re new to AWS CDK, then we encourage you to go through the CDK workshop later on.

Deploying AWS CDK apps into an AWS environment (a combination of an AWS account and region) requires that you provision resources that the AWS CDK needs to perform the deployment. These resources include an Amazon S3 bucket for storing files and IAM roles that grant permissions needed to perform deployments. The process of provisioning these initial resources is called bootstrapping. The required resources are defined in an AWS CloudFormation stack, called the bootstrap stack, which is usually named CDKToolkit. Like any CloudFormation stack, it appears in the CloudFormation console once it has been deployed.

After cloning the repository, let’s deploy the application (still without the business logic, which we’ll implement later on using CodeWhisperer). For this post, we’ll implement the application in Python. Therefore, make sure that you’re under the python directory. Then, use the cdk bootstrap command to bootstrap an AWS environment for AWS CDK. Replace {AWS_ACCOUNT_ID} and {AWS_REGION} with corresponding values first:

cdk bootstrap aws://{AWS_ACCOUNT_ID}/{AWS_REGION}

For more information about bootstrapping, refer to the documentation.

The last step to prepare your environment is to enable CodeWhisperer on your IDE. See Setting up CodeWhisperer for VS Code or Setting up Amazon CodeWhisperer for JetBrains to learn how to do that, depending on which IDE you’re using.

Image download

Let’s get started by implementing the first Lambda function, which is responsible for downloading an image from the provided URL and storing that image in an S3 bucket. Open the get_save_image.py file from the python/api/runtime/ directory. This file contains an empty Lambda function handler and the needed inputs parameters to integrate this Lambda function.

url is the URL of the input image provided by the user,

name is the name of the image provided by the user, and

S3_BUCKET is the S3 bucket name defined by our application infrastructure.

Write a comment in natural language that describes the required functionality, for example:

# Function to get a file from url

To trigger CodeWhisperer, hit the Enter key after entering the comment and wait for a code suggestion. If you want to manually trigger CodeWhisperer, then you can hit Option + C on MacOS or Alt + C on Windows. You can browse through multiple suggestions (if available) with the arrow keys. Accept a code suggestion by pressing Tab. Discard a suggestion by pressing Esc or typing a character.

For more information on how to work with CodeWhisperer, see Working with CodeWhisperer in VS Code or Working with Amazon CodeWhisperer from JetBrains.

You should get a suggested implementation of a function that downloads a file using a specified URL. The following image shows an example of the code snippet that CodeWhisperer suggests:

Figure 2. Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called get_file_from_url with the implementation suggestion to download a file using the requests lib.

Be aware that CodeWhisperer uses artificial intelligence (AI) to provide code recommendations, and that this is non-deterministic. The result you get in your IDE may be different from the one on the image above. If needed, fine-tune the code, as CodeWhisperer generates the core logic, but you might want to customize the details depending on your requirements.

Let’s try another action, this time to upload the image to an S3 bucket:

# Function to upload image to S3

As a result, CodeWhisperer generates a code snippet similar to the following one:

Figure 3. Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called upload_image with the implementation suggestion to download a file using the requests lib and upload it to S3 using the S3 client.

Now that you have the functions with the functionalities to download an image from the web and upload it to an S3 bucket, you can wire up both functions in the Lambda handler function by calling each function with the correct inputs.

Image recognition

Now let’s implement the Lambda function responsible for sending the image to Amazon Rekognition for processing, storing the results in a DynamoDB table, and sending a message with those results as JSON to a second Amazon SNS topic. Open the image_recognition.py file from the python/recognition/runtime/ directory. This file contains an empty Lambda and the needed inputs parameters to integrate this Lambda function.

queue_url is the URL of the Amazon SQS queue to which this Lambda function is subscribed,

table_name is the name of the DynamoDB table, and

topic_arn is the ARN of the Amazon SNS topic to which this Lambda function is published.

Using CodeWhisperer, implement the business logic of the next Lambda function as you did in the previous section. For example, to detect the labels from an image using Amazon Rekognition, write the following comment:

# Detect labels from image with Rekognition

And as a result, CodeWhisperer should give you a code snippet similar to the one in the following image:

Figure 4. Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called detect_labels with the implementation suggestion to use the Rekognition SDK to detect labels on the given image.

You can continue generating the other functions that you need to fully implement the business logic of your Lambda function. Here are some examples that you can use:

# Save labels to DynamoDB

# Publish item to SNS

# Delete message from SQS

Following the same approach, open the list_images.py file from the python/recognition/runtime/ directory to implement the logic to list all of the labels from the DynamoDB table. As you did previously, type a comment in plain English:

# Function to list all items from a DynamoDB table

Other frequently used code

Interacting with AWS isn’t the only way that you can leverage CodeWhisperer. You can use it to implement repetitive tasks, such as creating unit tests and converting message formats, or to implement algorithms like sorting and string matching and parsing. The last Lambda function that we’ll implement as part of this post is to convert a JSON payload received from Amazon SQS to XML. Then, we’ll POST this XML to an HTTP endpoint.

Open the send_email.py file from the python/integration/runtime/ directory. This file contains an empty Lambda function handler. An event is a JSON-formatted document that contains data for a Lambda function to process. Type a comment with your intent to get the code snippet:

# Transform json to xml

As CodeWhisperer uses the context of your files to generate code, depending on the imports that you have on your file, you’ll get an implementation such as the one in the following image:

Figure 5. Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called json_to_xml with the implementation suggestion to transform JSON payload into XML payload.

Repeat the same process with a comment such as # Send XML string with HTTP POST to get the last function implementation. Note that the email server isn’t part of this implementation. You can mock it, or simply ignore this HTTP POST step. Lastly, wire up both functions in the Lambda handler function by calling each function with the correct inputs.

Deploy and test the application

To deploy the application, run the command cdk deploy –all. You should get a confirmation message, and after a few minutes your application will be up and running on your AWS account. As outputs, the APIStack and RekognitionStack will print the API Gateway endpoint URLs. It will look similar to this example:

Outputs:

APIStack.RESTAPIEndpoint01234567 = https://examp1eid0.execute-
api.{your-region}.amazonaws.com/prod/

The first endpoint expects two string parameters: url (the image file URL to download) and name (the target file name that will be stored on the S3 bucket). Use any image URL you like, but remember that you must encode an image URL before passing it as a query string parameter to escape the special characters. Use an online URL encoder of your choice for that. Then, use the curl command to invoke the API Gateway endpoint:

curl -X GET ‘https://examp1eid0.execute-api.eu-east-
2.amazonaws.com/prod?url={encoded-image-URL}&amp;name={file-name}’

Replace {encoded-image-URL} and {file-name} with the corresponding values. Also, make sure that you use the correct API endpoint that you’ve noted from the AWS CDK deploy command output as mentioned above.

It will take a few seconds for the processing to happen in the background. Once it’s ready, see what has been stored in the DynamoDB table by invoking the List Images API (make sure that you use the correct URL from the output of your deployed AWS CDK stack):

curl -X GET ‘https://examp1eid7.execute-api.eu-east-2.amazonaws.com/prod’

After you’re done, to avoid unexpected charges to your account, make sure that you clean up your AWS CDK stacks. Use the cdk destroy command to delete the stacks.

Conclusion

In this post, we’ve seen how to get a significant productivity boost with the help of ML. With that, as a developer, you can stay focused on your IDE and reduce the time that you spend searching online for code snippets that are relevant for your use case. Writing comments in natural language, you get context-based snippets to implement full-fledged applications. In addition, CodeWhisperer comes with a mechanism called reference tracker, which detects whether a code recommendation might be similar to particular CodeWhisperer training data. The reference tracker lets you easily find and review that reference code and see how it’s used in the context of another project. Lastly, CodeWhisperer provides the ability to run scans on your code (generated by CodeWhisperer as well as written by you) to detect security vulnerabilities.

During the preview period, CodeWhisperer is available to all developers across the world for free. Get started with the free preview on JetBrains, VS Code or AWS Cloud9.

About the author:

Rafael Ramos

Rafael is a Solutions Architect at AWS, where he helps ISVs on their journey to the cloud. He spent over 13 years working as a software developer, and is passionate about DevOps and serverless. Outside of work, he enjoys playing tabletop RPG, cooking and running marathons.

Caroline Gluck

Caroline is an AWS Cloud application architect based in New York City, where she helps customers design and build cloud native data science applications. Caroline is a builder at heart, with a passion for serverless architecture and machine learning. In her spare time, she enjoys traveling, cooking, and spending time with family and friends.

Jason Varghese

Jason is a Senior Solutions Architect at AWS guiding enterprise customers on their cloud migration and modernization journeys. He has served in multiple engineering leadership roles and has over 20 years of experience architecting, designing and building scalable software solutions. Jason holds a bachelor’s degree in computer engineering from the University of Oklahoma and an MBA from the University of Central Oklahoma.

Dmitry Balabanov

Dmitry is a Solutions Architect with AWS where he focuses on building reusable assets for customers across multiple industries. With over 15 years of experience in designing, building, and maintaining applications, he still loves learning new things. When not at work, he enjoys paragliding and mountain trekking.

re:Invent 2022 DevOps and Developer Productivity Playlist

Danielle Kucera, Karun Bakshi, and I were privileged to organize the DevOps and Developer Productivity (DOP) track for re:Invent 2022. For 2022, the DOP track included 58 sessions and nearly 100 speakers.  If you weren’t able to attend, I have compiled a list of the on-demand sessions for you below.

Leadership Sessions

Delighting developers: Builder experience at AWS Adam Seligman, Vice President of Developer Experience, and Emily Freeman, Head of Community Development, share the latest AWS tools and experiences for teams developing in the cloud. Adam recaps the latest launches and demos how key services can integrate to accelerate developer productivity.

Amazon CodeCatalyst

Amazon CodeCatalyst, announced during Dr. Werner Vogels Keynote, is a unified software development service that makes it faster to build and deliver on AWS.

Introducing Amazon CodeCatalyst – Harry Mower, Director of DevOps Services, and Doug Clauson, Product Manager, provide an overview of Amazon CodeCatalyst. CodeCatalyst provides one place where you can plan work, collaborate on code, and build, test, and deploy applications with nearly continuous integration/continuous delivery (CI/CD) tools.

Deep dive on CodeCatalyst Workspaces – Tmir Karia, Sr. Product Manager, and Rahul Gulati, Sr. Product Manager,  discuss how Amazon CodeCatalyst Workspaces decreases the time you spend creating and maintaining a local development environment and allows you to quickly set up a cloud development workspace, switch between projects, and replicate the development workspace configuration across team members.

DevOps

AWS Well-Architected best practices for DevOps on AWS Elamaran Shanmugam, Sr. Container Specialist, and Deval Perikh, Sr. Enterprise Solutions Architect, discuss the components required to align your DevOps practices to the pillars of the AWS Well-Architected Framework.

Best practices for securing your software delivery lifecycle Jams Bland, Principal Solutions Architect, and Curtis Rissi, Principal Solutions Architect, discus ways you can secure your CI/CD pipeline on AWS. Review topics like security of the pipeline versus security in the pipeline, ways to incorporate security checkpoints across various pipeline stages, security event management, and aggregating vulnerability findings into a single pane of glass.

Build it & run it: Streamline your DevOps capabilities with machine learning Rafael Ramos, Shivansh Singh, and Jared Reimer discuss how to use machine learning–powered tools like Amazon CodeWhisperer, Amazon CodeGuru, and Amazon DevOps Guru to boost your applications’ availability and write software faster and more reliably.

Infrastructure as Code

AWS infrastructure as code: A year in review  Tatiana Cooke, Principal Product Manager, and Ben Perak, Principal Product Manage, discuss the new features and improvements for AWS infrastructure as code with AWS CloudFormation and AWS CDK.

How to reuse patterns when developing infrastructure as code Ryan Bachman, Ethan Rucinski, and Ravi Palakodeti explore AWS Cloud Development Kit (AWS CDK) constructs and AWS CloudFormation modules and how they make it easier to build applications on AWS.

Governance and security with infrastructure as code David Hessler, Senior DevOps Consultant, and Eric Beard, Senior Solutions Architect, discuss how to use AWS CloudFormation and the AWS CDK to deploy cloud applications in regulated environments while enforcing security controls.

Developer Productivity

Building on AWS with AWS tools, services, and SDKs Kyle Thomson, Senior Software Development Engineer and Deval Parikh, Senior Solutions Architect, discuss the ways developers can set up secure development environments and use their favorite IDEs to interact with, and deploy to, the AWS Cloud.

The Amazon Builders’ Library: 25 years of operational excellence at Amazon Colm MacCarthaigh, Distinguished Engineer, and David Yanacek, Sr. Principal Engineer, discuss how Amazon practices have changed and improved over time and what we’ve learned as builders and as operators.

Sustainability in the cloud with Rust and AWS Graviton Emil Lerch, Principal DevOps Specialist, and Esteban Kuber, Principal Engineer, discuss the benefits of Rust and AWS Graviton that can reduce energy consumption and increase productivity.

 

 

About the author:

Brian Beach

Brian Beach has over 20 years of experience as a Developer and Architect. He is currently a Principal Solutions Architect at Amazon Web Services. He holds a Computer Engineering degree from NYU Poly and an MBA from Rutgers Business School. He is the author of “Pro PowerShell for Amazon Web Services” from Apress. He is a regular author and has spoken at numerous events. Brian lives in North Carolina with his wife and three kids.

Caching NextJS Apps with Serverless Redis using Upstash

The modern application we build today is sophisticated. Every time a user loads a webpage, their browser needs to download the bulk of data in order to display that page. A website may consist of millions of data and serve hundreds of API calls. For the data to move smoothly with zero delays between server and client we can follow many strategies. We, developers want our app to deliver the best user experience possible, to achieve this we can employ a variety of techniques available.

There are a number of ways we can address this situation. It would be the best optimization if we could apply techniques that can reduce the amount of latency to perform read/write operations on the database. One of the most popular ways to optimize our API calls is by implementing Caching mechanism.

What is Caching?

Caching is the process of storing copies of files in a cache, or temporary storage location so that they can be accessed more quickly. Technically, a cache is any temporary storage location for copies of files or data, but the term is often used in reference to Internet technologies.

By Cloudflare.com

The most common example of caching we can see is the browser cache, which stores frequently accessed website resources locally so that it does not have to retrieve them over the network each time they are needed. Caching can boost the performance bottleneck of our web applications. When mostly dealing with heavy network traffic and large API calls optimization this technique can be one of the best options for our performance optimization.

Redis: Caching in Server-side

When we talk about caching in servers, one of the top pioneers of caching built-in databases is Redis. Redis (for REmote DIctionary Server) is an open-source NoSQL in-memory key-value data store. One of the best things about Redis is that we can persist data in a database that can continuously store them unless we delete or flush it manually. It is an in-memory database, its data access operations are faster than any other disk-based database, which eventually makes Redis the best choice for caching.

Redis can also be used as a primary database if needed. With the help of Redis, we can call to access and reaccessed as many times as needed without running the database query again. Depending on the Redis cache setup, this can stay in memory for a few hours, a few minutes, or longer. We even can set an expiration time for our caching which we will implement in our demo application.

Redis is able to handle huge amounts of data in real-time, making use of its in-memory data storage capabilities to help support highly responsive database constructs. Caching with Redis allows for fewer database accesses, which helps to reduce the amount of traffic and instances required even achieving a sub-millisecond of latency.

We will implement Redis in our Next application and see the performance gain we can achieve.

Let’s dive into it.

Initializing our Project

Before we begin I assume you have Node installed on your machine so that you can follow along with the steps involved. We will use Next for our project because it helps us write front-end and back-end logic with no configuration needed. We will create a starter project with the following command:

$ npx [email protected]typescript

After the command, give the project the desired name. After everything is done and the project is made for us we can add the dependencies we need to work on in this demo application.

$ npm i ioredis @chakra-ui/core @emotion/core @emotion/styled emotion-theming
$ npm i –save-dev @types/node @types/ioredis

The command above is all the dependencies we will deal with in this project. We will be making the use of ioredis to communicate with our Redis database and style things up with ChakraUI.

As we are using typescript for our project. We will also need to install the typescript version of the node and ioredis which we did in the second command as our local dev dependencies.

Setting up Redis with Upstash

We definitely need to connect our application with Redis. You can use Redis locally and connect to it from your application or use a Redis cloud instance. For this project demo, we will be using Upstash Redis.

Upstash is a serverless database for Redis, with servers/instances, you pay per hour or a fixed price. With Serverless, you pay per request. This means we are not charged when the database is not in use. Upstash configures and manages the database for you.

Head on to Upstash official website and start with an easy free plan. For our demo purpose, we don’t need to pay. Visit the Upstash console after creating your new account and create a new Redis serverless database with Upstash.

You can find the example of the connection string used ioredis in the Upstash dashboard. Copy the blue overlay URL. We will use this connection string to connect to the serverless Redis instance provided in with free tire by Upstash.

import Redis from “ioredis”;
export const redisConnect = new Redis(process.env.REDIS_URL);

In the snippet above we connected our app with the database. We can now use our Redis server instance provided by Upstash inside or our App.

Populating static data

The application we are building might not be an exact use case but, we actually want to see the implementation of caching performance Redis can make to our Application and know how it’s done.

Here we are making a Pokemon application where users can select a list of Pokemon and choose to see the details of Pokemon. We will implement caching to the visited Pokemon. In other words, if users visit the same Pokemon twice they will receive the cached result.

Let’s populate some data inside of our Pokemon options.

export const getStaticProps: GetStaticProps = async () => {
const res = await fetch(
‘https://pokeapi.co/api/v2/pokemon?limit=200&offset=200’
);
const { results }: GetPokemonResults = await res.json();

return {
props: {
pokemons: results,
},
};
};

We are making a call to our endpoint to fetch all the names of Pokemon. The GetStaticProps help us to fetch data at build time. The getStaticProps()function gives props needed for the component Home to render the pages that are generated at build time, not at runtime, and are static.

const Home: NextPage<{ pokemons: Pokemons[] }> = ({ pokemons }) => {
const [selectedPokemon, setSelectedPokemon] = useState<string>(”);
const toast = useToast();
const router = useRouter();

const handelSelect = (e: any) => {
setSelectedPokemon(e.target.value);
};

const searchPokemon = () => {
if (selectedPokemon === ”)
return toast({
title: ‘No pokemon selected’,
description: ‘You need to select a pokemon to search.’,
status: ‘error’,
duration: 3000,
isClosable: true,
});
router.push(`/details/${selectedPokemon}`);
};

return (
<div className={styles.container}>
<main className={styles.main}>
<Box my=”10″>
<FormControl>
<Select
id=”country”
placeholder={
selectedPokemon ? selectedPokemon : ‘Select a pokemon’
}
onChange={handelSelect}
>
{pokemons.map((pokemon, index) => {
return <option key={index}>{pokemon.name}</option>;
})}
</Select>
<Button
colorScheme=”teal”
size=”md”
ml=”3″
onClick={searchPokemon}
>
Search
</Button>
</FormControl>
</Box>
</main>
</div>
);
};

We have successfully populated some static data inside our dropdown to select some Pokemon. Let’s implement a page redirect to a dynamic route when we select a Pokemon name and click the search button.

Adding dynamic page

Creating a dynamic page inside of Next is simple as it has a folder structure provided, which we can leverage to add our dynamic Routes. Let’s create a detailed page for our Pokemon.

const PokemonDetail: NextPage<{ info: PokemonDetailResults }> = ({ info }) => {
return (
<div>
// map our data here
</div>
);
};

export const getServerSideProps: GetServerSideProps = async (context) => {
const { id } = context.query;
const name = id as string;
const data = await fetch(`https://pokeapi.co/api/v2/pokemon/${name}`);
const response: PokemonDetailResults = await data.json();

return {
props: {
info: response,
},
};
};

We made the use of getServerSideProps we are making the use of Server-Side-Rendering provided by Next which will help us to pre-render the page on each request using the data returned by getServerSideProps. This comes in handy when we want to fetch data that changes often and have the page updated to show the most current data. After receiving data we are mapping it over to display it on the screen.

Until now we really have not implemented caching mechanism into our project. Each time the user visits the page we are hitting the API endpoint and sending them back the data they requested for. Let’s move ahead and implement caching into our application.

Caching data

To implement caching in the first place we want to read our Redis database. As discussed Redis stores its data as key-value pairs. We will find whether the key is stored in Redis or not and feed the client with the respective data needed. For this to achieve we will create a function that reads Redis for the key client is requesting.

export const fetchCache = async <T>(key: string, fetchData: () => Promise<T>) => {
const cachedData = await getKey(key);
if (cachedData)return cachedData
return setValue(key, fetchData);
}

When we will know the client is requesting data they have not visited yet we will provide them a copy of data from the server and also behind the scene make a copy inside our Redis database. So, that we can serve data fast through Redis in the next request.

We will write a function where it takes in a parameter of key and if the key exists in the database it will return us parsed value to the client.

const getKey = async <T>(key: string): Promise<T | null> => {
const result = await redisConnect.get(key);
if (result) return JSON.parse(result);
return null;
}

We also need a function where it takes in a key and set the new values alongside with the keys inside our database only if we don’t have that key stored inside of Redis.

const setValue = async <T>(key: string, fetchData: () => Promise<T>): Promise<T> => {
const setValue = await fetchData();
await redisConnect.set(key, JSON.stringify(setValue));
return setValue;
}

Until now we have written everything we need to implement Caching. We will just need is to invoke the function in our dynamic pages. Inside of our [id].tsx we will make a minor tweak where we can invoke an API call only if we don’t have the requested key in Redis.

For this to happen we will need to pass a function as a prop to our fetchCache function.

export const getServerSideProps: GetServerSideProps = async (context) => {
const { id } = context.query;
const name = id as string;

const fetchData = async () => {
const data = await fetch(`https://pokeapi.co/api/v2/pokemon/${name}`);
const response: PokemonDetailResults = await data.json();
return response;
};

const cachedData = await fetchCache(name, fetchData);

return {
props: {
info: cachedData,
},
};
};

We added some tweaks to our code we wrote before. We imported and made the use of fetchCache functions inside of the dynamic page. This function will take in function as a prop and do the checking for key respectively.

Adding expiry

The expiration policy employed by a cache is another factor that helps determine how long a cached item is retained. The expiration policy is usually assigned to the object when it is added to the cache. This can also be customized according to the type of object that’s being cached. A common strategy involves assigning an absolute time of expiration to each object when it is added to the cache. Once that time passes, the item is removed from the cache accordingly.

Let’s also use the caching expiration feature of Redis in our Application. To implement this we just need to add a parameter to our fetchCache function.

const cachedData = await fetchCache(name, fetchData, 60 * 60 * 24);
return {
props: {
info: cachedData,
},
};

export const fetchCache = async (key: string, fetchData: () => Promise<unknown>, expiresIn: number) => {
const cachedData = await getKey(key);
if (cachedData) return cachedData
return setValue(key, fetchData, expiresIn);
}

const setValue = async <T>(key: string, fetchData: () => Promise<T>, expiresIn: number): Promise<T> => {
const setValue = await fetchData();
await redisConnect.set(key, JSON.stringify(setValue), “EX”, expiresIn);
return setValue;
}

For each key that is stored in our Redis database, we have added an expiry time of one day. When the set amount of time elapses, Redis will automatically get rid of the object from the cache so that it may be refreshed by calling the API again. This really helps when we want to feed the client with the updated fresh data every time they call an API.

Performance testing

After all of all these efforts we did which is all for our App performance and optimization. Let’s take a look at our application performance.

This might not be a suitable performance testing for small application. But app serving thousands of API calls with big data set can see a big advantage.

I will make use of the perf_hooks module to assist in measuring the time for our Next lambda to complete an invocation. This is not really provided by Next instead it’s imported from Node. With these APIs, you can measure the time it takes individual dependencies to load, how long your app takes to initially start, and even how long individual web service API calls take. This allows you to make more informed decisions on the efficiency of specific code blocks or even algorithms.

import { performance } from “perf_hooks”;

const startPerfTimer = (): number => {
return performance.now();
}

const endPerfTimer = (): number => {
return performance.now();
}

const calculatePerformance = (startTime: number, endTime: number): void => {
console.log(`Response took ${endTime – startTime} milliseconds`);
}

This may be overkill, to create a function for a line of code but we basically can reuse this function in our application when needed. We will add these function calls to our application and see the results millisecond(ms) of latency, it can impact our app performance overall.

In the above screenshot, we can see the millisecond of improvements in fetching the response. This can be a small improvement in the small application we have built. But, this may be a huge time and performance boost, especially working with large datasets.

Conclusion

Data-heavy applications do need caching operations to improve the response time and even reduce the cost of data volume and bandwidth. With the help of Redis, we can deduct the expensive operation database operations, third-party API calls, and server to server requests by duplicating a copy of the previous requests in our Redis instance.

There might be some cases, we might need to delegate caching to other applications or microservices or any form of key-value storage system that allows us to store and use when we need it. We chose Redis since it is open source and very popular in the industry. Redis’s other cool features include data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, HyperLogLogs, and many more.

I highly recommend you visit the Redis documentation here to gain a depth understanding of other features provided out of the box. Now we can go forth and use Redis to cache frequently queried data in our applications and gain a considerable performance boost.

Please find the code repository here.

Happy coding!

The post Caching NextJS Apps with Serverless Redis using Upstash appeared first on Flatlogic Blog.Flatlogic Admin Templates banner