Fully Automated Deployment of an Open Source Mail Server on AWS

Many AWS customers have the requirement to host their own email solution and prefer to operate mail severs over using fully managed solutions (e.g. Amazon WorkMail). While certainly there are merits to either approach, motivations for self-managing often include:

full ownership and control
need for customized configuration
restricting access to internal networks or when connected via VPN.

In order to achieve this, customers frequently rely on open source mail servers due to flexibility and free licensing as compared to proprietary solutions like Microsoft Exchange. However, running and operating open source mail servers can be challenging as several components need to be configured and integrated to provide the desired end-to-end functionality. For example, to provide a fully functional email system, you need to combine dedicated software packages to send and receive email, access and manage inboxes, filter spam, manage users etc. Hence, this can be complex and error-prone to configure and maintain. Troubleshooting issues often calls for expert knowledge. As a result, several open source projects emerged that aim at simplifying setup of open source mail servers, such as Mail-in-a-Box, Docker Mailserver, Mailu, Modoboa, iRedMail, and several others.

In this blog post, we take this one step further by adding infrastructure automation and integrations with AWS services to fully automate the deployment of an open source mail server on AWS. We present an automated setup of a single instance mail server, striving for minimal complexity and cost, while still providing high resiliency by leveraging incremental backups and automations. As such, the solution is best suited for small to medium organizations that are looking to run open source mail servers but do not want to deal with the associated operational complexity.

The solution in this blog uses AWS CloudFormation templates to automatically setup and configure an Amazon Elastic Compute Cloud (Amazon EC2) instance running Mail-in-a-Box, which integrates features such as email , webmail, calendar, contact, and file sharing, thus providing functionality similar to popular SaaS tools or commercial solutions. All resources to reproduce the solution are provided in a public GitHub repository under an open source license (MIT-0).

Amazon Simple Storage Service (Amazon S3) is used both for offloading user data and for storing incremental application-level backups. Aside from high resiliency, this backup strategy gives way to an immutable infrastructure approach, where new deployments can be rolled out to implement updates and recover from failures which drastically simplifies operation and enhances security.

We also provide an optional integration with Amazon Simple Email Service (Amazon SES) so customers can relay their emails through reputable AWS servers and have their outgoing email accepted by third-party servers. All of this enables customers to deploy a fully featured open source mail server within minutes from AWS Management Console, or restore an existing server from an Amazon S3 backup for immutable upgrades, migration, or recovery purposes.

Overview of Solution

The following diagram shows an overview of the solution and interactions with users and other AWS services.

After preparing the AWS Account and environment, an administrator deploys the solution using an AWS CloudFormation template (1.). Optionally, a backup from Amazon S3 can be referenced during deployment to restore a previous installation of the solution (1a.). The admin can then proceed to setup via accessing the web UI (2.) to e.g., provision TLS certificates and create new users. After the admin has provisioned their accounts, users can access the web interface (3.) to send email, manage their inboxes, access calendar and contacts and share files. Optionally, outgoing emails are relayed via Amazon SES (3a.) and user data is stored in a dedicated Amazon S3 bucket (3b.). Furthermore, the solution is configured to automatically and periodically create incremental backups and store them into an S3 bucket for backups (4.).

On top of popular open source mail server packages such as Postfix for SMTP and Dovecot for IMAP, Mail-in-a-box integrates Nextcloud for calendar, contacts, and file sharing. However, note that Nextcloud capabilities in this context are limited. It’s primarily intended to be used alongside the core mail server functionalities to maintain calendar and contacts and for lightweight file sharing (e.g. for sharing files via links that are too large for email attachments). If you are looking for a fully featured, customizable and scalable Nextcloud deployment on AWS, have a look at this AWS Sample instead.

Deploying the Solution

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account

An existing external email address to test your new mail server. In the context of this sample, we will use [email protected] as the address.
A domain that can be exclusively used by the mail server in the sample. In the context of this sample, we will use aws-opensource-mailserver.org as the domain. If you don’t have a domain available, you can register a new one with Amazon Route 53. In case you do so, you can go ahead and delete the associated hosted zone that gets automatically created via the Amazon Route 53 Console. We won’t need this hosted zone because the mail server we deploy will also act as Domain Name System (DNS) server for the domain.
An SSH key pair for command line access to the instance. Command line access to the mail server is optional in this tutorial, but a key pair is still required for the setup. If you don’t already have a key pair, go ahead and create one in the EC2 Management Console:

(Optional) In this blog, we verify end-to-end functionality by sending an email to a single email address ([email protected] ) leveraging Amazon SES in sandbox mode. In case you want to adopt this sample for your use case and send email beyond that, you need to request removal of email sending limitations for EC2 or alternatively, if you relay your mail via Amazon SES request moving out of Amazon SES sandbox.

Preliminary steps: Setting up DNS and creating S3 Buckets

Before deploying the solution, we need to set up DNS and create Amazon S3 buckets for backups and user data.

1.     Allocate an Elastic IP address: We use the address 52.6.x.y in this sample.

2.     Configure DNS: If you have your domain registered with Amazon Route 53, you can use the AWS Management Console to change the name server and glue records for your domain. Configure two DNS servers ns1.box.<your-domain> and ns2.box.<your-domain> by placing your Elastic IP (allocated in step 1) into the Glue records field for each name server:

If you use a third-party DNS service, check their corresponding documentation on how to set the glue records.

It may take a while until the updates to the glue records propagate through the global DNS system. Optionally, before proceeding with the deployment, you can verify your glue records are setup correctly with the dig command line utility:

# Get a list of root servers for your top level domain
dig +short org. NS
# Query one of the root servers for an NS record of your domain
dig c0.org.afilias-nst.info. aws-opensource-mailserver.org. NS

This should give you output as follows:

;; ADDITIONAL SECTION:
ns1.box.aws-opensource-mailserver.org. 3600 IN A 52.6.x.y
ns2.box.aws-opensource-mailserver.org. 3600 IN A 52.6.x.y

3.     Create S3 buckets for backups and user data: Finally, in the Amazon S3 Console, create a bucket to store Nextcloud data and another bucket for backups, choosing globally unique names for both of them. In context of this sample, we will be using the two buckets (aws-opensource-mailserver-backup and aws-opensource-mailserver-nextcloud) as shown here:

Deploying and Configuring Mail-in-a-Box

Click    to deploy and specify the parameters as shown in the below screenshot to match the resources created in the previous section, leave other parameters at their default value, then click Next and Submit.

This will deploy your mail server into a public subnet of your default VPC which takes about 10 minutes. You can monitor the progress in the AWS CloudFormation Console. Meanwhile, retrieve and note the admin password for the web UI from AWS Systems Manager Parameter Store via the MailInABoxAdminPassword parameter.

Roughly one minute after your mail server finishes deploying, you can log in at its admin web UI residing at https://52.6.x.y/admin with username admin@<your-domain>, as shown in the following picture (you need to confirm the certificate exception warning from your browser):

Finally, in the admin UI navigate to System > TLS(SSL) Certificates and click Provision to obtain a valid SSL certificate and complete the setup (you might need to click on Provision twice to have all domains included in your certificate, as shown here).

At this point, you could further customize your mail server setup (e.g., by creating inboxes for additional users). However, we will continue to use the admin user in this sample for testing the setup in the next section.

Note: If your AWS account is subject to email sending restrictions on EC2, you will see an error in your admin dashboard under System > System Status Checks that says ‘Incoming Email (SMTP/postfix) is running but not publicly accessible’. You are safe to ignore this and should be able to receive emails regardless.

Testing the Solution

Receiving Email

With your existing email account, compose and send an email to admin@<your-domain>. Then login as admin@<your-domain> to the webmail UI of your AWS mail server at https://box.<your-domain>/mail and verify you received the email:

Test file sharing, calendar and contacts with Nextcloud

Your Nextcloud installation can be accessed under https://box.<your-domain>/cloud, as shown in the next figure. Here you can manage your calendar, contacts, and shared files. Contacts created and managed here are also accessible in your webmail UI when you compose an email. Refer to the Nextcloud documentation for more details. In order to keep your Nextcloud installation consistent and automatically managed by Mail-in-a-box setup scripts, admin users are advised to refrain from changing and customizing the Nextcloud configuration.

Sending Email

For this sample, we use Amazon SES to forward your outgoing email, as this is a simple way to get the emails you send accepted by other mail servers on the web. Achieving this is not trivial otherwise, as several popular email services tend to block public IP ranges of cloud providers.

Alternatively, if your AWS account has email sending limitations for EC2 you can send emails directly from your mail server. In this case, you can skip the next section and continue with Send test email, but make sure you’ve deployed your mail server stack with the SesRelay set to false. In that case, you can also bring your own IP addresses to AWS and continue using your reputable addresses or build reputation for addresses you own.

Verify your domain and existing Email address to Amazon SES

In order to use Amazon SES to accept and forward email for your domain, you first need to prove ownership of it. Navigate to Verified Identities in the Amazon SES Console and click Create identity, select domain and enter your domain. You will then be presented with a screen as shown here:

You now need to copy-paste the three CNAME DNS records from this screen over to your mail server admin dashboard. Open the admin web UI of your mail server again, select System > Custom DNS, and add the records as shown in the next screenshot.

Amazon SES will detect these records, thereby recognizing you as the owner and verifying the domain for sending emails. Similarly, while still in sandbox mode, you also need to verify ownership of the recipient email address. Navigate again to Verified Identities in the Amazon SES Console, click Create identity, choose Email Address, and enter your existing email address.

Amazon SES will then send a verification link to this address, and once you’ve confirmed via the link that you own this address, you can send emails to it.

Summing up, your verified identities section should look similar to the next screenshot before sending the test email:

Finally, if you intend to send email to arbitrary addresses with Amazon SES beyond testing in the next step, refer to the documentation on how to request production access.

Send test email

Now you are set to log back into your webmail UI and reply to the test mail you received before:

Checking the inbox of your existing mail, you should see the mail you just sent from your AWS server.

Congratulations! You have now verified full functionality of your open source mail server on AWS.

Restoring from backup

Finally, as a last step, we demonstrate how to roll out immutable deployments and restore from a backup for simple recovery, migration and upgrades. In this context, we test recreating the entire mail server from a backup stored in Amazon S3.

For that, we use the restore feature of the CloudFormation template we deployed earlier to migrate from the initial t2.micro installation to an AWS Graviton arm64-based t4g.micro instance. This exemplifies the power of the immutable infrastructure approach made possible by the automated application level backups, allowing for simple migration between instance types with different CPU architectures.

Verify you have a backup

By default, your server is configured to create an initial backup upon installation and nightly incremental backups. Using your ssh key pair, you can connect to your instance and trigger a manual backup to make sure the emails you just sent and received when testing will be included in the backup:

ssh -i aws-opensource-mailserver.pem [email protected] sudo /opt/mailinabox/management/backup.py

You can then go to your mail servers’ admin dashboard at https://box.<your-doamin>/admin and verify the backup status under System > Backup Status:

Recreate your mail server and restore from backup

First, double check that you have saved the admin password, as you will no longer be able to retrieve it from Parameter Store once you delete the original installation of your mail server. Then go ahead and delete the aws-opensource-mailserver stack from your CloudFormation Console an redeploy it by clicking on this . However, this time, adopt the parameters as shown below, changing the instance type and corresponding AMI as well as specifying the prefix in your backup S3 bucket to restore from.

Within a couple of minutes, your mail server will be up and running again, featuring the exact same state it was before you deleted it, however, running on a completely new instance powered by AWS Graviton. You can verify this by going to your webmail UI at https://box.<yourdomain>/mail and logging in with your old admin credentials.

Cleaning up

 Delete the mail server stack from CloudFormation Console

Empty and delete both the backup and Nextcloud data S3 Buckets

Release the Elastic IP
In case you registered your domain from Amazon Route 53 and do not want to hold onto it, you need to disable automatic renewal. Further, if you haven’t already, delete the hosted zone that got created automatically when registering it.

Outlook

The solution discussed so far focuses on minimal operational complexity and cost and hence is based on a single Amazon EC2 instance comprising all functions of an open source mail server, including a management UI, user database, Nextcloud and DNS. With a suitably sized instance, this setup can meet the demands of small to medium organizations. In particular, the continuous incremental backups to Amazon S3 provide high resiliency and can be leveraged in conjunction with the CloudFormation automations to quickly recover in case of instance or single Availablity Zone (AZ) failures.

Depending on your requirements, extending the solution and distributing components across AZs allows for meeting more stringent requirements regarding high availability and scalability in the context of larger deployments. Being based on open source software, there is a straight forward migration path towards these more complex distributed architectures once you outgrow the setup discussed in this post.

Conclusion

In this blog post, we showed how to automate the deployment of an open source mail server on AWS and how to quickly and effortlessly restore from a backup for rolling out immutable updates and providing high resiliency. Using AWS CloudFormation infrastructure automations and integrations with managed services such as Amazon S3 and Amazon SES, the lifecycle management and operation of open source mail servers on AWS can be simplified significantly. Once deployed, the solution provides an end-user experience similar to popular SaaS and commercial offerings.

You can go ahead and use the automations provided in this blog and the corresponding GitHub repository to get started with running your own open source mail server on AWS!

Flatlogic Admin Templates banner

S3 URI Parsing is now available in AWS SDK for Java 2.x

The AWS SDK for Java team is pleased to announce the general availability of Amazon Simple Storage Service (Amazon S3) URI parsing in the AWS SDK for Java 2.x. You can now parse path-style and virtual-hosted-style S3 URIs to easily retrieve the bucket, key, region, style, and query parameters. The new parseUri() API and S3Uri class provide the highly-requested parsing features that many customers miss from the AWS SDK for Java 1.x. Please note that Amazon S3 AccessPoints and Amazon S3 on Outposts URI parsing are not supported.

Motivation

Users often need to extract important components like bucket and key from stored S3 URIs to use in S3Client operations. The new parsing APIs allow users to conveniently do so, bypassing the need for manual parsing or storing the components separately.

Getting Started

To begin, first add the dependency for S3 to your project.

<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>s3</artifactId>
<version>${s3.version}</version>
</dependency>

Next, instantiate S3Client and S3Utilities objects.

S3Client s3Client = S3Client.create();
S3Utilities s3Utilities = s3Client.utilities();

Parsing an S3 URI

To parse your S3 URI, call parseUri() from S3Utilities, passing in the URI. This will return a parsed S3Uri object. If you have a String of the URI, you’ll need to convert it into an URI object first.

String url = “https://s3.us-west-1.amazonaws.com/myBucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88”;
URI uri = URI.create(url);
S3Uri s3Uri = s3Utilities.parseUri(uri);

With the S3Uri, you can call the appropriate getter methods to retrieve the bucket, key, region, style, and query parameters. If the bucket, key, or region is not specified in the URI, an empty Optional will be returned. If query parameters are not specified in the URI, an empty map will be returned. If the field is encoded in the URI, it will be returned decoded.

Region region = s3Uri.region().orElse(null); // Region.US_WEST_1
String bucket = s3Uri.bucket().orElse(null); // “myBucket”
String key = s3Uri.key().orElse(null); // “resources/doc.txt”
boolean isPathStyle = s3Uri.isPathStyle(); // true

Retrieving query parameters

There are several APIs for retrieving the query parameters. You can return a Map<String, List<String>> of the query parameters. Alternatively, you can specify a query parameter to return the first value for the given query, or return the list of values for the given query.

Map<String, List<String>> queryParams = s3Uri.rawQueryParameters(); // {versionId=[“abc123”], partNumber=[“77”, “88”]}
String versionId = s3Uri.firstMatchingRawQueryParameter(“versionId”).orElse(null); // “abc123”
String partNumber = s3Uri.firstMatchingRawQueryParameter(“partNumber”).orElse(null); // “77”
List<String> partNumbers = s3Uri.firstMatchingRawQueryParameters(“partNumber”); // [“77”, “88”]

Caveats

Special Characters

If you work with object keys or query parameters with reserved or unsafe characters, they must be URL-encoded, e.g., replace whitespace ” ” with “%20”.

Valid:
“https://s3.us-west-1.amazonaws.com/myBucket/object%20key?query=%5Bbrackets%5D”

Invalid:
“https://s3.us-west-1.amazonaws.com/myBucket/object key?query=[brackets]”

Virtual-hosted-style URIs

If you work with virtual-hosted-style URIs with bucket names that contain a dot, i.e., “.”, the dot must not be URL-encoded.

Valid:
“https://my.Bucket.s3.us-west-1.amazonaws.com/key”

Invalid:
“https://my%2EBucket.s3.us-west-1.amazonaws.com/key”

Conclusion

In this post, I discussed parsing S3 URIs in the AWS SDK for Java 2.x and provided code examples for retrieving the bucket, key, region, style, and query parameters. To learn more about how to set up and begin using the feature, visit our Developer Guide. If you are curious about how it is implemented, check out the source code on GitHub. As always, the AWS SDK for Java team welcomes bug reports, feature requests, and pull requests on the aws-sdk-java-v2 GitHub repository.

How Zomato Boosted Performance 25% and Cut Compute Cost 30% Migrating Trino and Druid Workloads to AWS Graviton

Zomato is an India-based restaurant aggregator, food delivery, dining-out company with over 350,000 listed restaurants across more than 1,000 cities in India. The company relies heavily on data analytics to enrich the customer experience and improve business efficiency. Zomato’s engineering and product teams use data insights to refine their platform’s restaurant and cuisine recommendations, improve the accuracy of waiting times at restaurants, speed up the matching of delivery partners and improve overall food delivery process.

At Zomato, different teams have different requirements for data discovery based upon their business functions. For example, number of orders placed in specific area required by a city lead team, queries resolved per minute required by customer support team or most searched dishes on special events or days by marketing and other teams. Zomato’s Data Platform team is responsible for building and maintaining a reliable platform which serves these data insights to all business units.

Zomato’s Data Platform is powered by AWS services including Amazon EMR, Amazon Aurora MySQL-Compatible Edition and Amazon DynamoDB along with open source software Trino (formerly PrestoSQL) and Apache Druid for serving the previously mentioned business metrics to different teams. Trino clusters process over 250K queries by scanning 2PB of data and Apache Druid ingests over 20 billion events and serves 8 million queries every week. To deliver performance at Zomato scale, these massively parallel systems utilize horizontal scaling of nodes running on Amazon Elastic Compute Cloud (Amazon EC2) instances in their clusters on AWS. Performance of both these data platform components is critical to support all business functions reliably and efficiently in Zomato. To improve performance in a cost-effective manner, Zomato migrated these Trino and Druid workloads onto AWS Graviton-based Amazon EC2 instances.

Graviton-based EC2 instances are powered by Arm-based AWS Graviton processors. They deliver up to 40% better price performance than comparable x86-based Amazon EC2 instances. CPU and Memory intensive Java-based applications including Trino and Druid are suitable candidates for AWS Graviton based instances to optimize price-performance, as Java is well supported and generally performant out-of-the-box on arm64.

In this blog, we will walk you through an overview of Trino and Druid, how they fit into the overall Data Platform architecture and migration journey onto AWS Graviton based instances for these workloads. We will also cover challenges faced during migration, how Zomato team overcame those challenges, business gains in terms of cost savings and better performance along with future plans of Zomato on Graviton adoption for more workloads.

Trino overview

Trino is a fast, distributed SQL query engine for querying petabyte scale data, implementing massively parallel processing (MPP) architecture. It was designed as an alternative to tools that query Apache Hadoop Distributed File System (HDFS) using pipelines of MapReduce jobs, such as Apache Hive or Apache Pig, but Trino is not limited to querying HDFS only. It has been extended to operate over a multitude of data sources, including Amazon Simple Storage Service (Amazon S3), traditional relational databases and distributed data stores including Apache Cassandra, Apache Druid, MongoDB and more. When Trino executes a query, it does so by breaking up the execution into a hierarchy of stages, which are implemented as a series of tasks distributed over a network of Trino workers. This reduces end-to-end latency and makes Trino a fast tool for ad hoc data exploration over very large data sets.

Figure 1 – Trino architecture overview

Trino coordinator is responsible for parsing statements, planning queries, and managing Trino worker nodes. Every Trino installation must have a coordinator alongside one or more Trino workers. Client applications including Apache Superset and Redash connect to the coordinator via Presto Gateway to submit statements for execution. The coordinator creates a logical model of a query involving a series of stages, which is then translated into a series of connected tasks running on a cluster of Trino workers. Presto Gateway acts as a proxy/load-balancer for multiple Trino clusters.

Druid overview

Apache Druid is a real-time database to power modern analytics applications for use cases where real-time ingest, fast query performance and high uptime are important. Druid processes are deployed on three types of server nodes: Master nodes govern data availability and ingestion, Query nodes accept queries, execute them across the system, and return the results and Data nodes ingest and store queryable data. Broker processes receive queries from external clients and forward those queries to Data servers. Historicals are the workhorses that handle storage and querying on “historical” data. MiddleManager processes handle ingestion of new data into the cluster. Please refer here to learn more on detailed Druid architecture design.

Figure 2 – Druid architecture overview

Zomato’s Data Platform Architecture on AWS

Figure 3 – Zomato’s Data Platform landscape on AWS

Zomato’s Data Platform covers data ingestion, storage, distributed processing (enrichment and enhancement), batch and real-time data pipelines unification and a robust consumption layer, through which petabytes of data is queried daily for ad-hoc and near real-time analytics. In this section, we will explain the data flow of pipelines serving data to Trino and Druid clusters in the overall Data Platform architecture.

Data Pipeline-1: Amazon Aurora MySQL-Compatible database is used to store data by various microservices at Zomato. Apache Sqoop on Amazon EMR run Extract, Transform, Load (ETL) jobs at scheduled intervals to fetch data from Aurora MySQL-Compatible to transfer it to Amazon S3 in the Optimized Row Columnar (ORC) format, which is then queried by Trino clusters.

Data Pipeline-2: Debezium Kafka connector deployed on Amazon Elastic Container Service (Amazon ECS) acts as producer and continuously polls data from Aurora MySQL-Compatible database. On detecting changes in the data, it identifies the change type and publishes the change data event to Apache Kafka in Avro format. Apache Flink on Amazon EMR consumes data from Kafka topic, performs data enrichment and transformation and writes it in ORC format in Iceberg tables on Amazon S3. Trino clusters then query data from Amazon S3.

Data Pipeline-3: Moving away from other databases, Zomato had decided to go serverless with Amazon DynamoDB because of its high performance (single-digit millisecond latency), request rate (millions per second), extreme scale as per Zomato peak expectations, economics (pay as you go) and data volume (TB, PB, EB) for their business-critical apps including Food Cart, Product Catalog and Customer preferences. DynamoDB streams publish data from these apps to Amazon S3 in JSON format to serve this data pipeline. Apache Spark on Amazon EMR reads JSON data, performs transformations including conversion into ORC format and writes data back to Amazon S3 which is used by Trino clusters for querying.

Data Pipeline-4: Zomato’s core business applications serving end users include microservices, web and mobile applications. To get near real-time insights from these core applications is critical to serve customers and win their trust continuously. Services use a custom SDK developed by data platform team to publish events to the Apache Kafka topic. Then, two downstream data pipelines consume these application events available on Kafka via Apache Flink on Amazon EMR. Flink performs data conversion into ORC format and publishes data to Amazon S3 and in a parallel data pipeline, Flink also publishes enriched data onto another Kafka topic, which further serves data to an Apache Druid cluster deployed on Amazon EC2 instances.

Performance requirements for querying at scale

All of the described data pipelines ingest data into an Amazon S3 based data lake, which is then leveraged by three types of Trino clusters – Ad-hoc clusters for ad-hoc query use cases, with a maximum query runtime of 20 minutes, ETL clusters for creating materialized views to enhance performance of dashboard queries, and Reporting clusters to run queries for dashboards with various Key Performance Indicators (KPIs), with query runtime upto 3 minutes. ETL queries are run via Apache Airflow with a built-in query retry mechanism and a runtime of up to 3 hours.

Druid is used to serve two types of queries: computing aggregated metrics based on recent events and comparing aggregated metrics to historical data. For example, how is a specific metric in the current hour compared to the same last week. Depending on the use case, the service level objective for Druid query response time ranges from a few milliseconds to a few seconds.

Graviton migration of Druid cluster

Zomato first moved Druid nodes to AWS Graviton based instances in their test cluster environment to determine query performance. Nodes running brokers and middle-managers were moved from R5 to R6g instances and nodes running historicals were migrated from i3 to R6gd instances.   Zomato logged real-world queries from their production cluster and replayed them in their test cluster to validate the performance. Post validation, Zomato saw significant performance gains and reduced cost:

Performance gains

For queries in Druid, performance was measured using typical business hours (12:00 to 22:00 Hours) load of 14K queries, as shown here, where p99 query runtime reduced by 25%.

Figure 4 – Overall Druid query performance (Intel x86-64 vs. AWS Graviton)

Also, query performance improvement on the historical nodes of the Druid cluster are shown here, where p95 query runtime reduced by 66%.

Figure 5 –Query performance on Druid Historicals (Intel x86-64 vs. AWS Graviton)

Under peak load during business hours (12:00 to 22:00 Hours as shown in the provided graph), with increasingly loaded CPUs, Graviton based instances demonstrated close to linear performance resulting in better query runtime than equivalent Intel x86 based instances. This provided headroom to Zomato to reduce their overall node count in the Druid cluster for serving the same peak load query traffic.

Figure 6 – CPU utilization (Intel x86-64 vs. AWS Graviton)

Cost savings

A Cost comparison of Intel x86 vs. AWS Graviton based instances running Druid in a test environment along with the number, instance types and hourly On-demand prices in the Singapore region is shown here. There are cost savings of ~24% running the same number of Graviton based instances. Further, Druid cluster auto scales in production environment based upon performance metrics, so average cost savings with Graviton based instances are even higher at ~30% due to better performance.

Figure 7 – Cost savings analysis (Intel x86-64 vs. AWS Graviton)

Graviton migration of Trino clusters

Zomato also moved their Trino cluster in their test environment to AWS Graviton based instances and monitored query performance for different short and long-running queries. As shown here, mean wall (elapsed) time value for different Trino queries is lower on AWS Graviton instances than equivalent Intel x86 based instances, for most of the queries (lower is better).

Figure 8 – Mean Wall Time for Trino queries (Intel x86-64 vs. AWS Graviton)

Also, p99 query runtime reduced by ~33% after migrating the Trino cluster to AWS Graviton instances for a typical business day’s (7am – 7pm) mixed query load with ~15K queries.

Figure 9 –Query performance for a typical day (7am -7pm) load

Zomato’s team further optimized overall Trino query performance by enhancing Advanced Encryption Standard (AES) performance on Graviton for TLS negotiation with Amazon S3. It was achieved by enabling -XX:+UnlockDiagnosticVMOptions and -XX:+UseAESCTRIntrinsics in extra JVM flags. As shown here, mean CPU time for queries is lower after enabling extra JVM flags, for most of the queries.

Figure 10 –Query performance after enabling extra JVM options with Graviton instances

Migration challenges and approach

Zomato team is using Trino version 359 and multi-arch or ARM64-compatible docker image for this Trino version was not available. As the team wanted to migrate their Trino cluster to Graviton based instances with minimal engineering efforts and time, they backported the Trino multi-arch supported UBI8 based Docker image to their Trino version 359.  This approach allowed faster adoption of Graviton based instances, eliminating the heavy lift of upgrading, testing and benchmarking the workload on a newer Trino version.

Next Steps

Zomato has already migrated AWS managed services including Amazon EMR and Amazon Aurora MySQL-Compatible database to AWS Graviton based instances. With the successful migration of two main open source software components (Trino and Druid) of their data platform to AWS Graviton with visible and immediate price-performance gains, the Zomato team plans to replicate that success with other open source applications running on Amazon EC2 including Apache Kafka, Apache Pinot, etc.

Conclusion

This post demonstrated the price/performance benefits of adopting AWS Graviton based instances for high throughput, near real-time big data analytics workloads running on Java-based, open source Apache Druid and Trino applications. Overall, Zomato reduced the cost of its Amazon EC2 usage by 30%, while improving performance for both time-critical and ad-hoc querying by as much as 25%. Due to better performance, Zomato was also able to right size compute footprint for these workloads on a smaller number of Amazon EC2 instances, with peak capacity of Apache Druid and Trino clusters reduced by 25% and 20% respectively.

Zomato migrated these open source software applications faster by quickly implementing customizations needed for optimum performance and compatibility with Graviton based instances. Zomato’s mission is “better food for more people” and Graviton adoption is helping with this mission by providing a more sustainable, performant, and cost-effective compute platform on AWS. This is certainly a “food for thought” for customers looking forward to improve price-performance and sustainability for their business-critical workloads running on Open Source Software (OSS).

Flatlogic Admin Templates banner

Proactive Insights with Amazon DevOps Guru for RDS

Today, we are pleased to announce a new Amazon DevOps Guru for RDS capability: Proactive Insights. DevOps Guru for RDS is a fully-managed service powered by machine learning (ML), that uses the data collected by RDS Performance Insights to detect and alert customers of anomalous behaviors within Amazon Aurora databases. Since its release, DevOps Guru for RDS has empowered customers with information to quickly react to performance problems and to take corrective actions. Now, Proactive Insights adds recommendations related to operational issues that may prevent potential issues in the future.

Proactive Insights requires no additional set up for customers already using DevOps Guru for RDS, for both Amazon Aurora MySQL-Compatible Edition and Amazon Aurora PostgreSQL-Compatible Edition.

The following are example use cases of operational issues available for Proactive Insights today, with more insights coming over time:

Long InnoDB History for Aurora MySQL-Compatible engines – Triggered when the InnoDB history list length becomes very large.

Temporary tables created on disk for Aurora MySQL-Compatible engines – Triggered when the ratio of temporary tables created versus all temporary tables breaches a threshold.

Idle In Transaction for Aurora PostgreSQL-Compatible engines – Triggered when sessions connected to the database are not performing active work, but can keep database resources blocked.

To get started, navigate to the Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights. In the following screen capture, the number three indicates that there are three ongoing proactive insights. Click on that number to see the listing of the corresponding Proactive Insights, which may include RDS or other Proactive Insights supported by Amazon DevOps Guru.

Figure 1. Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights.

Ongoing problems (including reactive and proactive insights) are also highlighted against your database instance on the Database list page in the Amazon RDS console.

Figure 2. Proactive and Reactive Insights are highlighted against your database instance on the Database list page in the Amazon RDS console.

In the following sections, we will dive deep on these use cases of DevOps Guru for RDS Proactive Insights.

Long InnoDB History for Aurora MySQL-Compatible engines

The InnoDB history list is a global list of the undo logs for committed transactions. MySQL uses the history list to purge records and log pages when transactions no longer require the history.  If the InnoDB history list length grows too large, indicating a large number of old row versions, queries and even the database shutdown process can become slower.

DevOps Guru for RDS now detects when the history list length exceeds 1 million records and alerts users to close (either by commit or by rollback) any unnecessary long-running transactions before triggering database changes that involve a shutdown (this includes reboots and database version upgrades).

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS InnoDB History List Length Anomalous” Proactive Insight with an ongoing status. You will notice that Proactive Insights provides an “Insight overview”, “Metrics” and “Recommendations”.

Insight overview provides you basic information on this insight. In our case, the history list for row changes increased significantly, which affects query and shutdown performance.

Figure 3. Long InnoDB History for Aurora MySQL-Compatible engines Insight overview.

The Metrics panel gives you a graphical representation of the history list length and the timeline, allowing you to correlate it with any anomalous application activity that may have occurred during this window.

Figure 4. Long InnoDB History for Aurora MySQL-Compatible engines Metrics panel.

The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem. You will also notice the rationale behind the recommendation under the “Why is DevOps Guru recommending this?” column.

Figure 5. The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem.

Temporary tables created on disk for Aurora MySQL-Compatible engines

Sometimes it is necessary for the MySQL database to create an internal temporary table while processing a query. An internal temporary table can be held in memory and processed by the TempTable or MEMORY storage engine, or stored on disk by the InnoDB storage engine. An increase of temporary tables created on disk instead of in memory can impact the database performance.

DevOps Guru for RDS now monitors the rate at which the database creates temporary tables and the percentage of those temporary tables that use disk. When these values cross recommended levels over a given period of time, DevOps Guru for RDS creates an insight exposing this situation before it becomes critical.

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS Temporary Tables On Disk AnomalousProactive Insight with an ongoing status. You will notice this Proactive Insight provides an “Insight overview”, “Metrics” and “Recommendations”.

Insight overview provides you basic information on this insight. In our case, more than 58% of the total temporary tables created per second were using disk, with a sustained rate of two temporary tables on disk created every second, which indicates that query performance is degrading.

Figure 6. Temporary tables created on disk insight overview.

The Metrics panel shows you a graphical representation of the information specific for this insight. You will be presented with the evolution of the amount of temporary tables created on disk per second, the percentage of temporary tables on disk (out of the total number of database-created temporary tables), and of the overall rate at which the temporary tables are created (per second).

Figure 7. Temporary tables created on disk – evolution of the amount of temporary tables created on disk per second.

Figure 8. Temporary tables created on disk – the percentage of temporary tables on disk (out of the total number of database-created temporary tables).

Figure 9. Temporary tables created on disk – overall rate at which the temporary tables are created (per second).

The Recommendations section suggests actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more.

Figure 10. Temporary tables created on disk – actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more.

Additional explanations on this use case can be found by clicking on the “View troubleshooting doc” link.

Idle In Transaction for Aurora PostgreSQL-Compatible engines

A connection that has been idle in transaction  for too long can impact performance by holding locks, blocking other queries, or by preventing VACUUM (including autovacuum) from cleaning up dead rows.
PostgreSQL database requires periodic maintenance, which is known as vacuuming. Autovacuum in PostgreSQL automates the execution of VACUUM and ANALYZE commands. This process gathers the table statistics and deletes the dead rows. When vacuuming does not occur, this negatively impacts the database performance. It leads to an increase in table and index bloat (the disk space that was used by a table or index and is available for reuse by the database but has not been reclaimed), leads to stale statistics and can even end in transaction wraparound (when the number of unique transaction ids reaches its maximum of about two billion).

DevOps Guru for RDS monitors the time spent by sessions in an Aurora PostgreSQL database in idle in transaction state and raises initially a warning notification, followed by an alarm notification if the idle in transaction state continues (the current thresholds are 1800 seconds for the warning and 3600 seconds for the alarm).

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS Idle In Transaction Max Time AnomalousProactive Insight with an ongoing status. You will notice this Proactive Insights provides an “Insight overview”, “Metrics” and “Recommendations”.

In our case, a connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance.

Figure 11. A connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance.

The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started.

Figure 12. The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started.

As with the other insights, recommended actions are listed and a troubleshooting doc is linked for even more details on this use case.

Figure 13. Recommended actions are listed and a troubleshooting doc is linked for even more details on this use case.

Conclusion

With Proactive Insights, DevOpsGuru for RDS enhances its abilities to help you monitor your databases by notifying you about potential operational issues, before they become bigger problems down the road. To get started, you need to ensure that you have enabled Performance Insights on the database instance(s) you want monitored, as well as ensure and confirm that DevOps Guru is enabled to monitor those instances (for example by enabling it at account level, by monitoring specific CloudFormation stacks or by using AWS tags for specific Aurora resources). Proactive Insights is available in all regions where DevOps Guru for RDS is supported. To learn more about Proactive Insights, join us for a free hands-on Immersion Day (available in three time zones) on March 15th or April 12th.

About the authors:

Kishore Dhamodaran

Kishore Dhamodaran is a Senior Solutions Architect at AWS.

Raluca Constantin

Raluca Constantin is a Senior Database Engineer with the Relational Database Services (RDS) team at Amazon Web Services. She has 16 years of experience in the databases world. She enjoys travels, hikes, arts and is a proud mother of a 12y old daughter and a 7y old son.

Jonathan Vogel

Jonathan is a Developer Advocate at AWS. He was a DevOps Specialist Solutions Architect at AWS for two years prior to taking on the Developer Advocate role. Prior to AWS, he practiced professional software development for over a decade. Jonathan enjoys music, birding and climbing rocks.

Securely validate business application resilience with AWS FIS and IAM

To avoid high costs of downtime, mission critical applications in the cloud need to achieve resilience against degradation of cloud provider APIs and services.

In 2021, AWS launched AWS Fault Injection Simulator (FIS), a fully managed service to perform fault injection experiments on workloads in AWS to improve their reliability and resilience. At the time of writing, FIS allows to simulate degradation of Amazon Elastic Compute Cloud (EC2) APIs using API fault injection actions and thus explore the resilience of workflows where EC2 APIs act as a fault boundary. 

In this post we show you how to explore additional fault boundaries in your applications by selectively denying access to any AWS API. This technique is particularly useful for fully managed, “black box” services like Amazon Simple Storage Service (S3) or Amazon Simple Queue Service (SQS) where a failure of read or write operations is sufficient to simulate problems in the service. This technique is also useful for injecting failures in serverless applications without needing to modify code. While similar results could be achieved with network disruption or modifying code with feature flags, this approach provides a fine granular degradation of an AWS API without the need to re-deploy and re-validate code.

Overview

We will explore a common application pattern: user uploads a file, S3 triggers an AWS Lambda function, Lambda transforms the file to a new location and deletes the original:

Figure 1. S3 upload and transform logical workflow: User uploads file to S3, upload triggers AWS Lambda execution, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

We will simulate the user upload with an Amazon EventBridge rate expression triggering an AWS Lambda function which creates a file in S3:

Figure 2. S3 upload and transform implemented demo workflow: Amazon EventBridge triggers a creator Lambda function, Lambda function creates a file in S3, file creation triggers AWS Lambda execution on transformer function, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Using this architecture we can explore the effect of S3 API degradation during file creation and deletion. As shown, the API call to delete a file from S3 is an application fault boundary. The failure could occur, with identical effect, because of S3 degradation or because the AWS IAM role of the Lambda function denies access to the API.

To inject failures we use AWS Systems Manager (AWS SSM) automation documents to attach and detach IAM policies at the API fault boundary and FIS to orchestrate the workflow.

Each Lambda function has an IAM execution role that allows S3 write and delete access, respectively. If the processor Lambda fails, the S3 file will remain in the bucket, indicating a failure. Similarly, if the IAM execution role for the processor function is denied the ability to delete a file after processing, that file will remain in the S3 bucket.

Prerequisites

Following this blog posts will incur some costs for AWS services. To explore this test application you will need an AWS account. We will also assume that you are using AWS CloudShell or have the AWS CLI installed and have configured a profile with administrator permissions. With that in place you can create the demo application in your AWS account by downloading this template and deploying an AWS CloudFormation stack:

git clone https://github.com/aws-samples/fis-api-failure-injection-using-iam.git
cd fis-api-failure-injection-using-iam
aws cloudformation deploy –stack-name test-fis-api-faults –template-file template.yaml –capabilities CAPABILITY_NAMED_IAM

Fault injection using IAM

Once the stack has been created, navigate to the Amazon CloudWatch Logs console and filter for /aws/lambda/test-fis-api-faults. Under the EventBridgeTimerHandler log group you should find log events once a minute writing a timestamped file to an S3 bucket named fis-api-failure-ACCOUNT_ID. Under the S3TriggerHandler log group you should find matching deletion events for those files.

Once you have confirmed object creation/deletion, let’s take away the permission of the S3 trigger handler lambda to delete files. To do this you will attach the FISAPI-DenyS3DeleteObject  policy that was created with the template:

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles –query “Roles[?RoleName==’${ROLE_NAME}’].Arn” –output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies –query “Policies[?PolicyName==’${POLICY_NAME}’].Arn” –output text )
echo Impact Policy ARN: $POLICY_ARN

aws iam attach-role-policy
–role-name ${ROLE_NAME}
–policy-arn ${POLICY_ARN}

With the deny policy in place you should now see object deletion fail and objects should start showing up in the S3 bucket. Navigate to the S3 console and find the bucket starting with fis-api-failure. You should see a new object appearing in this bucket once a minute:

Figure 3. S3 bucket listing showing files not being deleted because IAM permissions DENY file deletion during FIS experiment.

If you would like to graph the results you can navigate to AWS CloudWatch, select “Logs Insights“, select the log group starting with /aws/lambda/test-fis-api-faults-S3CountObjectsHandler, and run this query:

fields @timestamp, @message
| filter NumObjects >= 0
| sort @timestamp desc
| stats max(NumObjects) by bin(1m)
| limit 20

This will show the number of files in the S3 bucket over time:

Figure 4. AWS CloudWatch Logs Insights graph showing the increase in the number of retained files in S3 bucket over time, demonstrating the effect of the introduced failure.

You can now detach the policy:

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles –query “Roles[?RoleName==’${ROLE_NAME}’].Arn” –output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies –query “Policies[?PolicyName==’${POLICY_NAME}’].Arn” –output text )
echo Impact Policy ARN: $POLICY_ARN

aws iam detach-role-policy
–role-name ${ROLE_NAME}
–policy-arn ${POLICY_ARN}

We see that newly written files will once again be deleted but the un-processed files will remain in the S3 bucket. From the fault injection we learned that our system does not tolerate request failures when deleting files from S3. To address this, we should add a dead letter queue or some other retry mechanism.

Note: if the Lambda function does not return a success state on invocation, EventBridge will retry. In our Lambda functions we are cost conscious and explicitly capture the failure states to avoid excessive retries.

Fault injection using SSM

To use this approach from FIS and to always remove the policy at the end of the experiment, we first create an SSM document to automate adding a policy to a role. To inspect this document, open the SSM console, navigate to the “Documents” section, find the FISAPI-IamAttachDetach document under “Owned by me”, and examine the “Content” tab (make sure to select the correct region). This document takes the name of the Role you want to impact and the Policy you want to attach as parameters. It also requires an IAM execution role that grants it the power to list, attach, and detach specific policies to specific roles.

Let’s run the SSM automation document from the console by selecting “Execute Automation”. Determine the ARN of the FISAPI-SSM-Automation-Role from CloudFormation or by running:

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies –query “Policies[?PolicyName==’${POLICY_NAME}’].Arn” –output text )
echo Impact Policy ARN: $POLICY_ARN

Use FISAPI-SSM-Automation-Role, a duration of 2 minutes expressed in ISO8601 format as PT2M, the ARN of the deny policy, and the name of the target role FISAPI-TARGET-S3TriggerHandlerRole:

Figure 5. Image of parameter input field reflecting the instructions in blog text.

Alternatively execute this from a shell:

ASSUME_ROLE_NAME=FISAPI-SSM-Automation-Role
ASSUME_ROLE_ARN=$( aws iam list-roles –query “Roles[?RoleName==’${ASSUME_ROLE_NAME}’].Arn” –output text )
echo Assume Role ARN: $ASSUME_ROLE_ARN

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles –query “Roles[?RoleName==’${ROLE_NAME}’].Arn” –output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies –query “Policies[?PolicyName==’${POLICY_NAME}’].Arn” –output text )
echo Impact Policy ARN: $POLICY_ARN

aws ssm start-automation-execution
–document-name FISAPI-IamAttachDetach
–parameters “{
“AutomationAssumeRole”: [ “${ASSUME_ROLE_ARN}” ],
“Duration”: [ “PT2M” ],
“TargetResourceDenyPolicyArn”: [“${POLICY_ARN}” ],
“TargetApplicationRoleName”: [ “${ROLE_NAME}” ]
}”

Wait two minutes and then examine the content of the S3 bucket starting with fis-api-failure again. You should now see two additional files in the bucket, showing that the policy was attached for 2 minutes during which files could not be deleted, and confirming that our application is not resilient to S3 API degradation.

Permissions for injecting failures with SSM

Fault injection with SSM is controlled by IAM, which is why you had to specify the FISAPI-SSM-Automation-Role:

Figure 6. Visual representation of IAM permission used for fault injections with SSM.

This role needs to contain an assume role policy statement for SSM to allow assuming the role:

AssumeRolePolicyDocument:
Statement:
– Action:
– ‘sts:AssumeRole’
Effect: Allow
Principal:
Service:
– “ssm.amazonaws.com”

The role also needs to contain permissions to describe roles and their attached policies with an optional constraint on which roles and policies are visible:

– Sid: GetRoleAndPolicyDetails
Effect: Allow
Action:
– ‘iam:GetRole’
– ‘iam:GetPolicy’
– ‘iam:ListAttachedRolePolicies’
Resource:
# Roles
– !GetAtt EventBridgeTimerHandlerRole.Arn
– !GetAtt S3TriggerHandlerRole.Arn
# Policies
– !Ref AwsFisApiPolicyDenyS3DeleteObject

Finally the SSM role needs to allow attaching and detaching a policy document. This requires

an ALLOW statement
a constraint on the policies that can be attached
a constraint on the roles that can be attached to

In the role we collapse the first two requirements into an ALLOW statement with a condition constraint for the Policy ARN. We then express the third requirement in a DENY statement that will limit the ‘*’ resource to only the explicit role ARNs we want to modify:

– Sid: AllowOnlyTargetResourcePolicies
Effect: Allow
Action:
– ‘iam:DetachRolePolicy’
– ‘iam:AttachRolePolicy’
Resource: ‘*’
Condition:
ArnEquals:
‘iam:PolicyARN’:
# Policies that can be attached
– !Ref AwsFisApiPolicyDenyS3DeleteObject
– Sid: DenyAttachDetachAllRolesExceptApplicationRole
Effect: Deny
Action:
– ‘iam:DetachRolePolicy’
– ‘iam:AttachRolePolicy’
NotResource:
# Roles that can be attached to
– !GetAtt EventBridgeTimerHandlerRole.Arn
– !GetAtt S3TriggerHandlerRole.Arn

We will discuss security considerations in more detail at the end of this post.

Fault injection using FIS

With the SSM document in place you can now create an FIS template that calls the SSM document. Navigate to the FIS console and filter for FISAPI-DENY-S3PutObject. You should see that the experiment template passes the same parameters that you previously used with SSM:

Figure 7. Image of FIS experiment template action summary. This shows the SSM document ARN to be used for fault injection and the JSON parameters passed to the SSM document specifying the IAM Role to modify and the IAM Policy to use.

You can now run the FIS experiment and after a couple minutes once again see new files in the S3 bucket.

Permissions for injecting failures with FIS and SSM

Fault injection with FIS is controlled by IAM, which is why you had to specify the FISAPI-FIS-Injection-EperimentRole:

Figure 8. Visual representation of IAM permission used for fault injections with FIS and SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the FIS execution role permitting access to use FIS templates, as well as the pass-role permission to grant the SSM execution role to the SSM service. Finally it shows the FIS user needing to have a pass-role permission to grant the FIS execution role to the FIS service.

This role needs to contain an assume role policy statement for FIS to allow assuming the role:

AssumeRolePolicyDocument:
Statement:
– Action:
– ‘sts:AssumeRole’
Effect: Allow
Principal:
Service:
– “fis.amazonaws.com”

The role also needs permissions to list and execute SSM documents:

– Sid: RequiredReadActionsforAWSFIS
Effect: Allow
Action:
– ‘cloudwatch:DescribeAlarms’
– ‘ssm:GetAutomationExecution’
– ‘ssm:ListCommands’
– ‘iam:ListRoles’
Resource: ‘*’
– Sid: RequiredSSMStopActionforAWSFIS
Effect: Allow
Action:
– ‘ssm:CancelCommand’
Resource: ‘*’
– Sid: RequiredSSMWriteActionsforAWSFIS
Effect: Allow
Action:
– ‘ssm:StartAutomationExecution’
– ‘ssm:StopAutomationExecution’
Resource:
– !Sub ‘arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationIamAttachDetachDocument}:$DEFAULT’

Finally, remember that the SSM document needs to use a Role of its own to execute the fault injection actions. Because that Role is different from the Role under which we started the FIS experiment, we need to explicitly allow SSM to assume that role with a PassRole statement which will expand to FISAPI-SSM-Automation-Role:

– Sid: RequiredIAMPassRoleforSSMADocuments
Effect: Allow
Action: ‘iam:PassRole’
Resource: !Sub ‘arn:aws:iam::${AWS::AccountId}:role/${SsmAutomationRole}’

Secure and flexible permissions

So far, we have used explicit ARNs for our guardrails. To expand flexibility, we can use wildcards in our resource matching. For example, we might change the Policy matching from:

Condition:
ArnEquals:
‘iam:PolicyARN’:
# Explicitly listed policies – secure but inflexible
– !Ref AwsFisApiPolicyDenyS3DeleteObject

or the equivalent:

Condition:
ArnEquals:
‘iam:PolicyARN’:
# Explicitly listed policies – secure but inflexible
– !Sub ‘arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/${FullPolicyName}

to a wildcard notation like this:

Condition:
ArnEquals:
‘iam:PolicyARN’:
# Wildcard policies – secure and flexible
– !Sub ‘arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/${PolicyNamePrefix}*’

If we set PolicyNamePrefix to FISAPI-DenyS3 this would now allow invoking FISAPI-DenyS3PutObject and FISAPI-DenyS3DeleteObject but would not allow using a policy named FISAPI-DenyEc2DescribeInstances.

Similarly, we could change the Resource matching from:

NotResource:
# Explicitly listed roles – secure but inflexible
– !GetAtt EventBridgeTimerHandlerRole.Arn
– !GetAtt S3TriggerHandlerRole.Arn

to a wildcard equivalent like this:

NotResource:
# Wildcard policies – secure and flexible
– !Sub ‘arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${RoleNamePrefixEventBridge}*’
– !Sub ‘arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${RoleNamePrefixS3}*’
and setting RoleNamePrefixEventBridge to FISAPI-TARGET-EventBridge and RoleNamePrefixS3 to FISAPI-TARGET-S3.

Finally, we would also change the FIS experiment role to allow SSM documents based on a name prefix by changing the constraint on automation execution from:

– Sid: RequiredSSMWriteActionsforAWSFIS
Effect: Allow
Action:
– ‘ssm:StartAutomationExecution’
– ‘ssm:StopAutomationExecution’
Resource:
# Explicitly listed resource – secure but inflexible
# Note: the $DEFAULT at the end could also be an explicit version number
# Note: the ‘automation-definition’ is automatically created from ‘document’ on invocation
– !Sub ‘arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationIamAttachDetachDocument}:$DEFAULT’

to

– Sid: RequiredSSMWriteActionsforAWSFIS
Effect: Allow
Action:
– ‘ssm:StartAutomationExecution’
– ‘ssm:StopAutomationExecution’
Resource:
# Wildcard resources – secure and flexible
#
# Note: the ‘automation-definition’ is automatically created from ‘document’ on invocation
– !Sub ‘arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationDocumentPrefix}*’

and setting SsmAutomationDocumentPrefix to FISAPI-. Test this by updating the CloudFormation stack with a modified template:

aws cloudformation deploy –stack-name test-fis-api-faults –template-file template2.yaml –capabilities CAPABILITY_NAMED_IAM

Permissions governing users

In production you should not be using administrator access to use FIS. Instead we create two roles FISAPI-AssumableRoleWithCreation and FISAPI-AssumableRoleWithoutCreation for you (see this template). These roles require all FIS and SSM resources to have a Name tag that starts with FISAPI-. Try assuming the role without creation privileges and running an experiment. You will notice that you can only start an experiment if you add a Name tag, e.g. FISAPI-secure-1, and you will only be able to get details of experiments and templates that have proper Name tags.

If you are working with AWS Organizations, you can add further guard rails by defining SCPs that control the use of the FISAPI-* tags similar to this blog post.

Caveats

For this solution we are choosing to attach policies instead of permission boundaries. The benefit of this is that you can attach multiple independent policies and thus simulate multi-step service degradation. However, this means that it is possible to increase the permission level of a role. While there are situations where this might be of interest, e.g. to simulate security breaches, please implement a thorough security review of any fault injection IAM policies you create. Note that modifying IAM Roles may trigger events in your security monitoring tools.

The AttachRolePolicy and DetachRolePolicy calls from AWS IAM are eventually consistent, meaning that in some cases permission propagation when starting and stopping fault injection may take up to 5 minutes each.

Cleanup

To avoid additional cost, delete the content of the S3 bucket and delete the CloudFormation stack:

# Clean up policy attachments just in case
CLEANUP_ROLES=$(aws iam list-roles –query “Roles[?starts_with(RoleName,’FISAPI-‘)].RoleName” –output text)
for role in $CLEANUP_ROLES; do
CLEANUP_POLICIES=$(aws iam list-attached-role-policies –role-name $role –query “AttachedPolicies[?starts_with(PolicyName,’FISAPI-‘)].PolicyName” –output text)
for policy in $CLEANUP_POLICIES; do
echo Detaching policy $policy from role $role
aws iam detach-role-policy –role-name $role –policy-arn $policy
done
done
# Delete S3 bucket content
ACCOUNT_ID=$( aws sts get-caller-identity –query Account –output text )
S3_BUCKET_NAME=fis-api-failure-${ACCOUNT_ID}
aws s3 rm –recursive s3://${S3_BUCKET_NAME}
aws s3 rb s3://${S3_BUCKET_NAME}
# Delete cloudformation stack
aws cloudformation delete-stack –stack-name test-fis-api-faults
aws cloudformation wait stack-delete-complete –stack-name test-fis-api-faults

Conclusion 

AWS Fault Injection Simulator provides the ability to simulate various external impacts to your application to validate and improve resilience. We’ve shown how combining FIS with IAM to selectively deny access to AWS APIs provides a generic path to explore fault boundaries across all AWS services. We’ve shown how this can be used to identify and improve a resilience problem in a common S3 upload workflow. To learn about more ways to use FIS, see this workshop.

About the authors:

Dr. Rudolf Potucek

Dr. Rudolf Potucek is Startup Solutions Architect at Amazon Web Services. Over the past 30 years he gained a PhD and worked in different roles including leading teams in academia and industry, as well as consulting. He brings experience from working with academia, startups, and large enterprises to his current role of guiding startup customers to succeed in the cloud.

Rudolph Wagner

Rudolph Wagner is a Premium Support Engineer at Amazon Web Services who holds the CISSP and OSCP security certifications, in addition to being a certified AWS Solutions Architect Professional. He assists internal and external Customers with multiple AWS services by using his diverse background in SAP, IT, and construction.

Unlock the power of EC2 Graviton with GitLab CI/CD and EKS Runners

Many AWS customers are using GitLab for their DevOps needs, including source control, and continuous integration and continuous delivery (CI/CD). Many of our customers are using GitLab SaaS (the hosted edition), while others are using GitLab Self-managed to meet their security and compliance requirements.

Customers can easily add runners to their GitLab instance to perform various CI/CD jobs. These jobs include compiling source code, building software packages or container images, performing unit and integration testing, etc.—even all the way to production deployment. For the SaaS edition, GitLab offers hosted runners, and customers can provide their own runners as well. Customers who run GitLab Self-managed must provide their own runners.

In this post, we’ll discuss how customers can maximize their CI/CD capabilities by managing their GitLab runner and executor fleet with Amazon Elastic Kubernetes Service (Amazon EKS). We’ll leverage both x86 and Graviton runners, allowing customers for the first time to build and test their applications both on x86 and on AWS Graviton, our most powerful, cost-effective, and sustainable instance family. In keeping with AWS’s philosophy of “pay only for what you use,” we’ll keep our Amazon Elastic Compute Cloud (Amazon EC2) instances as small as possible, and launch ephemeral runners on Spot instances. We’ll demonstrate building and testing a simple demo application on both architectures. Finally, we’ll build and deliver a multi-architecture container image that can run on Amazon EC2 instances or AWS Fargate, both on x86 and Graviton.

Figure 1.  Managed GitLab runner architecture overview.

Let’s go through the components:

Runners

A runner is an application to which GitLab sends jobs that are defined in a CI/CD pipeline. The runner receives jobs from GitLab and executes them—either by itself, or by passing it to an executor (we’ll visit the executor in the next section).

In our design, we’ll be using a pair of self-hosted runners. One runner will accept jobs for the x86 CPU architecture, and the other will accept jobs for the arm64 (Graviton) CPU architecture. To help us route our jobs to the proper runner, we’ll apply some tags to each runner indicating the architecture for which it will be responsible. We’ll tag the x86 runner with x86, x86-64, and amd64, thereby reflecting the most common nicknames for the architecture, and we’ll tag the arm64 runner with arm64.

Currently, these runners must always be running so that they can receive jobs as they are created. Our runners only require a small amount of memory and CPU, so that we can run them on small EC2 instances to minimize cost. These include t4g.micro for Graviton builds, or t3.micro or t3a.micro for x86 builds.

To save money on these runners, consider purchasing a Savings Plan or Reserved Instances for them. Savings Plans and Reserved Instances can save you up to 72% over on-demand pricing, and there’s no minimum spend required to use them.

Kubernetes executors

In GitLab CI/CD, the executor’s job is to perform the actual build. The runner can create hundreds or thousands of executors as needed to meet current demand, subject to the concurrency limits that you specify. Executors are created only when needed, and they are ephemeral: once a job has finished running on an executor, the runner will terminate it.

In our design, we’ll use the Kubernetes executor that’s built into the GitLab runner. The Kubernetes executor simply schedules a new pod to run each job. Once the job completes, the pod terminates, thereby freeing the node to run other jobs.

The Kubernetes executor is highly customizable. We’ll configure each runner with a nodeSelector that makes sure that the jobs are scheduled only onto nodes that are running the specified CPU architecture. Other possible customizations include CPU and memory reservations, node and pod tolerations, service accounts, volume mounts, and much more.

Scaling worker nodes

For most customers, CI/CD jobs aren’t likely to be running all of the time. To save cost, we only want to run worker nodes when there’s a job to run.

To make this happen, we’ll turn to Karpenter. Karpenter provisions EC2 instances as soon as needed to fit newly-scheduled pods. If a new executor pod is scheduled, and there isn’t a qualified instance with enough capacity remaining on it, then Karpenter will quickly and automatically launch a new instance to fit the pod. Karpenter will also periodically scan the cluster and terminate idle nodes, thereby saving on costs. Karpenter can terminate a vacant node in as little as 30 seconds.

Karpenter can launch either Amazon EC2 on-demand or Spot instances depending on your needs. With Spot instances, you can save up to 90% over on-demand instance prices. Since CI/CD jobs often aren’t time-sensitive, Spot instances can be an excellent choice for GitLab execution pods. Karpenter will even automatically find the best Spot instance type to speed up the time it takes to launch an instance and minimize the likelihood of job interruption.

Deploying our solution

To deploy our solution, we’ll write a small application using the AWS Cloud Development Kit (AWS CDK) and the EKS Blueprints library. AWS CDK is an open-source software development framework to define your cloud application resources using familiar programming languages. EKS Blueprints is a library designed to make it simple to deploy complex Kubernetes resources to an Amazon EKS cluster with minimum coding.

The high-level infrastructure code – which can be found in our GitLab repo – is very simple. I’ve included comments to explain how it works.

// All CDK applications start with a new cdk.App object.
const app = new cdk.App();

// Create a new EKS cluster at v1.23. Run all non-DaemonSet pods in the
// `kube-system` (coredns, etc.) and `karpenter` namespaces in Fargate
// so that we don’t have to maintain EC2 instances for them.
const clusterProvider = new blueprints.GenericClusterProvider({
version: KubernetesVersion.V1_23,
fargateProfiles: {
main: {
selectors: [
{ namespace: ‘kube-system’ },
{ namespace: ‘karpenter’ },
]
}
},
clusterLogging: [
ClusterLoggingTypes.API,
ClusterLoggingTypes.AUDIT,
ClusterLoggingTypes.AUTHENTICATOR,
ClusterLoggingTypes.CONTROLLER_MANAGER,
ClusterLoggingTypes.SCHEDULER
]
});

// EKS Blueprints uses a Builder pattern.
blueprints.EksBlueprint.builder()
.clusterProvider(clusterProvider) // start with the Cluster Provider
.addOns(
// Use the EKS add-ons that manage coredns and the VPC CNI plugin
new blueprints.addons.CoreDnsAddOn(‘v1.8.7-eksbuild.3’),
new blueprints.addons.VpcCniAddOn(‘v1.12.0-eksbuild.1’),
// Install Karpenter
new blueprints.addons.KarpenterAddOn({
provisionerSpecs: {
// Karpenter examines scheduled pods for the following labels
// in their `nodeSelector` or `nodeAffinity` rules and routes
// the pods to the node with the best fit, provisioning a new
// node if necessary to meet the requirements.
//
// Allow either amd64 or arm64 nodes to be provisioned
‘kubernetes.io/arch’: [‘amd64’, ‘arm64’],
// Allow either Spot or On-Demand nodes to be provisioned
‘karpenter.sh/capacity-type’: [‘spot’, ‘on-demand’]
},
// Launch instances in the VPC private subnets
subnetTags: {
Name: ‘gitlab-runner-eks-demo/gitlab-runner-eks-demo-vpc/PrivateSubnet*’
},
// Apply security groups that match the following tags to the launched instances
securityGroupTags: {
‘kubernetes.io/cluster/gitlab-runner-eks-demo’: ‘owned’
}
}),
// Create a pair of a new GitLab runner deployments, one running on
// arm64 (Graviton) instance, the other on an x86_64 instance.
// We’ll show the definition of the GitLabRunner class below.
new GitLabRunner({
arch: CpuArch.ARM_64,
// If you’re using an on-premise GitLab installation, you’ll want
// to change the URL below.
gitlabUrl: ‘https://gitlab.com’,
// Kubernetes Secret containing the runner registration token
// (discussed later)
secretName: ‘gitlab-runner-secret’
}),
new GitLabRunner({
arch: CpuArch.X86_64,
gitlabUrl: ‘https://gitlab.com’,
secretName: ‘gitlab-runner-secret’
}),
)
.build(app,
// Stack name
‘gitlab-runner-eks-demo’);

The GitLabRunner class is a HelmAddOn subclass that takes a few parameters from the top-level application:

// The location and name of the GitLab Runner Helm chart
const CHART_REPO = ‘https://charts.gitlab.io’;
const HELM_CHART = ‘gitlab-runner’;

// The default namespace for the runner
const DEFAULT_NAMESPACE = ‘gitlab’;

// The default Helm chart version
const DEFAULT_VERSION = ‘0.40.1’;

export enum CpuArch {
ARM_64 = ‘arm64’,
X86_64 = ‘amd64’
}

// Configuration parameters
interface GitLabRunnerProps {
// The CPU architecture of the node on which the runner pod will reside
arch: CpuArch
// The GitLab API URL
gitlabUrl: string
// Kubernetes Secret containing the runner registration token (discussed later)
secretName: string
// Optional tags for the runner. These will be added to the default list
// corresponding to the runner’s CPU architecture.
tags?: string[]
// Optional Kubernetes namespace in which the runner will be installed
namespace?: string
// Optional Helm chart version
chartVersion?: string
}

export class GitLabRunner extends HelmAddOn {
private arch: CpuArch;
private gitlabUrl: string;
private secretName: string;
private tags: string[] = [];

constructor(props: GitLabRunnerProps) {
// Invoke the superclass (HelmAddOn) constructor
super({
name: `gitlab-runner-${props.arch}`,
chart: HELM_CHART,
repository: CHART_REPO,
namespace: props.namespace || DEFAULT_NAMESPACE,
version: props.chartVersion || DEFAULT_VERSION,
release: `gitlab-runner-${props.arch}`,
});

this.arch = props.arch;
this.gitlabUrl = props.gitlabUrl;
this.secretName = props.secretName;

// Set default runner tags
switch (this.arch) {
case CpuArch.X86_64:
this.tags.push(‘amd64’, ‘x86’, ‘x86-64’, ‘x86_64’);
break;
case CpuArch.ARM_64:
this.tags.push(‘arm64’);
break;
}
this.tags.push(…props.tags || []); // Add any custom tags
};

// `deploy` method required by the abstract class definition. Our implementation
// simply installs a Helm chart to the cluster with the proper values.
deploy(clusterInfo: ClusterInfo): void | Promise<Construct> {
const chart = this.addHelmChart(clusterInfo, this.getValues(), true);
return Promise.resolve(chart);
}

// Returns the values for the GitLab Runner Helm chart
private getValues(): Values {
return {
gitlabUrl: this.gitlabUrl,
runners: {
config: this.runnerConfig(), // runner config.toml file, from below
name: `demo-runner-${this.arch}`, // name as seen in GitLab UI
tags: uniq(this.tags).join(‘,’),
secret: this.secretName, // see below
},
// Labels to constrain the nodes where this runner can be placed
nodeSelector: {
‘kubernetes.io/arch’: this.arch,
‘karpenter.sh/capacity-type’: ‘on-demand’
},
// Default pod label
podLabels: {
‘gitlab-role’: ‘manager’
},
// Create all the necessary RBAC resources including the ServiceAccount
rbac: {
create: true
},
// Required resources (memory/CPU) for the runner pod. The runner
// is fairly lightweight as it’s a self-contained Golang app.
resources: {
requests: {
memory: ‘128Mi’,
cpu: ‘256m’
}
}
};
}

// This string contains the runner’s `config.toml` file including the
// Kubernetes executor’s configuration. Note the nodeSelector constraints
// (including the use of Spot capacity and the CPU architecture).
private runnerConfig(): string {
return `
[[runners]]
[runners.kubernetes]
namespace = “{{.Release.Namespace}}”
image = “ubuntu:16.04”
[runners.kubernetes.node_selector]
“kubernetes.io/arch” = “${this.arch}”
“kubernetes.io/os” = “linux”
“karpenter.sh/capacity-type” = “spot”
[runners.kubernetes.pod_labels]
gitlab-role = “runner”
`.trim();
}
}

For security reasons, we store the GitLab registration token in a Kubernetes Secret – never in our source code. For additional security, we recommend encrypting Secrets using an AWS Key Management Service (AWS KMS) key that you supply by specifying the encryption configuration when you create your Amazon EKS cluster. It’s a good practice to restrict access to this Secret via Kubernetes RBAC rules.

To create the Secret, run the following command:

# These two values must match the parameters supplied to the GitLabRunner constructor
NAMESPACE=gitlab
SECRET_NAME=gitlab-runner-secret
# The value of the registration token.
TOKEN=GRxxxxxxxxxxxxxxxxxxxxxx

kubectl -n $NAMESPACE create secret generic $SECRET_NAME
–from-literal=”runner-registration-token=$TOKEN”
–from-literal=”runner-token=”

Building a multi-architecture container image

Now that we’ve launched our GitLab runners and configured the executors, we can build and test a simple multi-architecture container image. If the tests pass, we can then upload it to our project’s GitLab container registry. Our application will be pretty simple: we’ll create a web server in Go that simply prints out “Hello World” and prints out the current architecture.

Find the source code of our sample app in our GitLab repo.

In GitLab, the CI/CD configuration lives in the .gitlab-ci.yml file at the root of the source repository. In this file, we declare a list of ordered build stages, and then we declare the specific jobs associated with each stage.

Our stages are:

The build stage, in which we compile our code, produce our architecture-specific images, and upload these images to the GitLab container registry. These uploaded images are tagged with a suffix indicating the architecture on which they were built. This job uses a matrix variable to run it in parallel against two different runners – one for each supported architecture. Furthermore, rather than using docker build to produce our images, we use Kaniko to build them. This lets us build our images in an unprivileged container environment and improve the security posture considerably.
The test stage, in which we test the code. As with the build stage, we use a matrix variable to run the tests in parallel in separate pods on each supported architecture.

The assembly stage, in which we create a multi-architecture image manifest from the two architecture-specific images. Then, we push the manifest into the image registry so that we can refer to it in future deployments.

Figure 2. Example CI/CD pipeline for multi-architecture images.

Here’s what our top-level configuration looks like:

variables:
# These are used by the runner to configure the Kubernetes executor, and define
# the values of spec.containers[].resources.limits.{memory,cpu} for the Pod(s).
KUBERNETES_MEMORY_REQUEST: 1Gi
KUBERNETES_CPU_REQUEST: 1

# List of stages for jobs, and their order of execution
stages:
– build
– test
– create-multiarch-manifest
Here’s what our build stage job looks like. Note the matrix of variables which are set in BUILD_ARCH as the two jobs are run in parallel:
build-job:
stage: build
parallel:
matrix: # This job is run twice, once on amd64 (x86), once on arm64
– BUILD_ARCH: amd64
– BUILD_ARCH: arm64
tags: [$BUILD_ARCH] # Associate the job with the appropriate runner
image:
name: gcr.io/kaniko-project/executor:debug
entrypoint: [“”]
script:
– mkdir -p /kaniko/.docker
# Configure authentication data for Kaniko so it can push to the
# GitLab container registry
– echo “{“auths”:{“${CI_REGISTRY}”:{“auth”:”$(printf “%s:%s” “${CI_REGISTRY_USER}” “${CI_REGISTRY_PASSWORD}” | base64 | tr -d ‘n’)”}}}” > /kaniko/.docker/config.json
# Build the image and push to the registry. In this stage, we append the build
# architecture as a tag suffix.
– >-
/kaniko/executor
–context “${CI_PROJECT_DIR}”
–dockerfile “${CI_PROJECT_DIR}/Dockerfile”
–destination “${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}-${BUILD_ARCH}”

Here’s what our test stage job looks like. This time we use the image that we just produced. Our source code is copied into the application container. Then, we can run make test-api to execute the server test suite.

build-job:
stage: build
parallel:
matrix: # This job is run twice, once on amd64 (x86), once on arm64
– BUILD_ARCH: amd64
– BUILD_ARCH: arm64
tags: [$BUILD_ARCH] # Associate the job with the appropriate runner
image:
# Use the image we just built
name: “${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}-${BUILD_ARCH}”
script:
– make test-container

Finally, here’s what our assembly stage looks like. We use Podman to build the multi-architecture manifest and push it into the image registry. Traditionally we might have used docker buildx to do this, but using Podman lets us do this work in an unprivileged container for additional security.

create-manifest-job:
stage: create-multiarch-manifest
tags: [arm64]
image: public.ecr.aws/docker/library/fedora:36
script:
– yum -y install podman
– echo “${CI_REGISTRY_PASSWORD}” | podman login -u “${CI_REGISTRY_USER}” –password-stdin “${CI_REGISTRY}”
– COMPOSITE_IMAGE=${CI_REGISTRY_IMAGE}:${CI_COMMIT_SHORT_SHA}
– podman manifest create ${COMPOSITE_IMAGE}
– >-
for arch in arm64 amd64; do
podman manifest add ${COMPOSITE_IMAGE} docker://${COMPOSITE_IMAGE}-${arch};
done
– podman manifest inspect ${COMPOSITE_IMAGE}
# The composite image manifest omits the architecture from the tag suffix.
– podman manifest push ${COMPOSITE_IMAGE} docker://${COMPOSITE_IMAGE}

Trying it out

I’ve created a public test GitLab project containing the sample source code, and attached the runners to the project. We can see them at Settings > CI/CD > Runners:

Figure 3. GitLab runner configurations.

Here we can also see some pipeline executions, where some have succeeded, and others have failed.

Figure 4. GitLab sample pipeline executions.

We can also see the specific jobs associated with a pipeline execution:

Figure 5. GitLab sample job executions.

Finally, here are our container images:

Figure 6. GitLab sample container registry.

Conclusion

In this post, we’ve illustrated how you can quickly and easily construct multi-architecture container images with GitLab, Amazon EKS, Karpenter, and Amazon EC2, using both x86 and Graviton instance families. We indexed on using as many managed services as possible, maximizing security, and minimizing complexity and TCO. We dove deep on multiple facets of the process, and discussed how to save up to 90% of the solution’s cost by using Spot instances for CI/CD executions.

Find the sample code, including everything shown here today, in our GitLab repository.

Building multi-architecture images will unlock the value and performance of running your applications on AWS Graviton and give you increased flexibility over compute choice. We encourage you to get started today.

About the author:

Michael Fischer

Michael Fischer is a Principal Specialist Solutions Architect at Amazon Web Services. He focuses on helping customers build more cost-effectively and sustainably with AWS Graviton. Michael has an extensive background in systems programming, monitoring, and observability. His hobbies include world travel, diving, and playing the drums.

Develop a serverless application in Python using Amazon CodeWhisperer

While writing code to develop applications, developers must keep up with multiple programming languages, frameworks, software libraries, and popular cloud services from providers such as AWS. Even though developers can find code snippets on developer communities, to either learn from them or repurpose the code, manually searching for the snippets with an exact or even similar use case is a distracting and time-consuming process. They have to do all of this while making sure that they’re following the correct programming syntax and best coding practices.

Amazon CodeWhisperer, a machine learning (ML) powered coding aide for developers, lets you overcome those challenges. Developers can simply write a comment that outlines a specific task in plain English, such as “upload a file to S3.” Based on this, CodeWhisperer automatically determines which cloud services and public libraries are best-suited for the specified task, it creates the specific code on the fly, and then it recommends the generated code snippets directly in the IDE. And this isn’t about copy-pasting code from the web, but generating code based on the context of your file, such as which libraries and versions you have, as well as the existing code. Moreover, CodeWhisperer seamlessly integrates with your Visual Studio Code and JetBrains IDEs so that you can stay focused and never leave the development environment. At the time of this writing, CodeWhisperer supports Java, Python, JavaScript, C#, and TypeScript.

In this post, we’ll build a full-fledged, event-driven, serverless application for image recognition. With the aid of CodeWhisperer, you’ll write your own code that runs on top of AWS Lambda to interact with Amazon Rekognition, Amazon DynamoDB, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), Amazon Simple Storage Service (Amazon S3), and third-party HTTP APIs to perform image recognition. The users of the application can interact with it by either sending the URL of an image for processing, or by listing the images and the objects present on each image.

Solution overview

To make our application easier to digest, we’ll split it into three segments:

Image download – The user provides an image URL to the first API. A Lambda function downloads the image from the URL and stores it on an S3 bucket. Amazon S3 automatically sends a notification to an Amazon SNS topic informing that a new image is ready for processing. Amazon SNS then delivers the message to an Amazon SQS queue.

Image recognition – A second Lambda function handles the orchestration and processing of the image. It receives the message from the Amazon SQS queue, sends the image for Amazon Rekognition to process, stores the recognition results on a DynamoDB table, and sends a message with those results as JSON to a second Amazon SNS topic used in section three. A user can list the images and the objects present on each image by calling a second API which queries the DynamoDB table.

3rd-party integration – The last Lambda function reads the message from the second Amazon SQS queue. At this point, the Lambda function must deliver that message to a fictitious external e-mail server HTTP API that supports only XML payloads. Because of that, the Lambda function converts the JSON message to XML. Lastly, the function sends the XML object via HTTP POST to the e-mail server.

The following diagram depicts the architecture of our application:

Figure 1. Architecture diagram depicting the application architecture. It contains the service icons with the component explained on the text above.

Prerequisites

Before getting started, you must have the following prerequisites:

An AWS account and an Administrator user

Install and authenticate the AWS CLI. You can authenticate with an AWS Identity and Access Management (IAM) user or an AWS Security Token Service (AWS STS) token.
Install Python 3.7 or later.
Install Node Package Manager (npm).

Install the AWS CDK Toolkit.
Install the AWS Toolkit for VS Code or for JetBrains.

Install Git.

Configure environment

We already created the scaffolding for the application that we’ll build, which you can find on this Git repository. This application is represented by a CDK app that describes the infrastructure according to the architecture diagram above. However, the actual business logic of the application isn’t provided. You’ll implement it using CodeWhisperer. This means that we already declared using AWS CDK components, such as the API Gateway endpoints, DynamoDB table, and topics and queues. If you’re new to AWS CDK, then we encourage you to go through the CDK workshop later on.

Deploying AWS CDK apps into an AWS environment (a combination of an AWS account and region) requires that you provision resources that the AWS CDK needs to perform the deployment. These resources include an Amazon S3 bucket for storing files and IAM roles that grant permissions needed to perform deployments. The process of provisioning these initial resources is called bootstrapping. The required resources are defined in an AWS CloudFormation stack, called the bootstrap stack, which is usually named CDKToolkit. Like any CloudFormation stack, it appears in the CloudFormation console once it has been deployed.

After cloning the repository, let’s deploy the application (still without the business logic, which we’ll implement later on using CodeWhisperer). For this post, we’ll implement the application in Python. Therefore, make sure that you’re under the python directory. Then, use the cdk bootstrap command to bootstrap an AWS environment for AWS CDK. Replace {AWS_ACCOUNT_ID} and {AWS_REGION} with corresponding values first:

cdk bootstrap aws://{AWS_ACCOUNT_ID}/{AWS_REGION}

For more information about bootstrapping, refer to the documentation.

The last step to prepare your environment is to enable CodeWhisperer on your IDE. See Setting up CodeWhisperer for VS Code or Setting up Amazon CodeWhisperer for JetBrains to learn how to do that, depending on which IDE you’re using.

Image download

Let’s get started by implementing the first Lambda function, which is responsible for downloading an image from the provided URL and storing that image in an S3 bucket. Open the get_save_image.py file from the python/api/runtime/ directory. This file contains an empty Lambda function handler and the needed inputs parameters to integrate this Lambda function.

url is the URL of the input image provided by the user,

name is the name of the image provided by the user, and

S3_BUCKET is the S3 bucket name defined by our application infrastructure.

Write a comment in natural language that describes the required functionality, for example:

# Function to get a file from url

To trigger CodeWhisperer, hit the Enter key after entering the comment and wait for a code suggestion. If you want to manually trigger CodeWhisperer, then you can hit Option + C on MacOS or Alt + C on Windows. You can browse through multiple suggestions (if available) with the arrow keys. Accept a code suggestion by pressing Tab. Discard a suggestion by pressing Esc or typing a character.

For more information on how to work with CodeWhisperer, see Working with CodeWhisperer in VS Code or Working with Amazon CodeWhisperer from JetBrains.

You should get a suggested implementation of a function that downloads a file using a specified URL. The following image shows an example of the code snippet that CodeWhisperer suggests:

Figure 2. Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called get_file_from_url with the implementation suggestion to download a file using the requests lib.

Be aware that CodeWhisperer uses artificial intelligence (AI) to provide code recommendations, and that this is non-deterministic. The result you get in your IDE may be different from the one on the image above. If needed, fine-tune the code, as CodeWhisperer generates the core logic, but you might want to customize the details depending on your requirements.

Let’s try another action, this time to upload the image to an S3 bucket:

# Function to upload image to S3

As a result, CodeWhisperer generates a code snippet similar to the following one:

Figure 3. Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called upload_image with the implementation suggestion to download a file using the requests lib and upload it to S3 using the S3 client.

Now that you have the functions with the functionalities to download an image from the web and upload it to an S3 bucket, you can wire up both functions in the Lambda handler function by calling each function with the correct inputs.

Image recognition

Now let’s implement the Lambda function responsible for sending the image to Amazon Rekognition for processing, storing the results in a DynamoDB table, and sending a message with those results as JSON to a second Amazon SNS topic. Open the image_recognition.py file from the python/recognition/runtime/ directory. This file contains an empty Lambda and the needed inputs parameters to integrate this Lambda function.

queue_url is the URL of the Amazon SQS queue to which this Lambda function is subscribed,

table_name is the name of the DynamoDB table, and

topic_arn is the ARN of the Amazon SNS topic to which this Lambda function is published.

Using CodeWhisperer, implement the business logic of the next Lambda function as you did in the previous section. For example, to detect the labels from an image using Amazon Rekognition, write the following comment:

# Detect labels from image with Rekognition

And as a result, CodeWhisperer should give you a code snippet similar to the one in the following image:

Figure 4. Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called detect_labels with the implementation suggestion to use the Rekognition SDK to detect labels on the given image.

You can continue generating the other functions that you need to fully implement the business logic of your Lambda function. Here are some examples that you can use:

# Save labels to DynamoDB

# Publish item to SNS

# Delete message from SQS

Following the same approach, open the list_images.py file from the python/recognition/runtime/ directory to implement the logic to list all of the labels from the DynamoDB table. As you did previously, type a comment in plain English:

# Function to list all items from a DynamoDB table

Other frequently used code

Interacting with AWS isn’t the only way that you can leverage CodeWhisperer. You can use it to implement repetitive tasks, such as creating unit tests and converting message formats, or to implement algorithms like sorting and string matching and parsing. The last Lambda function that we’ll implement as part of this post is to convert a JSON payload received from Amazon SQS to XML. Then, we’ll POST this XML to an HTTP endpoint.

Open the send_email.py file from the python/integration/runtime/ directory. This file contains an empty Lambda function handler. An event is a JSON-formatted document that contains data for a Lambda function to process. Type a comment with your intent to get the code snippet:

# Transform json to xml

As CodeWhisperer uses the context of your files to generate code, depending on the imports that you have on your file, you’ll get an implementation such as the one in the following image:

Figure 5. Screenshot of the code generated by CodeWhisperer on VS Code. It has a function called json_to_xml with the implementation suggestion to transform JSON payload into XML payload.

Repeat the same process with a comment such as # Send XML string with HTTP POST to get the last function implementation. Note that the email server isn’t part of this implementation. You can mock it, or simply ignore this HTTP POST step. Lastly, wire up both functions in the Lambda handler function by calling each function with the correct inputs.

Deploy and test the application

To deploy the application, run the command cdk deploy –all. You should get a confirmation message, and after a few minutes your application will be up and running on your AWS account. As outputs, the APIStack and RekognitionStack will print the API Gateway endpoint URLs. It will look similar to this example:

Outputs:

APIStack.RESTAPIEndpoint01234567 = https://examp1eid0.execute-
api.{your-region}.amazonaws.com/prod/

The first endpoint expects two string parameters: url (the image file URL to download) and name (the target file name that will be stored on the S3 bucket). Use any image URL you like, but remember that you must encode an image URL before passing it as a query string parameter to escape the special characters. Use an online URL encoder of your choice for that. Then, use the curl command to invoke the API Gateway endpoint:

curl -X GET ‘https://examp1eid0.execute-api.eu-east-
2.amazonaws.com/prod?url={encoded-image-URL}&amp;name={file-name}’

Replace {encoded-image-URL} and {file-name} with the corresponding values. Also, make sure that you use the correct API endpoint that you’ve noted from the AWS CDK deploy command output as mentioned above.

It will take a few seconds for the processing to happen in the background. Once it’s ready, see what has been stored in the DynamoDB table by invoking the List Images API (make sure that you use the correct URL from the output of your deployed AWS CDK stack):

curl -X GET ‘https://examp1eid7.execute-api.eu-east-2.amazonaws.com/prod’

After you’re done, to avoid unexpected charges to your account, make sure that you clean up your AWS CDK stacks. Use the cdk destroy command to delete the stacks.

Conclusion

In this post, we’ve seen how to get a significant productivity boost with the help of ML. With that, as a developer, you can stay focused on your IDE and reduce the time that you spend searching online for code snippets that are relevant for your use case. Writing comments in natural language, you get context-based snippets to implement full-fledged applications. In addition, CodeWhisperer comes with a mechanism called reference tracker, which detects whether a code recommendation might be similar to particular CodeWhisperer training data. The reference tracker lets you easily find and review that reference code and see how it’s used in the context of another project. Lastly, CodeWhisperer provides the ability to run scans on your code (generated by CodeWhisperer as well as written by you) to detect security vulnerabilities.

During the preview period, CodeWhisperer is available to all developers across the world for free. Get started with the free preview on JetBrains, VS Code or AWS Cloud9.

About the author:

Rafael Ramos

Rafael is a Solutions Architect at AWS, where he helps ISVs on their journey to the cloud. He spent over 13 years working as a software developer, and is passionate about DevOps and serverless. Outside of work, he enjoys playing tabletop RPG, cooking and running marathons.

Caroline Gluck

Caroline is an AWS Cloud application architect based in New York City, where she helps customers design and build cloud native data science applications. Caroline is a builder at heart, with a passion for serverless architecture and machine learning. In her spare time, she enjoys traveling, cooking, and spending time with family and friends.

Jason Varghese

Jason is a Senior Solutions Architect at AWS guiding enterprise customers on their cloud migration and modernization journeys. He has served in multiple engineering leadership roles and has over 20 years of experience architecting, designing and building scalable software solutions. Jason holds a bachelor’s degree in computer engineering from the University of Oklahoma and an MBA from the University of Central Oklahoma.

Dmitry Balabanov

Dmitry is a Solutions Architect with AWS where he focuses on building reusable assets for customers across multiple industries. With over 15 years of experience in designing, building, and maintaining applications, he still loves learning new things. When not at work, he enjoys paragliding and mountain trekking.

Building .NET 7 Applications with AWS CodeBuild

AWS CodeBuild is a fully managed DevOps service for building and testing your applications. As a fully managed service, there is no infrastructure to manage and you pay only for the resources that you use when you are building your applications. CodeBuild provides a default build image that contains the current Long Term Support (LTS) version of the .NET SDK.

Microsoft released the latest version of .NET in November. This release, .NET 7, includes performance improvements and functionality, such as native ahead of time compilation. (Native AoT)..NET 7 is a Standard Term Support release of the .NET SDK. At this point CodeBuild’s default image does not support .NET 7. For customers that want to start using.NET 7 right away in their applications, CodeBuild provides two means of customizing your build environment so that you can take advantage of .NET 7.

The first option for customizing your build environment is to provide CodeBuild with a container image you create and maintain. With this method, customers can define the build environment exactly as they need by including any SDKs, runtimes, and tools in the container image. However, this approach requires customers to maintain the build environment themselves, including patching and updating the tools. This approach will not be covered in this blog post.

A second means of customizing your build environment is by using the install phase of the buildspec file. This method uses the default CodeBuild image, and adds additional functionality at the point that a build starts. This has the advantage that customers do not have the overhead of patching and maintaining the build image.

Complete documentation on the syntax of the buildspec file can be found here:

https://docs.aws.amazon.com/codebuild/latest/userguide/build-spec-ref.html

Your application’s buildspec.yml file contains all of the commands necessary to build your application and prepare it for deployment. For a typical .NET application, the buildspec file will look like this:

“`
version: 0.2
phases:
build:
commands:
– dotnet restore Net7TestApp.sln
– dotnet build Net7TestApp.sln
“`

Note: This build spec file contains only the commands to build the application, commands for packaging and storing build artifacts have been omitted for brevity.

In order to add the .NET 7 SDK to CodeBuild so that we can build your .NET 7 applications, we will leverage the install phase of the buildspec file. The install phase allows you to install any third-party libraries or SDKs prior to beginning your actual build.

“`
install:
commands:
– curl -sSL https://dot.net/v1/dotnet-install.sh | bash /dev/stdin –channel STS
“`

The above command downloads the Microsoft install script for .NET and uses that script to download and install the latest version of the .NET SDK, from the Standard Term Support channel. This script will download files and set environment variables within the containerized build environment. You can use this same command to automatically pull the latest Long Term Support version of the .NET SDK by changing the command argument STS to LTS.

Your updated buildspec file will look like this:

“`
version: 0.2
phases:
install:
commands:
– curl -sSL https://dot.net/v1/dotnet-install.sh | bash /dev/stdin –channel STS
build:
commands:
– dotnet restore Net7TestApp/Net7TestApp.sln
– dotnet build Net7TestApp/Net7TestApp.sln
“`

Once you check in your buildspec file, you can start a build via the CodeBuild console, and your .NET application will be built using the .NET 7 SDK.

As your build runs you will see output similar to this:

“`
Welcome to .NET 7.0!
———————
SDK Version: 7.0.100
Telemetry
———
The .NET tools collect usage data in order to help us improve your experience. It is collected by Microsoft and shared with the community. You can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT environment variable to ‘1’ or ‘true’ using your favorite shell.

Read more about .NET CLI Tools telemetry: https://aka.ms/dotnet-cli-telemetry
—————-
Installed an ASP.NET Core HTTPS development certificate.
To trust the certificate run ‘dotnet dev-certs https –trust’ (Windows and macOS only).
Learn about HTTPS: https://aka.ms/dotnet-https
—————-
Write your first app: https://aka.ms/dotnet-hello-world
Find out what’s new: https://aka.ms/dotnet-whats-new
Explore documentation: https://aka.ms/dotnet-docs
Report issues and find source on GitHub: https://github.com/dotnet/core
Use ‘dotnet –help’ to see available commands or visit: https://aka.ms/dotnet-cli
————————————————————————————–
Determining projects to restore…
Restored /codebuild/output/src095190443/src/git-codecommit.us-east-2.amazonaws.com/v1/repos/net7test/Net7TestApp/Net7TestApp/Net7TestApp.csproj (in 586 ms).
[Container] 2022/11/18 14:55:08 Running command dotnet build Net7TestApp/Net7TestApp.sln
MSBuild version 17.4.0+18d5aef85 for .NET
Determining projects to restore…
All projects are up-to-date for restore.
Net7TestApp -> /codebuild/output/src095190443/src/git-codecommit.us-east-2.amazonaws.com/v1/repos/net7test/Net7TestApp/Net7TestApp/bin/Debug/net7.0/Net7TestApp.dll
Build succeeded.
0 Warning(s)
0 Error(s)
Time Elapsed 00:00:04.63
[Container] 2022/11/18 14:55:13 Phase complete: BUILD State: SUCCEEDED
[Container] 2022/11/18 14:55:13 Phase context status code: Message:
[Container] 2022/11/18 14:55:13 Entering phase POST_BUILD
[Container] 2022/11/18 14:55:13 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2022/11/18 14:55:13 Phase context status code: Message:
“`

Conclusion

Adding .NET 7 support to AWS CodeBuild is easily accomplished by adding a single line to your application’s buildspec.yml file, stored alongside your application source code. This change allows you to keep up to date with the latest versions of .NET while still taking advantage of the managed runtime provided by the CodeBuild service.

About the author:

Tom Moore

Tom Moore is a Sr. Specialist Solutions Architect at AWS, and specializes in helping customers migrate and modernize Microsoft .NET and Windows workloads into their AWS environment.