Fully Automated Deployment of an Open Source Mail Server on AWS

Many AWS customers have the requirement to host their own email solution and prefer to operate mail severs over using fully managed solutions (e.g. Amazon WorkMail). While certainly there are merits to either approach, motivations for self-managing often include:

full ownership and control
need for customized configuration
restricting access to internal networks or when connected via VPN.

In order to achieve this, customers frequently rely on open source mail servers due to flexibility and free licensing as compared to proprietary solutions like Microsoft Exchange. However, running and operating open source mail servers can be challenging as several components need to be configured and integrated to provide the desired end-to-end functionality. For example, to provide a fully functional email system, you need to combine dedicated software packages to send and receive email, access and manage inboxes, filter spam, manage users etc. Hence, this can be complex and error-prone to configure and maintain. Troubleshooting issues often calls for expert knowledge. As a result, several open source projects emerged that aim at simplifying setup of open source mail servers, such as Mail-in-a-Box, Docker Mailserver, Mailu, Modoboa, iRedMail, and several others.

In this blog post, we take this one step further by adding infrastructure automation and integrations with AWS services to fully automate the deployment of an open source mail server on AWS. We present an automated setup of a single instance mail server, striving for minimal complexity and cost, while still providing high resiliency by leveraging incremental backups and automations. As such, the solution is best suited for small to medium organizations that are looking to run open source mail servers but do not want to deal with the associated operational complexity.

The solution in this blog uses AWS CloudFormation templates to automatically setup and configure an Amazon Elastic Compute Cloud (Amazon EC2) instance running Mail-in-a-Box, which integrates features such as email , webmail, calendar, contact, and file sharing, thus providing functionality similar to popular SaaS tools or commercial solutions. All resources to reproduce the solution are provided in a public GitHub repository under an open source license (MIT-0).

Amazon Simple Storage Service (Amazon S3) is used both for offloading user data and for storing incremental application-level backups. Aside from high resiliency, this backup strategy gives way to an immutable infrastructure approach, where new deployments can be rolled out to implement updates and recover from failures which drastically simplifies operation and enhances security.

We also provide an optional integration with Amazon Simple Email Service (Amazon SES) so customers can relay their emails through reputable AWS servers and have their outgoing email accepted by third-party servers. All of this enables customers to deploy a fully featured open source mail server within minutes from AWS Management Console, or restore an existing server from an Amazon S3 backup for immutable upgrades, migration, or recovery purposes.

Overview of Solution

The following diagram shows an overview of the solution and interactions with users and other AWS services.

After preparing the AWS Account and environment, an administrator deploys the solution using an AWS CloudFormation template (1.). Optionally, a backup from Amazon S3 can be referenced during deployment to restore a previous installation of the solution (1a.). The admin can then proceed to setup via accessing the web UI (2.) to e.g., provision TLS certificates and create new users. After the admin has provisioned their accounts, users can access the web interface (3.) to send email, manage their inboxes, access calendar and contacts and share files. Optionally, outgoing emails are relayed via Amazon SES (3a.) and user data is stored in a dedicated Amazon S3 bucket (3b.). Furthermore, the solution is configured to automatically and periodically create incremental backups and store them into an S3 bucket for backups (4.).

On top of popular open source mail server packages such as Postfix for SMTP and Dovecot for IMAP, Mail-in-a-box integrates Nextcloud for calendar, contacts, and file sharing. However, note that Nextcloud capabilities in this context are limited. It’s primarily intended to be used alongside the core mail server functionalities to maintain calendar and contacts and for lightweight file sharing (e.g. for sharing files via links that are too large for email attachments). If you are looking for a fully featured, customizable and scalable Nextcloud deployment on AWS, have a look at this AWS Sample instead.

Deploying the Solution

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account

An existing external email address to test your new mail server. In the context of this sample, we will use [email protected] as the address.
A domain that can be exclusively used by the mail server in the sample. In the context of this sample, we will use aws-opensource-mailserver.org as the domain. If you don’t have a domain available, you can register a new one with Amazon Route 53. In case you do so, you can go ahead and delete the associated hosted zone that gets automatically created via the Amazon Route 53 Console. We won’t need this hosted zone because the mail server we deploy will also act as Domain Name System (DNS) server for the domain.
An SSH key pair for command line access to the instance. Command line access to the mail server is optional in this tutorial, but a key pair is still required for the setup. If you don’t already have a key pair, go ahead and create one in the EC2 Management Console:

(Optional) In this blog, we verify end-to-end functionality by sending an email to a single email address ([email protected] ) leveraging Amazon SES in sandbox mode. In case you want to adopt this sample for your use case and send email beyond that, you need to request removal of email sending limitations for EC2 or alternatively, if you relay your mail via Amazon SES request moving out of Amazon SES sandbox.

Preliminary steps: Setting up DNS and creating S3 Buckets

Before deploying the solution, we need to set up DNS and create Amazon S3 buckets for backups and user data.

1.     Allocate an Elastic IP address: We use the address 52.6.x.y in this sample.

2.     Configure DNS: If you have your domain registered with Amazon Route 53, you can use the AWS Management Console to change the name server and glue records for your domain. Configure two DNS servers ns1.box.<your-domain> and ns2.box.<your-domain> by placing your Elastic IP (allocated in step 1) into the Glue records field for each name server:

If you use a third-party DNS service, check their corresponding documentation on how to set the glue records.

It may take a while until the updates to the glue records propagate through the global DNS system. Optionally, before proceeding with the deployment, you can verify your glue records are setup correctly with the dig command line utility:

# Get a list of root servers for your top level domain
dig +short org. NS
# Query one of the root servers for an NS record of your domain
dig c0.org.afilias-nst.info. aws-opensource-mailserver.org. NS

This should give you output as follows:

;; ADDITIONAL SECTION:
ns1.box.aws-opensource-mailserver.org. 3600 IN A 52.6.x.y
ns2.box.aws-opensource-mailserver.org. 3600 IN A 52.6.x.y

3.     Create S3 buckets for backups and user data: Finally, in the Amazon S3 Console, create a bucket to store Nextcloud data and another bucket for backups, choosing globally unique names for both of them. In context of this sample, we will be using the two buckets (aws-opensource-mailserver-backup and aws-opensource-mailserver-nextcloud) as shown here:

Deploying and Configuring Mail-in-a-Box

Click    to deploy and specify the parameters as shown in the below screenshot to match the resources created in the previous section, leave other parameters at their default value, then click Next and Submit.

This will deploy your mail server into a public subnet of your default VPC which takes about 10 minutes. You can monitor the progress in the AWS CloudFormation Console. Meanwhile, retrieve and note the admin password for the web UI from AWS Systems Manager Parameter Store via the MailInABoxAdminPassword parameter.

Roughly one minute after your mail server finishes deploying, you can log in at its admin web UI residing at https://52.6.x.y/admin with username admin@<your-domain>, as shown in the following picture (you need to confirm the certificate exception warning from your browser):

Finally, in the admin UI navigate to System > TLS(SSL) Certificates and click Provision to obtain a valid SSL certificate and complete the setup (you might need to click on Provision twice to have all domains included in your certificate, as shown here).

At this point, you could further customize your mail server setup (e.g., by creating inboxes for additional users). However, we will continue to use the admin user in this sample for testing the setup in the next section.

Note: If your AWS account is subject to email sending restrictions on EC2, you will see an error in your admin dashboard under System > System Status Checks that says ‘Incoming Email (SMTP/postfix) is running but not publicly accessible’. You are safe to ignore this and should be able to receive emails regardless.

Testing the Solution

Receiving Email

With your existing email account, compose and send an email to admin@<your-domain>. Then login as admin@<your-domain> to the webmail UI of your AWS mail server at https://box.<your-domain>/mail and verify you received the email:

Test file sharing, calendar and contacts with Nextcloud

Your Nextcloud installation can be accessed under https://box.<your-domain>/cloud, as shown in the next figure. Here you can manage your calendar, contacts, and shared files. Contacts created and managed here are also accessible in your webmail UI when you compose an email. Refer to the Nextcloud documentation for more details. In order to keep your Nextcloud installation consistent and automatically managed by Mail-in-a-box setup scripts, admin users are advised to refrain from changing and customizing the Nextcloud configuration.

Sending Email

For this sample, we use Amazon SES to forward your outgoing email, as this is a simple way to get the emails you send accepted by other mail servers on the web. Achieving this is not trivial otherwise, as several popular email services tend to block public IP ranges of cloud providers.

Alternatively, if your AWS account has email sending limitations for EC2 you can send emails directly from your mail server. In this case, you can skip the next section and continue with Send test email, but make sure you’ve deployed your mail server stack with the SesRelay set to false. In that case, you can also bring your own IP addresses to AWS and continue using your reputable addresses or build reputation for addresses you own.

Verify your domain and existing Email address to Amazon SES

In order to use Amazon SES to accept and forward email for your domain, you first need to prove ownership of it. Navigate to Verified Identities in the Amazon SES Console and click Create identity, select domain and enter your domain. You will then be presented with a screen as shown here:

You now need to copy-paste the three CNAME DNS records from this screen over to your mail server admin dashboard. Open the admin web UI of your mail server again, select System > Custom DNS, and add the records as shown in the next screenshot.

Amazon SES will detect these records, thereby recognizing you as the owner and verifying the domain for sending emails. Similarly, while still in sandbox mode, you also need to verify ownership of the recipient email address. Navigate again to Verified Identities in the Amazon SES Console, click Create identity, choose Email Address, and enter your existing email address.

Amazon SES will then send a verification link to this address, and once you’ve confirmed via the link that you own this address, you can send emails to it.

Summing up, your verified identities section should look similar to the next screenshot before sending the test email:

Finally, if you intend to send email to arbitrary addresses with Amazon SES beyond testing in the next step, refer to the documentation on how to request production access.

Send test email

Now you are set to log back into your webmail UI and reply to the test mail you received before:

Checking the inbox of your existing mail, you should see the mail you just sent from your AWS server.

Congratulations! You have now verified full functionality of your open source mail server on AWS.

Restoring from backup

Finally, as a last step, we demonstrate how to roll out immutable deployments and restore from a backup for simple recovery, migration and upgrades. In this context, we test recreating the entire mail server from a backup stored in Amazon S3.

For that, we use the restore feature of the CloudFormation template we deployed earlier to migrate from the initial t2.micro installation to an AWS Graviton arm64-based t4g.micro instance. This exemplifies the power of the immutable infrastructure approach made possible by the automated application level backups, allowing for simple migration between instance types with different CPU architectures.

Verify you have a backup

By default, your server is configured to create an initial backup upon installation and nightly incremental backups. Using your ssh key pair, you can connect to your instance and trigger a manual backup to make sure the emails you just sent and received when testing will be included in the backup:

ssh -i aws-opensource-mailserver.pem [email protected] sudo /opt/mailinabox/management/backup.py

You can then go to your mail servers’ admin dashboard at https://box.<your-doamin>/admin and verify the backup status under System > Backup Status:

Recreate your mail server and restore from backup

First, double check that you have saved the admin password, as you will no longer be able to retrieve it from Parameter Store once you delete the original installation of your mail server. Then go ahead and delete the aws-opensource-mailserver stack from your CloudFormation Console an redeploy it by clicking on this . However, this time, adopt the parameters as shown below, changing the instance type and corresponding AMI as well as specifying the prefix in your backup S3 bucket to restore from.

Within a couple of minutes, your mail server will be up and running again, featuring the exact same state it was before you deleted it, however, running on a completely new instance powered by AWS Graviton. You can verify this by going to your webmail UI at https://box.<yourdomain>/mail and logging in with your old admin credentials.

Cleaning up

 Delete the mail server stack from CloudFormation Console

Empty and delete both the backup and Nextcloud data S3 Buckets

Release the Elastic IP
In case you registered your domain from Amazon Route 53 and do not want to hold onto it, you need to disable automatic renewal. Further, if you haven’t already, delete the hosted zone that got created automatically when registering it.

Outlook

The solution discussed so far focuses on minimal operational complexity and cost and hence is based on a single Amazon EC2 instance comprising all functions of an open source mail server, including a management UI, user database, Nextcloud and DNS. With a suitably sized instance, this setup can meet the demands of small to medium organizations. In particular, the continuous incremental backups to Amazon S3 provide high resiliency and can be leveraged in conjunction with the CloudFormation automations to quickly recover in case of instance or single Availablity Zone (AZ) failures.

Depending on your requirements, extending the solution and distributing components across AZs allows for meeting more stringent requirements regarding high availability and scalability in the context of larger deployments. Being based on open source software, there is a straight forward migration path towards these more complex distributed architectures once you outgrow the setup discussed in this post.

Conclusion

In this blog post, we showed how to automate the deployment of an open source mail server on AWS and how to quickly and effortlessly restore from a backup for rolling out immutable updates and providing high resiliency. Using AWS CloudFormation infrastructure automations and integrations with managed services such as Amazon S3 and Amazon SES, the lifecycle management and operation of open source mail servers on AWS can be simplified significantly. Once deployed, the solution provides an end-user experience similar to popular SaaS and commercial offerings.

You can go ahead and use the automations provided in this blog and the corresponding GitHub repository to get started with running your own open source mail server on AWS!

Flatlogic Admin Templates banner

Announcing Snapchange: An Open Source KVM-backed Snapshot Fuzzing Framework

Fuzz testing or fuzzing is a commonly used technique for discovering bugs in software, and is useful in many different domains. However, the task of configuring a target to be fuzzed can be a laborious one, often involving refactoring large code bases to enable fuzz testing.

Today we are happy to announce Snapchange, a new open source project to make snapshot-based fuzzing much easier. Snapchange enables a target binary to be fuzzed with minimal modifications, providing useful introspection that aids in fuzzing. Snapchange is a Rust framework for building fuzzers that replay physical memory snapshots in order to increase efficiency and reduce complexity in fuzzing many types of targets. Snapchange utilizes the features of the Linux kernel’s built-in virtual machine manager known as kernel virtual machine or KVM. While Snapchange is agnostic to the target operating system, the included snapshot mechanism focuses on Linux-based targets for gathering the necessary debug information.

Snapchange started as an experiment by the AWS Find and Fix (F2) open source security research team to explore the potential of using KVM in enabling snapshot fuzzing. Snapshot fuzzing is a growing area of interest and experimentation among security researchers. Snapchange has now grown into a project that aims to provide a friendly experience for researchers and developers to experiment with snapshot fuzzing. Snapchange is one of a number of tools and techniques used by the F2 team in its research efforts aimed at creating a secure and trustworthy open source supply chain for AWS and its customers. We have found it sufficiently useful that we are sharing it with the broader research community.

Snapchange is available today under the Apache License 2.0 via GitHub. AWS F2 team is actively supporting Snapchange and has plans for new features, but we hope to engage the security research community to produce a more richly-featured and robust tool over the longer term. We welcome pull requests on GitHub and look forward to discussions that help enable future research via the project. In this blog post we’ll walk through a set of tutorials you’ll find in the repository to help provide a deeper understanding of Snapchange.

Note: Snapchange operates within a Linux operating system but requires direct access to underlying KVM primitives. Thus, it is compatible with EC2 bare metal instance types, which run without a hypervisor, but not with EC2 virtualized instances. While we provide an EC2 AMI to make it easier to get started by launching on a bare metal instance (more information on that provided below), users are free to run Snapchange in other environments that meet the basic hardware requirements.

What is snapshot fuzzing?

Fuzzing uncovers software issues by monitoring how the system behaves while processing data, especially data provided as an input to the system. Fuzzing attempts to answer the question: What happens when the data is structured in a way that is outside the scope of what the software expects to receive? If the software is bug-free, there should be no input, no matter how inappropriate or corrupt, that causes it to crash. All input data should be properly validated and either pass validation and be used, or be rejected with a well-defined error result. Any input data that causes the software to crash shows a flaw and a potential weakness in the software. For example, fuzzing an application that renders JPEGs could involve mutating a sample JPEG file and opening the mutated JPEG in the application. If the application crashes or otherwise behaves in unexpected ways, this mutated file might have uncovered an issue.

The chosen mutations, however, are not truly random. We typically guide a fuzzer using “coverage” techniques. Coverage measures what code path an input from the fuzzer has caused to be executed in the target, and is used to automatically guide a fuzzer to modify its subsequent inputs so that the execution path in the target is changed to include previously-untested portions of the target’s code. Information about the sections of code in the target that were previously executed are cached in an input corpus, and that information is used to guide new inputs to explore additional code paths. In this way, variations on the same inputs will be applied in the expectation of discovering more previously untested code sections in the target, until all parts of the target code which can be reached by a possible execution path have been tested.

A snapshot is a pairing of a physical memory dump of a running VM and its accompanying register state. Fuzzing with a snapshot enables granular execution in order to reach code blocks that are traditionally difficult to fuzz without the complexities of managing state within the target. The only information needed by Snapchange in order to continue the execution of the target in a virtual machine is the snapshot itself. Prior work exploring this technique include brownie, falkervisorchocolate_milk, Nyx, and what the fuzz. Most of these other tools require booting into a custom hypervisor on bare metal or with a modified KVM and kernel module. Snapchange can be used in environments where booting into a custom hypervisor isn’t straightforward. As noted, it can also be used on EC2 on bare metal instances that boot without any hypervisor at all.

How Snapchange Works

Snapchange fuzzes a target by injecting mutated data in the virtual machine and provides a breakpoint-based hooking mechanism, real-time coverage reports in a variety of formats (such as Lighthouse and LCOV), and single-step traces useful for debugging. With Snapchange, you can fuzz a given physical memory snapshot across multiple CPU cores in parallel, while monitoring for crashing states such as a segmentation fault or a call to an Address Sanitizer report.

While Snapchange doesn’t care how a snapshot is obtained, it includes one method which uses a patched QEMU instance via the included qemu_snapshot utility. This snapshot is then used as the initial state of a KVM virtual machine to fuzz a target.

The fuzzing loop starts by initializing the memory of a whole KVM virtual machine with the physical memory and register state of the snapshot. Snapchange then gives the user the ability to write a mutated input in the loaded guest’s memory. The virtual machine is then executed until a crash, timeout, or reset event occurs. At this point, the virtual machine will revert back to a clean state. The guest memory is restored to the original snapshot’s memory in preparation for the next input case. In order to avoid writing the entire snapshot memory on every reset, only pages that were modified during execution are restored. This significantly reduces the amount of memory which needs to be restored, speeding up the fuzzing cycle, and allowing more time to be spent fuzzing the target.

This ability to arbitrarily reset guest memory enables precise choices when harnessing a fuzz target. With snapshots, the harnessing effort involves discovering where in memory the relevant input resides. For example, instead of having to rewrite a networked application to take input packets from command line or stdin, we can use a debugger to break immediately after a recv call. Pausing execution at this point, we can document the buffer that was read into, for example address 0x6000_0000_0100 , and take a snapshot of the system with this memory address in mind. Once the snapshot is loaded via Snapchange, we can write a mutated input packet to address 0x6000_0000_0100 and continue executing the target as if it were a real packet. This precisely mimics what would happen if a corrupt or malicious packet was read off the network in a real-world scenario.

Experimenting with Snapchange

Snapchange, along with several example targets, can be found on GitHub. Because Snapchange relies on KVM for executing a snapshot, Snapchange must be used on a machine that has KVM access. Currently, Snapchange only supports x64 hosts and snapshots. As previously noted, Snapchange can be used in Amazon EC2 on a wide variety of .metal instances based on Intel processors, for example, a c6i.metal instance. There is also a public AMI containing Snapchange, with the examples pre-built and pre-snapshotted. The pre-built AMI is ami-008dec48252956ad5 in the US-East-2 region. For more information about using an AMI, check out the Get started with Amazon EC2 Linux instances tutorial. You can also install Snapchange in your own environment if you have access to supported hardware.

This blog will go over the first example in the Snapchange repository to demonstrate some of the features provided. For a more step-by-step walk-through, check out the full tutorial in the README for the 01_getpid example here.

Example target

We’ll start with the first example in Snapchange to demonstrate some of its features.

There are two goals for this target:

The input data buffer must solve for the string fuzzmetosolveme!

The return value from getpid() must be modified to be 0xdeadbeef

// harness/example1.c

void fuzzme(char* data) {
    int pid    = getpid();

    // Correct solution: data == “fuzzmetosolveme!”, pid == 0xdeadbeef
    if (data[0]  == ‘f’)
    if (data[1]  == ‘u’)
    if (data[2]  == ‘z’)
    if (data[3]  == ‘z’)
    if (data[4]  == ‘m’)
    if (data[5]  == ‘e’)
    if (data[6]  == ‘t’)
    if (data[7]  == ‘o’)
    if (data[8]  == ‘s’)
    if (data[9]  == ‘o’)
    if (data[10] == ‘l’)
    if (data[11] == ‘v’)
    if (data[12] == ‘e’)
    if (data[13] == ‘m’)
    if (data[14] == ‘e’)
    if (data[15] == ‘!’) {
        pid = getpid();
        if (pid == 0xdeadbeef) {
            // BUG
            *(int*)0xcafecafe = 0x41414141;
        }
    }

    return;
}

When taking the snapshot, we logged that the input buffer being fuzzed is located at 0x555555556004.

SNAPSHOT Data buffer: 0x555555556004

It is the fuzzer’s job to write an input test case to address 0x5555_5555_6004 to begin fuzzing. Let’s look at how Snapchange handles coverage with breakpoints.

Coverage Breakpoints

Snapchange gathers its coverage of a target using breakpoints. In the snapshot directory, an optional .covbps file containing virtual addresses in the guest can be created. Because the snapshot is static, we can use hard coded memory addresses as part of the fuzzing process. During initialization, a breakpoint is inserted into the guest memory at every address found in the coverage breakpoint file. If any coverage breakpoint is hit, it means the current input executed a new piece of the target for the first time. The input is saved into the input corpus for future use and the breakpoint is removed. This removal of coverage breakpoints when they are encountered, means that the fuzzer only pays for the cost of the coverage breakpoint once.

One approach using these coverage breakpoints is to trigger on new basic blocks from the control flow graph of a target. There are a few utility scripts included in Snapchange to gather these basic blocks using Binary Ninja, Ghidra, and radare2.

The example coverage breakpoint file of the basic blocks found in example1.bin is in snapshot/example1.bin.covbps

$ head ./snapshot/example1.bin.ghidra.covbps

0x555555555000
0x555555555014
0x555555555016
0x555555555020
0x555555555070
0x555555555080
0x555555555090
0x5555555550a0
0x5555555550b0
0x5555555550c0

Writing a fuzzer

To begin fuzzing with Snapchange, we can write the fuzzer specific for this target in Rust.

// src/fuzzer.rs

#[derive(Default)]
pub struct Example1Fuzzer;

impl Fuzzer for Example1Fuzzer {
type Input = Vec<u8>; // [0]
const START_ADDRESS: u64 = 0x5555_5555_5344;
const MAX_INPUT_LENGTH: usize = 16; // [1]
const MAX_MUTATIONS: u64 = 2; // [3]

fn set_input(&mut self, input: &Self::Input, fuzzvm: &mut FuzzVm<Self>) -> Result<()> {
// Write the mutated input
fuzzvm.write_bytes_dirty(VirtAddr(0x5555_5555_6004), CR3, input)?; // [2]

Ok(())
}

}

A few notes about this fuzzer:

The fuzzer uses input of type Vec<u8> ([0]). This tells Snapchange to provide the default mutation strategies for a vector of bytes.

Note: This is an abstract type, so the fuzzer can provide a custom mutator/generator if they choose.

The maximum length of a generated input will be 16 bytes ([1])
The fuzzer is passed a mutated Vec<u8> in set_input. This input is then written to the address of the buffer logged during the snapshot (0x5555_5555_6004) via the call to write_bytes_dirty ([2]).

Note: This address is from the printf(“SNAPSHOT Data buffer: %pn”, data); line in the harness

The fuzzer will apply, at most, two mutations per input case ([3])

Snapchange provides an entry point to the main command-line utility that takes an abstract Fuzzer, like the one we have written. This will be the entry point for our fuzzer as well.

// src/main.rs

fn main() {
snapchange_main::<fuzzer::Example1Fuzzer>().expect(“Error in Example 1”);
}

Building this we can verify that the project and snapshot directories are set up properly by attempting to translate the starting instruction pointer address from the snapshot. Snapchange provides a project translate command used for doing virtual to physical memory translations from the snapshot and attempting to disassemble the bytes found at the read physical address. We can disassemble from fuzzme function in the snapshot with the following:

$ cargo run -r — project translate fuzzme

With the confirmation that the project’s directory structure is set up properly, we can begin fuzzing!

Starting fuzzing!

Snapchange has a fuzz command which can execute across a configurable number of cores in parallel. To begin fuzzing, Snapchange will start a number of virtual machines with the physical memory found in the snapshot directory. Snapchange will then choose an input from the current corpus (or generate one if one doesn’t exist), mutate it with a variety of techniques, and then write it into the guest via the set_input() function we wrote. If any new coverage has been seen, the mutated input will be saved in the corpus for future use. If a crash is found, the crashing input will be saved for further analysis.

The example is looking for the password fuzzmetosolveme! by checking each byte in the input one at a time. This pattern creates a new location for coverage to find for each byte. If the mutation randomly finds the next byte in the password, that input is saved in the corpus to be used later to discover the next byte, until the entire password is uncovered.

We began fuzzing with 8 cores for this example.

$ cargo run -r — fuzz –cores 8

The fuzz terminal user interface (TUI) is brought up with several pieces of information used to monitor the fuzzing:

Execution time
Basic core statistics for number of executions per second overall and per core
Amount of coverage seen
Number of crashes seen
Average number of dirty pages needed to reset a guest
Number of cores currently alive
Basic coverage graph

The TUI also includes performance information about where time is being spent in the fuzzer as well as information about the reasons a virtual machine is exiting. This information is useful to have for understanding if the fuzzer is actually spending relevant time fuzzing or if the fuzzer is doing extraneous computation that is causing a performance degradation.

For example, this fuzz run is only spending 14%  of the total execution time in the guest virtual machine fuzzing the target. For some targets, this could present an opportunity to improve the performance of the fuzzer. Ideally, we want the fuzzer to be working in the guest virtual machine as much as possible. This test case is so small, though, that this number is to be expected, but it is still useful to keep in mind for more complex targets.

Lastly, there is a running list of recently-hit coverage to present a quick glance at what the fuzzer has recently uncovered in the target.

When fuzzing, the current coverage state seen by the fuzzer is written to disk, in real time, in a variety of formats: raw addresses, module+offset for usage in tools like Lighthouse, and (if debug information is available) LCOV format used for graphically annotating source code with this coverage information. This allows the developer or researcher to review the coverage to understand what the fuzzer is actually accomplishing to help them iterate on the fuzzer for potentially better results.

LCOV coverage displayed mid-fuzz session

Coverage displayed using Lighthouse in Binary Ninja

 

 

 

 

 

 

 

 

 

 

After some time, the fuzzer finds the correct input string to solve the first part of the target. We can look at the current corpus of the fuzzer in ./snapshot/current_corpus.

$ xxd snapshot/current_corpus/c2b9b72428f4059c
┌────────┬─────────────────────────┬─────────────────────────┬────────┬────────┐
│00000000│ 66 75 7a 7a 6d 65 74 6f ┊ 73 6f 6c 76 65 6d 65 21 │fuzzmeto┊solveme!│
└────────┴─────────────────────────┴─────────────────────────┴────────┴────────┘

Snapchange hooks

With the password discovered, the second half of the target revolves around getting getpid() to return an arbitrary value. This value isn’t expected to be returned from getpid(), but we can use Snapchange’s introspection features to force this result to happen. Snapchange includes breakpoint callbacks as a technique to introspect and modify the guest, such as by patching functions. Here is one example of forcing getpid() to always return the value 0xdeadbeef for our fuzzer.

fn breakpoints(&self) -> Option<&[Breakpoint<Self>]> {
Some(&[
Breakpoint {
lookup: BreakpointLookup::SymbolOffset(“libc.so.6!__GI___getpid”, 0x0),
bp_type: BreakpointType::Repeated,
bp_hook: |fuzzvm: &mut FuzzVm<Self>, _input, _fuzzer| {
// Set the return value to 0xdeadbeef
fuzzvm.set_rax(0xdead_beef); // [0]

// Fake an immediate return from the function by setting RIP to the
// value popped from the stack (this assumes the function was entered
// via a `call`)
fuzzvm.fake_immediate_return()?; // [1]

// Continue execution
Ok(Execution::Continue) // [2]
},
}
])
}

The fuzzer sets a breakpoint on the address for the symbol libc.so.6!__GI___getpid. When the breakpoint is triggered, the bp_hook function is called with the guest virtual machine (fuzzvm) as an argument. The return value for the function is stored in register rax, so we can set the value of rax to 0xdeadbeef via fuzzvm.set_rax(0xdeadbeef) [0]. We want the function to immediately return and not continue executing getpid(), so we fake the returning of the function by calling fuzzvm.fake_immediate_return() [1] to set the instruction pointer to the value on the top of the stack and Continue execution of the guest at this point [2] (rather than forcing the guest to reset).

We aren’t restricted to user space breakpoints. We could also force getpid() to return 0xdeadbeef by patching the call in the kernel in __task_pid_nr_ns. At offset 0x83 in __task_pid_nr_ns, we patch the moment the PID is read from memory and returned to the user from the kernel.

/*
// Single step trace from `cargo run -r — trace ./snapshot/current_corpus/c2b9b72428f4059c`
INSTRUCTION 1162 0xffffffff810d0ed3 0x6aa48000 | __task_pid_nr_ns+0xb3
mov eax, dword ptr [rbp+0x50]
EAX:0x0
[RBP:0xffff88806c91c000+0x50=0xffff88806c91c050 size:UInt32->0xe6]]
[8b, 45, 50]
*/
Breakpoint {
lookup: BreakpointLookup::SymbolOffset(“__task_pid_nr_ns”, 0xb3),
bp_type: BreakpointType::Repeated,
bp_hook: |fuzzvm: &mut FuzzVm<Self>, _input, _fuzzer| {
// The instruction retrieving the PID is
// mov eax, dword ptr [rbp+0x50]
// Write the 0xdeadbeef value into the memory at `rbp + 0x50`

// Get the current `rax` value
let rbp = fuzzvm.rbp();
let val: u32 = 0xdeadbeef;

// Write the wanted 0xdeadbeef in the memory location read in the
// kernel
fuzzvm.write_bytes_dirty(VirtAddr(rbp + 0x50), CR3, &val.to_le_bytes())?;

// Continue execution
Ok(Execution::Continue)
},
},

With getpid patched, we can continue fuzzing the target and check the Crashes tab in the TUI.

This looks like we’ve detected a segmentation fault (SIGSEGV) for address 0xcafecafe from the bug found in the target:

// BUG
*(int*)0xcafecafe = 0x41414141;

Single Step Traces

With a crash in hand, Snapchange can give us a single step trace using the crash as an input.

$ cargo run -r — trace ./snapshot/crashes/SIGSEGV_addr_0xcafecafe_code_AddressNotMappedToObject/c2b9b72428f4059c

This will give the state of the system at the time of the reset as well as the single step trace of the execution path. Notice that the guest reset on the force_sig_fault kernel function. This function is hooked by Snapchange to monitor for crashing states.

The single step trace is written to disk containing all instructions executed as well as the register state during each instruction. The trace includes:

Decoded instruction
State of the involved registers and memory for the given instruction
Assembly bytes for the instruction (useful for patching)
Source code where this assembly originated (if debug information is available)

What’s next for Snapchange?

The team is excited to hear from you and the community at large. We have ideas for more features and other analysis that can aid in fuzzing efforts and are interested in hearing what features the community is looking for in their fuzzing workflows. We’d also love feedback from you about your experience writing fuzzers using Snapchange on Snapchange’s GitHub. If this blog has sparked your curiosity, check out the other real-world examples included in the Snapchange repository.

Flatlogic Admin Templates banner

Introducing AWS Libcrypto for Rust, an Open Source Cryptographic Library for Rust

Today we are excited to announce the availability of AWS Libcrypto for Rust (aws-lc-rs), an open source cryptographic library for Rust software developers with FIPS cryptographic requirements. At our 2022 AWS re:Inforce talk we introduced our customers to AWS Libcrypto (AWS-LC), and our investment in and improvements to open source cryptography. Today we continue that mission by releasing aws-lc-rs, a performant cryptographic library for Linux (x86, x86-64, aarch64) and macOS (x86-64) platforms.

Rust developers increasingly need to deploy applications that meet US and Canadian government cryptographic requirements. We evaluated how to deliver FIPS validated cryptography in idiomatic and performant Rust, built around our AWS-LC offering. We found that the popular ring (v0.16) library fulfilled much of the cryptographic needs in the Rust community, but it did not meet the needs of developers with FIPS requirements. Our intention is to contribute a drop-in replacement for ring that provides FIPS support and is compatible with the ring API. Rust developers with prescribed cryptographic requirements can seamlessly integrate aws-lc-rs into their applications and deploy them into AWS Regions.

AWS-LC is the foundation of aws-lc-rs. AWS-LC is the general-purpose cryptographic library for the C programming language at AWS. It is a fork from Google’s BoringSSL, with features and performance enhancements developed by AWS, such as FIPS support, formal verification for validating implementation correctness, performance improvements on Arm processors for ChaCha20-Poly1305 and NIST P-256 algorithms, and improvements to ECDSA signature verification for NIST P-256 curves on x86 based platforms. aws-lc-rs leverages these AWS-LC optimizations to improve performance in Rust applications. AWS-LC has been submitted to an accredited lab for FIPS validation testing, and upon completion will be submitted to NIST for certification. Once NIST grants a validation certificate to AWS-LC, we will make an announcement to Rust developers on how to leverage the FIPS mode in aws-lc-rs.

We used rustls, a Rust library that provides TLS 1.2 and 1.3 protocol implementations to benchmark aws-lc-rs performance. We ran a set of benchmark scenarios on c7g.metal (Graviton3) and c6i.metal (x86-64) Amazon Elastic Compute Cloud (Amazon EC2) instance types. The graph below shows the improvement of a TLS client negotiating new connections using aws-lc-rs. aws-lc-rs with rustls significantly improves throughput in each of the algorithms tested, and for every hardware platform. We are excited to share aws-lc-rs and these cryptographic improvements with the Rust community today. We are continually evaluating our benchmarks for opportunities to improve aws-lc-rs.

Getting Started

Incorporating aws-lc-rs in your project is straightforward. Let’s look at how you can use aws-lc-rs in your Rust application by creating a SHA256 digest of the message “Hello Blog Readers!” An enhanced version of this digest example is available in the aws-lc-rs repository.

Install the Rust toolchain with rustup, if you do not already have it.
Initialize a new cargo package for your Rust project, if you don’t already have one:
$ cargo new –bin aws-lc-rs-example

Record aws-lc-rs as a dependency in the project Cargo file:
$ cargo add aws-lc-rsApplications already using the ring (v0.16.x) API: If your application already leverages the ring API, you can easily test and benchmark your application against aws-lc-rs without changing your application’s use declarations:$ cargo remove ring
$ cargo add –rename ring aws-lc-rs

Edit src/main.rs:

use aws_lc_rs::digest::{digest, SHA256};

fn main() {
const MESSAGE: &[u8] = b”Hello Blog Readers!”;

let output = digest(&SHA256, MESSAGE);

for v in output.as_ref() {
print!(“{:02x}”, *v);
}
println!();
}

Compile and run the program:
$ cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.04s
Running `target/debug/aws-lc-rs-example`
c10843d459bf8e2fa6000d59a95c0ae57966bd296d9e90531c4ec7261460c6fb

Conclusion

Rust developers increasingly need to deploy applications that meet US and Canadian government cryptographic requirements. In this post, you learned how we are building aws-lc-rs in order to bring FIPS compliant cryptography to Rust applications. Together AWS-LC and aws-lc-rs bring performance improvements for Arm and x86-64 processor families for commonly used cryptographic algorithms. If you are interested in using or contributing to aws-lc-rs source code or documentation, they are publicly available under the terms of either the Apache Software License 2.0 or ISC License from our GitHub repository. We use GitHub Issues for managing feature requests or bug reports. You can follow aws-lc-rs on crates.io for notifications about new releases. If you discover a potential security issue in aws-lc-rs or AWS-LC, we ask that you notify AWS Security using our vulnerability reporting page.

Flatlogic Admin Templates banner

Behind the Scenes on AWS Contributions to Cloud Native Open Source Projects

Amazon Elastic Kubernetes Service (Amazon EKS) is well known in the Kubernetes community. But few realize that AWS engineers are closely involved and contributing upstream to Kubernetes and to many more cloud native open source projects.

In the past year alone, AWS contributed significantly to containerd, Cortex, etcd, Fluentd, nerdctl, Notary, OpenTelemetry, Thanos, and Tinkerbell. We employ maintainers and contributors on these projects and we will contribute more to these and other projects in the coming year. Here’s a behind-the-scenes look at our contributions and why we’re investing in the open source projects we support. You can also meet many of our contributors in the AWS booth at KubeCon Europe in Amsterdam, April 18-21, 2023 and hear from them in our virtual Container Day event 9 a.m. – 4 p.m. CEST on April 18.

“Amazon EKS is committed to open source and we are spending a lot of our cycles now focused on contributing back to the community. Kubernetes is part of a community that’s bigger than AWS and so we’re continuing to be committed to maintaining and helping that community to be successful because without it, we wouldn’t exist, either,” said Barry Cooks, Vice President, Kubernetes, at AWS and a Cloud Native Computing Foundation (CNCF) governing board member.

AWS contributes to Kubernetes and Etcd

Today, AWS is heavily involved in open source, cloud native projects. Consider, for example, some of our recent key contributions to Kubernetes and etcd, the underlying data store for Kubernetes.

“We’re building the AWS cloud provider, contributing to CAPI (cluster API), and serve as part of the security response committee. We helped implement gzip optimization which improves the performance of Kubernetes clients,” said Nathan Taber who leads the product team for Kubernetes at AWS, in a keynote at KubeCon North America 2022. “With etcd we’re bringing our operational learnings from running just so much etcd at scale, back into the community.”

The AWS cloud provider for Kubernetes is the open source interface between a Kubernetes cluster and AWS service APIs. This project allows a Kubernetes cluster to provision, monitor, and remove AWS resources necessary for operation of the cluster.

As of Kubernetes 1.27, AWS has just finished a multi-year effort to migrate our legacy cloud provider out of tree to an external cloud provider. The cloud provider migration reduces binary bloat in the main kubernetes/kubernetes (k/k) repository, as well as reduces dependency complexity and the surface area for security vulnerabilities.

AWS has also built a webhook framework that allows cloud providers to host webhooks in their cloud-controller-managers, which makes certain migration tasks easier. One use case for this is helping other cloud providers to migrate the persistent volume labeller admission controllers from the API server code, which is one of the last areas of cloud provider specific code that needs to be migrated out of core Kubernetes.

“We’ve included a lot of space in our planning for upstream open source work this year,” said Nick Turner, software developer on the AWS Kubernetes team and a chair in Kubernetes SIG-cloud-provider. “Expect us to keep up our contributions to the cloud provider and the load balancer controller as well as increase our investments in the AWS IAM authenticator for Kubernetes and KMS encryption provider.”

These and other Kubernetes contributions bring value to the entire Kubernetes community as well as to the EKS service and its customers.

Since KubeCon Detroit last fall, the EKS-etcd team has contributed numerous improvements to etcd. Chao Chen contributed to the effort to improve testing mechanisms for etcd by unifying the test frameworks used by etcd tests. Baoming Wang contributed an important metric to the Kubernetes API server code base which will help catch data corruption issues early. We’ve also worked on building a linearizability test suite, made various improvements to the core etcd database and the etcd backend database Bolt-DB, contributed to documentation, made helm more resilient to etcd side transient errors, and fixed an issue with the installation script for argo-cd-helmfile.

What’s driving AWS to contribute more to cloud native open source

Like most modern companies, AWS builds many of its services with open source components. There are several business and technical reasons we do this, which we’ve outlined in an article on The New Stack about why we invest in sustainable open source. We recognize that the success of our services depends on the success of those underlying open source projects.

Given that most of the open source projects that AWS supports underpin specific services, AWS tasks all engineers working in services, regardless of their assigned sub-service teams, to contribute in any way that they can to those upstream projects.

The result is a virtuous cycle that promotes mutually beneficial growth. As AWS services grow, so too do the open source projects upon which they are based because of AWS contributions and support. Conversely, as these open source projects grow from the contributions of other companies and developers, so do the benefits to the AWS services that depend upon them.

AWS contributions focus on performance and scale

AWS contributions to open source typically come as a practical matter in the form of bug fixes, code reviews, documentation, new features, or security enhancements. Like many developers working in the open source space, AWS engineers often work to address issues that arise in the course of their day jobs and then share the fixes with the rest of the open source community. Similarly, new features for an open source project are developed by AWS engineers to expand the project’s scale or performance which in turn increases the project’s usability, stability, and overall appeal.

Because AWS has a large number of Kubernetes clusters under management, it affords AWS a unique opportunity to test the limitations of open source software and build its edges stronger and further out from its initial core. So many of the contributions that our team members do for upstream Kubernetes, etcd, containerd, and other projects center on making sure that we provide insights to the upstream community on where things break down in scaling, production, and operational readiness.

The resulting insights provide value for the entire open source community as well as our own customers.

Take for example, the lag fix that curiously performed as a latency expander. AWS engineer Shyam Jeedigunta, was looking at the logs and metrics collected from thousands of production EKS clusters. He determined that Gzip compression is enabled inside the Kubernetes API server to reduce the demand on network bandwidth and to decrease latency.  However, the compression was actually increasing the latency for large list requests made by clients to the Kubernetes API server. Shyam, who is also co-chair of the Kubernetes scalability special interest group (SIG), took a deep dive into the issue to investigate whether a particular compression level created the problem and if so, could the compression level be reduced? Could Gzip compression be disabled entirely? What impact would that have on latency and network bandwidth?

Answers to questions like this one lead to contributions upstream in etcd and core Kubernetes from AWS service teams. Customers and others often report these kinds of issues to the project as well, but the nature of the problem isn’t clear until it’s viewed on 1,000 nodes and 200,000 objects of a certain kind. AWS engineers diagnose what’s going on, put together troubleshooting information, and collate information into proposals on how to fix the problem(s) to upstream to Kubernetes. AWS likes to spearhead fixing issues that arise from running the projects at scale.

Key AWS contributions

AWS contributes to many Kubernetes sub projects and SIGs. For example, Micah Hausler and Sri Saran Balaji Vellore Rajakumar serve on the Kubernetes Security Response Committee (SRC), Davanum Srinivas (Dims) chairs SIG-Architecture and SIG-k8s-infra, and Nick Turner is a chair in SIG-cloud-provider.  Key contributions have gone into projects including containerd, Cortex, cdk8s, CNI, nerdctl and Prometheus. Innovations have also been substantial and include TorchServe, improved ARM support through AWS Graviton, and the Virtual GPU plugin. However, this is not an exhaustive or complete list of AWS contributions and innovations in the cloud native community.

On containerd, for example, AWS employs two maintainers who contribute features and help ensure the project’s general health and security. Key contributions from AWS engineers to the containerd project include OpenTelemetry integration in the 1.7.0 release, improved tracing, and improved fuzzing integration.

“It’s been awesome to see the growth on the container runtime team here at AWS these past few years. I love to see the eagerness to learn not just *how* to contribute, but how to do it well and really benefit the broader community,” said Phil Estes, a principal engineer at AWS and a containerd maintainer.

Nerdctl, a Docker-compatible CLI for containerd and a containerd sub-project, is used by other open source projects Lima, Finch, and Rancher Desktop. AWS engineers significantly improved nerdctl’s compose support by adding 11 out of 13 missing compose commands. We enhanced nerdctl’s image signing/verification support by contributing cosign support for nerdctl compose, and notation support for nerdctl. And engineer Jin Dong recently became the first reviewer for the project from AWS.

AWS services are also standardizing on OpenTelemetry, a set of open source tools and standards for collecting metrics, logs, and traces to measure application performance. AWS Distro for OpenTelemetry (ADOT), OpenSearch, and CloudWatch are all building on OpenTelemetry and contribute back to the upstream project. All ADOT code is 100% open source and contributed upstream. Key contributions include: adding functionality to upstream observability components such as OpenTelemetry language SDKs, collectors, and agents.

“Amazon is the fourth largest contributor to OpenTelemetry with a dedicated maintainer and many contributors working on the project. A key contribution has been improving collector and metric stability, including improved Prometheus interoperability with OpenTelemetry,” said Taber.

A fourth example is Cortex where AWS is the top supporter of the project and employs three maintainers. As AWS runs this project at scale, engineers have the opportunity to identify and fix scaling cliffs before they become a problem for the rest of the community. Some of the key contributions are new features and performance improvements. Examples include partition compactor, Ring DynamoDB Multikey KV, out of order samples ingestion, snappy-block gRPC compression, ARM images, and Thanos PromQL engine integration.

We have also contributed bug fixes to Thanos, a tool for setting up highly available Prometheus instances with long term storage. Thanos is a CNCF incubating project which Cortex depends on. We participated in the development of the new Thanos PromQL engine and open sourced a tool that could use fuzzing for correctness testing which has already caught a few bugs.

AWS employs four maintainers on Tinkerbell, a cloud native open source bare metal provisioning engine for EKS Anywhere and a CNCF Sandbox project. Key contributions include organizing the project roadmap, VLAN support, a Kubernetes native backend, out-of-band management Kubernetes controller, Helm Chart deployment, and Cluster API provider updates.

“Our team has done a lot of work to update the Tinkerbell backend from Postgres to native Kubernetes,” said Taber.

AWS employs three maintainers in Notation, a sub project of Notary under the CNCF, and is the third largest code contributor to Notary. Notation enables the generation of cryptographic signatures for container images so users can verify that they come from a trusted source or process. AWS founded the sub project with other contributors to come up with specifications for signature format, generation, verification, and revocation. As part of this work we also defined a process for evaluating signature envelope formats like COSE ensuring that they met a high security bar before they were used in Notation.

AWS employees have either written or reviewed the majority of code contributions for the core Notation libraries and a CLI. AWS also employs a maintainer to Ratify so Kubernetes users can easily enable policies for signature verification with their existing admissions controllers. Similarly we also employ a maintainer to ORAS so signatures can easily be pushed to OCI registries. Notation enables users to define granular trust policies for defining which sources they want to trust, balance deployment safety and security needs, and flexibility on secure signing key storage options.

We have contributed to many other open source projects as well, including Crossplane, for which AWS added support for EKS IRSA in the China region and fixed Amazon Route 53 wildcard support, and Backstage, with AWS Proton and AWS Code Suite (AWS CodeBuild, AWS CodePipeline, and AWS CodeDeploy).

“We’re very excited about doing more development in the open, sharing that with our customers, and working directly in some cases with customers on their needs in open source projects and working together to make the community stronger in the Kubernetes space,” Cooks said.

AWS is open

We want to hear from you. AWS engineers are open to helping community members through collaboration and contribution opportunities. Tell us how we can help meet your needs.

AWS engineers, solutions architects, and product managers are hanging out on the Kubernetes community and the CNCF community Slack channels. Channels where you can reach out to us include the provider AWS channel and Karpenter channel, and the AWS controllers for Kubernetes channel on the Kubernetes Slack.

Find us and tell us what you’d like us to work on. Or if you have a particular issue that you found in one of these upstream projects that you think our engineers can help move the needle on. Come find us and talk to us in the CNCF’s AWS Slack channel and join us for our virtual Container Day on April 18, before KubeCon EU.

Flatlogic Admin Templates banner

Right-size your Kubernetes Applications Using Open Source Goldilocks for Cost Optimization

In the last few years as companies have modernized their business applications, many have moved to microservices based architectures using containers on Kubernetes. A lot of the initial focus was on designing and building new cloud native architectures to support the applications. As environments have grown, we’ve seen a shift in focus to optimize resource allocation and right-size workloads to reduce costs.

In this blog post we will share guidance on how to optimize resource allocation and right-size applications in Kubernetes environments using Goldilocks. We’ll walk through how to install Goldilocks as well as a sample application to view the suggested resource recommendations. This applies to all Kubernetes applications, including those running on Amazon Elastic Kubernetes Service (Amazon EKS), that are deployed with managed node groups, self-managed node groups, and AWS Fargate.

Right-sizing applications on Kubernetes

In Kubernetes, resource right-sizing is done through setting resource specifications in the application manifest. These settings directly impact:

Performance — Kubernetes applications running on the same node will arbitrarily compete for resources without proper resource specifications. This can adversely impact application performance.
Cost Optimization — Applications deployed with oversized resource specifications will result in increased costs and underutilized infrastructure.
Autoscaling — The Kubernetes Cluster Autoscaler and Horizontal Pod Autoscaling require resource specifications to function.

The most common resource specifications in Kubernetes are for CPU and memory requests and limits.

Requests and Limits

Containerized applications are deployed on Kubernetes as Pods. CPU and memory requests and limits are an optional part of the Pod definition. CPU is specified in units of Kubernetes CPUs while memory is specified in bytes, usually as mebibytes (Mi).

Requests and limits each serve different functions in Kubernetes and affect scheduling and resource enforcement differently.

Scheduling

The Kubernetes scheduler only considers requests when determining where to place Pods in your cluster. Acceptable nodes are those that have enough available resources to satisfy the Pod’s resource requests.  Limits are not considered by the scheduler.

Resource Enforcement

The container runtime on the node where your Pods are running is responsible for resource enforcement.  Both requests and limits are factors in ensuring applications have access to their required compute resources. Their effect on CPU and memory is different:

CPU — If no limits are specified, then each Pod on a node can use all the available CPU on the host. As soon as available CPU is exhausted, Pods are throttled using a Linux primitive called cgroups. This is a resource sharing primitive that ensures each Pod gets its fair share of CPU time. CPU requests determine that fair share and are weighted to give more CPU time to Pods with larger CPU requests. If a limit is specified then CPU time will not exceed the specific limit.
Memory — Just like CPU, if no memory limits are specified, then each Pod can use all the available memory on the host. Unlike CPU, when memory is exhausted, there is no sharing mechanism. The Pod will either be terminated by the Linux Out-of-memory (OOM) killer or the kubelet will evict the Pod. The same process will happen if a Pod’s memory usage exceeds its limit.

Vertical Pod Autoscaler

So how do application owners choose the “right” values for their CPU and memory resource requests? An ideal solution is to load test the application in a development environment and measure resource usage using observability tooling. While that might make sense for your organization’s most critical applications, it’s likely not feasible for every containerized application deployed in your cluster.

Fortunately, there is a Kubernetes project that has a feature specifically designed to help provide resource recommendations — the Vertical Pod Autoscaler (VPA). VPA is a Kubernetes sub-project owned by the Autoscaling special interest group (SIG). It’s designed to automatically set Pod requests based on observed application performance. VPA collects resource usage using the Kubernetes Metric Server by default but can be optionally configured to use Prometheus as a data source.

VPA has a recommendation engine that measures application performance and makes sizing recommendations. The VPA recommendation engine can be deployed stand-alone so VPA will not perform any autoscaling actions. It’s configured by creating a VerticalPodAutoscaler custom resource for each application and VPA updates the object’s status field with resource sizing recommendations.

Creating VerticalPodAutoscaler objects for every application in your cluster and trying to read and interpret the JSON results is challenging at scale. Goldilocks is an open source project that makes this easy.

Goldilocks

Goldilocks is an open source project from Fairwinds that is designed to help organizations get their Kubernetes application resource requests “just right”. It takes its name, very appropriately, from the well known fairly tale Goldilocks and the Three Bears. Goldilocks builds on top of the Kubernetes Vertical Pod Autoscaler and provides:

A controller that automates the creation of VerticalPodAutoscaler objects for workloads in your cluster.
A dashboard that displays resource recommendations for all the monitored workloads.

The default configuration of Goldilocks is an opt-in model. You choose which workloads are monitored by adding the goldilocks.fairwinds.com/enabled: true label to a namespace.

Solution Overview

Let’s walk through how to install Goldilocks, including its dependencies Metrics Server and Vertical Pod Autoscaler. Then we’ll install a sample application to view the suggested resource recommendations. The diagram shown here illustrates all of the components on an Amazon EKS cluster and their interactions.

The Metrics Server collects resource metrics from the Kubelet running on worker nodes and exposes them through Metrics API for use by the Vertical Pod Autoscaler. The Goldilocks controller watches for namespaces with the goldilocks.fairwinds.com/enabled: true label and creates VerticalPodAutoscaler objects for each workload in those namespaces.

In this blog post, we will be creating a namespace called javajmx-sample and will be creating a tomcat deployment. We will label this namespace in order to get a recommendation from Goldilocks. As soon as we label the namespace, we will be able to see a VPA object called goldilocks-tomcat-example created.

Prerequisites

You will need the following to complete the steps in this post:

AWS Command Line Interface (AWS CLI) version 2
kubectl
helm
If you don’t have an Amazon EKS cluster, you can create one using the eksctl

Step 1: Deploying the Metrics Server

In this step, we will be deploying the Metrics server which provides the resource metrics to be used by Vertical Pod Autoscaler.

helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server

helm upgrade –install metrics-server metrics-server/metrics-server

Let’s verify the status of the metrics-server. Once successfully deployed, you should be able to see the resource utilization of the deployments within seconds:

kubectl top pods  -n kube-system

NAME                     CPU(cores)   MEMORY(bytes)  
aws-node-czlb8           2m           35Mi            
aws-node-fs22v           3m           35Mi            
aws-node-nl4js           2m           60Mi            
aws-node-vth4m           2m           59Mi            
coredns-d5b9bfc4-lbhb7   4m           13Mi            
coredns-d5b9bfc4-ngtf9   4m           14Mi            
kube-proxy-5gq76         1m           12Mi            
kube-proxy-mvp6g         1m           12Mi            
kube-proxy-vxpw9         1m           33Mi            
kube-proxy-zsfs4         1m           34Mi  

Step 2 : Enable namespaces which needs resource recommendation from Goldilocks

We will be deploying sample workloads in the javajmx-sample namespace and we will get the resource recommendation for the applications running on it. Let’s go ahead and create the namespace and label it.

kubectl create ns javajmx-sample
kubectl label ns javajmx-sample goldilocks.fairwinds.com/enabled=true

To ensure the label was applied successfully, run describe on the javajmx-sample namespace

kubectl describe ns javajmx-sample

Name:         javajmx-sample
Labels:       goldilocks.fairwinds.com/enabled=true
              kubernetes.io/metadata.name=javajmx-sample
Annotations:  <none>
Status:       Active

No resource quota.

No LimitRange resource.

Step 3 : Deploy Goldilocks

We will be using a helm chart to deploy Goldilocks. The deployment creates three objects :

Goldilocks-controller: responsible for creating the VPA objects for the workloads whose namespace is enabled for a Goldilocks recommendation

Goldilocks-vpa-recommender:  responsible for providing the resource recommendations for the workloads

Goldilocks-dashboard: summarizes the resource recommendation of the workloads and will also provide the yaml manifest for implementing the recommendation.

To deploy Goldilocks, run the following helm commands:

helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm upgrade –install goldilocks fairwinds-stable/goldilocks –namespace goldilocks –create-namespace –set vpa.enabled=true

Now, we will use kubectl to verify if the deployment was successful:

NAME                                          READY   STATUS    RESTARTS   AGE
goldilocks-controller-7bc5788596-q752s        1/1     Running   0          18h
goldilocks-dashboard-7ffff8966b-dphmj         1/1     Running   0          18h
goldilocks-dashboard-7ffff8966b-s2dgf         1/1     Running   0          18h
goldilocks-vpa-recommender-5ddf6dcd66-njgt4   1/1     Running   0          18h

Step 4 : Deploy the sample application

In this step, we will be deploying a sample application in the javajmx-sample namespace to get recommendations from Goldilocks. The application tomcat-example  is initially provisioned with a CPU and Memory request of 100m and 180Mi respectively and limits of 300m CPU and 300 Mi Memory.

kubectl apply -f https://raw.githubusercontent.com/aws-observability/aws-o11y-recipes/main/sandbox/javajmx/example/sample-javajmx-app.yaml

nht-admin:~/environment $ kubectl get pods -n javajmx-sample
NAME                              READY   STATUS    RESTARTS   AGE
tomcat-bad-traffic-generator      1/1     Running   0          127m
tomcat-example-5c874c8b8b-zt2tv   1/1     Running   0          127m
tomcat-traffic-generator          1/1     Running   0          127m

As mentioned earlier, Goldilocks will be creating VPAs for each deployment in a Goldilocks enabled namespace. Using the kubectl command, we can verify that a VPA was created in thejavajmx-sample namespace for the goldilocks-tomcat-example:

nht-admin:~/environment $ kubectl get vpa -n javajmx-sample
NAME                        MODE   CPU   MEM         PROVIDED   AGE
goldilocks-tomcat-example   Off    15m   109814751   True       127m

Step 5 : Review the Goldilocks recommendation dashboard

Goldilocks-dashboard will expose the dashboard in the port 8080 and we can access it to get the resource recommendation.  We now run this kubectl command to access the dashboard:

kubectl -n goldilocks port-forward svc/goldilocks-dashboard 8080:80

We can now open a browser to http://localhost:8080 to display the Goldilocks dashboard.

Let’s analyze the javajmx-sample namespace to see the recommendations provided by Goldilocks. We should be able to see the recommendations for the goldilocks-tomcat-example deployment.

Here the screen shows the request and limit recommendations for the javajmx-sample workload. The Current column under each Quality of Service (QoS) indicates the currently configured CPU and Memory request and limits. The Guaranteed and Burstable column under each QoS indicates the recommended CPU and Memory request limits for the respective QoS.

We can clearly notice  that we have over provisioned the resources and Goldilocks has made the recommendations to optimize the CPU and Memory request. The recommended level for CPU request and CPU limit is 15m and 15m compared to the current setting of 100m and 300m for Guaranteed QoS.  Memory request and limits are recommended to be 105M and 105M, compared to the current setting of 180Mi and 300 Mi.

Notice that the recommendations are available for two different Quality of Service (QoS) types: Guaranteed and Burstable. Kubernetes provides different levels of Quality of Service to pods depending on what they request and what limits are set for them. Pods that need to stay up and consistently good can request guaranteed resources, while pods with less exacting requirements can use resources with less or no guarantee.

Guaranteed (QoS) pods are considered top priority and are guaranteed to not be killed until they exceed their limits. If limits, and optionally requests, (not equal to 0) are set for all resources across all containers and limits and requests  are equal, then the pod is classified as Guaranteed.

Burstable (QoS) pods have some form of minimal resource guarantee, but can use more resources when available. Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist. If requests, and optionally limits, are set (not equal to 0) for one or more resources across one or more containers, and they are not equal, then the pod is classified as Burstable.

To follow the recommended resource specification, customers can simply copy  the respective manifest file for the QoS class they are interested in and deploy the workloads which will then be right-sized and optimized.

For example, if we decide to apply the recommendations for the Guaranteed QoS, we could copy the YAML from the dashboard as shown here and apply them to the deployment object:

Let’s run the kubectl edit command to the deployment to apply the recommendations:

kubectl edit deployment tomcat-example -n javajmx-sample

The resources section in the containers spec  shows that we have successfully applied the recommendation of request and limits for CPU, and memory:

Once we apply the recommendations, we should be able to verify that the pod is trying to restart and come online with the updated resource configuration. Let’s verify the same by running the kubectl describe  command on the tomcat-example deployment:

kubectl describe deployment tomcat-example -n javajmx-sample

The output should look like the following:

Name:                   tomcat-example
Namespace:              javajmx-sample
CreationTimestamp:      Mon, 06 Feb 2023 17:41:38 +0000
Labels:                 <none>
Annotations:            deployment.kubernetes.io/revision: 2
Selector:               app=tomcat-example-pods
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=tomcat-example-pods
  Containers:
   tomcat-example-pod:
    Image:       public.ecr.aws/u6p4l7a1/sample-java-jmx-app:latest
    Ports:       8080/TCP, 9404/TCP
    Host Ports:  0/TCP, 0/TCP
    Limits:
      cpu:     15m
      memory:  105Mi
    Requests:
      cpu:        15m
      memory:     105Mi
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>

Cleanup

To delete the deployments and sample workloads we created in the blog, execute the following commands:

helm delete metrics-server
helm delete goldilocks -n goldilocks
kubectl delete -f https://raw.githubusercontent.com/aws-observability/aws-o11y-recipes/main/sandbox/javajmx/example/sample-javajmx-app.yaml

Conclusion

This post demonstrated how Goldilocks can be used to efficiently rightsize the resource requests for Kubernetes applications. Customers in modernization efforts often have minimal time to decide the resource requirements for their applications, which usually involves a complex process of reviewing monitoring dashboards. By adopting the recommendations from Goldilocks, customers can shorten the time to market for their applications and optimize their Amazon EKS costs.

Further reading

EKS Best practices
Blog: Using Prometheus to Avoid Disasters with Kubernetes CPU Limits

Goldilocks project

Flatlogic Admin Templates banner

Run Open Source FFMPEG at Lower Cost and Better Performance on a VT1 Instance for VOD Encoding Workloads 

FFmpeg is an open source tool commonly used by media technology companies to encode and transcode video and audio formats. FFmpeg users can leverage a cost efficient Amazon Web Services (AWS) instance for their video on demand (VOD) encoding workloads now that AWS offers VT1 support on Amazon Elastic Compute Cloud (Amazon EC2).

VT1 offers improved visual quality for 4K video, support for a newer version of FFmpeg (4.4), expanded OS/kernel support , and bug fixes. These instances are powered by the AMD-Xilinx Alveo U30 media accelerator. Xilinx has the ability to add a single line  into FFmpeg, and enable the Alveo U30 to do the transcoding work. The Xilinx Video SDK includes an enhanced version of FFmpeg that can communicate with the hardware accelerated transcode pipeline in Xilinx devices to deliver up to 30% lower cost per stream than Amazon EC2 GPU-based instances and up to 60% lower cost per stream than Amazon EC2 CPU-based instances.

Companies typically use EC2 CPU instances such as the C5 and C6 coupled with FFmpeg for their VOD encoding workloads. These workloads can be costly in cases where companies encode thousands of VOD assets. The cost of an EC2 workload is influenced by the number of concurrent encoding jobs that an instance can support and this subsequently affects the time it takes to encode targeted outputs. As VOD libraries expand, companies typically auto scale to increase the size or number of C5 and C6 instances or allow the instances to operate longer. In both cases, these workloads experience an increase in cost. Important note: There is no additional charge for AWS Auto Scaling. You pay only for the AWS resources needed to run your applications and Amazon CloudWatch monitoring fees.

Amazon EC2 VT1 instances are designed to accelerate real-time video transcoding and deliver low-cost transcoding for live video streams. VT1 is also a cost effective and performance-enhancing alternative for VOD encoding workloads. Using FFmpeg as the transcoding tool, AWS performed an evaluation of VT1, C5, and C6 instances to compare price performance and speed of encode for VOD assets. When compared to C5 and C6 instances, VT1 instances can achieve up to 75% cost savings. The results show that you could operate two VT1 instances for the price of one C5 or C6 instance. Additionally, the  .

Benchmarking Method

First let’s determine the best instance type to use for our VOD workload. C5 and C6 instances are commonly used for transcoding. We used  and C6i.8xl instances and compared them against VT1.3xl instances to transcode 4K and 1080p VOD assets. The two assets were encoded into the output targets as shown  . The VT1.3xl, C5.9xl, and C6i.8xl output targets were measured against the amount of time it took to complete the encode.

As shown here in the screenshot from the AWS console for various instance types, VT1.3xl is the smallest instance type in the VT1 family. Even though VT1.6xl compares closely to C5 and C6 in terms of CPU/memory, we chose VT1.3xl for a closer price/performance comparison.

VT1 family instance type comparison to C5 in terms of CPU/memory.

Input data points

Sample input content

The following table summarizes the key parameters of the source content video files used in measuring the encoding performance for the benchmarking

Clip Name
Frame Count
Duration
Frame Rate
Codec
Resolution
Chroma Sampling

1080p
43092
12 mins
60
H.264, High Profile
1920×1080
4:2:0 YUV

4K
776
13 secs
60
H.264, High Profile
3840×2160
4:2:0 YUV

Evaluation Adaptive Bit Rate (ABR ) targets

Adaptive bitrate streaming (ABR or ABS) is technology designed to stream files efficiently over HTTP networks. Multiple files of the same content, in different size files, are offered to a user’s video player, and the client chooses the most suitable file to play back on the device. This involves transcoding a single input stream to multiple output formats optimized for different viewing resolutions.

For the benchmarking tests the input 4K and 1080K files were transcoded to various target resolutions that can be used to support different device and network capabilities at their native resolution: 1080p, 720p, 540p, and 360p. The bitrate (br) in the graphic shown here indicates the bitrate associated with each pixel. For example a 4K input file was transcoded to 360p resolution and 640 bitrate.

Figure 1: Adaptive bitrate ladder used to transcode output video files. Source: https://developer.att.com/video-optimizer/docs/best-practices/adaptive-bitrate-video-streaming

Output results

Target duration analysis

The VT1.3xl instance completed the targeted encodes 15.709 seconds faster than the C5.9xl instance and 12.58 seconds faster than the C6.8xl instance. The results in the following charts detail that the VT1.3xl instance has better speed and price performance when compared to the C5.9xl and C6i.8xl instances.

% Price Performance = {(C5/C6 Price Performance – VT1 Price Performance) /C5/C6 Price Performance} * 100

% Speed = { (C5/C6 duration – VT1 duration) /C5/C6 duration } * 100

H.264 4K Clip (3 seconds duration)
Instance Type

VT1.3xl
C5.9xl
C6i.8xl

Codec
mpsoc_vcu_h264
x264
x264

Duration to complete ABR targets (See Evaluation Targets Below)
14.47632
30.186221
27.05856

Speed‍ Compared to VT1.3xl (%)
N/A
52.043284 %
46.500035 %

Instance Cost ($/hour)
$0.65
$1.53
$1.22

Instance Cost ($/second)
$0.00018
$0.00043
$0.000338

Price Performance: $/(clip transcoded)
$0.002605
$0.01298
$0.009145

Price Performance Compared to VT1.3xl (%) ‍
N/A
79.930662 %
71.514488 %

H.264 1080p Clip (12 minutes duration)
Instance Type

VT1.3xl
C5.9xl
C6i.8xl

Codec
mpsoc_vcu_h264
x264
x264

Duration to complete ABR targets (See Evaluation Targets Below)
490.82074
837.20001
762.63252

Speed‍ Compared to VT1.3xl (%)‍
N/A
41.373538 %
35.641252 %

Instance Cost ($/hour)
$0.65
$1.53
$1.22

Instance Cost ($/second)
$0.00018
$0.00043
$0.000338

Price Performance: $/(clip transcoded)
$0.0883
$0.3599
$0.2577

Price Performance Compared to VT1.3xl (%)‍
N/A
75.465407 %
65.735351 %

The following section explains the encoding parameters used for testing. FFmpeg was installed on 1 x C5.9xl, 1x C6i.8xl, and 1 x VT1.3xl instances.  The two input files as mentioned in sample input files were run in parallel on each instance type, and the total duration to complete the transcoding to various output target resolutions was calculated.

Technical specifications

EC2 Instances: 1 x C5.9xl, 1x C6i.8xl, 1 x VT1.3xl
Video framework: FFmpeg
Video codecs: x264 (CPU), XMA (Xilinx U30)
Quality objective: x264 faster
Operating system

For C5 and C6 – x264, Amazon Linux 2 (Linux kernel 4.14)
For VT1- Xilinx: Amazon Linux 2 (Linux kernel 5.4.0-1038-aws)

Video codec settings for encoding performance tests

Instance
C5, C6
VT1

Codec
x264
mpsoc_vcu_h264

Preset
Faster
n/a

Output Bitrate (CBR)
See Evaluation Targets above
See Evaluation Targets above

Chroma Subsampling
YUV 4:2:0
YUV 4:2:0

Color Bit Depth
8 bits
8 bits

Profile
High
n/a

Conclusion

Amazon VT1 EC2 instances are typically used for live real-time encoding; however, this blog post demonstrates VT1 VOD encoding  and price performance advantages when compared to C5 and C6 EC2 instances. VT1 instances can encode VOD assets up to 52% faster, and achieve up to 75% reduction in cost when compared to C5 and C6 instances. VT1 is best utilized in workloads with VOD encoding jobs that require low encoding speeds in the time it takes to complete outputs. Please visit the Amazon EC2 VT1 instances page for more details.

Flatlogic Admin Templates banner

Dive Deeper into Data Lake for Nonprofits, a New Open Source Solution from AWS for Salesforce for Nonprofits

Nonprofits are using cloud-based solutions for fundraising, donor and member management, and communications. With this move online, they have access to more data than ever. This data has the potential to transform their missions and increase their impact. However, sharing, connecting, and interpreting data from many different sources can be a challenge. To address this challenge, Amazon Web Services (AWS) and AWS Partner Salesforce for Nonprofits announced the general availability of Data Lake for Nonprofits – Powered by AWS.

Data Lake for Nonprofits is an open source application that helps nonprofit organizations set up a data lake in their AWS account and populate it with the data that they have in the Salesforce Non Profit Success Pack (NPSP) schema. This data resides in Amazon Relational Database Service (Amazon RDS), where it can be accessed by other AWS services like Amazon Redshift and Amazon QuickSight, as well as Business Intelligence products such as Tableau.

This post is written for developers and solution integrator partners who want to get a closer look at the architecture and implementation of Data Lake for Nonprofits to better understand the solution.  We’ll walk you through the architecture and how to set up a data lake using AWS Amplify.

Prerequisites

To follow along with this walkthrough, you must have the following prerequisites:

An AWS Account

An AWS Identity and Access Management (IAM) user with Administrator permissions that enables you to interact with your AWS account
A Salesforce account that has the Non Profit Success Pack managed packages installed
A Salesforce user that has API permissions to interact with your Salesforce org

Git command line interface installed on your computer for cloning the repository

Node.js and Yarn Package Manager installed on your computer for cleanup

Solution Overview

The following diagram shows the solution’s high-level architecture.

Salesforce and AWS have released the source code in GitHub under a BSD 3-Clause license so nonprofits and their cloud partners can use, customize, and innovate on top of it at no cost.

Use your command-line shell to clone the GitHub repository for your own development.

git clone https://github.com/salesforce-misc/Data-Lake-for-Nonprofits

The solution consists of two tiers which are frontend and backend applications. In the following sections, we’ll walk you through both architectures and show you how to install a data lake.

Frontend Walkthrough

We have developed the frontend application using React with Typescript, and it has three layers.

The view layer comprises React components.
The model layer is where the most of the application logic is maintained. The models are Mobx State Tree (MST) types with properties, actions, and computed values. Relationships between models are expressed as MST trees.
The API layer is used to communicate with AWS Services using AWS SDK for JavaScript v3.

The frontend application has been built and zipped for AWS Amplify using GitHub Workflows. This is done for every change in the repository. The latest build can be found in the GitHub repository.

Backend Walkthrough

We have developed the backend application using using AWS CloudFormation templates that can be found under the infra/cf folder in the GitHub repository. These templates will provision the resources on your AWS account for your data lake as below.

vpc.yaml template is used to provision the Amazon Virtual Private Cloud (Amazon VPC) that the application will be running in.

buckets.yaml template is used to provision several Amazon Simple Storage Service (Amazon S3) buckets.

datastore.yaml is used to provision an Amazon Relational Database Service (Amazon RDS) PostgreSQL database.

athena.yaml is used to provision Amazon Athena with a custom workgroup.

step_function.yaml is used to provision AWS Step Functions and related AWS Lambda functions. Step Functions is used to orchestrate the Lambda functions that are going to import your data from your Salesforce account into Amazon RDS for PostgreSQL using an Amazon AppFlow connection.

Lambda functions can be found in the infra/lambdasfolder.

src/cleanupSQL.ts performs the movement of data from the data-loading database schema to the public database schema.

src/filterSchemaListing.ts filters out S3’s “schema/” key as well as queries Salesforce for deleted objects.

src/finalizeSQL.ts drops tables that will be no longer needed in the application to save cost.

src/listEntities.ts calls APIs since the output is too large for Step Functions.

src/processImport.ts imports a Lambda function which writes to RDS the data which is sent to the Amazon SQS queue.

src/pullNewSchema.ts queries Amazon AppFlow and then Salesforce to gather any new or updated fields.

src/setupSQL.ts sets up the data-loading database schema and creates the tables based on the schema file.

src/statusReport.ts performs a status update to S3 based on where it is in the Step Functions State Machine.

src/updateFlowSchema.ts uses the updated schema file on S3 to create or update the Amazon AppFlow flow.

Amazon Simple Queue Service (Amazon SQS) is used to queue the import data so that the Lambda function can use it.

Amazon EventBridge is used to set up a job that syncs your data based on your choice of frequency.

Amazon CloudWatch is used to keep logs during installation as well as synchronization. The application creates a CloudWatch dashboard to track the usage of the data lake.

Deploy the Frontend Application using AWS Amplify

The latest release of the frontend application can be downloaded from the GitHub repository and deployed in AWS Amplify in your AWS account as explained in the User Guide. It typically takes a few minutes to deploy the frontend application, after which a URL to the frontend application is presented. The URL should look like the link here:

https://abc.xyz….amplifyapp.com

When you open the URL, you will see the frontend application, which provides a wizard-like guide to the steps. Each step will guide you through the instructions on how to move forward.

Step 1 will ask for an Access Key ID and Secret Access key for an IAM user of your AWS account. The application shows guidance on how to log in to AWS Management Console and use Identity and Access Management (IAM) to create the IAM user with admin permissions.

This step also requires you to select the AWS region where you would like to create the Amazon AppFlow connection and deploy the data lake.

Step 2 establishes the connection to your Salesforce account. The application guides the user to leverage Amazon AppFlow, which allows AWS to connect to your Salesforce account.

At the end of the page, use the drop down menu to select the connection name and click Next.

Step 3 will help you choose the data objects and set the frequency of data synchronization for your data lake. The data import option can be set to any date and time, and you can choose the frequency from the given options: daily/weekly/monthly.

This page further displays the complete set of objects from your Salesforce account. Choose the necessary objects that you want to import into your data lake.

In Step 4, you can review the data lake configuration and confirm. This step is where you are allowed to go back to the previous step to make any changes if needed.

Step 5 is where the data lake is provisioned, and then your data is imported. This can take half an hour to several hours, depending on the size of the data in your Salesforce account.

Once the data lake is ready, in Step 6, you can find the instructions and information needed to connect to your data lake using business intelligence applications such as Tableau Cloud and Tableau Desktop that will help you visualize your data and analyze it per your business needs.

Cleanup

Keeping the data lake in your AWS account will incur charges due to the resources provisioned. To avoid incurring future charges, run these commands on your terminal where you cloned the GitHub repository and follow the instructions:

cd Data-Lake-for-Nonprofits/app

yarn delete-datalake

Conclusion

This post showed you how AWS services are used to transform your Salesforce data into a data lake. It uses AWS Amplify to host the frontend application and provisions several AWS services for the data lake backend. The architecture is based on the successful collaboration between AWS and Salesforce to build an open source data lake solution using a simple and easy-to-use, wizard-like application.

We invite you to clone the GitHub repository and develop your own solution for your needs, provide feedback, and contribute to the project.

Flatlogic Admin Templates banner

Driving Action and Communication in AWS Amplify Open Source Projects

AWS Amplify is a complete solution that lets frontend web and mobile developers easily build, ship, and host full-stack applications on AWS, with the flexibility to leverage additional AWS services as use cases evolve. To build Amplify applications, customers typically use one of the open source Amplify libraries. At Amplify, we manage and build these open source projects on GitHub.

In the last year, the number of contributions across the Amplify projects has increased and the teams have scaled to meet customer needs while building across programming languages and frameworks. This necessitates a constant balance of collaboration with contributors and also flexible processes to continually move the projects forward. The goal is to provide a delightful experience for Front End developers building on AWS.

Open source is as much about relationships as it is about code. These relationships are being built anytime the team is collaborating with external contributors, developers, and customers at different touch points. A core element to facilitate open source is to facilitate relationships through consistent and transparent communication. There isn’t a clear, well-defined playbook for this. It requires iteration and continuous feedback from contributors to fine tune and make better. How we communicate can mean how the projects are planned, structured, and received by the developer community! And all of this may change over time!

In this post, we’ll cover some processes and tools that we’ve built and use at AWS Amplify to help build a vibrant and responsive open source community.

Organization

For context, an average of 35 external contributor pull requests (PRs) and 350 issues are opened each month on GitHub across the Amplify project repositories. In 2022, this equated to an 80% increase in issues and 66% increase in external contributor PRs from the previous year. An external contributor is any active contributor that is not a member of the Amplify GitHub organization. These projects range from the Amplify CLI to the client libraries like Amplify JS and Amplify UI. We also have a very active Discord server with over 19,000 members.

Along with the growth, it can be a cultural shift for both AWS engineers and customers to work in the open. Not every team member is used to working in public, and not all customers are used to working with AWS through GitHub or Discord.

Our ownership model is that each individual team manages their own repositories and is responsible for the project health and operations. This includes issue and Pull Request (PR) management, communication, and releases. This level of autonomy helps the projects move fast and remain flexible. As the service has grown, so has the need for standardization and operational tooling within each team.

Communication and transparency – reducing the time to response

The starting point for community is consistent communication and transparency. There are different elements to this and it fluctuates over time and as projects grow or slow in velocity. There is an expectation in open source software development that there is an ongoing dialog with the community. This may take the form of contributors helping on issue triage and workarounds, providing feedback on Request for Comments (RFCs), and submitting PRs. It could also be developers using the libraries and services within their applications.

In each of these scenarios, open communication is what helps to oil that machine. In open source, the goal is to determine an action for each issue, pull request, feature request, or question that is created. In most cases, proactive communication and quick responses in issues and PRs helps to determine this action faster. This also shapes the open source contributor experience. An external contributor is more likely to stay engaged in an issue thread or PR review if the maintainers are quick to respond.

Knowing this, we identified ways to reduce friction and make the experience better knowing that communication plays such a large role. These ways are:

Standardize the structure of the operational processes (GitHub issues labels, etc.)
Communicate as early and often as possible in response to open items (issues, PRs, etc.)
Proactively track updates on these items in order to follow up quickly

With these goals in mind, we had to operationalize processes to allow us to deliver on each. Thinking through this, two core themes surfaced:

Develop a consistent approach and cadence for communication
Reduce overall time-to-response (or action)

Tackling these first would improve the contributor experience and begin to strengthen our own internal culture. Once standardized, we could then proactively track open items and metrics.

Communications processes

How do we improve communication across our project? This is where that cultural shift comes in. Sometimes this requires changes to normal process and team communication to fully embrace working in the open. The project is always evolving and updating, including nights and weekends. Without a clear process to drive action, things can quickly begin to get missed.

The first task was to find the root cause of the communication friction points in a project repository. What are the communication touch points? At those points, where can automation help reduce friction and time to next action? The initial entry points to communication in a GitHub repository are:

A contributor is using the project and opens an item in the form of an issue, or
A contributor opens a PR to submit code for inclusion into the project

The team on-call engineers were initially shadowed to observe how they identified, and interacted with, new and updated issues and PRs. A common theme was that the context switching and frequent back and forth on these items was very time consuming and hard to track. After observing this same pattern across multiple projects, it was clear that the conversation should be streamlined.

We needed an efficient way to both remind the original posters of what is needed to help reproduce an issue and encourage them to also provide details. For each of these items, we had to determine the best way to expedite the path to action. Maintainers need to triage an issue to determine the next action to take. Maintainers also need to triage each pull request to determine if it aligns with the project and identify what the next steps are with respect to review and merge.

GitHub issue template forms to the rescue

We initially standardized the GitHub issue template forms across all of the Amplify projects after getting access to the Beta feature in early 2021. These forms are used each time a contributor opens an issue, pull request, or feature request in a repository. This allowed for more actionable conversations by collecting all of the required information up front. The goal of the form was to collect just the required information without introducing unnecessary friction for contributors. This is different from the original GitHub issue templates that allowed for more free-form data entry. With this approach, we were able to require standard information that had been found to be helpful in triaging issues while shadowing the maintainers. Here is a screenshot of a portion of the current GitHub issue form template in the Amplify CLI repository.

This is the form that is used to open new issues in the repository. It asks the user to check some boxes before opening an issue, such as whether they have installed the latest version of the Amplify CLI, searched for duplicate or closed issues, and read the guide for submitting bug reports. It also asks some open ended questions about the user’s environment.

As using the new form to file an issue became the standard, the starting point for communication became clearer, allowing future interactions on issues and PRs to be more direct. Through the structured nature and standardized collection of data fields, we were able to significantly improve the quality of our search results. This has helped to reduce duplicate issues (although they may still exist in some forms in older issues) and increase visibility and traction on current open issues and feature requests.

Standardization

As the Amplify project quickly grew, the operational processes needed to grow with it and remain consistent across teams. Repository standardization tools are typically templates used to structure a project when it is first created. The original templates are helpful but there is an ongoing challenge of maintaining, or changing, the structure over time as these dimensions change. To account for this, we created an open source repository audit tool that provides a declarative approach to keep certain items such as labels, project topic tags, and descriptions up to date.

The audit tool also includes a dashboard to provide insight into other items, such as GitHub actions, required links, and the number of good first issues. The required items are defined in a configuration file and each repository is checked against that structure. This is a simple but effective way to quickly check across all the projects without manually checking each project.

Repositories will change over time but we need to make sure that there is consistency in the core structure. For instance, new labels are added, some are removed. Or, the CODEOWNERs.md file needs updated. The audit tool queries the repository metadata and provides a self-service way for each team to check if the repo is in good standing as part of their standard operating processes or has fallen out of date and needs attention. This screenshot shows a portion of the audit tool for the Amplify JS repository. The green check mark next to the links indicates that these items are present in the repository.

We want customers to have a consistent entry point when landing in any AWS Amplify repository. This includes the same nomenclature and core labels to correctly communicate the lifecycle of issues and PRs. This screenshot shows the required labels section of the tool. The set of required labels is compared against the labels in the repository for any discrepancies.

Data and tooling – separating out external contributor events

To accurately gauge activities on issues and opened items, it was critical to determine whether events were initiated by someone on the Amplify team or an external contributor in order prioritize communication and the triaging process. Our triaging process involves troubleshooting or reproducing an issue by a project maintainer. We needed to isolate GitHub issue (and pull request) event data created by external contributors in a separate tool to support this effort.

To track this event stream from GitHub, we built serverless data processes using AWS Lambda to capture issues and PRs that have been updated, and then capture any new events (comments, labels, etc.) on the issues. The data allows for ad-hoc querying against events to isolate if activity is increasing in a certain area. This also helps teams to quickly spot items when there are lags in timezones or as team on-call rotations change. All of these queries take into account the active members of the Amplify organization to identify those events only triggered by external contributors.

Capturing this data has allowed us to build tools to proactively engage in open and closed issues. The following sections outline these tools.

Trending issues

With the external contributor data separated from Amplify team events, we are able to start tracking, in near real-time, issues that have an increased amount of activity within a given time period. One way has been to create a dashboard that highlights trending issues that are receiving increased activity from external contributors. This is used at the individual project level and also across all of the projects. This helps to identify cross-project themes that may be starting without constant, manual, checking of issues to see what has changed.

The trending dashboard only ranks issues based on the activity that they are receiving from external contributors within a given time period. It takes into account recent comments and the aggregate reactions on comments that are not created by Amplify maintainers. This provides a holistic view to the themes that contributors and developers may be experiencing across the entirety of Amplify without the noise of Amplify team comments.

This includes issues, both open and closed post triage when there’s already been communication. It’s important to track closed issues (that are not locked) to not miss any new comments or activity. Since the issue is closed, those comments may not be seen unless a team member happens to view the issue. This screenshot shows the trending list of open issues in the Amplify CLI repository. The list of issues and aggregated metadata (number of comments and reactions) only includes data from external contributors.

Closed issues with increased activity

As previously mentioned, closed issue visibility is a challenge since it’s difficult to track which closed issues were updated and their specific changes – new comment, increase in number of reactions. The trending dashboard also tracks this activity.

The dashboard is especially useful for issues that are already closed that may continue to receive events. Without automation, it’s very time consuming (and manual) to identify an increased spike in reactions or external contributor comments on issues across one project, let alone over 20. Often times in projects, these comments go unnoticed unless an Amplify team member is tagged or subscribed to a specific item and happens to see the notification.

Having a UI that highlights these items also helps to reduce notification fatigue. The Amplify team (engineers, product, and developer experience) receive many notifications across items for many of the Amplify repositories. It is helpful to have a single dashboard to spot check if activity has increased on an older, closed issue.

Metrics to track

So how do we know that all of these processes and tools are making a positive impact? A few key metrics helped:

Mean Time to Respond (MTTR) – This measures how responsive we are to customers and encourages communicating as soon as possible once an issue or Pull Request is opened.

Mean Time to Close (MTTC) – This measures the overall timeframe that it takes to close an issue or Pull Request.

The definition of these metrics used with the constant stream of event data has allowed us to track the MTTC and MTTR of items at the repository and organization level.

MTTR is primarily about response times. How quickly do we respond back throughout the lifecycle of an issue (or PR)? A lower MTTC indicates that issues are closed faster. This may mean a few things: Maintainers have answered questions, reproduced issues, and actioned PRs. There are a lot of factors that contribute to an item being closed, which may have varying timelines. One example is feature requests that may remain open longer than a normal issue. There are caveats to each metric, but this helps to isolate issues that need further investigation.

Even with the progress in tracking, a few consistent challenges presented themselves. The back and forth communication on issues can be very sporadic. The next step was to isolate which issues and PRs needed a response.

Pending response

It’s difficult and time consuming to keep track of comment responses in issues and PRs utilizing only the notifications view within GitHub. One mechanism to keep track of this is identifying items that were awaiting a response and the original poster has now followed up (i.e. responded).

The ideal flow is something like this:

Issue is opened
Team responds with follow up or questions
Issue is labeled pending-response

A dashboard displays issues that have the pending-response labels AND an external contributor has responded

This helps reduce active response times and highlights when issues have received a reply. Additionally, this ensures that issues are actioned and don’t remain in an unknown state (i.e. not triaged) while awaiting a reply. Similar to the trending issues above, the team created a dashboard to surface any issue that fit this criteria. This screenshot shows the list of issues that need a response from the maintainers. These issues all have the pending-response label and an external contributor has been the most recent to comment.

Conclusion

Open source is not static. Tools, teams, and community are always evolving and what is working today will certainly change over the next year. Working backwards from consistency and communication has allowed Amplify to focus attention on prioritizing proactive action to make it a pleasant experience for community members to contribute to the projects.

We continue to identify more efficient ways to strengthen the developer relationships and facilitate more open communication across the Amplify repositories. As the community and project evolve, it’s important to remain flexible, communicate early and often, and continue to improve what the team can control.

Interested in learning more about open source at AWS Amplify? Follow @AWSAmplify on Twitter to get the latest updates about the Amplify Contributor Program, explore the open source Amplify GitHub repositories, or join the Amplify Discord server.

Flatlogic Admin Templates banner

AWS Teams with OSTIF on Open Source Security Audits

We are excited to announce that AWS is sponsoring open source software security audits by the Open Source Technology Improvement Fund (OSTIF), a non-profit dedicated to securing open source. This funding is part of a broader initiative at Amazon Web Services (AWS) to support open source software supply chain security.

Last year, AWS committed to investing $10 million over three years alongside the Open Source Security Foundation (OpenSSF) to fund supply chain security. AWS will be directly funding $500,000 to OSTIF as a portion of our ongoing initiative with OpenSSF. OSTIF has played a critical role in open source supply chain security by providing security audits and reviews to projects through their work as a pre-existing partner of the OpenSSF. Their broad experience with auditing open source projects has already provided significant benefits. This month the group completed a significant security audit of Git that uncovered 35 issues, including two critical and one high-severity finding. In July, the group helped find and fix a critical vulnerability in sigstore, a new open source technology for signing and verifying software.

Many of the tools and services provided by AWS are built on open source software. Through our OSTIF sponsorship, we can proactively mitigate software supply chain risk further up the supply chain by improving the health and security of the foundational open source libraries that AWS and our customers rely on. Our investment helps support upstream security and provides customers and the broader open source community with more secure open source software.

Supporting open source supply chain security is akin to supporting electrical grid maintenance. We all need the grid to continue working, and to be in good repair, because nothing gets powered without it. The same is true of open source software. Virtually everything of importance in the modern IT world is built atop open source. We need open source software to be well maintained and secure.

We look forward to working with OSTIF and continuing to make investments in open source supply chain security.

Flatlogic Admin Templates banner

The AWS Modern Applications and Open Source Zone: Learn, Play, and Relax at AWS re:Invent 2022

AWS re:Invent is filled with fantastic opportunities, but I wanted to tell you about a space that lets you dive deep with some fantastic open source projects and contributors: the AWS Modern Applications and Open Source Zone! Located in the east alcove on the third floor of the Venetian Conference Center, this space exists so that re:Invent attendees can be introduced to some of the amazing projects that power and enhance the AWS solutions you know and use. We’ve divided the space up into three areas: Demos, Experts, and Fun.

Demos: Learn and be curious

We have two dedicated demo stations in the Zone and a deep list of projects that we are excited to show you from Amazonians, AWS Heroes, and AWS Community Builders. Please keep in mind this schedule may be subject to change, and we have some last minute surprises that we can’t share here, so be sure to drop by.

Monday, November 28, 2022

Kiosk #
9 AM – 11 AM
11 AM – 1 PM
1 PM – 3 PM
3 PM – 5 PM

1

Continuous Deployment
and GitOps delivery with Amazon EKS Blueprints and ArgoCD

Tsahi Duek, Dima Breydo

StackGres: An Advanced
PostgreSQL Platform on EKSAlvaro Hernandez

 TBD

Step Functions templates and
prebuilt Lambda Packages for
deploying scalable serverless applications in seconds

Rustem Feyzkhanov

2

Data on EKS

Vara Bonthu, Brian Hammons

 TBD
How to use Amazon Keyspaces
(for Apache Cassandra) and Apache Spark
to build applicationsMeet Bhagdev

Scale your applications beyond IPv4 limits

Sheetal Joshi

Tuesday, November 29, 2022

Kiosk #
9 AM – 11 AM
11 AM – 1 PM
1 PM – 3 PM
3 PM – 5 PM

1

Let’s build a self service developer portal with AWS Proton

Adam Keller

Doing serverless on AWS with Terraform (serverless.tf + terraform-aws-modules)

Anton Babenko

 Steampipe

Chris Farris, Bob Tordella

Using Lambda Powertools for better observability in IoT Applications

Alina Dima

2

Build and run containers on AWS with AWS Copilot

Sergey Generalov

Fargate Surprise
Fargate Surprise

Amplify Libraries Demo

Matt Auerbach

Wednesday, November 30, 2022

Kiosk #
9 AM – 11 AM
11 AM – 1 PM
1 PM – 3 PM
3 PM – 5 PM

1

Building Embedded Devices with FreeRTOS SMP and the Raspberry Pi Pico

Dan Gross

Quantum computing in the cloud with Amazon Braket

Michael Brett, Katharine Hyatt

Leapp

Andrea Cavagna

EKS multicluster management and applications delivery

Nicholas Thomson, Sourav Paul

2

Using SAM CLI and Terraform for local testing

Praneeta Prakash, Suresh Poopandi

How to use Terraform AWS and AWSCC provider in your project

Tyler Lynch, Drew Mullen

How to use Terraform AWS and AWSCC provider in your project

Glenn Chia, Welly Siau

Smart City Monitoring Using AWS IoT and Digital Twin

Syed Rehan

Thursday, December 1, 2022

Kiosk #
9 AM – 11 AM
11 AM – 1 PM
1 PM – 3 PM

1

Modern data exchange using AWS data streaming

Ali Alemi

Learn how to leverage your Amazon EKS cluster as a substrate for execution of distributed Ray programs for Machine Learning.

Apoorva Kulkarni

 TBD

2

Spreading apps, controlling traffic, and optimizing costs in Kubernetes

Lukonde Mwila

 TBD

Terraform IAM policy validator

Bohan Li

Experts

Pull up a chair, grab a drink and a snack, charge your devices, and have a conversation with some of our experts. We’ll have people visiting the zone all throughout re:Invent, with expertise in a variety of open source technologies and AWS services including (but not limited to):

Amazon Athena
Amazon DocumentDB (with MongoDB compatibility)
Amazon DynamoDB
Amazon Elastic Container Service (Amazon ECS)
Amazon Elastic Kubernetes Service (Amazon EKS)
Amazon Eventbridge
Amazon Keyspaces (for Apache Cassandra)
Amazon Kinesis
Amazon Linux
Amazon Managed Grafana
Amazon Managed Service for Prometheus
Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
Amazon MQ
Amazon Redshift
Amazon Simple Notification Service (Amazon SNS)
Amazon Simple Queue Service (Amazon SQS)
Amazon Simple Storage Service (Amazon S3)
Apache Flink, Hadoop, Hudi, Iceberg, Kafka, and Spark
Automotive Grade Linux
AWS Amplify
AWS App Mesh
AWS App Runner
AWS CDK
AWS Copilot
AWS Distro for OpenTelemetry
AWS Fargate
AWS Glue
AWS IoT Greengrass
AWS Lambda
AWS Proton
AWS SDKs
AWS Serverless Application Model (AWS SAM)
AWS Step Functions
Bottlerocket
Cloudscape Design System
Embedded Linux
Flutter
FreeRTOS
Javascript
Karpenter
Lambda Powertools
OpenSearch
Red Hat OpenShift Service on AWS (ROSA)
Rust
Terraform

Fun

Want swag? We’ve got it, but it is protected by THE CLAW. That’s right, we brought back the claw machine, and this year, we might have some extra special items in there for you to catch. No spoilers, but we’ve heard there have been some Rustaceans sighted. You might want to bring an extra (empty) suitcase.

But we’re not done. By popular request, we also brought back Dance Dance Revolution. Warm up your dancing shoes or just cheer on the crowd. You never know who will be showing off their best moves.

Conclusion

The AWS Modern Applications and Open Source Zone is a must-visit destination for your re:Invent journey. With demos, experts, food, drinks, swag, games, and mystery surprises, how can you not stop by?

Flatlogic Admin Templates banner