Backup and replication of your DynamoDB tables

We are using Amazon’s DynamoDB (DDB) as part of our platform. As stated in the FAQ section, AWS itself replicates the data across three facilities (Availability Zones, AZs) within a given region, to automatically cope with an eventual outage of any of them. This is a relief, and useful as part of an out of the box solution, but you’d probably want to go beyond this setup, depending on what your high availability and disaster recovery requirements are.

I have recently done some research and POCs as to how it would be best to achieve a solution inline with our current setup. We needed it to:

  • be as cost effective as possible, while covering our needs
  • introduce the least possible complexity in terms of deployment and management
  • satisfy our current data backup needs and be inline with allowing us to handle high availability in the near future.

There’s definitely some good literature on the topic online (1), besides related AWS resources, but have decided to write a serious of posts which will hopefully provide a more practical view on the problem and the different range of possible solutions.

In terms of high availability, probably your safest bet would be to go with cross-region replication of your DDB tables. In a nutshell, this will allow you to create replica tables of your master ones in a different AWS region. Luckily, AWS labs provides an implementation on how to do this, open-sourced and hosted in GitHub. If you take a close look at the project’s README, you’ll notice it is implemented by using the Kinesis Client Library (KCL). It works by using DDB Streams, so for this to work, streaming of the DDB tables you want to replicate needs to be enabled, at least for the master ones (replicas don’t need it).

From what I’ve seen, there would be several ways of accomplising our data replication needs:

Using a CloudFormation template

Using a CloudFormation (CF) template to take care of setting up all the infrastructure you need to run their cross-replication implementation mentioned above. If you’re not very familiar with CF, they describe it as:

AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.

Creating a stack with it is quite straight forward, and the wizard will allow you to configure the options on the following screenshot, besides some more advanced ones on the following screen for which you can use defaults in a basic setup.

Screen Shot 2016-07-25 at 09.38.36

Using this template will take care of creating everything from your IAM roles and Security Groups creation, to launching the defined EC2 instances to perform the job. One of those instances will take care of coordinating replication and the other(s) will take care of the actual replication process (i.e. running the KCL worker processes). The actual worker instances are implicitly defined as part of an autoscaling group, as to guarantee that the worker instances are always running, in order to prevent events from the DDB stream being unprocessed, which would lead to data loss.

I couldn’t fully test this method as after CF finished setting up everything, I couldn’t use the ReplicationConsoleURL to configure master/replica tables due to the AWS error below. Anyway, wanted a more fine grained control of the process, so looked into the next option.

Screen Shot 2016-07-25 at 11.14.46

Manually creating your AWS resources and running the replication process

This would basically imply performing most of what CF does on your behalf. So it would mean quite a bit more work in terms of infrastructure configuration, be it through the AWS console or as part of your standard environment deployment process.

I believe this would be a valid scenario in the case you want to use your existing AWS resources to run the worker processes. You’ll need to leverage what your costs restrictions and computing resource needs are, before finding considering this a valid approach. In our case, this would help us with both, so decided to explore it further.

Given that we already have EC2 resources set up as part of our deployment process, I decided to create a simple bash script that would kick of the replication process as part of our deployment. It basically takes care of installing the required OS dependencies, cloning the git repo and building it, and then executing the process. It requires 4 arguments to be provided (source region/table and target region/table). Obviously, it doesn’t perform any setup on your behalf, so the argument tables will need to exist on the specified regions, and the source table must have streaming enabled.

This proved to be a simple enough approach, and worked as expected. The only downside of it is that, regardless of it running within our existing EC2 fleet, we still needed to figure out a mechanism of monitoring the worker process, in order to restart it in case it dies for any reason, and avoid data loss as mentioned above. Definitely an approach we might end up using in the near future.

Using lambda to process the DDB stream events

This method uses the same approach as the above, in that it relies on events from your DDB tables streams, but removes the need of having to take care of the AWS compute resources you will need in doing so. You will still need to handle some infrastructure and write the lambda function that will perform the actual replication, but will definitely help with the cost and simplicity requirements mentioned in the introduction.

Will leave the details of this approach for the last post of this series though, as it is quite a broad topic that I will cover there in detail.

In the upcoming posts I will discuss the overall final solution we end up going with, but before getting to that, in my next post, I will discuss how to backup your DDB tables to S3.

Stay tuned!