Going serverless with Lambdas

This is the first post of a series where we’ll be writing about the motivation and decision making process behind our cloud strategy for Sandbox.

Setting the stage

Our first assumption was that, given that Sandbox follows a client-server architecture and that most of its core functionality is on the client side, our cloud requirements wouldn’t be as high as if we were a pure SAAS product.

When we started architecting our cloud components for Sandbox, we came up with the following list of requirements:

  1. It had to allow us to have a low time-to-market. Rewriting certain parts later would be something we would accept as a side effect of this.
  2. It had to be cost efficient. We didn’t anticipate much traffic during our early days after our initial go live, so it felt natural to look for a ‘Pay as you go’ model, rather than having resources sitting idle until we ramp up our user base.
  3. It needed to allow us to have a simple go live strategy. This needed to apply both in terms of process and tooling used. It obviously also needed to leverage our previous knowledge of cloud providers and products/services.

So, what did we decide to go with, and most importantly, why?

First, choosing the cloud provider: we had a certain degree of experience using AWS, from our professional life as consultants, but especially from our days working at StackFoundation, so this was the easy choice!

The AWS offering is huge, and constantly evolving, so deciding which services to use wasn’t that straight-forward. From a computing perspective, there seemed to be 2 clear choices, using ECS to handle our EC2 instances, or going serverless with Lambdas. Our experience leaned more towards EC2, but when we revisited our cloud architecture requirements, we concluded Lambdas were a better fit, as they ticked all the boxes:

  1. In terms of the actual application development, choosing EC2 or Lambdas didn’t seem to make much of a difference. Once you decide on whether to go with plain Lambdas or use a framework to make Lambda development closer to standard API development, feature development was pretty straightforward, with a low learning curve. We initially went with Jrestless although we crafted our own solution lambda-api module of the Datamill project, more on this on future posts. On the EC2 world, this would have meant us using micro-services running on containers, so pretty much the same picture.
  2. The term ‘Pay as you go’ feels like it was coined for Lambdas, at least in terms of compute resources. You get billed based on your memory requirements and actual usage period (measured to the nearest 100ms). On the other hand, with EC2 instances, you have to keep your instances live 24×7. So we either we had to go with low-powered T2 family instances to keep the cost low and struggle if we were to get unpredicted usage spikes, or go with M3 family instances, and potentially waste money on them during periods where they were under-used. This would  probably be the case during our early preview period.
  3. In terms of overall architecture, it is more complicated to configure ECS clusters and all the different components around them than to simply worry about actual application code development, which is one of the selling points of Lambdas (and the serverless movement in general). This may be an overstatement, as even with Lambdas there is additional configuration. For example, you also need to take care of permissions (namely Lambda access to other AWS resources like SES, S3, DDB, etc) but definitely at least an order of magnitude simpler than doing this on the ECS front. In addition, even though the AWS console is the natural path to start with until you get a grasp of how the service works and the different configuration options you have, you will eventually want to migrate to using some kind of tooling to take care of deployment, both for your lower environments as well as for production. We came across Serverless, a NodeJS based framework that aids you in your Lambda infrastructure management (or other function based solutions by other providers). The alternative would have probably been to go with Terraform, which is an infrastructure as code framework. Our call here was motivated both in terms of project needs and the overhead of the learning curve of the tooling to use. If we went with Lambdas, we would mainly need to worry only about them in term of AWS resources, so Serverless offered all we needed in this sense. In terms of the learning curve, using a general purpose framework like Terraform would imply us having to deal with concepts and features we wouldn’t be needing for our use case.

So, now that we’re live, how do we feel about our assumptions/decisions after the fact?

In terms of application development, the only caveat we found was in terms of E2E testing. We obviously have a suite of unit and integration tests on our cloud components, but during development, we realised some bugs arose from the integration with our cloud frontend that couldn’t be covered by our integration tests. This was true specially when services like API Gateway were involved in the problem, as we haven’t found a way of simulating this on our local environment. We came across localstack late in the day, and gave it a quick chance, but didn’t seem to be stable enough. We couldn’t spend longer on it at the time, so decided to cover these corner scenarios with manual testing. We might decide to revisit this decision going forward, though.

In terms of our experience in working with Lambdas, the only downside we came across was the problem of cold Lambdas. AWS will keep your Lambdas running for an undocumented period, after which, any usage will imply that the Lambda infrastructure needs to be recreated in order for it to serve requests. We knew about this before going down the Lambda path, so we can’t say it came as a surprise. We had agreed that in the worst case scenario, we would go with a solution we found proposed on several sites: having a scheduled Lambda that would keep our API Lambdas “awake” permanently. To this end, we added a health check endpoint to every function, which would be called by our scheduled Lambda. In future releases we will also use the health check response to take further actions, for example notify us of downtime.

Overall, we’re quite happy with the results. We may decide to change certain aspects of our general architecture, or even move to an ECS based solution if we go SAAS, but for the time being this one seems like a perfect fit for our requirements. Further down the line, we’re planning to write a follow up post regarding the concepts discussed on this post, with some stats and further insights of how this works both in production and as part of our SDLC.

Of course, there is no rule of the thumb here, and what might work for us may not work for other teams/projects, even if they share our list of requirements. In any case, if you’re going down the serverless journey, we’d love to hear about your experiences!

Backup and replication of your DynamoDB tables

We are using Amazon’s DynamoDB (DDB) as part of our platform. As stated in the FAQ section, AWS itself replicates the data across three facilities (Availability Zones, AZs) within a given region, to automatically cope with an eventual outage of any of them. This is a relief, and useful as part of an out of the box solution, but you’d probably want to go beyond this setup, depending on what your high availability and disaster recovery requirements are.

I have recently done some research and POCs as to how it would be best to achieve a solution inline with our current setup. We needed it to:

  • be as cost effective as possible, while covering our needs
  • introduce the least possible complexity in terms of deployment and management
  • satisfy our current data backup needs and be inline with allowing us to handle high availability in the near future.

There’s definitely some good literature on the topic online (1), besides related AWS resources, but have decided to write a serious of posts which will hopefully provide a more practical view on the problem and the different range of possible solutions.

In terms of high availability, probably your safest bet would be to go with cross-region replication of your DDB tables. In a nutshell, this will allow you to create replica tables of your master ones in a different AWS region. Luckily, AWS labs provides an implementation on how to do this, open-sourced and hosted in GitHub. If you take a close look at the project’s README, you’ll notice it is implemented by using the Kinesis Client Library (KCL). It works by using DDB Streams, so for this to work, streaming of the DDB tables you want to replicate needs to be enabled, at least for the master ones (replicas don’t need it).

From what I’ve seen, there would be several ways of accomplising our data replication needs:

Using a CloudFormation template

Using a CloudFormation (CF) template to take care of setting up all the infrastructure you need to run their cross-replication implementation mentioned above. If you’re not very familiar with CF, they describe it as:

AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.

Creating a stack with it is quite straight forward, and the wizard will allow you to configure the options on the following screenshot, besides some more advanced ones on the following screen for which you can use defaults in a basic setup.

Screen Shot 2016-07-25 at 09.38.36

Using this template will take care of creating everything from your IAM roles and Security Groups creation, to launching the defined EC2 instances to perform the job. One of those instances will take care of coordinating replication and the other(s) will take care of the actual replication process (i.e. running the KCL worker processes). The actual worker instances are implicitly defined as part of an autoscaling group, as to guarantee that the worker instances are always running, in order to prevent events from the DDB stream being unprocessed, which would lead to data loss.

I couldn’t fully test this method as after CF finished setting up everything, I couldn’t use the ReplicationConsoleURL to configure master/replica tables due to the AWS error below. Anyway, wanted a more fine grained control of the process, so looked into the next option.

Screen Shot 2016-07-25 at 11.14.46

Manually creating your AWS resources and running the replication process

This would basically imply performing most of what CF does on your behalf. So it would mean quite a bit more work in terms of infrastructure configuration, be it through the AWS console or as part of your standard environment deployment process.

I believe this would be a valid scenario in the case you want to use your existing AWS resources to run the worker processes. You’ll need to leverage what your costs restrictions and computing resource needs are, before finding considering this a valid approach. In our case, this would help us with both, so decided to explore it further.

Given that we already have EC2 resources set up as part of our deployment process, I decided to create a simple bash script that would kick of the replication process as part of our deployment. It basically takes care of installing the required OS dependencies, cloning the git repo and building it, and then executing the process. It requires 4 arguments to be provided (source region/table and target region/table). Obviously, it doesn’t perform any setup on your behalf, so the argument tables will need to exist on the specified regions, and the source table must have streaming enabled.

This proved to be a simple enough approach, and worked as expected. The only downside of it is that, regardless of it running within our existing EC2 fleet, we still needed to figure out a mechanism of monitoring the worker process, in order to restart it in case it dies for any reason, and avoid data loss as mentioned above. Definitely an approach we might end up using in the near future.

Using lambda to process the DDB stream events

This method uses the same approach as the above, in that it relies on events from your DDB tables streams, but removes the need of having to take care of the AWS compute resources you will need in doing so. You will still need to handle some infrastructure and write the lambda function that will perform the actual replication, but will definitely help with the cost and simplicity requirements mentioned in the introduction.

Will leave the details of this approach for the last post of this series though, as it is quite a broad topic that I will cover there in detail.

In the upcoming posts I will discuss the overall final solution we end up going with, but before getting to that, in my next post, I will discuss how to backup your DDB tables to S3.

Stay tuned!