This is the first post of a series where we’ll be writing about the motivation and decision making process behind our cloud strategy for Sandbox.
Setting the stage
Our first assumption was that, given that Sandbox follows a client-server architecture and that most of its core functionality is on the client side, our cloud requirements wouldn’t be as high as if we were a pure SAAS product.
When we started architecting our cloud components for Sandbox, we came up with the following list of requirements:
- It had to allow us to have a low time-to-market. Rewriting certain parts later would be something we would accept as a side effect of this.
- It had to be cost efficient. We didn’t anticipate much traffic during our early days after our initial go live, so it felt natural to look for a ‘Pay as you go’ model, rather than having resources sitting idle until we ramp up our user base.
- It needed to allow us to have a simple go live strategy. This needed to apply both in terms of process and tooling used. It obviously also needed to leverage our previous knowledge of cloud providers and products/services.
So, what did we decide to go with, and most importantly, why?
First, choosing the cloud provider: we had a certain degree of experience using AWS, from our professional life as consultants, but especially from our days working at StackFoundation, so this was the easy choice!
The AWS offering is huge, and constantly evolving, so deciding which services to use wasn’t that straight-forward. From a computing perspective, there seemed to be 2 clear choices, using ECS to handle our EC2 instances, or going serverless with Lambdas. Our experience leaned more towards EC2, but when we revisited our cloud architecture requirements, we concluded Lambdas were a better fit, as they ticked all the boxes:
- In terms of the actual application development, choosing EC2 or Lambdas didn’t seem to make much of a difference. Once you decide on whether to go with plain Lambdas or use a framework to make Lambda development closer to standard API development, feature development was pretty straightforward, with a low learning curve. We initially went with Jrestless although we crafted our own solution lambda-api module of the Datamill project, more on this on future posts. On the EC2 world, this would have meant us using micro-services running on containers, so pretty much the same picture.
- The term ‘Pay as you go’ feels like it was coined for Lambdas, at least in terms of compute resources. You get billed based on your memory requirements and actual usage period (measured to the nearest 100ms). On the other hand, with EC2 instances, you have to keep your instances live 24×7. So we either we had to go with low-powered T2 family instances to keep the cost low and struggle if we were to get unpredicted usage spikes, or go with M3 family instances, and potentially waste money on them during periods where they were under-used. This would probably be the case during our early preview period.
- In terms of overall architecture, it is more complicated to configure ECS clusters and all the different components around them than to simply worry about actual application code development, which is one of the selling points of Lambdas (and the serverless movement in general). This may be an overstatement, as even with Lambdas there is additional configuration. For example, you also need to take care of permissions (namely Lambda access to other AWS resources like SES, S3, DDB, etc) but definitely at least an order of magnitude simpler than doing this on the ECS front. In addition, even though the AWS console is the natural path to start with until you get a grasp of how the service works and the different configuration options you have, you will eventually want to migrate to using some kind of tooling to take care of deployment, both for your lower environments as well as for production. We came across Serverless, a NodeJS based framework that aids you in your Lambda infrastructure management (or other function based solutions by other providers). The alternative would have probably been to go with Terraform, which is an infrastructure as code framework. Our call here was motivated both in terms of project needs and the overhead of the learning curve of the tooling to use. If we went with Lambdas, we would mainly need to worry only about them in term of AWS resources, so Serverless offered all we needed in this sense. In terms of the learning curve, using a general purpose framework like Terraform would imply us having to deal with concepts and features we wouldn’t be needing for our use case.
So, now that we’re live, how do we feel about our assumptions/decisions after the fact?
In terms of application development, the only caveat we found was in terms of E2E testing. We obviously have a suite of unit and integration tests on our cloud components, but during development, we realised some bugs arose from the integration with our cloud frontend that couldn’t be covered by our integration tests. This was true specially when services like API Gateway were involved in the problem, as we haven’t found a way of simulating this on our local environment. We came across localstack late in the day, and gave it a quick chance, but didn’t seem to be stable enough. We couldn’t spend longer on it at the time, so decided to cover these corner scenarios with manual testing. We might decide to revisit this decision going forward, though.
In terms of our experience in working with Lambdas, the only downside we came across was the problem of cold Lambdas. AWS will keep your Lambdas running for an undocumented period, after which, any usage will imply that the Lambda infrastructure needs to be recreated in order for it to serve requests. We knew about this before going down the Lambda path, so we can’t say it came as a surprise. We had agreed that in the worst case scenario, we would go with a solution we found proposed on several sites: having a scheduled Lambda that would keep our API Lambdas “awake” permanently. To this end, we added a health check endpoint to every function, which would be called by our scheduled Lambda. In future releases we will also use the health check response to take further actions, for example notify us of downtime.
Overall, we’re quite happy with the results. We may decide to change certain aspects of our general architecture, or even move to an ECS based solution if we go SAAS, but for the time being this one seems like a perfect fit for our requirements. Further down the line, we’re planning to write a follow up post regarding the concepts discussed on this post, with some stats and further insights of how this works both in production and as part of our SDLC.
Of course, there is no rule of the thumb here, and what might work for us may not work for other teams/projects, even if they share our list of requirements. In any case, if you’re going down the serverless journey, we’d love to hear about your experiences!