Infra As Code - Ramblings

Ever since I started working in the DevOps/Site Reliability space, I have primarily worked with Cloud Infrastructure. And, the paradigm of ‘Servers are cattle, not pets’ had permeated across the industry to a good extent. While working with different teams, projects and scale, I have had the opportunity to work on a variety of tools that help with Infrastructure Management, as well as Configuration Management. A few weeks ago, I helped set up infrastructure with the AWS Cloud Development Kit (CDK), and I have some thoughts! I’ll end up comparing this with Terraform, the Infrastructure as Code tool that I am the most familiar with, and also used just before I picked up CDK.

Terraform - the status quo

Terraform is a declarative, human readable language (well, sort of) for managing cloud infrastructure. Right before using TF, I worked at a team that had a home brew Python codebase for imperatively managing AWS and Azure resources, and required us to think explicitly about not just CRUD operations, but also various edge cases. This gave us a ton of flexibility, but also meant that we needed to be very thorough with our implementation, or we would end up causing production downtime. With a declarative approach, a huge burden was lifted from my shoulders. The authors of the underlying provider did still have to handle the same edge cases, but because they are often open source, and way more battle tested codebases, the risk of damaging user facing services is minimised significantly.

The Happy Path

If you’re working on a project that uses popular providers (AWS, GCP, Azure, Linode, etc), then you’ll have a great time. Some of them even have well designed modules for most commonly used resources (e.g. Kubernetes clusters, Autoscaling groups, Networks, and so on). If you find good modules created by third party folks (I particularly liked Google Cloud’s modules), you get started very quickly, but you need to start budgeting time in the future for keeping up with the module version upgrades. Minor version upgrades are usually fine, but major upgrades almost always break compatibility in some form. Applying them to your codebase can require significant reengineering or even a build out of a new set of infrastructure. In addition to the module upgrades, the Terraform binary itself needs to be upgraded periodically. However, since HCL (HashiCorp Configuration Language) has reached a point of stability and maturity, the binary upgrades themselves tend to be relatively stress free.

The steep climbs

A major downside of using third party modules is that we need to understand their abstraction models, and they may not always be a good fit for our mental models, or the projects/services we are building out. We also need our own tooling (or, if you’re using HashiCorp’s Terraform Cloud, they have some reporting dashboards) to keep track of what versions are configured across your codebase, and ways to upgrade the various statefiles and their configurations. If you don’t have this tooling in place, every upgrade suddenly becomes a large chore, modules get pinned to older versions to avoid breakages, and suddenly, you have a large tech debt project to get everything to recent versions.

If you stare at this situation from a distance, you quickly realise that this entire toolchain is not made for folks unfamiliar with infrastructure work. It is a powerful tool, but not a beginner friendly one. This requires the presence of a dedicated infrastructure engineer or team, in addition to the product folks.

Portability?

It is impossible to write ‘portable’ code. Every cloud provider has their own set of parameters and resources for the same object (see aws_instance v/s google_compute instance ). Unless your application uses the lowest common denominators for a given resource type, it is a very bad idea to write a wrapper module that lets a user create an instance in either of the cloud providers, with the flip of a variable.

Not-Invented-Here may be the right approach

Considering all these factors, I’ve come to the conclusion that writing your own core modules from scratch might be a good idea, especially if your team has relatively senior infrastructure folks, and this is not their first rodeo with Terraform. You start slower, but get to very selectively choose what features to support (or not), and later on, your team can focus more on enabling the business, rather than spending time on toil.

AWS CDK - a new friend?

CDK is a framework for multiple languages (Python, TypeScript, Java, Golang and C#/.Net) to manage infrastructure for AWS. The framework includes a set of modules (called Constructs) and a CLI, which we can string together to create stacks and applications. I find the approachability of this framework to be very good, as it lets most folks work with a language they’re already familiar with. After years of working in a declarative language that had very limited (maybe even unintuitive?) flow control capabilities, it felt very good to write regular classes and methods with the usual expressiveness of TypeScript.

Imperative? Maybe not!

If you’re just starting out with CDK, it feels like you’re writing imperative code in your favourite programming language to create (or update) infrastructure. In reality though, the framework generates JSON (or YAML) files that are then shipped to AWS CloudFormation, which does the actual heavy lifting of managing the infrastructure on our behalf. This deceptively simple behaviour can cause junior folks to make mistakes that end up recreating resources as unexpected side effects. As long as we temper our expectations, and go read up about CloudFormation’s approach to reconciling changes in code with infrastructure in the real world, the chances of shooting ourselves in the foot reduce significantly.

CloudFormation - the elephant in the room

And oh, the prospect of working with CloudFormation is not something I look forward to. It’s been more than 7 years since I last worked with it, and I’m so glad I no longer need to touch the underlying YAML/JSON files anymore. The configuration files, though, are the smallest concern. Working with the actual AWS service that manages all the infrastructure is the most painful part. Iterating on a change can be super slow, even if you work towards keeping your stacks small and de-coupled. Manually changing most resources will make CloudFormation very angry, and require a bunch of sacrifices to bring the stack back to a healthy state. A simple rollback does not always guarantee that you’ll get your stack back to a healthy state. And, at times, the reason for a particular apply failing is not super obvious, even if you drill down to the offending resource.

The saving grace in all of this is the ability to foresee what will happen in the infrastructure with your proposed change (similar to Terraform’s plan sub-command/feature). Over time, I learnt to trust the output of the plan, which may be more conservative/stringent at times than what happens in the real world. However, I still have scars from the early days of CloudFormation, that refuse to go away.

Nice things

Working with regular programming languages does have a few benefits though: in addition to the more sensible flow control structures (as against Terraform’s HCL), we can do a lot more effective validation of user inputs. The entire standard library of your preferred language is available, and this level of expressiveness just cannot be matched by Terraform (even when you factor in the usage of tools like Terragrunt from the ecosystem).

All this makes it easier to get started with Infrastructure as Code for backend or generalist engineers, and hence, your team can make good progress with good practices for a little longer, before you bring an infrastructure specialist onboard.

One of the nicest realisations/aha moments of working with CDK was the way IAM policies are managed/generated. Unlike Terraform, other infrastructure frameworks, or working directly with the AWS Web Console, where providing maximum privileges is the default behaviour, CDK makes it super easy to implement the principle of least privilege. The implementation of each AWS resource usually includes grantRead/grantWrite methods, with specific filters. The CDK makes it second nature to create very restrictive policies, and I wish more frameworks/tools worked on improving their story here.

Community?

Although CloudFormation is the primary target of this framework, there are sister projects that can also be used to generate Kubernetes manifests and Terraform configurations. The overall community around all these tools is much smaller though, and will eventually be dependent on the largesse of AWS, given that they all depend on the CloudFormation and CDK teams shipping core features. This makes it a tool for a smaller niche than Terraform, especially if multi-cloud deployments are a consideration.

Final thoughts

After years of working with Terraform, HCL still does not feel like home. CDK was my first ever serious project in TypeScript, and I am sure I wrote code that was not very idiomatic. If I were to pick CDK again, I really need to figure out a way to get local feedback very quickly, especially if I want to work on a non-trivial infrastructure again with CDK.

Both these tools have one major limitation. Importing of resources is supported and encouraged, but they are most effective in building out a fresh set of infrastructure, and are very declarative, overwriting any manual changes that may have been made out of band. This works great for large organisations and their cloud infrastructure, where compliance to strict standards is way more valuable. However, in most smaller organisations, one off changes (either to deal with incidents/emergencies, or prototyping of features) are often needed, and should be quick to roll out. In such scenarios, the current IaC paradigm ends up being very limiting, and red-tape-y. I’d hazard to say that these tools feel like they’re stuck in the waterfall era of software development. The entire team needs to sit together, design the foundational infrastructure upfront, wait for that to be built out, and then commence their services and frontend work. If prototyping shows that a particular approach is not the right path, tearing down infrastructure, or re-modelling parts of it can feel very cumbersome.

All this makes it frustrating for a practitioner like me, especially if I have to recommend it to a friend who wants to start their SaaS journey. We’ve made a ton of progress in user experience in our toolchains over the past few years, and it sometimes feels like we’re living in the future. But, as a smart person once said, that future seems to be unevenly distributed.

System Initiative, a new toolchain in the infrastructure provisioning space, seems promising and aims to tackle this major hassle of working with IaC. It constantly works with your cloud provider, and keeps track of changes made, even if they may be outside the control/scope of SI itself (in other words, changes made directly from the AWS Web console or CLI). It is still early days, and the demos they roll out every week are impressive. I am looking forward to a newer approach and a better quality of life for all cloud engineers in the near future!

I did not expect to ramble on for so long. I’m glad I had the opportunity to finally try AWS CDK out after hearing about it for years on end from Sathya, and also go write some TypeScript code larger/longer than a one-off CLI script. If you can point me to any other interesting tools in this space that are trying to tackle infrastructure management in a more intuitive way, feel free to ping me on Mastodon or X with your recommendations.

P.S.: Sathya shared an important note while reviewing a draft of this post: The IAM capabilities only exist for L2 Constructs that CDK provides, and support for new resource types can take some time to work their way through the support chain via CloudFormation. And, although CDK can be used in other languages, TypeScript is the best supported language. Every other language uses transpiled code to build the SDK.