From Dumb and Boring YAML to Dumb and Boring YAML; IaC Experience

For more than 3 years I've been doing DevOps and involved myself in multiple projects ranging from simple static web applications deployed to some CDNs to multi-tenant systems with event-driven architectures. I've always found some ways to make my boring job fun. Sometimes, I experiment with new DevOps tools other times I try to write automation over some repetitive tasks. One of the major thing that is in every DevOps toolbox in IaC (Infrastructure as a Code) tool.

The experience I've shared in this article is personal and might not be the same for you. Hey, remember as long as it works for you, don't complain.

Infrastructure-as-a-Code

Remember the last time you tried to spin up a virtual machine in some cloud? Think, there are multiple virtual machines involved each having different configurations. How do you make this process repeatable, observable, managed, and testable? This is where IaC tools come in. IaC tools let you to define your infrastructure in code declaratively and have some sort of tooling to apply, preview, and destroy the changes in your cloud. There are multiple tools and technologies involved to define and manage your IaC code. Terraform is a popular example.

Infrastructure as code (IaC) is the ability to provision and support your computing infrastructure using code instead of manual processes and settings. Any application environment requires many infrastructure components like operating systems, database connections, and storage. Developers have to regularly set up, update, and maintain the infrastructure to develop, test, and deploy applications.

- Amazon Web Services (AWS)

Square 1: AWS Cloudformation

In the beginning, I started with AWS Cloudformation. So, AWS Cloudformation is the IaC tool that lets you define your infrastructure in YAML / JSON code and deploy it in AWS.

AWS CloudFormation is a service that helps you model and set up your AWS resources so that you can spend less time managing those resources and more time focusing on your applications that run in AWS. You create a template that describes all the AWS resources that you want (like Amazon EC2 instances or Amazon RDS DB instances), and CloudFormation takes care of provisioning and configuring those resources for you. You don't need to individually create and configure AWS resources and figure out what's dependent on what; CloudFormation handles that. The following scenarios demonstrate how CloudFormation can help.

- Amazon Web Services (AWS)

Cloudformation template looks like this when written in YAML.

AWSTemplateFormatVersion: '2010-09-09'
Description: Create an S3 Bucket

Resources:
  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: my-example-s3-bucket
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true

Outputs:
  BucketName:
    Description: Name of the S3 bucket
    Value: !Ref S3Bucket

The first project I worked on was a legacy AWS Cloudformation project. It was a relatively simple 3-tier application with some ETL processing involved. The codebase was a single 3000+ LOC YAML document. Changing it felt tough and cumbersome. I think these are the major reasons why I felt it was so lacking.

I didn't use cfn-lint at that time and making sure the YAML change is valid required deploying it. There were a lot of repetitive patterns in the code which I thought should be composable. The configuration was too static and it felt too boring. Although some variables and stuff were involved I didn't like the static nature of the YAML file. Like not being able to calculate CIDR ranges, string manipulations etc.

Square 2: Terraform

Then, I heard of HashiCorp Terraform. Terraform is written in a language called HCL (Hashicorp Configuration Language). HCL is a DSL created by Hashicorp, the creator company of Terraform. Terraform felt great. It felt like I was programming the infrastructure at the time. And hearing all the great benefits like I can deploy AWS, Azure or any other resources with the same codebase felt like a clear winner to me.

This is how the above code looks like in terraform.

provider "aws" {
  region = "us-west-2"
}

resource "aws_s3_bucket" "example" {
  bucket = "my-example-s3-bucket"
  acl    = "private"

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

output "bucket_name" {
  description = "Name of the S3 bucket"
  value       = aws_s3_bucket.example.id
}

The language is great. You can manipulate strings using a wide variety of functions, there are multiple providers and the plugin system is really extensible. You can write a clean and neat IaC with it. This really overcome the challenges I felt previously.

The tooling is extremely good. For example: there are LSPs which can help you in writing the code. There are many sane functions provided by the terraform itself. It feels like programming to some sense. The modules system is a great abstraction mechanism. I can create a module called static-website and provision all CDNs, Storage Buckets there and reuse that module in multiple places. The configuration is dynamic now to some extent.

Terraform started providing me the correct amount of abstraction, until it didn't. I started abusing the programmability nature of the terraform to start making everything dynamic. I wanted to write the master terraform code which I can plug into any project and just change some variables and have it work.

This is a code snippet of the label module from my template repository for terraform codebases.

 name               = lower(replace(coalesce(var.name, var.context.name, local.defaults.sentinel), local.regex_replace_chars, local.defaults.replacement))
  namespace          = lower(replace(coalesce(var.namespace, var.context.namespace, local.defaults.sentinel), local.regex_replace_chars, local.defaults.replacement))
  environment        = lower(replace(coalesce(var.environment, var.context.environment, local.defaults.sentinel), local.regex_replace_chars, local.defaults.replacement))
  stage              = lower(replace(coalesce(var.stage, var.context.stage, local.defaults.sentinel), local.regex_replace_chars, local.defaults.replacement))
  delimiter          = coalesce(var.delimiter, var.context.delimiter, local.defaults.delimiter)
  label_order        = length(var.label_order) > 0 ? var.label_order : (length(var.context.label_order) > 0 ? var.context.label_order : local.defaults.label_order)
  additional_tag_map = merge(var.context.additional_tag_map, var.additional_tag_map)

Github: regmicmahesh/infrastructure-backend (github.com)

Slowly, terraform started feeling limited. Although the new DSL provided some level of dynamic nature in the codebase, it's DSL in the end. It's not a programming language. There are limitations to it. I don't think the problems I faced on Cloudformation were solved by Terraform.

You can create multiple resources by using count terraform but it introduces its own sort of issues. So, the usage of count introduces a list data structure in the scene which manages the count of the resources. Like if you remove an item of some index in the list, all of the elements of the list following the item will be redeployed. There is a drift between terraform modules and official AWS APIs, some stuffs are implemented in the aws-provider in their own way. There is not any good proper concept of conditional resources in terraform, like how Conditions: block work in Cloudformation.

Square 3: Pulumi

Then, I heard of this new IaC tool called Pulumi and decided to invest some time on it. Pulumi felt great. You can write your IaC in any of the supported programming languages and have it deployed in the cloud. I enjoyed writing Pulumi code in Go. Pulumi is also the language in which it itself is written, so the support was great!

I wrote custom stacks, experimented with different architecture patterns, and wrote custom components and stuff.

There's this automation API of Pulumi which lets you not even use the Pulumi tooling and just have your infrastructure code as a single binary. Doesn't it sound cool to have a binary that you can store in some artifact store and deploy it and your infrastructure will be in the desired state?

This is how the above code looks like in Pulumi.

package main

import (
    "github.com/pulumi/pulumi-aws/sdk/v5/go/aws/s3"
    "github.com/pulumi/pulumi/sdk/v3/go/pulumi"
)

func main() {
    pulumi.Run(func(ctx *pulumi.Context) error {
        bucket, err := s3.NewBucket(ctx, "example-bucket", &s3.BucketArgs{
            BucketName: pulumi.String("my-example-s3-bucket"),
            AcceleratedConfigurati on: &s3.BucketAcceleratedConfigurationArgs{
                Enabled: pulumi.Bool(false),
            },
            AclInput: pulumi.String("private"),
            PublicAccessBlockConfiguration: &s3.BucketPublicAccessBlockConfigurationArgs{
                BlockPublicAcls:       pulumi.Bool(true),
                BlockPublicPolicy:     pulumi.Bool(true),
                IgnorePublicAcls:      pulumi.Bool(true),
                RestrictPublicBuckets: pulumi.Bool(true),
            },
        })
        if err != nil {
            return err
        }

        ctx.Export("bucketName", bucket.ID())
        return nil
    })
}

You can find a blog about my Pulumi architecture here as well.

Structuring Infrastructure as Code with a Layered Approach (maheshchandraregmi.com.np)

I thought developers would also be involved in managing the IaC if it's in the programming language that they use. But, it didn't quite work like that. They had no interest in writing infrastructure code and even if they did, it was just too risky for them to deploy.

Then, I didn't touch the pulumi code for a month. I opened the pulumi codebase next month and it all started breaking apart. The dependencies were outdated, there were vulnerabilities introduced in the packages that I used.

The code felt too magic to me and required few hours to quickly wrap my head around how I had done things. The dynamic nature of the code quickly became the downside because there were just too many moving pieces and changing something had unintended consequences.

Back to Square 1: AWS Cloudformation

And, now I'm back in AWS CloudFormation. I think the configuration and DevOps processes should be as dumb as possible. It should be clear to the eyes on one look and involve no magic. It should be lean and every extra line of code or tool involved is a tech debt.

I started using nested stacks and custom resources of CloudFormation as well. I don't need to use loops because I don't need to make the infrastructure dynamic. Do I need to add a new subnet in future? Ok, I will modify the code to have a new subnet. For me or any other DevOps who will join in the future it is easier to see than to wrap a head around the looping logic.

We can also leverage features like Stacksets, Drift Detection, Automated Rollbacks with cloudformation. The code can also be viewed as a diagram in the application manager and can be entirely managed through AWS.

I use these templates in the projects nowadays.

regmicmahesh/cloudformation-stack-templates (github.com)

Thank you for reading!