Approaching Terraform · 2020-02-24 · Infrastructure as Code: You write your infra definition as...
Transcript of Approaching Terraform · 2020-02-24 · Infrastructure as Code: You write your infra definition as...
Approaching TerraformFrom an Organizational point of view
February 2020
01/ Who am I?
02 / Glovo 2.5y ago
03 / Terraform 101
04 / Glovo now
Index
05 / Learnings
Who am I?
Dani Torramilans
Senior Backend Engineer
> Joined at the end of 2017
> Mostly platform, infra, performance & observability work
> Currently working on the microservices migration at Glovo
01/ Who am I?
02 / Glovo 2.5y ago
03 / Terraform 101
04 / Glovo now
Index
05 / Learnings
Tech @ 2017
< 20Team Size
Glovo Tech 2017
< 30Instances
No IaC mostly, some CloudFormationIaC
0Infra tests
Few per weekInfra requests
No idea. We survived.Uptime
< 10 %Tracked infra
0 %IaC monitors
1 engineer (Roberto)Infra team
Common problems for Infra 2017
Problem: Hey, can you give X developer the key for Y machine? - Once a month
Permission requests
Problem: “Hey we need this new machine for a partners project” - Once a month
Infra requests
Problem: X Certificate expired because it was created manually.
Maintenance fires
Problem: “Cluster X failed to read from Y and we’re now down!”
Deployment issues
Problem: “This endpoint/thing has been down for three days and no one noticed”
Coarse monitoring
Problem: Who added this machine? Who is responsible for this cost?
Accountability
Common problems for Infra 2017
Problem: Hey, can you give X developer the key for Y machine? - Once a month
Permission requests
Problem: “Hey we need this new machine for a partners project” - Once a month
Infra requests
Problem: X Certificate expired because it was created manually.
Maintenance fires
Problem: “Cluster X failed to read from Y and we’re now down!”
Deployment issues
Problem: “This endpoint/thing has been down for three days and no one noticed”
Coarse monitoring
Problem: Who added this machine? Who is responsible for this cost?
Accountability
Solution: Sure, here you go <manually gives ssh key>
Solution: Sure, let me recreate it (or put it in AWS Certs)
Solution: 🤷♂
Solution: 🤷Solution: Sure, let me fix that policySolution: Sure, let me spin it up
01/ Who am I?
02 / Glovo 2.5y ago
03 / Terraform 101
04 / Glovo now
Index
05 / Learnings
Terraform 101
What is Terraform anyway?A way of defining, instantiating and managing infrastructure & infra configuration. Features include:
● Infrastructure as Code: You write your infra definition as text files, using HCL language. These definitions represent your resources.
● Execution plans: When adding or changing infrastructure, you just update the corresponding HCL code, and ask Terraform to plan the changes. Then you can choose to apply them.
● Multi-provider: You can declare resources from multiple providers, not only AWS: Terraform Cloud, DataDog, PagerDuty, Sentry, GCP, Azure, etc can all be managed through Terraform templates.
● State mgmt: Terraform keeps a track of the state of all resources it knows about, and makes sure different engineers don’t try to modify them in parallel (state locking), as well as noticing any external changes (drifting)
Feature that made us decide on Terraform
Allow us to enforce arbitrary policies on all new infra, with custom written policies
SENTINEL POLICIES
Not only AWS is supported, you can manage DataDog, PagerDuty, & others!
MANY PROVIDERS
Declarative language, easier than Cloud Formation. Loop constructs & conditionals built in. Very flexible.
SYNTAX
Popular, maintained and other active projects such as terraforming
COMMUNITY
Terraform Cloud + GitHub integration makes the workflow easy and out of the box.
TOOLING
Quick to learn. Clear & up to date docs, very developer centric.
EASY ADOPTION
Terraform 101
What’s the hello world code for Terraform? The basic blocks are providers, resources and data resources
Terraform 101
Additional features of the language include looping constructs, local variables and conditionals.
Creating many servers with a specific name each
Creating a network with many subnets, and map transformations
Terraform 101
The biggest feature is modules, which allow you to encapsulate a set of resources, parametrize them and reuse elsewhere. Just needs to be a separate folder! Extra points if you publish it and people reuse.
Specify some parameters
Define your resources
Specify some outputs
Use the module!
Terraform 101
An added plus of using Terraform Cloud is that they run Sentinel Policies for you. Sentinel policies allow you to enforce requirements to infra changes, helping us implement a variety of policies as code.
How we work with Terraform
Use terraform plan with the TF CLI, with the proper keys for AWS setup.
Just submit a PR to the relevant infra repo. We use Terraform Cloud. Platform team manages Terraform Cloud, creates & deletes the projects, and TFC listens to all Github infra repos, applying changes and commenting on PRs. You can still do terraform plan locally to get a speculative plan without having to commit.
Devs
SRE & Sys Eng
Commit Plan
Apply changes
Review Approve applies & manage
01/ Who am I?
02 / Glovo 2.5y ago
03 / Terraform 101
04 / Glovo now
Index
05 / Learnings
< 20Team Size
Glovo Tech 2017 Glovo Tech 2020
< 30Instances
No IaC mostly, some CloudFormationIaC
0Infra tests
Few per monthInfra changes
No idea. We survived.Uptime
< 10 %IaC infra
150 aiming for 300
> 300 avg
Mostly on Terraform, still some 100 CF Stacks
> 0
Several per week
99.5% - 99.9% weekly
> 60%
0 %IaC monitors > 70%
1 engineer (Roberto)Infra team 7 engineers Infra, 3 Engineers BTD
Tech @ 2020
Common problems for Infra 2020
Problem: Hey, can you give X developer the permission for Y machine? - Multiple times a day
Permission requests
Problem: “Hey we need this new machine for a partners project” - Multiple times a month.
Infra requests
Problem: This machine is not big enough and I don’t have permissions to change it in AWS, now it fails!
Maintenance fires
Problem: “X service can’t read from Y S3 bucket!”
Deployment issues
Problem: “This new service doesn’t have any monitoring yet”
Coarse monitoring
Problem: Who added this machine? Who is responsible for this cost?
Accountability
Solution: Role based auth. Roles represented in TF. Open a PR :)
Solution: Here is the TF code. Open a PR :)
Solution: You can clone the TF observability module in this repo. Open a PR :)
Solution: Let me just do tag-based grouping of the infra and estimate cost. Or just look at committers in the TF repo
Solution: This is the TF repo for X service. Open a PR :)Solution: This is your aws
account and this is your TF repo. Open a PR :)
Common problems for Infra 2020 (remix)
Problem: How can we limit changes to IAM permissions
Security concerns
Problem: Tagging conventions, naming conventions, networking best practices
More standards
Problem: Data Eng wants the biggest machine in all of AWS
Preemptive cost mgmt
Problem: “We need a disaster recovery plan”
Disaster recovery
Problem: “How should I build my microservice? What is an EC2?”
Infra standards
Problem: How are the PagerDuty schedules set up
Service config mgmt
Solution: Just build a sentinel policy
Solution: Sentinel policy requiring review & approval, and informing of cost/month
Solution: Just use the microservice TF module. This is the repo. Open a PR :)
Solution: Look at this repo. If you want to, open a PR :)
Solution: Start reinstantiating your TF repos and upload your backups.
Solution: Write sentinel policies for compliance, blocking PRs if needed.
01/ Who am I?
02 / Glovo 2.5y ago
03 / Terraform 101
04 / Glovo now
Index
05 / Learnings
Learnings
● Standardize processes, using tooling as much as possible. Automated tools > docs.
● Make knowledge sharing and workshops an integral part of a tech rollout
● Try to have a role of helping people build things rather than being the AWS police
● Give people the freedom to PR anything rather than having to wait for you.
● Automate as many policies and checks as you can, prioritize by security concerns and time consumed
● Make teams own their infrastructure end to end. You’re here to help them build, not to babysit.
● Being opinionated on infra is not bad.
● Have a clear boundary between dynamic deployments and static infrastructure.
● Abstract from real examples.
● Know which problems you are trying to solve, do not blindly adopt (cc @Joan Martínez)
Thanks!