Refactoring At Scale 🛠️
Source: Spotify 
Problem
- The number of applications I manage is increasing way faster than my engineering team.
- A large portion of my engineers’ time is spent on maintenance activities like updating language versions, dependency versions, and infrastructure migrations. This leaves less time for feature development.
- When critical vulnerabilities emerge, my entire team has to stop feature work and scramble to patch across all repositories - inefficient use of resources.
- Application migration and refactoring work is frequent and difficult to scale because of a growing inconsistency across application projects.
Examples
- Spring Boot and their train release (actuator, metrics, circuit breaker, etc… having breaking changes)
- Migrating from self-managed Kubernetes to a managed Kubernetes service as EKS.
- Deprecating a build tool or requiring all projects to use a new minimum language minimum.
- Moving to a new SaaS for APM (Application Performance Monitoring) functions
- Migrating from hostname-based to path-based routing.
Goals
- Consistency
- Ease in the version management
- Reliability in the applied patches thanks to the variety of tests and its coverage
- End-to-end automated process, no manual validation for a production deployment required
- Infrastructure migrations turned into transparent or trivial tasks for engineering teams
Requirements
Project
- MUST let versions being propagated top-down without any application overriding what the library dictates.
Platform
- MUST be able to do a gradual rollout of changes to mitigate risks
- MUST have a dashboard on scripts running against dozens, hundreds, or thousands of repositories to preserve visibility on failing repositories.
- MUST provide a versioned set of secured Docker images and infrastructure modules
Company
- MUST enforce coding practices requiring parseable configuration, examples:
- Application configuration, (e.g. YAML based)
- Build tool referencing dependencies (e.g. TOML based)
- IaC (e.g. Terraform modules with
.tfvars.json
configuration)
- MUST enforce testing standards to unlock continuous deployment
- MUST provide a stack of wrapped application infrastructure components that prevent its implementation from leaking in project codebases
Existing Tools
- Application library or framework encapsulating repetitive application infrastructure (telemetry, IO, …)
- Descriptive files (JSON, YAML, TOML, …) to enable parsing capabilities, and scripts automating PRs
- IaC tool describing resources as JSON
- Build tool like Gradle having its versions managed as TOML
- Application configuration YAML-based
Refactoring tools
- OpenRewrite 
- Moderne.io  (SaaS on top of openrewrite)
Version Management
Going un-versioned
Based on what we said before, it’s significant work to enable an entire organization to manage versions, for most only a part of it is handled. Smaller teams can decide to pragmatically apply a patch deliberately without guardrails.
For example a Kubernetes team requiring to use a new apiVersion
for a specific Kind
. You can make the call, knowing
exactly the scope of the change, and being a build-time change that you can apply it transparently to everyone by
updating Helm templates everyone is sourcing from after testing and review.
In general, it should stay the exception, the goal being to preserve reliability and deterministic outputs as much as possible.
Resources
Last updated on