Refactoring At Scale 🛠️

Spotify Growth Source: Spotify

Problem

The number of applications I manage is increasing way faster than my engineering team.
A large portion of my engineers’ time is spent on maintenance activities like updating language versions, dependency versions, and infrastructure migrations. This leaves less time for feature development.
When critical vulnerabilities emerge, my entire team has to stop feature work and scramble to patch across all repositories - inefficient use of resources.
Application migration and refactoring work is frequent and difficult to scale because of a growing inconsistency across application projects.

Examples

Spring Boot and their train release (actuator, metrics, circuit breaker, etc… having breaking changes)
Migrating from self-managed Kubernetes to a managed Kubernetes service as EKS.
Deprecating a build tool or requiring all projects to use a new minimum language minimum.
Moving to a new SaaS for APM (Application Performance Monitoring) functions
Migrating from hostname-based to path-based routing.

Goals

Consistency
Ease in the version management
Reliability in the applied patches thanks to the variety of tests and its coverage
End-to-end automated process, no manual validation for a production deployment required
Infrastructure migrations turned into transparent or trivial tasks for engineering teams

Requirements

Project

MUST let versions being propagated top-down without any application overriding what the library dictates.

Platform

MUST be able to do a gradual rollout of changes to mitigate risks
MUST have a dashboard on scripts running against dozens, hundreds, or thousands of repositories to preserve visibility on failing repositories.
MUST provide a versioned set of secured Docker images and infrastructure modules

Company

MUST enforce coding practices requiring parseable configuration, examples:
- Application configuration, (e.g. YAML based)
- Build tool referencing dependencies (e.g. TOML based)
- IaC (e.g. Terraform modules with .tfvars.json configuration)
MUST enforce testing standards to unlock continuous deployment
MUST provide a stack of wrapped application infrastructure components that prevent its implementation from leaking in project codebases

Existing Tools

Application library or framework encapsulating repetitive application infrastructure (telemetry, IO, …)
Descriptive files (JSON, YAML, TOML, …) to enable parsing capabilities, and scripts automating PRs
- IaC tool describing resources as JSON
- Build tool like Gradle having its versions managed as TOML
- Application configuration YAML-based

Refactoring tools

OpenRewrite
Moderne.io (SaaS on top of openrewrite)

Version Management

Going un-versioned

Based on what we said before, it’s significant work to enable an entire organization to manage versions, for most only a part of it is handled. Smaller teams can decide to pragmatically apply a patch deliberately without guardrails.

For example a Kubernetes team requiring to use a new apiVersion for a specific Kind. You can make the call, knowing exactly the scope of the change, and being a build-time change that you can apply it transparently to everyone by updating Helm templates everyone is sourcing from after testing and review.

In general, it should stay the exception, the goal being to preserve reliability and deterministic outputs as much as possible.

Resources

Fleet Management at Spotify