Proposed Solution
Objective
Stand up platform foundations, migrate one production workload, upskill the owning team, and publish a measured plan to expand adoption.
North Star
Improve stability & performance, reduce cloud cost, and upgrade developer experience—while increasing feature delivery velocity safely.
Success Metrics
| Success ID | Goal | Metric | Target | How to Measure |
|---|---|---|---|---|
| 1 | Reduce compute cost for migrated workload | £/day for workers (like-for-like traffic) | −33% vs pre-migration | Average CPU utilization reduced on existing servers vs new cost |
| 2 | Elasticity under load | HPA scales pods to meet queue depth / RPS | <2 min scale up | Prometheus metric showing pod counts meet queue backpressure (queue depth) |
| 3 | Keep queues healthy | Queue message age P95 is reduced | TBD | Message age in queue |
| 4 | Observability | Golden signals dashboard coverage (Latency, traffic, errors and saturation) | 100% pilot workers | Appropriate dashboard for workers present |
| 5 | GitOps driven | All changes go through the gitops pipeline | 100% of changes flow through gitops pipeline | All changes are visible via gitops operator vs manual steps on servers/kubectl |
| 6 | New K8s cluster defined in code | A new production ready monitored cluster is defined in terraform and an iac repo | 100% of infra is defined in code | You can delete the cluster in a dev env and it will be recreated by re-running terraform, and continue functioning |
Assumptions
- Team Ownership: A team owns the worker in question and commits to supporting it end to end
- Pilot Commitment: Pilot team commits engineering time (1 dev) for migration, testing, and on-call during the trial
- Containerization: Worker will build into a container fairly easily
- Risk Profile: The worker is preferred not to be mission critical but uses a lot of resources
Trial Scope
Initial Focus: Async workers only
This allows us to:
- Validate the platform with lower-risk workloads
- Prove autoscaling with queue-based metrics
- Establish patterns before migrating critical services
- Measure cost reduction on resource-intensive workloads
Key Capabilities
1. GitOps-Driven Deployments
- Deploy by merging a PR, not by touching clusters
- All state defined in Git with immutable audit trail
- Automatic reconciliation and drift detection
2. Progressive Delivery
- Low-risk rollouts with automatic rollback
- Max unavailable = 0 during rollouts
- Health checks prevent bad deployments
3. Autoscaling
- Scale based on queue depth, RPS, or resource utilization
- Scale to zero for idle workers (optional)
- Fast reaction time (<2 min p95)
4. Observability
- Golden signals dashboards per service
- Standardized metrics (latency, throughput, errors, saturation)
- Queue metrics (lag, age, depth)
5. Self-Service
- Templates and paved-road patterns
- Clear migration paths
- Documentation and runbooks
Migration Strategy
- Foundation: Stand up GKE cluster, GitOps, observability stack
- Template: Create worker template with all best practices
- Pilot: Migrate one high-resource worker
- Measure: Collect DORA metrics and SLOs
- Iterate: Refine based on learnings
- Expand: Publish plan for broader adoption
Expected Outcomes
Short Term (Trial Phase)
- One worker running on Kubernetes
- GitOps pipeline operational
- Autoscaling proven with queue metrics
- Cost reduction measured (target: -33%)
- Team upskilled on Kubernetes
Long Term (Post-Trial)
- Platform ready for broader adoption
- Migration playbook published
- Templates available for all teams
- DORA metrics improved
- Cloud costs reduced