Proposed Solution

Objective

Stand up platform foundations, migrate one production workload, upskill the owning team, and publish a measured plan to expand adoption.

North Star

Improve stability & performance, reduce cloud cost, and upgrade developer experience—while increasing feature delivery velocity safely.

Success Metrics

Success ID	Goal	Metric	Target	How to Measure
1	Reduce compute cost for migrated workload	£/day for workers (like-for-like traffic)	−33% vs pre-migration	Average CPU utilization reduced on existing servers vs new cost
2	Elasticity under load	HPA scales pods to meet queue depth / RPS	<2 min scale up	Prometheus metric showing pod counts meet queue backpressure (queue depth)
3	Keep queues healthy	Queue message age P95 is reduced	TBD	Message age in queue
4	Observability	Golden signals dashboard coverage (Latency, traffic, errors and saturation)	100% pilot workers	Appropriate dashboard for workers present
5	GitOps driven	All changes go through the gitops pipeline	100% of changes flow through gitops pipeline	All changes are visible via gitops operator vs manual steps on servers/kubectl
6	New K8s cluster defined in code	A new production ready monitored cluster is defined in terraform and an iac repo	100% of infra is defined in code	You can delete the cluster in a dev env and it will be recreated by re-running terraform, and continue functioning

Assumptions

Team Ownership: A team owns the worker in question and commits to supporting it end to end
Pilot Commitment: Pilot team commits engineering time (1 dev) for migration, testing, and on-call during the trial
Containerization: Worker will build into a container fairly easily
Risk Profile: The worker is preferred not to be mission critical but uses a lot of resources

Trial Scope

Initial Focus: Async workers only

This allows us to:

Validate the platform with lower-risk workloads
Prove autoscaling with queue-based metrics
Establish patterns before migrating critical services
Measure cost reduction on resource-intensive workloads

Key Capabilities

1. GitOps-Driven Deployments

Deploy by merging a PR, not by touching clusters
All state defined in Git with immutable audit trail
Automatic reconciliation and drift detection

2. Progressive Delivery

Low-risk rollouts with automatic rollback
Max unavailable = 0 during rollouts
Health checks prevent bad deployments

3. Autoscaling

Scale based on queue depth, RPS, or resource utilization
Scale to zero for idle workers (optional)
Fast reaction time (<2 min p95)

4. Observability

Golden signals dashboards per service
Standardized metrics (latency, throughput, errors, saturation)
Queue metrics (lag, age, depth)

5. Self-Service

Templates and paved-road patterns
Clear migration paths
Documentation and runbooks

Migration Strategy

Foundation: Stand up GKE cluster, GitOps, observability stack
Template: Create worker template with all best practices
Pilot: Migrate one high-resource worker
Measure: Collect DORA metrics and SLOs
Iterate: Refine based on learnings
Expand: Publish plan for broader adoption

Expected Outcomes

Short Term (Trial Phase)

One worker running on Kubernetes
GitOps pipeline operational
Autoscaling proven with queue metrics
Cost reduction measured (target: -33%)
Team upskilled on Kubernetes

Long Term (Post-Trial)

Platform ready for broader adoption
Migration playbook published
Templates available for all teams
DORA metrics improved
Cloud costs reduced

Objective​

North Star​

Success Metrics​

Assumptions​

Trial Scope​

Key Capabilities​

1. GitOps-Driven Deployments​

2. Progressive Delivery​

3. Autoscaling​

4. Observability​

5. Self-Service​

Migration Strategy​

Expected Outcomes​

Short Term (Trial Phase)​

Long Term (Post-Trial)​