Skip to main content

Proposed Solution

Objective

Stand up platform foundations, migrate one production workload, upskill the owning team, and publish a measured plan to expand adoption.

North Star

Improve stability & performance, reduce cloud cost, and upgrade developer experience—while increasing feature delivery velocity safely.

Success Metrics

Success IDGoalMetricTargetHow to Measure
1Reduce compute cost for migrated workload£/day for workers (like-for-like traffic)−33% vs pre-migrationAverage CPU utilization reduced on existing servers vs new cost
2Elasticity under loadHPA scales pods to meet queue depth / RPS<2 min scale upPrometheus metric showing pod counts meet queue backpressure (queue depth)
3Keep queues healthyQueue message age P95 is reducedTBDMessage age in queue
4ObservabilityGolden signals dashboard coverage (Latency, traffic, errors and saturation)100% pilot workersAppropriate dashboard for workers present
5GitOps drivenAll changes go through the gitops pipeline100% of changes flow through gitops pipelineAll changes are visible via gitops operator vs manual steps on servers/kubectl
6New K8s cluster defined in codeA new production ready monitored cluster is defined in terraform and an iac repo100% of infra is defined in codeYou can delete the cluster in a dev env and it will be recreated by re-running terraform, and continue functioning

Assumptions

  1. Team Ownership: A team owns the worker in question and commits to supporting it end to end
  2. Pilot Commitment: Pilot team commits engineering time (1 dev) for migration, testing, and on-call during the trial
  3. Containerization: Worker will build into a container fairly easily
  4. Risk Profile: The worker is preferred not to be mission critical but uses a lot of resources

Trial Scope

Initial Focus: Async workers only

This allows us to:

  • Validate the platform with lower-risk workloads
  • Prove autoscaling with queue-based metrics
  • Establish patterns before migrating critical services
  • Measure cost reduction on resource-intensive workloads

Key Capabilities

1. GitOps-Driven Deployments

  • Deploy by merging a PR, not by touching clusters
  • All state defined in Git with immutable audit trail
  • Automatic reconciliation and drift detection

2. Progressive Delivery

  • Low-risk rollouts with automatic rollback
  • Max unavailable = 0 during rollouts
  • Health checks prevent bad deployments

3. Autoscaling

  • Scale based on queue depth, RPS, or resource utilization
  • Scale to zero for idle workers (optional)
  • Fast reaction time (<2 min p95)

4. Observability

  • Golden signals dashboards per service
  • Standardized metrics (latency, throughput, errors, saturation)
  • Queue metrics (lag, age, depth)

5. Self-Service

  • Templates and paved-road patterns
  • Clear migration paths
  • Documentation and runbooks

Migration Strategy

  1. Foundation: Stand up GKE cluster, GitOps, observability stack
  2. Template: Create worker template with all best practices
  3. Pilot: Migrate one high-resource worker
  4. Measure: Collect DORA metrics and SLOs
  5. Iterate: Refine based on learnings
  6. Expand: Publish plan for broader adoption

Expected Outcomes

Short Term (Trial Phase)

  • One worker running on Kubernetes
  • GitOps pipeline operational
  • Autoscaling proven with queue metrics
  • Cost reduction measured (target: -33%)
  • Team upskilled on Kubernetes

Long Term (Post-Trial)

  • Platform ready for broader adoption
  • Migration playbook published
  • Templates available for all teams
  • DORA metrics improved
  • Cloud costs reduced