Skip to main content

Problem Statement

Background

Over the past few years OnBuy has grown revenue rapidly. That growth brought:

  • More sellers
  • More customers
  • Tens of millions of additional products
  • Higher delivery cadence for new features
  • Increased traffic
  • Expanded engineering team

This growth made it harder to write, ship, run, and observe our systems. To cope, we scaled infrastructure without proportional observability or automation, which increased cloud spend.

We now need to: Reduce costs, improve stability and performance, and increase observability—while improving developer experience.

Problem Definition

Current Deployment Model

Today we deploy to VM groups managed by the DevOps team:

  • CI/CD: GitLab CI/CD + php-deploy
  • Deployment Pattern: ~45-minute big-bang deploy across frontend, backend, and workers
  • Issues:
    • Undesired changes are hard to detect
    • Rollbacks are slow, often impacting availability
    • Limited observability
    • Reactive issue identification (queue growth, DB pressure, degraded SLIs)
    • No proactive SLOs

Infrastructure Limitations

  • Static Infrastructure: Hand-managed, no autoscaling
  • No Orchestrated Scheduling: Manual resource allocation
  • Cost Impact: Higher costs due to poor elasticity
  • Scale Mismatch: Current model isn't fit for our scale

Current State (Baseline)

Cloud Costs

  • Monthly cloud costs: £200k

DORA Metrics (Baseline)

MetricCurrent ValueNotes
Deployment Frequency~1/dayBig-bang deployments
Lead Time for Changes~1 dayTo confirm
Change Failure Rate (CFR)UnknownNeeds instrumentation
MTTR (Mean Time to Recovery)~1 hourTo confirm

Top-Level SLO Candidates (Current SLIs)

MetricCurrent ValueTarget
P95 Response Time (core buyer flows)Average 250ms, P99 700msTBD
Availability (core buyer flows)98.??%Confirm via synthetic + logs
Error Rate (HTTP 5xx)2%TBD

Action: Instrument week-1 baselines from GitLab, deploy logs, load balancer logs, and existing dashboards.

Why Change?

The current infrastructure model has reached its limits:

  1. Deployment Risk: Big-bang deployments increase blast radius
  2. Cost Inefficiency: Static infrastructure leads to over-provisioning
  3. Limited Visibility: Reactive monitoring means issues are discovered too late
  4. Developer Friction: Manual processes slow down feature delivery
  5. Scalability: Current model doesn't scale with our growth trajectory

Proposed Approach

Build an Internal Developer Platform on Kubernetes that enables teams to:

  • Build, ship, and run services with observability, alerting, autoscaling, and security built in
  • Self-serve deployments by default
  • Deploy via GitOps (Git-based workflows)
  • Use progressive delivery for safer rollouts
  • Scale automatically based on demand

The DevOps/Platform team will productise this experience, making it easy for teams to adopt.