Problem Statement

Background

Over the past few years OnBuy has grown revenue rapidly. That growth brought:

More sellers
More customers
Tens of millions of additional products
Higher delivery cadence for new features
Increased traffic
Expanded engineering team

This growth made it harder to write, ship, run, and observe our systems. To cope, we scaled infrastructure without proportional observability or automation, which increased cloud spend.

We now need to: Reduce costs, improve stability and performance, and increase observability—while improving developer experience.

Problem Definition

Current Deployment Model

Today we deploy to VM groups managed by the DevOps team:

CI/CD: GitLab CI/CD + php-deploy
Deployment Pattern: ~45-minute big-bang deploy across frontend, backend, and workers
Issues:
- Undesired changes are hard to detect
- Rollbacks are slow, often impacting availability
- Limited observability
- Reactive issue identification (queue growth, DB pressure, degraded SLIs)
- No proactive SLOs

Infrastructure Limitations

Static Infrastructure: Hand-managed, no autoscaling
No Orchestrated Scheduling: Manual resource allocation
Cost Impact: Higher costs due to poor elasticity
Scale Mismatch: Current model isn't fit for our scale

Current State (Baseline)

Cloud Costs

Monthly cloud costs: £200k

DORA Metrics (Baseline)

Metric	Current Value	Notes
Deployment Frequency	~1/day	Big-bang deployments
Lead Time for Changes	~1 day	To confirm
Change Failure Rate (CFR)	Unknown	Needs instrumentation
MTTR (Mean Time to Recovery)	~1 hour	To confirm

Top-Level SLO Candidates (Current SLIs)

Metric	Current Value	Target
P95 Response Time (core buyer flows)	Average 250ms, P99 700ms	TBD
Availability (core buyer flows)	98.??%	Confirm via synthetic + logs
Error Rate (HTTP 5xx)	2%	TBD

Action: Instrument week-1 baselines from GitLab, deploy logs, load balancer logs, and existing dashboards.

Why Change?

The current infrastructure model has reached its limits:

Deployment Risk: Big-bang deployments increase blast radius
Cost Inefficiency: Static infrastructure leads to over-provisioning
Limited Visibility: Reactive monitoring means issues are discovered too late
Developer Friction: Manual processes slow down feature delivery
Scalability: Current model doesn't scale with our growth trajectory

Proposed Approach

Build an Internal Developer Platform on Kubernetes that enables teams to:

Build, ship, and run services with observability, alerting, autoscaling, and security built in
Self-serve deployments by default
Deploy via GitOps (Git-based workflows)
Use progressive delivery for safer rollouts
Scale automatically based on demand

The DevOps/Platform team will productise this experience, making it easy for teams to adopt.

Background​

Problem Definition​

Current Deployment Model​

Infrastructure Limitations​

Current State (Baseline)​

Cloud Costs​

DORA Metrics (Baseline)​

Top-Level SLO Candidates (Current SLIs)​

Why Change?​

Proposed Approach​