Problem Statement
Background
Over the past few years OnBuy has grown revenue rapidly. That growth brought:
- More sellers
- More customers
- Tens of millions of additional products
- Higher delivery cadence for new features
- Increased traffic
- Expanded engineering team
This growth made it harder to write, ship, run, and observe our systems. To cope, we scaled infrastructure without proportional observability or automation, which increased cloud spend.
We now need to: Reduce costs, improve stability and performance, and increase observability—while improving developer experience.
Problem Definition
Current Deployment Model
Today we deploy to VM groups managed by the DevOps team:
- CI/CD: GitLab CI/CD + php-deploy
- Deployment Pattern: ~45-minute big-bang deploy across frontend, backend, and workers
- Issues:
- Undesired changes are hard to detect
- Rollbacks are slow, often impacting availability
- Limited observability
- Reactive issue identification (queue growth, DB pressure, degraded SLIs)
- No proactive SLOs
Infrastructure Limitations
- Static Infrastructure: Hand-managed, no autoscaling
- No Orchestrated Scheduling: Manual resource allocation
- Cost Impact: Higher costs due to poor elasticity
- Scale Mismatch: Current model isn't fit for our scale
Current State (Baseline)
Cloud Costs
- Monthly cloud costs: £200k
DORA Metrics (Baseline)
| Metric | Current Value | Notes |
|---|---|---|
| Deployment Frequency | ~1/day | Big-bang deployments |
| Lead Time for Changes | ~1 day | To confirm |
| Change Failure Rate (CFR) | Unknown | Needs instrumentation |
| MTTR (Mean Time to Recovery) | ~1 hour | To confirm |
Top-Level SLO Candidates (Current SLIs)
| Metric | Current Value | Target |
|---|---|---|
| P95 Response Time (core buyer flows) | Average 250ms, P99 700ms | TBD |
| Availability (core buyer flows) | 98.??% | Confirm via synthetic + logs |
| Error Rate (HTTP 5xx) | 2% | TBD |
Action: Instrument week-1 baselines from GitLab, deploy logs, load balancer logs, and existing dashboards.
Why Change?
The current infrastructure model has reached its limits:
- Deployment Risk: Big-bang deployments increase blast radius
- Cost Inefficiency: Static infrastructure leads to over-provisioning
- Limited Visibility: Reactive monitoring means issues are discovered too late
- Developer Friction: Manual processes slow down feature delivery
- Scalability: Current model doesn't scale with our growth trajectory
Proposed Approach
Build an Internal Developer Platform on Kubernetes that enables teams to:
- Build, ship, and run services with observability, alerting, autoscaling, and security built in
- Self-serve deployments by default
- Deploy via GitOps (Git-based workflows)
- Use progressive delivery for safer rollouts
- Scale automatically based on demand
The DevOps/Platform team will productise this experience, making it easy for teams to adopt.