Platform Architecture
Overview
The Internal Developer Platform (IDP) is built on Kubernetes (GKE) and provides a self-service platform for teams to build, ship, and run services with observability, alerting, autoscaling, and security built in by default.
Core Components
1. Kubernetes Cluster (GKE)
- Purpose: Container orchestration platform providing compute, networking, and storage primitives
- Configuration: Managed via Terraform/IaC for reproducibility
- Availability: Single availability zone (1 AZ) for initial deployment
- Features:
- Autoscaling node pools
- Managed control plane
- Integrated networking and load balancing
2. GitOps (Flux)
- Purpose: Declarative deployment management ensuring cluster state matches Git
- Approach: Gitless GitOps using OCI artifacts as deployment vehicles
- Workflow:
- Make your kustomize changes locally, then run flux locally using tilt.dev to test
- PR merged → OCI artifact created
- GitOps operator syncs from artifact registry
- Cluster state reconciles to desired state
- Benefits:
- Runs locally so you can test your changes before creating PRs
- Immutable audit trail (all changes via Git)
- Automatic drift detection and correction
- Rollback via Git history
3. Secrets Management
- Solution: SOPS (Sealed Secrets) or External Secrets Operator
- Storage: Encrypted secrets stored in Git, versioned and auditable
- Decryption: KMS-backed decryption at deployment time
- Security: Secrets never stored in plaintext in Git
4. Observability Stack (TBD)
- Metrics:
- Prometheus running in-cluster (GKE integration)
- New Relic agents for application metrics & cluster metrics
- Dashboards: Standardized golden signals dashboards per service (TBD)
- Logs: Centralized logging via Google Cloud Logging
- Traces: Distributed tracing via New Relic
5. Autoscaling
- HPA (Horizontal Pod Autoscaler): Primary scaling mechanism based on CPU/memory metrics
- KEDA (Kubernetes Event-Driven Autoscaling): Advanced scaling for queue-based workloads
- Scaling Targets:
- Queue depth/lag
- Request rate (RPS)
- CPU/memory utilization
- Reaction Time: ≤2 minutes p95 for scale-up events
- Scale-to-Zero: Optional for idle workers
6. Worker Templates & Paved Road
- Components:
- Dockerfile templates
- CI/CD pipeline configuration
- Helm/Kustomize manifests
- KEDA/HPA scaling definitions
- Default SLO/alerts
- Pre-configured dashboards
- Purpose: Reduce time-to-production for new workers
Architecture Flow
┌─────────────────┐
│ Git Repo │
│ (Source) │
└────────┬────────┘
│
│ PR Merge
▼
┌─────────────────┐
│ CI/CD │
│ Build Image │
└────────┬────────┘
│
│ Push OCI Artifact
▼
┌─────────────────┐
│ GitOps │
│ (Flux) │
└────────┬────────┘
│
│ Sync & Deploy
▼
┌─────────────────┐
│ GKE Cluster │
│ ┌───────────┐ │
│ │ Workers │ │
│ │ (Pods) │ │
│ └─────┬─────┘ │
│ │ │
│ ▼ │
│ ┌───────────┐ │
│ │ HPA/KEDA │ │
│ │ Scaling │ │
│ └───────────┘ │
└────────┬────────┘
│
│ Metrics/Logs
▼
┌─────────────────┐
│ Observability │
│ (New Relic + │
│ Prometheus) │
└─────────────────┘
Key Principles
- GitOps First: All changes flow through Git, no manual kubectl operations
- Self-Service: Teams can deploy independently with guardrails
- Observable by Default: Every service gets dashboards and alerts
- Secure by Default: Secrets encrypted, RBAC enforced, policy-driven
- Cost-Effective: Autoscaling reduces idle costs, better resource utilization
- Progressive Delivery: Safe rollouts with automatic rollback on failure