Structured workflow for onboarding new services to the internal platform including infrastructure provisioning, observability setup, and documentation.
Inherits all available tools
Additional assets for this skill
This skill inherits all available tools. When active, it can use any tool Claude has access to.
name: ops-platform-onboarding description: | Structured workflow for onboarding new services to the internal platform including infrastructure provisioning, observability setup, and documentation.
trigger: |
skip_when: |
This skill defines the structured process for onboarding services to the internal developer platform. Use it to ensure consistent, compliant service deployments.
| Phase | Focus | Output |
|---|---|---|
| 1. Requirements | Gather service requirements | Requirements doc |
| 2. Golden Path Selection | Choose deployment pattern | Selected template |
| 3. Infrastructure Provisioning | Create service resources | Infrastructure ready |
| 4. Observability Setup | Configure monitoring | Dashboards/alerts |
| 5. Security Configuration | Apply security controls | Security validated |
| 6. Documentation | Complete service docs | Runbook ready |
| 7. Handoff | Transfer to service team | Ownership confirmed |
## Service Onboarding Request
**Service Name:** [name]
**Team:** [owning team]
**Requested By:** [name]
**Target Date:** YYYY-MM-DD
### Service Information
| Attribute | Value |
|-----------|-------|
| Service type | [API / Worker / Batch / Frontend] |
| Language/runtime | [Go / Node.js / Python / etc.] |
| Criticality | [Tier 1/2/3/4] |
| External traffic | [Yes / No] |
| Data sensitivity | [PII / Financial / Public] |
### Resource Requirements
| Resource | Requirement | Notes |
|----------|-------------|-------|
| CPU | [cores] | [peak/average] |
| Memory | [GB] | [peak/average] |
| Storage | [GB] | [type: SSD/HDD] |
| Database | [type] | [shared/dedicated] |
| Cache | [type] | [shared/dedicated] |
### Dependencies
| Dependency | Type | SLA Required |
|------------|------|--------------|
| [service] | Internal | [Yes/No] |
| [external] | External | [Yes/No] |
### Compliance Requirements
- [ ] SOC2
- [ ] PCI-DSS
- [ ] GDPR
- [ ] HIPAA
- [ ] Other: ____________
| Golden Path | Use Case | Includes |
|---|---|---|
| api-service | REST/GraphQL APIs | ALB, EKS, RDS, ElastiCache |
| worker-service | Background processing | SQS, EKS, auto-scaling |
| batch-job | Scheduled jobs | EventBridge, Lambda/Fargate |
| frontend-app | Static sites, SPAs | CloudFront, S3, API Gateway |
| data-pipeline | ETL, streaming | Kinesis, Glue, S3 |
| Requirement | api-service | worker-service | batch-job |
|---|---|---|---|
| HTTP traffic | Yes | No | No |
| Queue processing | Optional | Yes | Optional |
| Scheduled runs | No | No | Yes |
| Real-time | Yes | Near-real-time | No |
| Auto-scaling | Yes | Yes | N/A |
## Golden Path Selection
**Service:** [name]
**Selected Path:** [api-service / worker-service / etc.]
### Rationale
1. Service type [X] matches [golden path] pattern
2. Traffic requirements of [X] supported by [features]
3. Compliance requirements met by built-in [controls]
### Customizations Required
| Standard Component | Customization | Reason |
|--------------------|---------------|--------|
| [component] | [change] | [why] |
### Approval
- [ ] Platform team reviewed
- [ ] Security team reviewed (if customizations)
- [ ] Architecture team reviewed (if non-standard)
# Example service provisioning
module "service" {
source = "platform/service-template"
service_name = var.service_name
team = var.team
environment = var.environment
golden_path = "api-service"
# Compute
cpu_request = "500m"
memory_request = "512Mi"
replicas_min = 2
replicas_max = 10
# Database
database_enabled = true
database_class = "db.t3.medium"
# Tags
tags = {
Team = var.team
Environment = var.environment
CostCenter = var.cost_center
}
}
# Verify namespace
kubectl get namespace [service-name]
# Verify compute
kubectl get deployment -n [service-name]
# Verify database
aws rds describe-db-instances --db-instance-identifier [service-db]
# Verify DNS
dig [service-name].internal.example.com
Standard service dashboard includes:
| Panel | Metrics |
|---|---|
| Request rate | requests/sec, by status code |
| Error rate | 5xx rate, 4xx rate |
| Latency | p50, p95, p99 |
| Saturation | CPU, memory utilization |
| Dependencies | Upstream/downstream health |
| Alert | Condition | Severity | Response |
|---|---|---|---|
| High error rate | 5xx > 1% for 5m | Critical | Page on-call |
| High latency | p99 > 1s for 5m | Warning | Alert team |
| Low availability | uptime < 99.9% | Critical | Page on-call |
| Resource saturation | CPU > 85% for 10m | Warning | Alert team |
## Service Level Objectives
**Service:** [name]
**SLO Version:** 1.0
| SLI | Target | Measurement |
|-----|--------|-------------|
| Availability | 99.9% | Successful requests / total requests |
| Latency | p99 < 500ms | Request duration percentile |
| Error rate | < 0.1% | 5xx responses / total responses |
### Error Budget
- Monthly budget: 43.2 minutes downtime
- Current consumption: [X]%
- Actions if budget exceeded: [escalation process]
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: service-policy
namespace: [service-name]
spec:
podSelector:
matchLabels:
app: [service-name]
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: istio-system
ports:
- port: 8080
egress:
- to:
- namespaceSelector:
matchLabels:
name: database
ports:
- port: 5432
## Security Configuration Review
**Service:** [name]
**Reviewer:** @security-team
| Control | Status | Notes |
|---------|--------|-------|
| mTLS enabled | PASS | Istio strict mode |
| Network policies | PASS | Ingress/egress restricted |
| Secrets management | PASS | Using Vault |
| Least privilege IAM | PASS | Scoped to required resources |
| Vulnerability scanning | PASS | Trivy in CI/CD |
| Document | Purpose | Template |
|---|---|---|
| Service Overview | What the service does | README.md |
| Runbook | Operational procedures | runbook.md |
| Architecture | Design decisions | architecture.md |
| API Docs | Interface documentation | OpenAPI spec |
| On-call Guide | Incident handling | oncall.md |
## [Service Name] Runbook
### Service Overview
[Brief description of what the service does]
### Quick Reference
| Item | Value |
|------|-------|
| Repository | [link] |
| Dashboard | [link] |
| Logs | [query link] |
| On-call | [PagerDuty service] |
### Common Operations
#### Restart Service
```bash
kubectl rollout restart deployment/[service] -n [namespace]
kubectl scale deployment/[service] -n [namespace] --replicas=X
kubectl logs -l app=[service] -n [namespace] --tail=100
| Symptom | Possible Cause | Resolution |
|---|---|---|
| High latency | DB connection pool | Scale DB or optimize queries |
| 5xx errors | Dependency down | Check upstream services |
| OOM kills | Memory leak | Investigate heap, restart |
| Level | Contact | When |
|---|---|---|
| L1 | [team Slack channel] | First response |
| L2 | [on-call engineer] | Cannot resolve in 15m |
| L3 | [service owner] | Critical/extended outage |
---
## Phase 7: Handoff
### Handoff Checklist
- [ ] Service owner identified and trained
- [ ] On-call rotation set up
- [ ] Access provisioned to team
- [ ] Documentation reviewed by team
- [ ] Shadowing session completed
- [ ] Ownership officially transferred
### Handoff Template
```markdown
## Service Handoff Confirmation
**Service:** [name]
**Date:** YYYY-MM-DD
**Platform Team:** @[name]
**Service Owner:** @[name]
### Completed Items
- [x] Infrastructure provisioned and documented
- [x] Observability configured
- [x] Security controls applied
- [x] Runbook created and reviewed
- [x] On-call rotation configured
- [x] Training session completed
### Outstanding Items
| Item | Owner | Due Date |
|------|-------|----------|
| [item] | [owner] | YYYY-MM-DD |
### Acknowledgment
By signing below, the service owner confirms:
1. Receipt of all documentation
2. Understanding of operational procedures
3. Acceptance of on-call responsibilities
**Service Owner:** _________________ Date: _______
**Platform Team:** _________________ Date: _______
| Rationalization | Why It's WRONG | Required Action |
|---|---|---|
| "Skip documentation, code is self-explanatory" | On-call != developers | Complete runbook |
| "We'll add observability later" | Blind deployments = incidents | Observability on day 1 |
| "Golden path doesn't fit exactly" | Customizations add complexity | Justify every deviation |
| "Security can come later" | Later = never for security | Security from start |
| "Team can figure it out" | Assumptions cause outages | Complete handoff process |
For platform onboarding tasks, dispatch:
Task tool:
subagent_type: "ring-ops-team:platform-engineer"
model: "opus"
prompt: |
SERVICE ONBOARDING REQUEST
Service: [name]
Team: [team]
Type: [API/Worker/Batch]
Requirements: [summary]
Golden Path: [if known]