Software Architecture

Design question → options with tradeoffs → documented decision.

<when_to_use>

Designing new systems or major features
Evaluating architectural approaches
Making technology stack decisions
Planning for scale and performance
Analyzing design tradeoffs

NOT for: trivial tech choices, premature optimization, undocumented requirements </when_to_use>

<phases> Track with TodoWrite. Advance only, never regress.

Phase	Trigger	activeForm
Discovery	Session start	"Gathering requirements"
Codebase Analysis	Requirements clear	"Analyzing codebase"
Constraint Evaluation	Codebase understood	"Evaluating constraints"
Solution Design	Constraints mapped	"Designing solutions"
Documentation	Design selected	"Documenting architecture"

Situational (insert before Documentation when triggered):

Review & Refinement → feedback cycles on complex designs

Edge cases:

Small questions: skip directly to Solution Design
Greenfield: skip Codebase Analysis
No ADR needed: skip Documentation phase
Iteration: Review & Refinement may repeat

TodoWrite format:

- Discovery { problem domain }
- Analyze { codebase area }
- Evaluate { constraint type }
- Design { solution approach }
- Document { decision type }

Workflow:

Start: Create Discovery as in_progress
Transition: Mark current completed, add next in_progress
High start: skip to Solution Design for clear problems
Optional end: Documentation skippable if ADR not needed </phases>

<principles> Proven over Novel — favor battle-tested over bleeding-edge without strong justification.

Framework:

3+ years production at scale?
Strong community + active maintenance?
Available experienced practitioners?
Total cost of ownership (learning, tooling, hiring)?

Red flags:

"Early adopters" without time budget
"Written in X" without benchmarks
"Everyone's talking" without case studies

Complexity Budget — each abstraction must provide 10x value.

Questions:

What specific problem does this solve?
Can we solve with existing tools/patterns?
Maintenance burden (docs, onboarding, debugging)?
Impact on debugging and incident response?

Unix Philosophy — small, focused modules with clear contracts, single responsibilities.

Checklist:

Single, well-defined purpose?
Describe in one sentence without "and"?
Dependencies explicit and minimal?
Testable in isolation?
Clean, stable interface?

Observability First — no system ships without metrics, tracing, alerting.

Required every service:

Metrics: RED (Rate, Errors, Duration) for all endpoints
Tracing: distributed traces with correlation IDs
Logging: structured logs with context
Alerts: SLO-based with runbooks
Dashboards: at-a-glance health

Modern by Default — use contemporary proven patterns for greenfield, respect legacy constraints.

Patterns (2025):

TypeScript strict mode for type safety
Rust for performance-critical services
Container deployment (Docker, K8s)
Infrastructure as Code (Terraform, Pulumi)
Distributed tracing (OpenTelemetry)
Event-driven architectures

Legacy respect:

Document why legacy exists
Plan incremental migration
Don't rewrite what works reliably

Evolutionary — design for change with clear upgrade paths.

Practices:

Version all APIs with deprecation policies
Feature flags for gradual rollouts
Design with migration paths in mind
Deployment independent from release
Automated compatibility testing </principles>

<technology_selection> Database Selection

Decision factors:

Data model fit
- Relational (structured, ACID, complex queries) → PostgreSQL, MySQL
- Document (flexible schema, nested data) → MongoDB, DynamoDB
- Graph (relationship-heavy) → Neo4j, DGraph
- Time-series (metrics, events) → TimescaleDB, InfluxDB
- Key-value (simple lookups, cache) → Redis, DynamoDB
Consistency requirements
- Strong consistency → PostgreSQL, CockroachDB
- Eventual consistency acceptable → DynamoDB, Cassandra
- Hybrid needs → MongoDB, Cosmos DB
Scale characteristics
- Read-heavy → read replicas, caching
- Write-heavy → sharding, write-optimized DB
- Both → consider CQRS pattern
Operational complexity
- Managed service available? Use it unless special needs
- Self-hosted required? Factor operational overhead
- Multi-region? Consider distributed databases

Decision matrix:

ACID + complex queries + proven? → PostgreSQL
Flexibility + horizontal scaling + managed? → DynamoDB
Document model + rich queries + open source? → MongoDB
High write throughput + wide column? → Cassandra
Caching + pub/sub + simple data? → Redis

Framework Selection

Backend (TypeScript/JavaScript):

Hono (modern, fast, edge) — best for new projects, serverless
Express (proven, massive ecosystem) — best for teams with Express experience
Fastify (performance-focused) — best when raw speed matters
NestJS (structured, enterprise) — best for large teams, complex domains

Backend (Rust):

Axum (modern, tokio-based) — best for new projects, type-safe routing
Actix-web (mature, fast) — best when raw performance critical
Rocket (ergonomic, batteries-included) — best for rapid development

Decision criteria:

Team expertise and learning curve
Performance requirements (most apps don't need Rust speed)
Ecosystem and library availability
Type safety and developer experience
Deployment target (serverless, containers, bare metal)

Frontend:

React + TanStack Router — complex state, large ecosystem
Solid — performance-critical UIs
Svelte — small teams, simple apps
Next.js — SSR/SSG needs, full-stack React

Infrastructure

Serverless (Vercel, Cloudflare Workers, AWS Lambda):

✓ Zero ops, auto-scaling, pay-per-use
✗ Cold starts, vendor lock-in, harder debugging
Best for: low-traffic apps, edge functions, prototypes

Container orchestration (Kubernetes, ECS):

✓ Portability, fine control, proven at scale
✗ Operational complexity, learning curve
Best for: medium-large apps, multi-service systems

Platform-as-a-Service (Heroku, Render, Railway):

✓ Simple deploys, managed infrastructure
✗ Higher cost, less control, scaling limits
Best for: startups, MVPs, small teams

Bare metal / VMs:

✓ Full control, cost-effective at scale
✗ High operational burden
Best for: special requirements, very large scale </technology_selection>

<design_patterns> Service Decomposition

Monolith first, then extract:

Start with well-organized monolith
Identify bounded contexts as you learn domain
Extract when hitting specific pain:
- Different scaling needs (one service needs 10x instances)
- Different deployment cadences (ML model updates vs API)
- Team boundaries (separate teams, separate services)
- Technology constraints (need Rust for one component)

When to use microservices:

✓ Large team (10+ engineers)
✓ Clear domain boundaries
✓ Independent scaling needs
✓ Polyglot requirements
✗ Small team (<5 engineers)
✗ Unclear domain
✗ Premature optimization

Communication Patterns

Synchronous (REST, GraphQL, gRPC):

Use when: immediate response needed, simple request-response
Tradeoffs: tight coupling, cascading failures, latency compounds
Mitigation: circuit breakers, timeouts, retries with backoff

Asynchronous (message queues, event streams):

Use when: eventual consistency acceptable, high volume, decoupling needed
Tradeoffs: complexity, harder debugging, ordering challenges
Patterns: message queues (RabbitMQ, SQS), event streams (Kafka, Kinesis)

Event-driven architecture:

Core: services publish events, others subscribe
Benefits: loose coupling, easy to add consumers, audit trail
Challenges: eventual consistency, event versioning, ordering
Best practices:
- Schema registry for event contracts
- Include correlation IDs for tracing
- Design idempotent consumers
- Plan for out-of-order delivery

Data Management

Database per service:

Each service owns its data
No direct database access across services
Communication via APIs or events
Tradeoff: data consistency challenges, no joins across services

Shared database (anti-pattern for microservices):

Multiple services access same database
Only acceptable: transitioning from monolith
Migration path: add service layer, restrict direct access

CQRS (Command Query Responsibility Segregation):

Separate write model from read model
Use when: read/write patterns very different, complex queries needed
Implementation: write to normalized DB, project to read-optimized views

Event Sourcing:

Store events, not current state
Rebuild state by replaying events
Use when: audit trail critical, temporal queries needed
Challenges: migration complexity, eventual consistency </design_patterns>

<scalability> Performance Modeling

Key metrics:

Latency: p50, p95, p99 response times
Throughput: requests per second
Resource utilization: CPU, memory, network, disk I/O
Error rates: 4xx, 5xx responses
Saturation: queue depths, connection pools

Capacity planning:

Baseline: measure current performance under normal load
Load test: use realistic traffic patterns (gradual ramp, spike, sustained)
Find limits: identify bottlenecks (CPU? DB? Network?)
Model growth: project based on business metrics (users, transactions)
Plan headroom: maintain 30–50% capacity buffer

Bottleneck Identification

Database:

Symptoms: high query latency, connection pool exhaustion
Solutions: indexing, query optimization, read replicas, caching, sharding

CPU:

Symptoms: high CPU utilization, slow processing
Solutions: horizontal scaling, algorithm optimization, caching, async processing

Memory:

Symptoms: OOM errors, high GC pressure
Solutions: memory profiling, data structure optimization, streaming processing

Network:

Symptoms: high bandwidth usage, slow transfers
Solutions: compression, CDN, protocol optimization (HTTP/2, gRPC)

I/O:

Symptoms: disk queue depth, slow reads/writes
Solutions: SSD, batching, async I/O, caching

Scaling Strategies

Vertical scaling (bigger machines):

✓ Simple, no code changes
✗ Expensive, hard limits, single point of failure
Use when: quick fix needed, not yet optimized

Horizontal scaling (more machines):

✓ Cost-effective, no hard limits, fault tolerant
✗ Requires stateless design, load balancing complexity
Requirements: stateless services, shared state in DB/cache

Caching layers:

L1 (Application): in-memory, fastest, stale risk
L2 (Distributed): Redis, Memcached, shared across instances
L3 (CDN): CloudFlare, CloudFront, edge caching
Strategy: cache-aside, write-through, write-behind based on needs

Database scaling:

Read replicas: route reads to replicas, writes to primary
Sharding: partition data across databases (customer, geography, hash)
Connection pooling: PgBouncer, connection reuse
Query optimization: indexes, query tuning, explain plans </scalability>

<rust_architecture> When to Choose Rust

Strong fit:

Performance-critical services (compute-heavy, low-latency)
Resource-constrained environments
Systems programming needs
Memory safety critical (no GC pauses)
Concurrent processing with correctness guarantees

May not be worth it:

Prototype/MVP phase (slower iteration)
Small team without Rust experience
Standard CRUD API (TS faster to develop)
Heavy dependency on ecosystem libraries only in other languages

Stack Recommendations

Web services:

Runtime: tokio (async runtime, de facto standard)
Web framework: axum (modern, type-safe) or actix-web (mature, fast)
Database: sqlx (compile-time checked queries), diesel (full ORM)
Serialization: serde with serde_json, bincode for binary
Observability: tracing + tracing-subscriber for structured logging
Error handling: thiserror for libraries, anyhow for applications

Project structure:

my-service/
├── Cargo.toml          # Workspace manifest
├── crates/
│   ├── api/            # HTTP handlers, routing
│   ├── domain/         # Business logic, pure Rust
│   ├── persistence/    # Database access
│   └── common/         # Shared utilities

Operational considerations:

Build times longer than TS (use sccache, mold linker)
Binary size larger (use cargo-bloat to analyze)
Memory usage lower at runtime
Deploy as single static binary (easy containerization)
Cross-compilation more complex

Tradeoffs vs TypeScript:

Rust:
✓ 5–10x lower memory usage
✓ Faster execution (often 2–10x)
✓ Catch bugs at compile time (no null refs, race conditions)
✓ No GC pauses
✗ Slower development (borrow checker learning curve)
✗ Smaller ecosystem for web-specific libraries
✗ Harder to hire

TypeScript:
✓ Faster iteration (REPL, quick rebuilds)
✓ Massive ecosystem (npm)
✓ Easy to hire
✗ Higher memory usage
✗ Runtime errors possible
✗ GC pauses

</rust_architecture>

<common_patterns> API Gateway — single entry point for all client requests, handles routing, auth, rate limiting.

Use when: multiple backend services, need centralized auth/logging
Options: Kong, AWS API Gateway, custom Nginx
Tradeoffs: single point of failure, added latency

Backends for Frontends (BFF) — separate backend for each frontend type.

Use when: different clients need different data shapes
Benefits: optimized per-client, independent deployment
Tradeoffs: code duplication, more services to maintain

Circuit Breaker — prevent cascading failures by failing fast when downstream unhealthy.

Implementation: track failure rate, open circuit after threshold, half-open to test recovery
Libraries: Hystrix (Java), Polly (.NET), Resilience4j (Java), opossum (Node)

Saga Pattern — manage distributed transactions across services.

Choreography: services emit events, others listen and react
Orchestration: central coordinator manages workflow
Use when: multi-service transaction, eventual consistency acceptable

Strangler Fig — gradually migrate from legacy by routing new features to new system.

Route all traffic through proxy/facade
Build new features in new system
Gradually migrate existing features
Sunset legacy when complete </common_patterns>

<implementation_guidance> Phased Delivery

Phase 1: MVP (2–4 weeks)

Core user workflow only
Simplest possible architecture
Manual processes acceptable
Focus: validate problem-solution fit

Phase 2: Beta (4–8 weeks)

Key features, basic scalability
Monitoring and logging
Automated deployment
Focus: validate product-market fit

Phase 3: Production (8–12 weeks)

Full feature set
Production-grade reliability
Auto-scaling, disaster recovery
Focus: scale and optimize

Phase 4: Optimization (ongoing)

Performance tuning
Cost optimization
Feature refinement
Focus: efficiency and experience

Critical Path Analysis

For each phase identify:

Blocking dependencies (what must be done first?)
Parallel workstreams (what can happen simultaneously?)
Resource constraints (who's needed, when?)
Risk areas (what might delay us?)
Decision points (what decisions can't be delayed?)

Observability

Metrics (quantitative health):

Business metrics (signups, transactions, revenue)
System metrics (CPU, memory, disk, network)
Application metrics (request rate, latency, errors)

Logging (what happened):

Structured JSON logs
Correlation IDs across services
Context (user ID, request ID, session)
Appropriate log levels (ERROR actionable, WARN concerning, INFO key events)

Tracing (where time is spent):

Distributed traces with OpenTelemetry
Critical path instrumentation
Database query timing
External API call timing

Alerting (what needs attention):

SLO-based alerts (error rate, latency, availability)
Actionable only (if it fires, someone must do something)
Runbooks for each alert
Escalation policies </implementation_guidance>

<adr_template>

# ADR-XXX: {TITLE}

**Status**: [Proposed | Accepted | Deprecated | Superseded]
**Date**: YYYY-MM-DD
**Deciders**: {WHO}
**Context**: {PROBLEM}

## Decision

{WHAT_WE_DECIDED}

## Alternatives Considered

### Option 1: {NAME}
- **Pros**: {BENEFITS}
- **Cons**: {DRAWBACKS}
- **Why not chosen**: {REASON}

### Option 2: {NAME}
- **Pros**: {BENEFITS}
- **Cons**: {DRAWBACKS}
- **Why not chosen**: {REASON}

## Consequences

**Positive**:
- {BENEFIT_1}
- {BENEFIT_2}

**Negative**:
- {TRADEOFF_1}
- {TRADEOFF_2}

**Neutral**:
- {IMPACT_1}
- {IMPACT_2}

## Implementation Notes

- {TECHNICAL_DETAIL_1}
- {TECHNICAL_DETAIL_2}
- {MIGRATION_PATH}

## Success Metrics

- {HOW_MEASURE_SUCCESS}
- {WHAT_METRICS_TRACK}

## Review Date

{WHEN_REVISIT}

</adr_template>

<questions_to_ask> Understanding Requirements

Functional:

Core user workflows?
What data stored, how long?
Required integrations?
Critical features vs nice-to-haves?

Non-functional:

How many users (now and in 1–2 years)?
Acceptable latency? (p99 < 500ms? < 100ms?)
Availability target? (99.9%? 99.99%?)
Consistency requirement? (strong? eventual?)
Data retention policy?
Compliance requirements? (GDPR, HIPAA, SOC2?)

Constraints

Technical:

Existing systems to integrate with?
Technologies already in use?
Current team expertise?
Deployment environment? (cloud, on-prem, hybrid?)

Business:

Budget for infrastructure?
Timeline for delivery?
Acceptable technical debt?
Long-term vision (1–2 years)?

Organizational:

How many engineers will work on this?
Team structure and communication patterns?
Deployment frequency? (multiple/day, weekly, monthly?)
On-call and support model?

Technology Selection

For each choice ask:

Why this over alternatives? (specific reasons, not "popular")
What production experience exists? (internal or external)
Operational complexity?
Vendor lock-in risk?
Community support and longevity?
Total cost of ownership?
Can we hire for this technology?

Risk Assessment

For each decision:

Blast radius if this fails?
Rollback strategy?
How will we detect problems?
Contingency plan?
What assumptions are we making?
Cost of being wrong? </questions_to_ask>

<workflow> Use `EnterPlanMode` when presenting options — enables keyboard navigation.

Structure:

Prose above tool: context, reasoning, ★ recommendation
Inside tool: 2–3 options with tradeoffs + "Something else"
User selects: number, modifications, or combo

After user choice:

Restate decision
List implications
Surface concerns if any
Ask clarifying questions if gaps remain

Before documenting:

Verify all options considered
Confirm rationale is clear
Check success metrics defined
Validate migration path if applicable

At Documentation phase:

Create ADR if architectural decision
Skip if simple tech choice
Mark phase complete after delivery </workflow>

<rules> ALWAYS: - Create Discovery todo at session start - Update todos at phase transitions - Ask clarifying questions about requirements and constraints before proposing - Present 2–3 viable options with clear tradeoffs - Document decisions with rationale (ADR when appropriate) - Consider immediate needs and future scale - Evaluate team expertise and operational capacity - Account for budget and timeline constraints

NEVER:

Recommend bleeding-edge tech without strong justification
Over-engineer solutions for current scale
Skip constraint analysis (budget, timeline, team, existing systems)
Propose architectures the team can't operate
Ignore operational complexity in technology selection
Proceed without understanding non-functional requirements (latency, scale, availability)
Skip phase transitions when moving through workflow </rules>

<references> - [FORMATTING.md](../../shared/rules/FORMATTING.md) — formatting conventions </references>

software-architecture

Software Architecture

Related Skills