Architecture
cs-lewis-backend is a single-service Python application — Django + Django Ninja + Wagtail — running on AWS ECS Fargate, serving a Flutter mobile app and the editorial CMS from the same codebase. The backing data stores are Postgres (RDS), Redis (ElastiCache), and S3 with CloudFront edge cache for media.
This document describes what is built and how it is wired. The why behind each component choice lives in ADR-0001.
System Overview
Internet
│
┌─────▼─────┐
│ Cloudflare│ (WAF + DDoS + rate limit)
└─────┬─────┘
│
┌──────────┴──────────┐
│ │
▼ ▼
┌──────────┐ ┌────────────┐
│CloudFront│ │ ALB │
│ (media) │ │ │
└────┬─────┘ └─────┬──────┘
│ │
▼ ▼
┌──────────┐ ┌──────────────────────┐
│ S3 │ │ ECS Fargate │
│ audio + │ │ ┌──────────────┐ │
│ images │ │ │ Django (web) │ │ ← Django Ninja API
└──────────┘ │ └──────────────┘ │ + Wagtail Admin
│ ┌──────────────┐ │ + Django Admin
│ │ Celery (jobs)│ │
│ └──────────────┘ │
└─────┬───────────┬────┘
│ │
▼ ▼
┌──────────┐ ┌────────────┐
│ RDS │ │ElastiCache │
│ Postgres │ │ Redis │
└──────────┘ └────────────┘
│
┌───────▼────────┐
│ External │
│ • Anthropic │
│ • ElevenLabs │
│ • Substack │
│ • Sentry │
└────────────────┘
A single web tier sits behind Cloudflare (WAF + DDoS + rate limit) for the API and CloudFront for media. CDN-fronting of the API is deferred to Phase 2 — Redis at origin is sufficient at MVP scale. Celery workers run as a separate Fargate service so background workloads cannot starve the API. Postgres holds the content graph and user data; Redis caches API responses and brokers Celery jobs; S3 holds binary media.
Services & Components
Django web service (ECS Fargate)
- Process: gunicorn + uvicorn workers serving the Django application.
- Surfaces inside this process:
- Django Ninja API at
/api/v1/* — Flutter mobile contract, async views, auto-generated OpenAPI.
- Wagtail Admin at
/cms/* — editor authoring, StreamField content, scheduled publishing, workflows.
- Django Admin at
/admin/* — user management, reflection moderation queue (Phase 2), operational tooling.
- Service layer: Wireup is the DI container wiring the API → service layer. All swappable services (
Ranker, AI clients, ElevenLabs, Substack sync) are registered once in the container; views and Celery tasks declare their dependencies by signature and receive them via @inject / @inject_app. This is what makes the Postgres GIN → Meilisearch Ranker swap mechanical — one registration change, zero call-site changes.
- Scaling: horizontal on Fargate; ALB fan-out to multiple tasks.
- Reads from: Postgres (Django ORM), Redis (cache).
- Writes to: Postgres, Redis (cache writes), Celery queue (job enqueue), Sentry (errors).
Celery worker service (ECS Fargate)
- Process: Celery workers and a Celery Beat scheduler, deployed as separate Fargate services.
- Jobs handled at MVP:
- AI tag drafting on content ingest (Anthropic SDK).
- ElevenLabs audio pre-generation per published passage.
- Substack sync polling.
- Scheduled publishing for Daily Drop via Celery Beat.
- Phase 2 additions: Meilisearch reindex on publish; receipt validation hooks; semantic-similarity index updates if pgvector is adopted.
- Scaling: horizontal; queue depth drives autoscaling.
Postgres (AWS RDS, managed)
- Holds: Wagtail content (pages, snippets, StreamField payloads), Django models (users, reflections, daily drops, audit), tag arrays with GIN indexes for overlap queries.
- Configuration: Postgres 16, multi-AZ in prod, single-AZ in staging, automated daily snapshots, PITR (35-day window in prod).
- Extensions:
pg_trgm for fuzzy text matching where useful; planned pgvector for Phase 2 semantic similarity if it earns its place.
Redis (AWS ElastiCache, managed)
- Holds: API response cache, Celery broker, Celery result backend (short TTL), JWT denylist and refresh-token revocation, rate-limit counters.
- Configuration: Redis 7, single-shard at MVP, multi-AZ replica in prod.
S3 + CloudFront
- Holds: ElevenLabs-generated audio files, AI-generated featured images, editor uploads.
- Access: read-public at MVP (all content is guest-accessible). Signed URLs in Phase 2 for tier-gated audio.
- Cache: CloudFront edge in front of S3; long TTL on immutable content keyed by content hash.
Request Flow
Guest read — today's passage
- Mobile client → WAF → ALB → Django Ninja.
- Redis cache lookup on
today:{date}:{register}. HIT returns immediately.
- MISS: Postgres query for today's
DailyDrop joined to its Passage → Pydantic response → cache write → response.
- Mobile fetches audio + featured image directly via CloudFront URLs.
Guest read — dive deeper from a passage
- Mobile → WAF → ALB → Django Ninja.
Ranker.related_to(passage_id, filters) is invoked.
- At MVP, the ranker is backed by Postgres GIN array overlap:
SELECT id, label,
array_length(motif_tags & :seed_motifs, 1) AS overlap
FROM passage
WHERE motif_tags && :seed_motifs
AND id != :seed_id
AND register = ANY(:register_filter)
ORDER BY overlap DESC
LIMIT 20;
- Response cached under
related:{passage_id}:{filters_hash}; TTL tuned to publish cadence.
- Phase 2 swap: the
Ranker implementation moves to Meilisearch without changing call-sites.
- Editor authors a passage in Wagtail admin and saves a draft.
- Post-save signal enqueues
draft_tags_for_passage(passage_id) on Celery.
- Worker calls Anthropic SDK with the passage text + tag taxonomy prompt → returns drafted tags with confidence scores.
- Worker writes the draft tags to the passage (
review_status=draft) and surfaces a "needs review" indicator in the admin.
- Editor reviews, edits, flips workflow state to
reviewed.
- Wagtail's workflow gate allows publish. On publish, a post-publish signal:
- Invalidates the relevant Redis cache keys.
- Enqueues
generate_audio_reading(passage_id) → ElevenLabs → S3.
- (Phase 2) Enqueues
reindex_passage_in_search(passage_id).
Data Stores
| Store |
Holds |
Backup / DR |
| Postgres (RDS) |
Wagtail content, Django models, tag indexes |
Automated daily snapshots + PITR (35-day window in prod); multi-AZ failover |
| Redis (ElastiCache) |
Cache, Celery broker, sessions, rate-limit counters |
Daily snapshot to S3; cache loss tolerated (rebuilt on miss); broker loss requires job replay |
| S3 |
Media (audio, images, uploads) |
Object versioning enabled; cross-region replication TBD for prod |
See docs/cms-architecture.md for the content data model and collection schemas.
Network Topology
- VPC: dedicated VPC per environment with public + private subnets across two AZs.
- Public subnets: ALB, NAT gateway.
- Private subnets: ECS tasks (web + Celery), RDS, ElastiCache.
- Security groups: ALB → ECS (443); ECS → RDS (5432); ECS → Redis (6379); ECS → S3 via VPC gateway endpoint (no NAT egress for S3).
- Egress: outbound to Anthropic, ElevenLabs, Substack, Sentry via NAT gateway.
- DNS: Route 53; domain to be confirmed with client.
- TLS: ACM-issued certs on ALB and CloudFront; TLS 1.2+ enforced.
Environments
| Environment |
Purpose |
Differences from prod |
| Local (dev) |
Developer workstation |
Docker Compose: Django + Postgres + Redis containers; LocalStack or direct S3 for media |
| Staging |
Pre-prod testing, editor preview, client review |
Single-AZ RDS, smaller Fargate task sizes, no autoscaling, open or no WAF, separate Sentry environment |
| Production |
Public launch (~Oct 2026) |
Multi-AZ RDS with PITR, autoscaling ECS, WAF enabled, full Sentry, CloudFront enabled |
Infrastructure as Code
|
Choice |
Notes |
| Tool |
Terraform (proposed) |
AWS-aligned stack; widely supported; team familiarity |
| Layout |
Per-stack modules: network/, compute/, data/, secrets/, edge/ |
Each module independently planned and applied |
| Location |
infra/ in this repo |
Single repo keeps app code and infra moving together; PRs can span both |
Detailed IaC choices land in deployment.md once infrastructure is provisioned.
External Dependencies
| Service |
Used for |
Phase |
Failure handling |
| Anthropic / OpenAI |
AI tag drafting on ingest |
MVP |
Celery retry with exponential backoff; failures logged to Sentry; editor can re-trigger from admin |
| ElevenLabs |
Audio pre-generation (cloned voice) |
MVP |
Celery retry; failed jobs visible in admin; audio is non-blocking (text-only fallback) |
| Substack |
Article sync into the CMS |
MVP |
Celery polling job; per-article idempotent insert; sync failures non-fatal |
| Bookstore affiliate links |
Purchase handoff (source ladder) |
MVP |
Stored as field on Work; no runtime dependency |
| HarperCollins |
Content rights (no Narnia audio) |
MVP |
Enforced via Work.rights flag — chapter audio gated at the API |
| App Store / Play Store |
Apple/Google Sign-In, account deletion (V1); IAP (Phase 2) |
V1 + Phase 2 |
Receipt validation server-side; account-deletion endpoint must be production-tested before App Store submission |
| Sentry |
Error tracking |
MVP |
Soft dependency — application continues if Sentry unreachable |
| AWS (RDS, ElastiCache, S3, CloudFront, ECS, Secrets Manager) |
Managed services |
MVP |
AWS SLAs; multi-AZ for prod data tier |
Security Architecture
| Concern |
Mitigation |
| CSRF (admin) |
Django CSRF middleware enabled on all admin POSTs |
| XSS |
Django template auto-escaping; Wagtail/Draftail sanitisation; reflection input HTML-escaped before render |
| SQL injection |
Django ORM parameterised queries; no raw SQL on user input |
| Secrets |
AWS Secrets Manager / SSM Parameter Store; never in code or container images |
| Transport |
TLS 1.2+ via ALB and CloudFront; HSTS headers set by Django |
| Rate limiting |
Redis-backed (django-ratelimit) on write endpoints; WAF rate limit on the public API |
| Auth (V1) |
django-allauth (headless) + Apple/Google Sign-In + JWT; password hashing via Django defaults (Argon2/PBKDF2) |
| Account deletion (V1) |
Dedicated endpoint with hard-delete + anonymise strategy per FK; must be production-tested before App Store submission |
| PII handling |
V1 PII limited to email + pseudonym (accounts ship in V1); encryption at rest via RDS |
| Audit |
Wagtail page history on content edits; Django admin log on user actions; structured request logs |
Scaling Characteristics
| Tier |
At MVP |
Headroom |
| ALB |
Auto-managed |
— |
| Django web (Fargate) |
2 tasks; autoscale on CPU + request count |
Horizontal scale to dozens; Phase 2 may split read/write tiers |
| Celery workers |
2 tasks per queue; autoscale on queue depth |
Add dedicated queues per job type as needed |
| Postgres |
db.t4g.medium (single in staging, multi-AZ in prod) |
Vertical scale first; read replicas in Phase 2 if needed |
| Redis |
cache.t4g.micro |
Vertical scale first; cluster mode in Phase 2 if needed |
| S3 / CloudFront |
Auto-managed |
— |
| Surface |
p95 target |
Notes |
| Today's Passage API |
< 100 ms (cache hit) / < 300 ms (cache miss) |
Cache TTL aligned to Daily Drop cadence |
| Dive-deeper API |
< 200 ms (cache hit) / < 500 ms (cache miss) |
Postgres GIN at MVP; Meilisearch swap if p95 degrades |
| Editor publish → reader visibility |
< 30 s |
Cache invalidation + downstream Celery jobs |
| Audio playback start |
< 1 s |
CloudFront-cached MP3/AAC; pre-generated, not on-demand |
Open Items
- Domain / DNS — to be confirmed with client.
- Account-deletion strategy (V1) — FK on-delete strategies (cascade vs anonymise) to be defined in the first sprint; must be production-tested before App Store submission.
- Passage vs Article modeling — to resolve in
docs/cms-architecture.md before first sprint commits.