ADR-0001: Backend Architecture¶
Status¶
Proposed — supersedes the prior tech-stack working assumption. The original client-facing recommendation (Payload CMS + Node + Flutter + ElevenLabs) is amended here for the backend half; mobile and voice choices stand. Awaiting client acceptance per project constraints non-negotiable #4.
Updated 2026-06-29: Auth/accounts moved from Phase 2 into V1 (per the delivery lead — see the auth-scope Slack thread, 2026-06-29); content monetization (paywall/subscription/tier gating) stays Phase 2. Framework/component versions re-verified against current releases as of this date (see "Versions verified" note in the Decision section).
Date¶
2026-06-25 (updated 2026-06-29)
Context¶
cs-lewis-backend ships an MVP in ~Oct 2026 (build phase starts 2026-06-29, ~14 weeks). The app is guest-accessible by default; accounts/auth ship in V1 so users can persist their Soul Map (sign-up/sign-in, social sign-in, password reset, and — because accounts ship — App-Store-mandated account deletion). Content is not monetized in V1: paywall, subscription, tier gating, and content gates are Phase 2. Phase 2 also brings search, share sheet, and the reflection moderation queue.
Scope note (2026-06-29): auth was originally scoped as Phase 2. The delivery lead confirmed accounts are in the 14-week build, gates and monetization are not (auth-scope Slack thread, 2026-06-29; project constraints #1). The architecture below was already shaped to carry auth in the same boundary, so this is a scope pull-forward, not a re-architecture.
Three things drove a reassessment of the original Payload + Node recommendation once we understood the product more deeply:
- The hot path is graph traversal, not CRUD. "Dive deeper" and "Connects across" are deterministic multi-axis tag-overlap ranking queries over the Lewis corpus. The defining surface of the app is finding related content by computed tag overlap — not editing or storing it. A CMS is the wrong shape for that work; a ranking-aware data layer is the right shape.
- AI tagging is a first-class workflow, and the AI ecosystem is Python-native. AI-drafted tags + editor QA is the core editorial loop. The mature SDKs (Anthropic, OpenAI), embedding libraries, and ML tooling all live in Python. A Node primary stack would force a second Python service just for AI work — defeating the "one service" simplicity Payload was supposed to deliver.
- The team is small and the runway is short. One technical lead (Python-strong) and three engineers over ~10 build weeks (S&D wraps in week 4). Multi-service operational complexity, custom CMS UI builds, and language-context-switching are all luxury costs the schedule cannot absorb.
This ADR locks the backend architecture for MVP with explicit Phase 2 seams.
Decision Drivers¶
- Must
- Serve guest reads at scale (read-heavy, cacheable, low-latency dive-deeper queries)
- Editor workflows for AI-tag QA, Substack ingest, ElevenLabs audio pre-generation, Portal curation, Daily Drop scheduling
- V1 auth: accounts, social sign-in (Apple/Google), email/password, password reset, and App-Store-mandated account deletion — in the same service/auth boundary as the CMS and API
- Phase 2-ready schema seams for paywall, tier gating, and subscription without rebuild
- Secure by default (CSRF, XSS, SQL injection, rate limiting, secrets management)
- Auto-generated mobile API contract (Flutter team consumes OpenAPI)
- Should
- One language across API + CMS + AI workflows (reduces ops surface, simplifies hiring, eliminates context-switching cost)
- One deployable for MVP (minimise infrastructure moving parts inside the 14-week budget)
- Production-hardened components with long track records — this is a personal project for the client; vendor risk needs to be low
- Swap-points to scale up specific tiers (search engine, edge cache) when load demands, without rewriting call-sites
- Nice-to-have
- Same engine for tag-overlap ranking and Phase 2 full-text search
- pgvector-ready for Phase 2 semantic similarity if it earns its place
Why we moved off the original Payload + Node proposal¶
The original recommendation (Payload CMS, Node, open-source, headless) was sound on the things it optimised for: a flexible relational content schema, a usable editor UI, no SaaS per-record licensing, and an API-first shape. None of those goals are being abandoned. What changed is our understanding of what's actually hard about this product.
Why Node.js (and Payload as the Node primary stack) doesn't fit this app¶
- AI is in Python. The tag-drafting pipeline, future deeper-meaning generation, AI featured images, and any Phase 2 embedding/semantic work all use SDKs and libraries that are Python-native. In a Node primary stack, those workflows become a second service (Python + its own deploy + its own monitoring + a contract with the Node side). That is the opposite of the simplicity Payload was supposed to deliver.
- Security defaults vs assembly. Django ships with CSRF, XSS, SQL-injection protection, secure cookies, rate-limiting hooks, and a hardened auth/permissions framework as first-class concerns. Express applications assemble these from independent middleware (helmet, csurf, express-validator, passport, …) — more configuration surface, more places to misconfigure, more audit work. For a project that needs to be secure at scale, framework-level defaults are the safer floor.
- App Store-mandated account deletion is a V1 requirement now that accounts ship in the build: it needs hard-delete with FK strategies across reflections, soul map, and audit trails. Django's mature auth and ORM relationship modelling make this materially easier than rolling it on Node.
- Team strength. The technical lead is Python-strong (FastAPI, Django). A Node primary stack means everyone except the AI pipeline carries a language tax inside the 14-week build. Python primary inverts that: the editorial workflows, API, and AI all share idioms, libraries, and ops.
- Vendor maturity. Payload's stable line is 3.x (built on Next.js); 4.0 is still in beta as of June 2026 and Payload was acquired by Figma in 2026, so its roadmap and ownership are in flux. It is well-designed, but its production track record is years behind the Python equivalents we are proposing, and key editorial features (multi-step publishing workflows, inline commenting) sit behind Payload's enterprise/commercial tier — whereas Wagtail ships them in the free core. (See the internal "Payload vs Wagtail — editorial experience" comparison.) For a client paying for software that needs to last, lower vendor risk beats newer.
Why dropping Payload specifically is not a downgrade¶
Payload's two strongest selling points — code-first content schemas and a polished block-based editor — are both available in the proposed Python stack via Wagtail (see below). Wagtail has been in production at NASA, the NHS, Mozilla, Google's company pages, Oxfam, the British Council, and Stanford for over a decade. Its StreamField + Draftail editor matches Payload's Blocks/Lexical editor in capability and exceeds it in production hardening. Wagtail also ships multi-step editorial workflows, inline commenting, and deferred draft validation in its free core — exactly the AI-tag-QA gating loop this product needs — whereas in Payload those are enterprise-tier or hand-built (per the internal Payload-vs-Wagtail editorial comparison). We get Payload's editor benefits without Payload's runtime, language, or licensing costs.
Considered Options¶
Option A: Payload CMS + separate Node API tier (the original proposal)¶
- Pros: Editor UI out of the box. Code-first schemas. TypeScript end-to-end. Open source, no per-record SaaS licensing.
- Cons:
- Payload is a CMS, not a graph-traversal engine — the dive-deeper ranking query is awkward inside its query layer.
- Editor traffic and reader traffic share the same Node process by default — operational risk under load.
- AI tagging pipeline cannot live in Node; introduces a second (Python) service, eliminating the "one service" advantage.
- Payload v3 is comparatively young versus the Python equivalents at the same role.
Option B: Payload-as-authoring + projected Node read API tier¶
- Pros: Each tier optimised — Payload edits, read API serves. A search-engine projection (Meilisearch) gives fast ranking + Phase 2 search.
- Cons:
- Two services from day one (CMS + read API), plus a third Python service for AI. The 14-week build cannot absorb that ops surface.
- Webhook reliability and eventual consistency on publish become first-class concerns from day one.
- Cross-language type drift between Node (Payload) and Python (AI), or all-Node with no AI advantage.
Option C: Wagtail on Django + Django Ninja (single Python service) — chosen¶
- Pros:
- One service in one language covers CMS, API, AI workflows, user management, and moderation queue.
- Wagtail's StreamField + Draftail editor matches Payload's block-based editor at comparable polish, with 10+ years of production hardening at NASA, NHS, Mozilla, Google, Stanford, Oxfam, British Council.
- Django Ninja gives FastAPI-grade ergonomics (Pydantic v2, async views, auto-generated OpenAPI) sitting on Django's ORM, auth, admin, and middleware — Flutter team gets a typed mobile contract from day one.
- Django Admin covers user management, reflection moderation queue (Phase 2), and operational tooling without a second admin product.
- Django's security defaults (CSRF, ORM-level SQL injection protection, auth, permissions) match the "secure at scale" driver.
- Python primary stack natively supports the AI pipeline — no second service, no contract drift.
- Mature LTS releases — predictable upgrade path over the app's commercial lifetime.
- Cons:
- Wagtail introduces its own conventions (pages vs snippets, StreamField, image renditions); ~1 week onboarding cost in the first sprint.
- Django ORM async story is partial; hot paths that benefit from async I/O use
sync_to_async/async_to_syncdeliberately.
Option D: Custom CMS UI + FastAPI¶
- Pros: Maximum flexibility on the editor experience. Pure async API.
- Cons: Building a Payload-equivalent CMS UI (Tiptap or Lexical editor + relation pickers + media UI + workflows + moderation queue + scheduled publishing) is a realistic 3–5 week build alongside the API, AI pipeline, mobile contract, and Phase 2 schema headroom. Highest-risk way to spend the 14 weeks. And auth, permissions, and admin (all V1 now) rebuild from scratch instead of coming free with Django + Wagtail.
Decision¶
We adopt Option C with the component stack below. Each decision lists what we picked, the alternatives considered, and the reasoning a non-engineer reviewer can act on.
Versions verified 2026-06-29 (pin the LTS/stable line, not the bleeding edge — this is a long-lived client build): Django 5.2 LTS (supported to Apr 2028; 6.0 is the current feature release but not LTS, and 4.2 reached EOL Apr 2026), Wagtail 7.4 LTS (supported to Nov 2027), Django Ninja 1.6.x, PostgreSQL 18 (on AWS RDS — latest minor 18.4), Redis 8, django-allauth 65.x (headless). These supersede the looser pins in the original draft (Django "5.x", Postgres 16, Redis 7).
Framework — Django 5.2 LTS + Django Ninja¶
- Why Django: Production-hardened at hyper-scale (Instagram through its growth, Disqus, Pinterest, Mozilla, NASA, Reddit early). 5.2 is the current LTS (security support to April 2028) — predictable upgrade path over the app's commercial lifetime, which matters more than the newest feature release for a personal client project. Security defaults (CSRF, XSS, SQL injection, secure session cookies, auth + permissions) are first-class, not assembled. Largest Python-backend talent pool of any framework — keeps hiring options open.
- Why Django Ninja over DRF: Django Ninja (1.6.x) uses Pydantic v2 (the same validation library FastAPI uses), supports async views, and auto-generates OpenAPI/Swagger documentation. The Flutter team consumes the generated OpenAPI spec to keep the mobile contract typed and in sync. DRF works but predates Pydantic and its OpenAPI story is less native. This choice also drives the auth library decision below — the auth layer must be Ninja-native, not DRF-coupled.
- Why not Express.js / Node: AI ecosystem is Python; choosing Node forces a second service. Express requires assembling security middleware where Django bundles it. Smaller backend-framework talent pool for production Node than for Django.
- Why not pure FastAPI: FastAPI is excellent for pure APIs but ships no admin UI, no user/auth framework, no permissions system. Choosing FastAPI here means rebuilding admin, user management, and moderation tooling — which is exactly what Wagtail + Django Admin gives us for free.
CMS — Wagtail¶
- Why Wagtail: Production at NASA, NHS UK, Mozilla, Google's company pages, Stanford, Oxfam GB, the British Council, RCA London — over a decade of mature use. StreamField gives block-based rich content (custom block types: paragraph, pull-quote, embed, audio reading, image, etc.) — the same shape Payload's block editor offers. Draftail is a modern, polished rich-text editor. Built-in: draft/preview, scheduled publishing (Daily Drop fits this natively), editorial workflows (perfect for AI-tag QA gating), image renditions, snippet management, tagging. Lives in the same Django process — no second service, no second admin product, no second auth boundary.
- Why not Payload: Covered in detail above. Briefly: same editor benefit, lower vendor risk, no second-language operational cost, no second service.
- Why not Strapi / Sanity / Contentful: Strapi has the same Node operational profile as Payload. Sanity and Contentful are SaaS with per-record + per-API-call pricing — at "lots of users" scale the bill grows with traffic, and the project constraint was explicitly "no SaaS licensing."
- Why not custom CMS UI: A custom CMS UI is 3–5 weeks of dedicated frontend work. The 14-week budget cannot afford that opportunity cost when Wagtail provides ~95% of what we would build.
Database — Postgres 18 (Django ORM, AWS RDS)¶
- Why Postgres: Industry standard relational DB; production at Apple, Instagram, Spotify, GitHub, Reddit at massive scale. Pin Postgres 18 — available on AWS RDS (latest RDS minor 18.4, May 2026), so it is deployable today, not just upstream-stable (19 is in beta — not for a production launch). AWS RDS gives managed point-in-time recovery, multi-AZ failover, and automated minor-version upgrades. JSONB columns where we want flexibility (AI tag-score payloads). GIN indexes on array columns make tag-overlap queries fast at MVP scale. pgvector extension is one install away for Phase 2 semantic similarity if it earns its place.
- Why not MongoDB: The content graph (Work → Chapter → Passage; Theme ↔ Passage; Reflection → Passage) is structurally relational. JSONB inside Postgres gives the flexibility benefit without giving up referential integrity and join performance.
- Why not MySQL: Postgres' JSONB, array types, GIN indexes, and pgvector ecosystem are all materially better for this product's read patterns.
Search / ranking — Postgres GIN at MVP, Meilisearch as Phase 2 swap-in¶
- Why Postgres GIN at MVP: The corpus stays under 10K passages at launch. Postgres' array operators (
motif_tags && ARRAY[...]) with GIN indexes return tag-overlap matches in under 50ms at this scale, witharray_length(motif_tags & ARRAY[...], 1)giving overlap-score ranking. One fewer service to operate, one fewer thing to deploy, one fewer thing to monitor — directly buys time inside the 14-week budget. - Why Meilisearch is the planned Phase 2 swap-in: When (a) the corpus grows past ~50K, (b) Phase 2 full-text search lands, or © ranking complexity outgrows array operators, Meilisearch serves both ranking and full-text from one engine with low ops cost. Read-API code is written against a
Rankerinterface from day one so the swap is mechanical and never touches call-sites. - Why not Elasticsearch / OpenSearch: JVM ops, cluster management, index sizing — overkill for any plausible scale this product hits.
- Why not Algolia: SaaS with per-record + per-search pricing. With graph traversal as the hot path, search volume scales with user traffic; the bill scales with the product's success in exactly the wrong way.
Dependency injection — Wireup¶
- Why Wireup: Django has no native DI container. Without one, services (the
Ranker, AI tag pipeline, ElevenLabs client, Substack sync) are either instantiated inline in views (tight coupling, hard to test) or wired via module-level globals (import-order fragility, awkward mocking). Wireup provides a lightweight DI container with a first-class Django integration:@injecton views and Django Ninja endpoints;@inject_appon Celery tasks, management commands, and signals. Services are plain Python classes — no framework coupling, directly unit-testable by constructing them in tests with stub dependencies. - Why this matters for the
Rankerswap: TheRankerinterface (Postgres GIN now, Meilisearch in Phase 2) is the canonical swap-point in this ADR. Wireup is what makes that swap mechanical: changing one registration in the container is the entire change. Call-sites in views never import a concrete implementation directly. - Why not manual service factories / module-level singletons: No dependency graph validation at startup (misconfigured services surface at request time, not boot time). No consistent pattern across views, tasks, and commands. Mock-patching module imports in tests is brittle.
- Why not Django's built-in app registry or class-based views with
as_view(): These solve view organisation, not dependency provision. They don't give you a validated dependency graph or a single place to swap implementations.
Cache + broker — Redis 8 (AWS ElastiCache)¶
- Why Redis: The de facto cache standard — production at Twitter, GitHub, Instagram, Stack Overflow, Snapchat. Redis 8 is the current stable line as of June 2026. Doubles as the Celery broker (one fewer service), and holds the JWT denylist / refresh-token revocation state for V1 auth. AWS ElastiCache gives managed multi-AZ failover and snapshot backups.
Background jobs — Celery + Redis¶
- Why Celery: Mature Python job queue; production at Instagram, Mozilla, Lyft. Retry and dead-letter semantics are critical for the external integrations we rely on (Anthropic/OpenAI rate limits, ElevenLabs job latency, Substack polling). Pairs naturally with the Redis broker already in the stack. Celery Beat handles scheduled publishing (Daily Drop).
- Why not AWS Lambda / serverless workers: Cold-start latency hurts AI calls; we want long-running workers, not function invocations. Celery is the right fit.
Object storage — S3 + CloudFront for media only¶
- Why S3 + CloudFront: AWS-aligned with the hosting choice. CloudFront caches ElevenLabs audio files and AI-generated featured images at the edge, cutting latency and S3 egress cost for global users.
- CDN in front of the API is deferred to Phase 2. Redis at origin is sufficient at MVP scale; revisiting once usage telemetry warrants edge caching for read endpoints.
- Why not Cloudflare R2: Possible alternative for media storage; defer the decision to a follow-up ADR if S3 egress costs turn out to matter. Sticking with S3 keeps the AWS data-tier surface uniform.
Auth (V1) — django-allauth (headless) + Apple Sign-In + Google Sign-In + JWT¶
Accounts ship in V1 (scope update 2026-06-29). The auth layer must be Django Ninja-native, since that is the API framework — not coupled to DRF.
- Why django-allauth headless:
django-allauth(65.x, June 2026) is the mature Django auth package — email/password, social sign-in, email verification, password reset, and Apple Sign-In (App Store mandate when offering third-party sign-in on iOS). Itsallauth.headlessmode exposes every flow as a framework-agnostic REST API with a published OpenAPI spec, and the project's own examples now demonstrate Django Ninja integration plus a built-in JWT token strategy. One package covers the whole flow set behind our Ninja API. - Why not
dj-rest-auth(the original draft's choice):dj-rest-authis built on Django REST Framework. This stack is Django Ninja, so dj-rest-auth would drag in DRF purely for auth — a second API framework in the process for no benefit.allauth.headlessremoves that coupling. - JWT for the mobile client: the allauth headless JWT strategy issues access/refresh tokens to keep Flutter sessions stateless against the API; refresh-token revocation / denylist lives in Redis. (If a dedicated token lib is ever preferred over the headless strategy,
django-ninja-jwt— a SimpleJWT fork with the DRF dependency removed — is the Ninja-native fallback. Default to the allauth headless strategy to avoid a second auth dependency.) - Account deletion is a V1 deliverable (App Store mandate, now that accounts ship): a real "delete my account" endpoint that purges PII and retained data. With reflections and soul map referencing a user, the deletion strategy (cascade vs anonymise per FK) is a first-sprint schema decision — tracked in Open items, no longer deferred to Phase 2.
Edge / WAF — Cloudflare¶
- Why Cloudflare: Best-in-class global PoP coverage, mature rate limiting + bot management + DDoS mitigation, simple DX (single dashboard, sensible defaults), free tier covers MVP-level traffic with predictable upgrade tiers. Decoupling the WAF from AWS also gives us a vendor diversification benefit on the security tier.
- Why not AWS WAF: Would keep the security tier inside the same vendor surface, but the management ergonomics are heavier (rule groups, web ACLs, separate per-region pricing) and the bot/edge-cache story is materially behind Cloudflare's.
Hosting — AWS ECS Fargate¶
- Why ECS Fargate: Serverless containers — no EC2 management, no Kubernetes control plane to operate. Autoscaling is native. Cheaper than EKS at our expected scale. Standard pattern for Python web apps on AWS. Web and Celery worker run as separate Fargate services so editor traffic, reader traffic, and background jobs scale independently.
- Why not EKS: Operational overhead not justified at our scale.
- Why not Render / Fly.io / Railway: All viable for a smaller team, but the client's "scale" requirement and our AWS-aligned services (RDS, ElastiCache, S3, CloudFront, SSM, Secrets Manager) make AWS-native ECS Fargate the lower-friction choice.
Observability (Phase 1) — Sentry + structured logs¶
- Why Sentry: Industry standard for error tracking; native Django and Celery integrations; sufficient at Phase 1 traffic. Metrics tooling (Grafana / Datadog) is deferred to Phase 2 when traffic and complexity justify it.
Architecture (MVP)¶
┌──────────────────────────────┐
│ Cloudflare / AWS WAF │
└──────────────┬───────────────┘
│
┌───────────────────┴───────────────────┐
│ CloudFront │
│ (media only — MVP) │
└─────┬──────────────────────┬──────────┘
│ │
│ ▼
│ ┌──────────────┐
│ │ S3 │
│ │ audio/images │
│ └──────────────┘
│
▼
┌────────────────────────┐
│ Django (ECS Fargate) │
│ ┌──────────────────┐ │
│ │ Django Ninja │ │ ← Mobile API (auto-OpenAPI)
│ └──────────────────┘ │
│ ┌──────────────────┐ │
│ │ Wagtail │ │ ← Editor CMS
│ └──────────────────┘ │
│ ┌──────────────────┐ │
│ │ Django Admin │ │ ← Users (V1), mod queue (Ph 2)
│ └──────────────────┘ │
└─────┬───────────┬──────┘
│ │
▼ ▼
┌────────────┐ ┌──────────┐
│ Postgres │ │ Redis │
│ + GIN tags │ │ cache + │
└────────────┘ │ broker │
▲ └────┬─────┘
│ │
│ ┌────▼─────────┐
└─────────│ Celery │ ← AI tagging,
│ workers │ ElevenLabs,
└──────────────┘ Substack sync
Mapping the CMS data model onto Wagtail¶
The collections in docs/cms-architecture.md map cleanly:
| Domain concept | Wagtail/Django primitive |
|---|---|
| Journey / Portal | Wagtail Page (tree structure, slugs, scheduled publishing) |
Theme (with has_page=true) |
Wagtail Page (topic hub) |
| Theme (taxonomy only) | Wagtail Snippet |
| Passage | Wagtail Snippet with StreamField body |
| Article | Wagtail Snippet with StreamField body (or fold into Passage with provenance=editorial — see open item below) |
| Work, Chapter, Life Stage, Podcast | Wagtail Snippet |
| Reflection | Django model (user-generated, moderated via Django Admin queue) |
| Daily Drop | Django model + Celery Beat job |
Production credibility (for client review)¶
Every component in this stack has a multi-year production track record at organisations the client can recognise:
| Component | Used in production at |
|---|---|
| Django | Instagram, Disqus, Mozilla, NASA, Pinterest, Spotify, Reddit |
| Wagtail | NASA, NHS UK, Google (company pages), Mozilla, Stanford, Oxfam, British Council |
| Postgres | Apple, Instagram, Spotify, GitHub, Reddit |
| Redis | Twitter, GitHub, Instagram, Stack Overflow, Snapchat |
| Celery | Instagram, Mozilla, Lyft |
| AWS (ECS, RDS, S3, CloudFront) | Industry standard |
| Sentry | Industry standard |
No SaaS lock-in on the editorial or content tier. No per-record licensing. Every choice has a managed-services path on AWS that the client's ops team (or our handover team) can operate.
Cost Estimate (ballpark, pre-finalization)¶
Plan for ~$350/month recurring at MVP launch, plus a one-time content-ingest spike of a few hundred dollars. This is a pre-finalization figure; the firm number lands when the architecture is locked (see project constraints #4 and the "Ongoing costs" note below). Numbers are AWS on-demand (no reserved/savings-plan commitment) in a single production environment.
Recurring monthly (AWS infra + SaaS)¶
| Component | Spec assumed | MVP launch | Modest growth |
|---|---|---|---|
| ECS Fargate | 2 web + 1 Celery worker (0.5 vCPU / 1 GB each), 24/7 | $40–80 | $120–250 |
| RDS Postgres 18 | db.t4g.medium, multi-AZ + PITR | $60–130 | $150–300 |
| ElastiCache Redis 8 | cache.t4g.small, multi-AZ | $30–70 | $70–150 |
| S3 | audio + images, single-digit to tens of GB | $5–15 | $15–40 |
| CloudFront | media-only edge (API CDN deferred to Phase 2) | $10–30 | $30–100 |
| Cloudflare WAF | free tier covers MVP; Pro tier later | $0–20 | $20–50 |
| Sentry | Team plan | $0–30 | $30–60 |
| Misc (Secrets Manager, CloudWatch logs, data egress) | — | $10–30 | $30–80 |
| Recurring total | ~$150–400 | ~$450–1000 |
The committed planning figure of ~$350/month sits at the upper end of the MVP range because the ADR specifies multi-AZ for both RDS and ElastiCache (failover is a stated requirement — see Risks table). Dropping to single-AZ roughly halves the RDS + Redis lines (~$90–200/mo total), but trades away the failover guarantee.
One-time / episodic (AI + voice pipeline)¶
These are not steady monthly costs — they are bulk spikes that recur only when new content is ingested:
| Pipeline step | Driver | One-time cost |
|---|---|---|
| AI tag drafting | Anthropic/OpenAI over full Lewis corpus (<10K passages) | ~$50–300 |
| ElevenLabs pre-generation | Cloned-voice audio, all passages, pre-rendered (not on-demand) | ~$100–400 |
Steady-state ongoing AI/voice (incremental new content only) is small — tens of dollars per month, folded into the recurring estimate's headroom.
Assumptions & exclusions¶
- AWS on-demand pricing. A 1-year reserved/savings-plan commitment cuts compute ~30–40%.
- One production environment. A full always-on staging environment roughly doubles infra cost; a scaled-down staging adds less.
- Excludes domain registration, CI/CD minutes, and team/ops labour.
- Real cost driver at scale is traffic (Fargate + CloudFront + Postgres size), not the AI pipeline — which front-loads as a one-time build cost.
Consequences¶
Positive¶
- One service, one language, one auth boundary for MVP — the team's 14 weeks are spent on the product, not on glue.
- Editor UX is competitive with the original Payload proposal via Wagtail's StreamField + Draftail, with materially longer production hardening.
- AI pipeline is a first-class citizen (Anthropic / OpenAI / embedding SDKs in the same process as the API).
- Auto-OpenAPI from Django Ninja gives the Flutter team a typed contract from day one.
- V1 auth lives in the same boundary, no extra service:
django-allauthheadless runs inside the one Django process alongside the CMS, API, and AI workflows — accounts, social sign-in, and account deletion add no new deployable. - Phase 2 is additive, not a rebuild: paywall/tier-gating sit on the existing
tierfield and user model; Wireup makes theRankerswap (Postgres GIN → Meilisearch) a single container registration change; pgvector is one extension away. - Security defaults at the framework level materially lower the App Store-compliance burden — relevant now that account deletion is a V1 deliverable.
Negative¶
- Wagtail conventions (pages vs snippets, StreamField blocks, image renditions) require ~1 week onboarding in the first sprint.
- Django ORM async story is partial on Django 5.2 LTS (psycopg3 supported; not every ORM path is natively async). Hot paths that benefit from async I/O use
sync_to_async/async_to_syncdeliberately. Fully-native async ORM lands on Django 6.x; we trade that for the LTS support window and revisit at the next LTS. - Postgres GIN ranking is a stopgap: it works at <10K passages but will need swapping to Meilisearch when corpus grows or full-text search lands.
- One process means editor activity and reader activity share the runtime. Mitigated by separate Celery workers for heavy tasks and aggressive Redis caching on the read API.
Risks & Mitigations¶
| Risk | Mitigation |
|---|---|
| Tag-overlap ranking on Postgres GIN degrades as corpus grows | Write the read API against a Ranker interface from day one. Track p95 on the dive-deeper endpoint. Have the Meilisearch projection plan documented before Phase 2 search work begins. |
| Wagtail learning curve eats build time | Spike 1 week of Wagtail in the first sprint: model Passage + Journey + StreamField blocks end-to-end before committing the rest of the schema. |
| AI tagging quality requires editor effort | The QA workflow (draft → reviewed) is gated in Wagtail's workflow feature. Add a "needs review" dashboard for editors as a Phase 1 polish item. |
| Account deletion (App Store mandate) is non-trivial with reflections, soul map, audit data — and is now V1 | First-sprint schema work must include a hard-delete plan with FK on-delete strategies (cascade vs anonymise) per collection, since accounts ship in V1. Cannot ship to the App Store without it. |
| One service = wider blast radius on outage | ECS Fargate auto-scales the web tier; Celery workers are separate Fargate services so background load cannot starve the API; RDS multi-AZ + PITR; Sentry catches regressions early. |
| Wagtail / Payload editor gap (block library polish) | Accept ~95% parity at MVP. Track specific block needs as we go; add custom Wagtail StreamField blocks where editors flag gaps. |
| Client originally bought into Payload + Node | Walk through this ADR's "Why we moved off" section with the client before locking. The editor benefit is preserved (Wagtail); the operational and AI-pipeline costs of Node are eliminated. |
Open items (not blocking ADR acceptance)¶
- Passage vs Article modeling. Product spec collapses these into one text-piece type with
provenancedistinguishing them; the Figma CMS schema has them as separate collections. Resolve indocs/cms-architecture.mdbefore the first sprint commits a Wagtail snippet definition. - Account deletion strategy (now V1) — define FK on-delete (cascade vs anonymise) per model in the first sprint; account deletion must be production-tested before App Store submission.
- Guest → account data hand-off — how guest-created state (e.g. a first save) is represented before an account exists and migrated on sign-up. Build-time decision in the first auth sprint (anonymous/device id vs deferred persistence).
- Tech stack client acceptance. Per project constraints #4, the original proposal (Payload + Flutter + ElevenLabs) needs client sign-off. This ADR amends the backend half; the client should be walked through it before locking.
Related¶
- Project constraints — non-negotiables (accounts in V1 / no monetization in V1, tag model load-bearing, AI tagging build-time, stack proposed-not-confirmed)
- Auth-scope Slack thread (2026-06-29) — source for the auth-in-V1 scope decision
- Product domain model — content model and traversal mechanics this stack must serve
- Prior tech-stack working assumption — what this ADR supersedes
- MVP scope — Phase 1 vs Phase 2 split this architecture is shaped around
docs/cms-architecture.md— collections and relationships the Wagtail/Django models will encode- Source spec: "What we are building — end of week 4" Notion page (
3894dccd9d8b8055a0a9ed38ed43eb88) - Source spec: "Foundational Architecture" Figma board (
3ivmGpow1xs7q9k81ea6OL, node578-1345)