AI Lead · Aloukik Aditya

AI/ML Roadmap

A prioritised, sprint-by-sprint plan for the AI & ML team. Adjust, cut, or add based on product decisions — this is the baseline.

Owner Aloukik (AI Lead)
Sprint Cadence 1 Week
Last Updated May 14, 2026
Tooling Claude Code
Scroll to explore

North Star

Move from reactive AI fixes to a proactive AI platform that drives onboarding, retention, and revenue.

🎯
Best-in-Class Onboarding
LLM-driven image escalation that converts every new user at peak desire.
Cheapest NSFW Stack
Grok via OpenRouter as primary, Fal.ai for infra — lowest cost, highest capability.
🖼️
Automated Quality Gates
Pre-generated content with AI classifiers so users never see a broken image.
💬
Smart Engagement
Dynamic suggestions and intimacy progression that keep users coming back.

Charts & Tables

Visual breakdown of sprint distribution, priority allocation, and task load.

Priority Distribution
14
Sprints
P0 Urgent
3
P1 High
5
P2 Medium
5
P3 Low
1
Tasks per Sprint
Sprint Timeline — Estimated Week-by-Week
Sprint Master Table
Sprint Theme Priority Tasks Split Key Output

Sprint Overview

14 sprints across 4 priority tiers. Splits are made only where a task is genuinely too large for one week.

Sprint 1
E2E Verification — Image Escalation
P0 Urgent
Sprint 2 · Part 1
Grok NSFW Validation + Endpoint Audit
P0 UrgentSplit
Sprint 2 · Part 2
OpenRouter Stress Test + Monitoring
P0 Urgent
Sprint 3
RunPod → Fal.ai Migration
P1 High
Sprint 4 · Part 1
GCP Audit + Erin Transition
P1 HighSplit
Sprint 4 · Part 2
Asset Migration + GCP Cleanup
P1 High
Sprint 5
NSFW LoRA Taxonomy + Image Classifier
P1 High
Sprint 6 · Part 1
NSFW LoRA Training — Tier-1
P2 MediumSplit
Sprint 6 · Part 2
Live Image Classifier + LoRA Integration
P2 Medium
Sprint 7
Smart Suggestions — Dynamic LLM Prompts
P2 Medium
Sprint 8
User Media Input — Grok Vision
P2 Medium
Sprint 9 · Part 1
Intimacy Meter Design + Memory System
P2 MediumSplit
Sprint 9 · Part 2
Memory System v1 Build
P2 Medium
Sprint 10+
Intimacy Meter Full Build + Rollout
P3 Low

Sprint Breakdown

Every task is AI/ML team work only. Backend and frontend items are flagged as suggestions — not AI tasks.

S1
E2E Verification — Image Escalation
P0 Urgent
Confirm Toan's image escalation work is production-ready before it becomes the core onboarding experience at scale.
Image escalation is the most important piece of onboarding. Recently pushed by Toan. AI team must verify the full pipeline E2E before product ships to users at scale.
AI/ML Tasks
  • E2E pipeline walkthrough
    Run the full image escalation sequence (dressed → undress, 6 stages) for Sophie and Lena. Manually inspect each output for: correct escalation order, no broken/artifact images, creator identity preserved across all stages.
    👁 Manual visual inspection required
  • Stress test: image generation concurrency
    Write a load test script (Claude Code), hit the endpoint at 10 / 25 / 50 concurrent requests. Record latency P50/P95, failure rate, broken image rate under load.
    📊 Output: pass/fail report + recommended concurrency cap
  • Broken image audit
    Sample 50 pre-generated images from DB across all escalation tiers. Manually classify: good / broken / wrong creator. Document failure rate per tier.
    🏷 Labelled dataset becomes the classifier benchmark in S5
  • Tattoo regression check
    New model has known tattoo degradation vs old model. Inspect escalation images for tattooed creators. Flag to product if quality is below bar with visual evidence.
  • Observability audit
    Check what logging exists: Are broken image events logged? Are escalation stage transitions traceable? Are generation errors surfaced? Document gaps and flag critical missing logs.
    ⚠ We can't debug production issues without logs
  • E2E test runbook
    Document the full E2E verification process as a repeatable runbook. Next time someone pushes to this pipeline, the runbook is the checklist.
suggest → BE: Error handling + retry on escalation endpoint suggest → BE: Pre-gen images indexed in DB with tier tags
S2
P1
Grok NSFW Validation + Endpoint Audit
P0 Urgent Split
Validate Grok's quality and NSFW capability before committing it as primary LLM.
Why split: Manually reviewing 100 conversations for quality + persona + NSFW compliance is 2-3 days of work alone. Infra stress testing belongs in Part 2 so neither gets rushed.
AI/ML Tasks
  • OpenRouter endpoint audit
    Review current integration (D2C-4559 / D2C-4561). Verify all modes route correctly: chat, image intent detection, guardrails, suggestion mode. Document any gaps.
  • Grok NSFW capability test
    Run 100 representative onboarding conversations through Grok. Score each on: NSFW compliance rate, persona consistency, message quality, response tone vs current LLM baseline.
    👁 Manual review of outputs is the bulk of this sprint
  • Prompt compatibility audit
    Our prompts were tuned for Claude/current LLM. Test all active Langfuse prompt templates against Grok. Identify any that produce worse outputs. Document required prompt adjustments.
  • Multi-turn persona consistency test
    Run 20 extended conversations (10+ turns each) through Grok. Verify persona, tone, and NSFW escalation remain consistent across the full conversation — not just the first message.
    ⚠ Single-turn tests are not enough to validate primary LLM behaviour
  • Content filter false positive retest
    D2C-4080: current LLM blocks its own suggestions. Rerun these exact cases with Grok to determine if this is model-specific or pipeline-level.
  • Cost-per-conversation estimate
    Based on token counts from the 100-conversation test, compute average cost/conversation on Grok vs current LLM. Produce a concrete monthly cost projection at current DAU.
S2
P2
OpenRouter Stress Test + Fallback + Monitoring
P0 Urgent
Validate OpenRouter is infrastructure-ready at scale. Define rate limit strategy, wire monitoring, and confirm fallback chain works.
AI/ML Tasks
  • Stress test: OpenRouter concurrent requests
    Load test at 10 / 50 / 100 / 200 concurrent requests. Measure: latency, rate limit hits, error rate. Document max safe concurrency + recommended alerting thresholds for BE.
  • Rate limit strategy
    Define what happens when OpenRouter rate limits are hit: queue and retry, immediate fallback, or degraded experience. Document strategy with latency and cost tradeoffs.
  • Fallback chain test
    Simulate Grok failure (timeout, rate limit, API error). Verify fallback to backup LLM is seamless — no user-facing error, no broken conversation state, no duplicate charges.
  • Cost tracking instrumentation
    Log token usage and model name per request via OpenRouter. Track actual cost/conversation before and after the switch to validate the pricing advantage in production.
  • Error monitoring setup
    Define what AI-layer errors we alert on: Grok failures, fallback triggers, high broken-image rates, guardrail rejection spikes. Document alert thresholds and owner.
  • Failover runbook
    Document the manual failover procedure: how to switch primary LLM, rollback to previous model, and verify the switch is working. Required before Grok goes to prod.
suggest → BE: Parameterize LLM model in config (D2C-4581) suggest → BE: Add model name to conversation logs
S3
RunPod → Fal.ai Migration
P1 High
Move backup LLM off degrading RunPod infrastructure onto Fal.ai. Validate fully before decommissioning.
RunPod quality has declined significantly over the last 3-4 months. Fal.ai is more reliable and cheaper. Tasks are sequential and well-scoped — one sprint is the right call.
AI/ML Tasks
  • Fal.ai endpoint setup
    Deploy backup LLM docker image (D2C-4561 dockerfile) to Fal.ai. Resolve any Fal.ai-specific config differences. No GPU weights needed since OpenRouter handles primary.
  • Cold start characterisation
    Measure Fal.ai cold start time. Determine if cold starts are acceptable for a backup LLM or if we need warm instances. Document recommendation with cost implications.
  • Smoke test + quality benchmark
    Run 50 conversations through the new Fal.ai endpoint. Compare output quality vs RunPod baseline. Flag any regressions with specific examples.
    👁 Manual review of sample outputs required
  • Latency + cost benchmark
    Compare P50/P95 latency and cost-per-request: RunPod vs Fal.ai. Confirm Fal.ai is cheaper and within latency SLA for a backup system.
  • Stress test + failover integration test
    Load test Fal.ai at the concurrency cap from S2. Trigger failover from Grok → Fal.ai. Verify: seamless conversation continuity, correct model switch, no duplicate charges.
  • Decommission RunPod + rollback plan
    Once all tests pass, shut down RunPod instances and update all documentation. Document rollback procedure in case Fal.ai has issues in its first week of production.
suggest → BE: Update backend endpoint URL to Fal.ai suggest → BE: Fal.ai webhook auth (D2C-3910)
S4
P1
GCP Audit + Erin Transition Coordination
P1 High Split
Understand exactly what's in GCP and plan the migration carefully before touching anything.
Why split: Migrating large model weights and datasets between cloud providers is data-size dependent — can take multiple days of transfer time. Audit first so Part 2 is a clean execution sprint.
AI/ML Tasks
  • GCP storage audit
    Catalog everything in GCP: training datasets, model weights, generated images, logs, scheduled jobs. Classify each: keep (cold archive) / migrate to Fal.ai / delete.
  • Cost breakdown by service
    Pull GCP billing data. Break down monthly spend by compute (training jobs), storage (buckets), and networking (egress). Identify top cost drivers and project savings after migration.
  • Identify active write paths
    Check which GCP buckets are still being actively written to. Any active write path means something in prod still uses GCP — must be accounted for before deletion.
  • Coordinate Erin transition
    Erin still uses the old GCP training pipeline. Document the Fal.ai training workflow, walk her through it, and confirm she can run training on Fal.ai independently before we kill GCP access.
  • Identify GCP-dependent scheduled jobs
    Check for any cron jobs, Cloud Functions, or Vertex AI pipelines running on a schedule. These are easy to miss and will break silently after migration.
  • Document minimal GCP footprint plan
    Define exactly what stays in GCP (Docker + GPU for experiments). Estimate monthly cost. This becomes the target state after Part 2 and the reference for future spend reviews.
S4
P2
Asset Migration + Pipeline Deprecation + GCP Cleanup
P1 High
Execute the migration plan from Part 1. Leave GCP in minimal footprint state only. Realise cost savings.
AI/ML Tasks
  • Migrate active assets to Fal.ai
    Transfer training datasets and model weights identified in Part 1. Verify checksums post-transfer to confirm no data corruption.
  • Deprecate old model training pipeline
    Archive/delete old GCP training scripts, jobs, and associated storage. Confirm old model is not referenced in any live system before deletion.
    Decision: ditch old model completely — better model coming, cost saving outweighs keeping it
  • GCP storage cleanup
    Delete deprecated buckets and assets. Confirm active write paths from Part 1 have been rerouted or shut down.
  • Post-migration cost validation
    Pull GCP billing after cleanup. Confirm actual cost reduction matches the projection from Part 1. Flag if anything unexpected is still running.
  • Update all internal documentation
    Update any internal docs, READMEs, or runbooks that reference GCP storage paths, training procedures, or model endpoints. Stale docs cause incidents.
suggest → BE: Confirm no backend services reference old GCP model endpoints
S5
NSFW LoRA Taxonomy + Broken Image Classifier
P1 High
Lock the LoRA category taxonomy. Design, build, and benchmark a working automated broken image classifier.
AI/ML Tasks
  • Define NSFW LoRA category taxonomy
    Finalize all LoRA categories for the full escalation sequence. Existing: selfie z-image, cleavage, ass in underwear/lingerie, nipslip, removing clothes. Define remaining NSFW tiers, category-to-escalation-step mapping, and estimated training effort per category.
    📄 Output: spec doc — product sign-off required before training begins
  • Broken image classifier — design + build
    Multimodal AI audit pipeline. Input: batch of images. Model: Grok vision vs Claude multimodal (evaluate cost vs accuracy). Classifications: good / broken / wrong_creator / wrong_category. Parallel processing, auto-regen up to 3x before manual flag.
  • Prompt versioning setup for classifier
    Before iterating, set up prompt versioning (Langfuse or equivalent). Every classifier prompt version must be tracked so we can compare accuracy across iterations and roll back if needed.
  • Classifier benchmark + false positive analysis
    Run against the 50-image manually labelled sample from S1. Target: >90% precision on broken. Separately analyse false positives — which types of good images does the classifier incorrectly flag?
  • SFW pre-gen batch run + classifier audit
    Run full SFW pre-gen batch for all active creators through the classifier. Review flagged images manually. Iterate on prompt if precision drops at scale vs the benchmark.
  • Per-creator QA report
    After the batch run, produce a per-creator quality report: images generated, images flagged, auto-regenned count, final pass rate. This becomes the health metric for the pre-gen pipeline.
suggest → BE: DB schema — category, escalation_tier, quality_status, retry_count fields
S6
P1
NSFW LoRA Training — Tier-1 (All Four Categories)
P2 Medium Split
Train and QA all four Tier-1 LoRA categories. Manual visual QA per creator for every category.
Why split: Each LoRA = dataset prep + LoRA strength experimentation + hours of training + manual visual QA per creator + iteration rounds. Four LoRAs fills a sprint. Combining with classifier build and integration is too heavy.
AI/ML Tasks
  • Pre-training dataset audit
    Before training any LoRA, audit training images for each category: correct labels, no duplicates, sufficient variety, no low-quality samples. Bad training data = wasted training runs.
  • Tier-1 LoRA training: selfie z-image
    Follow Lucky's 6-step workflow. Experiment with creator LoRA strength. Manual visual QA per creator: face consistency, pose correctness, escalation stage accuracy.
    👁 Manual visual inspection required per creator
  • Tier-1 LoRA training: cleavage
    Same workflow. Extra attention to anatomical artifacts in the chest area. Multiple iteration rounds expected.
    👁 Manual visual inspection required per creator
  • Tier-1 LoRA training: ass in underwear/lingerie
    Fabric rendering and body proportion are the main failure modes. QA must check both carefully.
    👁 Manual visual inspection required per creator
  • Tier-1 LoRA training: nipslip
    Highest artifact-risk category in the escalation sequence. Thorough visual QA required. Expect the most iteration rounds of any Tier-1 category.
    👁 Manual visual inspection required per creator
  • LoRA strength calibration doc
    Document optimal LoRA strength settings found for each category and each creator during experimentation. This is institutional knowledge — without it, the next person retrains from scratch.
S6
P2
Live Broken Image Classifier + LoRA Integration
P2 Medium
Ship a real-time quality gate for live image generation. Wire Tier-1 LoRAs into the pre-gen pipeline and validate at scale.
AI/ML Tasks
  • Live broken image classifier
    Lightweight, fast classifier for real-time generation (separate from S5 batch classifier). Must complete in <500ms. Broken → silently regen, no charge to user, log event. Good → deliver, charge normally.
    suggest → BE: Retry endpoint + "no charge" flag on regen
    suggest → FE: Show loading state during regen
  • Classifier latency profiling
    Profile the live classifier under load. If P95 exceeds 500ms, identify bottleneck (model call, image download, post-processing) and optimise.
  • Regen loop monitoring
    Instrument how often images break and how many regen attempts are needed. High regen rates = LoRA quality issue, not a classifier issue. Log separately so we can act on the right thing.
  • LoRA integration into pre-gen pipeline
    Wire all four Tier-1 LoRAs into the batch pre-gen system from S5. Run full NSFW Tier-1 pre-gen batch for active creators. Audit outputs through the batch classifier.
  • Pre-gen pipeline run report
    After the NSFW batch: images generated per category per creator, failure rates, regen counts, final inventory size. Product needs this to know what content is available before surfacing it.
  • A/B test instrumentation plan
    Define the measurement plan for whether broken image elimination improves conversion. What events to track, control vs treatment, primary metric. Hand to product/BE to wire up.
S7
Smart Suggestions — Dynamic LLM Conversation Prompts
P2 Medium
After each creator message, the LLM returns 2-3 contextual suggestions so users always know what to say next. Reduces drop-off from users who lose momentum.
Full sprint. New endpoint, prompt engineering, diversity logic, persona alignment, and load testing. Existing Icebox ticket: D2C-3539 (Implement Suggestion Mode).
AI/ML Tasks
  • Spike + design
    Test Grok at generating contextual suggestions from real conversation history samples. Define behaviour: trigger timing, input (last N messages + creator persona), output format (2-3 ranked strings), edge cases (first message, post-paywall, post-media-share).
    👁 Manual review of suggestion quality — defines the quality bar. Output: design doc.
  • Suggestion endpoint — build
    Build a /suggest endpoint (or extend chat response to return suggestions[] optionally). Must be async and non-blocking relative to main chat response. Claude Code for scaffolding. Reference D2C-3539.
  • Prompt engineering: contextual quality
    Iteratively tune the suggestion prompt using real conversation history samples. Target: suggestions feel like a natural nudge, not a chatbot menu. Multiple prompt iterations with manual review each round.
  • Suggestion diversity logic
    Ensure suggestions don't repeat across consecutive turns, cover different types (question / action / flirt / media prompt), and vary in intensity based on conversation stage. Implement deduplication and variety scoring.
  • Creator persona alignment
    Suggestions must match each creator's voice and style — a suggestion that sounds like the wrong creator breaks immersion. Test each suggestion type against each active creator's persona profile.
  • Stress test + latency validation
    Load test at realistic volume (fires after every message). Suggestions must not add latency to the chat response. Document P95 and confirm fully non-blocking behaviour.
suggest → FE: Tappable chips below last creator message suggest → BE: Call suggestion endpoint async/non-blocking
S8
User Media Input — Grok Vision + Guardrails
P2 Medium
Enable users to send images, audio, and video to the AI. Grok handles the heavy lifting — low effort, high impact.
AI/ML Tasks
  • Grok multimodal input — spike
    Test Grok's vision API with sample user-sent images across categories: selfie, clothing, body. Evaluate response quality, persona consistency, tone appropriateness. Identify any categories Grok handles poorly.
    👁 Manual review of AI responses required — tone and persona need careful checking
  • Media-type specific prompt engineering
    Creator responses to a selfie should feel different from an outfit photo or a body pic. Build and test prompt variations per media type so reactions feel natural and persona-consistent, not generic.
  • LLM API: multimodal input mode
    Extend LLM API endpoint to accept media_type + media_url alongside text. Route to Grok vision model. Text-only mode must be completely unaffected. Claude Code for scaffolding.
  • Media content guardrails
    Before user-uploaded media reaches Grok, classify it: reject illegal/underage content, log all media inputs, fail safe (reject) on classifier uncertainty.
    ⚠ Compliance requirement — must not be skipped or minimised
  • Static media suggestion prompt library
    Build 20-30 suggestion prompts that encourage users to share media naturally ("Show me what you're wearing", "Share a selfie — I'll rate it"). Static baseline that S7 smart suggestions will augment over time.
    suggest → FE: Surface as suggestion chips in chat input (Snapchat-style UX)
  • Stress test + payload size limits
    Load test multimodal endpoint with realistic image payload sizes. Measure latency delta vs text-only. Document max payload size, P95 latency, and recommended FE compression settings.
S9
P1
Intimacy Meter: Spec + Memory + Knowledge Graph Design
P2 Medium Split
Produce complete, product-signed-off design docs before any code is written. This design touches product, AI, BE, and FE.
Why split: Spec decisions have downstream consequences for DB schema, LLM prompting, content unlocking, and frontend. One week for design, one week for build. Never build on an unreviewed design.
AI/ML Tasks
  • Competitive analysis
    How do Replika, Candy.ai, and OnlyFans handle user progression and retention mechanics? What works, what feels manipulative, what converts? Informs the spec significantly.
  • Intimacy progression spec
    Tier 1 (0–50 chats): SFW / flirty → selfie content. Tier 2 (50–100): Suggestive → cleavage/lingerie. Tier 3 (100–500): Revealing → nipslip/removing clothes. Tier 4 (500+): Explicit → full NSFW. Decay mechanic if user goes silent. Per-creator tracking.
    📄 Product sign-off required before Part 2 begins
  • Memory management system — design
    What to remember: preferences, shared facts, emotional moments, media shared. Storage: structured summary per user-creator pair. Retrieval: injected at session start. Evaluate: full history vs compressed summary vs RAG.
  • Knowledge graph — design + tooling evaluation
    User-creator relationship graph. Nodes: user, creator, memory events, media, milestones. Edges: shared_with, reacted_to, milestone_reached, unlocked. Evaluate: Redis graph vs Postgres jsonb vs dedicated graph DB.
  • Memory privacy policy definition
    Define what the system can and cannot store (PII rules, retention period, right to delete), how extracted memories are scoped to user-creator pairs. Must be defined before build, not after.
S9
P2
Memory System v1 Build
P2 Medium
Implement the memory summarization and injection pipeline. Only begins after product signs off on the Part 1 spec.
AI/ML Tasks
  • Memory summarization pipeline
    After each session: extract key facts + emotional signals from conversation history. Store as structured summary per user-creator pair. Claude Code for scaffolding.
    👁 Manual review of extracted memory samples required — check for accuracy, hallucination, privacy
  • Memory injection into system prompt
    Inject the stored summary at session start. Test that injected memory reads naturally in context and does not interfere with the creator persona or conversation flow.
  • Memory decay logic
    Implement staleness handling: memories older than X days or contradicted by newer conversation are deprioritised or pruned. Define decay parameters based on expected chat frequency.
  • Memory quality evaluation
    Run 20 real conversation samples through the full pipeline. Manually verify: are the right things remembered? Is anything hallucinated? Does memory injection improve conversation feel vs without?
  • Memory performance benchmark
    Measure latency impact of memory retrieval + injection on session start time. If injection adds >200ms, optimise the retrieval query or consider async pre-loading.
  • A/B test instrumentation plan for memory
    Define how we measure whether memory improves retention: events to track, control (no memory) vs treatment (with memory), primary metric (D7 retention or session length). Hand to product/BE to wire.

Backlog & Voice Note

Avatar and video pre-generation — deprioritised until the image pipeline is fully stable. Generic only, no per-creator personalisation. SFW always before NSFW.

Wan2.2 Evaluation (D2C-4681)
Burn the $20 Mulerouter coupon. Test I2V + image edit spicy + face swap. Write go/no-go memo. Unblocks everything below.
Generic Avatar Pre-gen
Generate a library of generic avatar animations (no creator likeness). Evaluate Wan2.2 for avatar-style generation.
Video Pre-gen Pipeline
Mirror image pre-gen pipeline for short clips. Same escalation taxonomy. SFW first — NSFW video is significantly more time-intensive.
Video Broken-clip Classifier
Frame-by-frame vs whole-clip classification. Reuse classifier design from S5 where possible.
🔇
Voice — NOT AI Team Work
Voice is tightly coupled with backend infrastructure, not the AI/ML stack. ElevenLabs SDK interacts directly with the frontend. Adding AI layers in between hurts latency — inference speed is the #1 constraint for voice. This has been communicated to Dave.

Voice is owned by backend. AI team has no sprint items for voice. If voice quality issues are LLM-related (response tone, persona in voice context), AI team can advise on prompting only.

Key Principles

For the product team — context on why the plan is structured this way.

1
Invisible work is not optional
Stress testing, infra migrations, load benchmarks, and observability don't ship visible features but gate everything downstream. Do not compress these sprints.
2
Manual QA is the bottleneck, not code
Every generated image, LoRA output, LLM response, classifier verdict, and extracted memory needs a human eye. The automated classifier (S5–S6) reduces this over time but doesn't eliminate it.
3
SFW always before NSFW
NSFW LoRA has higher artifact rates and more QA rounds. Always validate SFW first before moving to NSFW — for both images and video.
4
Grok is the LLM bet
NSFW-capable, cheaper, multimodal. Everything here assumes Grok via OpenRouter as primary. RunPod LLM is backup only → migrated to Fal.ai → eventually deprecated.
5
Design sign-off gates build
Intimacy meter and memory system touch multiple systems. Product must approve the S9 Part 1 spec before Part 2 build begins. Half-built is worse than not started.
6
Smart suggestions are a retention multiplier
Users who don't know what to say next drop off. Suggestions remove that friction. Faster to ship than the intimacy meter and delivers immediate retention impact.
7
Document everything non-obvious
LoRA strength settings, classifier prompt versions, Fal.ai cold start behaviour, GCP cost breakdowns — if it's not written down it will be rediscovered the hard way.
8
Claude Code accelerates but doesn't replace review
We use Claude Code for scaffolding and boilerplate. Every AI output — images, LLM responses, classifier verdicts, extracted memories — needs a human eye before shipping.