AI Lead · Aloukik Aditya

AI/ML Roadmap

A prioritised, sprint-by-sprint plan for the AI & ML team. Adjust, cut, or add based on product decisions — this is the baseline.

Owner Aloukik (AI Lead)

Sprint Cadence 1 Week

Last Updated May 14, 2026

Tooling Claude Code

Scroll to explore

Vision

North Star

Move from reactive AI fixes to a proactive AI platform that drives onboarding, retention, and revenue.

🎯

Best-in-Class Onboarding

LLM-driven image escalation that converts every new user at peak desire.

⚡

Cheapest NSFW Stack

Grok via OpenRouter as primary, Fal.ai for infra — lowest cost, highest capability.

🖼️

Automated Quality Gates

Pre-generated content with AI classifiers so users never see a broken image.

💬

Smart Engagement

Dynamic suggestions and intimacy progression that keep users coming back.

At a Glance

Charts & Tables

Visual breakdown of sprint distribution, priority allocation, and task load.

Priority Distribution

Sprints

P0 Urgent

P1 High

P2 Medium

P3 Low

Tasks per Sprint

Sprint Timeline — Estimated Week-by-Week

Sprint Master Table

Sprint	Theme	Priority	Tasks	Split	Key Output

At a Glance

Sprint Overview

14 sprints across 4 priority tiers. Splits are made only where a task is genuinely too large for one week.

Sprint 1

E2E Verification — Image Escalation

P0 Urgent

Sprint 2 · Part 1

Grok NSFW Validation + Endpoint Audit

P0 UrgentSplit

Sprint 2 · Part 2

OpenRouter Stress Test + Monitoring

P0 Urgent

Sprint 3

RunPod → Fal.ai Migration

P1 High

Sprint 4 · Part 1

GCP Audit + Erin Transition

P1 HighSplit

Sprint 4 · Part 2

Asset Migration + GCP Cleanup

P1 High

Sprint 5

NSFW LoRA Taxonomy + Image Classifier

P1 High

Sprint 6 · Part 1

NSFW LoRA Training — Tier-1

P2 MediumSplit

Sprint 6 · Part 2

Live Image Classifier + LoRA Integration

P2 Medium

Sprint 7

Smart Suggestions — Dynamic LLM Prompts

P2 Medium

Sprint 8

User Media Input — Grok Vision

P2 Medium

Sprint 9 · Part 1

Intimacy Meter Design + Memory System

P2 MediumSplit

Sprint 9 · Part 2

Memory System v1 Build

P2 Medium

Sprint 10+

Intimacy Meter Full Build + Rollout

P3 Low

Detailed Plan

Sprint Breakdown

Every task is AI/ML team work only. Backend and frontend items are flagged as suggestions — not AI tasks.

Confirm Toan's image escalation work is production-ready before it becomes the core onboarding experience at scale.

Image escalation is the most important piece of onboarding. Recently pushed by Toan. AI team must verify the full pipeline E2E before product ships to users at scale.

AI/ML Tasks

E2E pipeline walkthrough

Run the full image escalation sequence (dressed → undress, 6 stages) for Sophie and Lena. Manually inspect each output for: correct escalation order, no broken/artifact images, creator identity preserved across all stages.

👁 Manual visual inspection required
Stress test: image generation concurrency

Write a load test script (Claude Code), hit the endpoint at 10 / 25 / 50 concurrent requests. Record latency P50/P95, failure rate, broken image rate under load.

📊 Output: pass/fail report + recommended concurrency cap
Broken image audit

Sample 50 pre-generated images from DB across all escalation tiers. Manually classify: good / broken / wrong creator. Document failure rate per tier.

🏷 Labelled dataset becomes the classifier benchmark in S5
Tattoo regression check

New model has known tattoo degradation vs old model. Inspect escalation images for tattooed creators. Flag to product if quality is below bar with visual evidence.
Observability audit

Check what logging exists: Are broken image events logged? Are escalation stage transitions traceable? Are generation errors surfaced? Document gaps and flag critical missing logs.

⚠ We can't debug production issues without logs
E2E test runbook

Document the full E2E verification process as a repeatable runbook. Next time someone pushes to this pipeline, the runbook is the checklist.

suggest → BE: Error handling + retry on escalation endpoint suggest → BE: Pre-gen images indexed in DB with tier tags

S2
P1

Validate Grok's quality and NSFW capability before committing it as primary LLM.

Why split: Manually reviewing 100 conversations for quality + persona + NSFW compliance is 2-3 days of work alone. Infra stress testing belongs in Part 2 so neither gets rushed.

AI/ML Tasks

OpenRouter endpoint audit

Review current integration (D2C-4559 / D2C-4561). Verify all modes route correctly: chat, image intent detection, guardrails, suggestion mode. Document any gaps.
Grok NSFW capability test

Run 100 representative onboarding conversations through Grok. Score each on: NSFW compliance rate, persona consistency, message quality, response tone vs current LLM baseline.

👁 Manual review of outputs is the bulk of this sprint
Prompt compatibility audit

Our prompts were tuned for Claude/current LLM. Test all active Langfuse prompt templates against Grok. Identify any that produce worse outputs. Document required prompt adjustments.
Multi-turn persona consistency test

Run 20 extended conversations (10+ turns each) through Grok. Verify persona, tone, and NSFW escalation remain consistent across the full conversation — not just the first message.

⚠ Single-turn tests are not enough to validate primary LLM behaviour
Content filter false positive retest

D2C-4080: current LLM blocks its own suggestions. Rerun these exact cases with Grok to determine if this is model-specific or pipeline-level.
Cost-per-conversation estimate

Based on token counts from the 100-conversation test, compute average cost/conversation on Grok vs current LLM. Produce a concrete monthly cost projection at current DAU.

S2
P2

Validate OpenRouter is infrastructure-ready at scale. Define rate limit strategy, wire monitoring, and confirm fallback chain works.

AI/ML Tasks

Stress test: OpenRouter concurrent requests

Load test at 10 / 50 / 100 / 200 concurrent requests. Measure: latency, rate limit hits, error rate. Document max safe concurrency + recommended alerting thresholds for BE.
Rate limit strategy

Define what happens when OpenRouter rate limits are hit: queue and retry, immediate fallback, or degraded experience. Document strategy with latency and cost tradeoffs.
Fallback chain test

Simulate Grok failure (timeout, rate limit, API error). Verify fallback to backup LLM is seamless — no user-facing error, no broken conversation state, no duplicate charges.
Cost tracking instrumentation

Log token usage and model name per request via OpenRouter. Track actual cost/conversation before and after the switch to validate the pricing advantage in production.
Error monitoring setup

Define what AI-layer errors we alert on: Grok failures, fallback triggers, high broken-image rates, guardrail rejection spikes. Document alert thresholds and owner.
Failover runbook

Document the manual failover procedure: how to switch primary LLM, rollback to previous model, and verify the switch is working. Required before Grok goes to prod.

suggest → BE: Parameterize LLM model in config (D2C-4581) suggest → BE: Add model name to conversation logs

Move backup LLM off degrading RunPod infrastructure onto Fal.ai. Validate fully before decommissioning.

RunPod quality has declined significantly over the last 3-4 months. Fal.ai is more reliable and cheaper. Tasks are sequential and well-scoped — one sprint is the right call.

AI/ML Tasks

Fal.ai endpoint setup

Deploy backup LLM docker image (D2C-4561 dockerfile) to Fal.ai. Resolve any Fal.ai-specific config differences. No GPU weights needed since OpenRouter handles primary.
Cold start characterisation

Measure Fal.ai cold start time. Determine if cold starts are acceptable for a backup LLM or if we need warm instances. Document recommendation with cost implications.
Smoke test + quality benchmark

Run 50 conversations through the new Fal.ai endpoint. Compare output quality vs RunPod baseline. Flag any regressions with specific examples.

👁 Manual review of sample outputs required
Latency + cost benchmark

Compare P50/P95 latency and cost-per-request: RunPod vs Fal.ai. Confirm Fal.ai is cheaper and within latency SLA for a backup system.
Stress test + failover integration test

Load test Fal.ai at the concurrency cap from S2. Trigger failover from Grok → Fal.ai. Verify: seamless conversation continuity, correct model switch, no duplicate charges.
Decommission RunPod + rollback plan

Once all tests pass, shut down RunPod instances and update all documentation. Document rollback procedure in case Fal.ai has issues in its first week of production.

suggest → BE: Update backend endpoint URL to Fal.ai suggest → BE: Fal.ai webhook auth (D2C-3910)

S4
P1

Understand exactly what's in GCP and plan the migration carefully before touching anything.

Why split: Migrating large model weights and datasets between cloud providers is data-size dependent — can take multiple days of transfer time. Audit first so Part 2 is a clean execution sprint.

AI/ML Tasks

GCP storage audit

Catalog everything in GCP: training datasets, model weights, generated images, logs, scheduled jobs. Classify each: keep (cold archive) / migrate to Fal.ai / delete.
Cost breakdown by service

Pull GCP billing data. Break down monthly spend by compute (training jobs), storage (buckets), and networking (egress). Identify top cost drivers and project savings after migration.
Identify active write paths

Check which GCP buckets are still being actively written to. Any active write path means something in prod still uses GCP — must be accounted for before deletion.
Coordinate Erin transition

Erin still uses the old GCP training pipeline. Document the Fal.ai training workflow, walk her through it, and confirm she can run training on Fal.ai independently before we kill GCP access.
Identify GCP-dependent scheduled jobs

Check for any cron jobs, Cloud Functions, or Vertex AI pipelines running on a schedule. These are easy to miss and will break silently after migration.
Document minimal GCP footprint plan

Define exactly what stays in GCP (Docker + GPU for experiments). Estimate monthly cost. This becomes the target state after Part 2 and the reference for future spend reviews.

S4
P2

Execute the migration plan from Part 1. Leave GCP in minimal footprint state only. Realise cost savings.

AI/ML Tasks

Migrate active assets to Fal.ai

Transfer training datasets and model weights identified in Part 1. Verify checksums post-transfer to confirm no data corruption.
Deprecate old model training pipeline

Archive/delete old GCP training scripts, jobs, and associated storage. Confirm old model is not referenced in any live system before deletion.

Decision: ditch old model completely — better model coming, cost saving outweighs keeping it
GCP storage cleanup

Delete deprecated buckets and assets. Confirm active write paths from Part 1 have been rerouted or shut down.
Post-migration cost validation

Pull GCP billing after cleanup. Confirm actual cost reduction matches the projection from Part 1. Flag if anything unexpected is still running.
Update all internal documentation

Update any internal docs, READMEs, or runbooks that reference GCP storage paths, training procedures, or model endpoints. Stale docs cause incidents.

suggest → BE: Confirm no backend services reference old GCP model endpoints

Lock the LoRA category taxonomy. Design, build, and benchmark a working automated broken image classifier.

AI/ML Tasks

Define NSFW LoRA category taxonomy

Finalize all LoRA categories for the full escalation sequence. Existing: selfie z-image, cleavage, ass in underwear/lingerie, nipslip, removing clothes. Define remaining NSFW tiers, category-to-escalation-step mapping, and estimated training effort per category.

📄 Output: spec doc — product sign-off required before training begins
Broken image classifier — design + build

Multimodal AI audit pipeline. Input: batch of images. Model: Grok vision vs Claude multimodal (evaluate cost vs accuracy). Classifications: good / broken / wrong_creator / wrong_category. Parallel processing, auto-regen up to 3x before manual flag.
Prompt versioning setup for classifier

Before iterating, set up prompt versioning (Langfuse or equivalent). Every classifier prompt version must be tracked so we can compare accuracy across iterations and roll back if needed.
Classifier benchmark + false positive analysis

Run against the 50-image manually labelled sample from S1. Target: >90% precision on broken. Separately analyse false positives — which types of good images does the classifier incorrectly flag?
SFW pre-gen batch run + classifier audit

Run full SFW pre-gen batch for all active creators through the classifier. Review flagged images manually. Iterate on prompt if precision drops at scale vs the benchmark.
Per-creator QA report

After the batch run, produce a per-creator quality report: images generated, images flagged, auto-regenned count, final pass rate. This becomes the health metric for the pre-gen pipeline.

suggest → BE: DB schema — category, escalation_tier, quality_status, retry_count fields

S6
P1

Train and QA all four Tier-1 LoRA categories. Manual visual QA per creator for every category.

Why split: Each LoRA = dataset prep + LoRA strength experimentation + hours of training + manual visual QA per creator + iteration rounds. Four LoRAs fills a sprint. Combining with classifier build and integration is too heavy.

AI/ML Tasks

Pre-training dataset audit

Before training any LoRA, audit training images for each category: correct labels, no duplicates, sufficient variety, no low-quality samples. Bad training data = wasted training runs.
Tier-1 LoRA training: selfie z-image

Follow Lucky's 6-step workflow. Experiment with creator LoRA strength. Manual visual QA per creator: face consistency, pose correctness, escalation stage accuracy.

👁 Manual visual inspection required per creator
Tier-1 LoRA training: cleavage

Same workflow. Extra attention to anatomical artifacts in the chest area. Multiple iteration rounds expected.

👁 Manual visual inspection required per creator
Tier-1 LoRA training: ass in underwear/lingerie

Fabric rendering and body proportion are the main failure modes. QA must check both carefully.

👁 Manual visual inspection required per creator
Tier-1 LoRA training: nipslip

Highest artifact-risk category in the escalation sequence. Thorough visual QA required. Expect the most iteration rounds of any Tier-1 category.

👁 Manual visual inspection required per creator
LoRA strength calibration doc

Document optimal LoRA strength settings found for each category and each creator during experimentation. This is institutional knowledge — without it, the next person retrains from scratch.

S6
P2

Ship a real-time quality gate for live image generation. Wire Tier-1 LoRAs into the pre-gen pipeline and validate at scale.

AI/ML Tasks

Live broken image classifier

Lightweight, fast classifier for real-time generation (separate from S5 batch classifier). Must complete in <500ms. Broken → silently regen, no charge to user, log event. Good → deliver, charge normally.

suggest → BE: Retry endpoint + "no charge" flag on regen

suggest → FE: Show loading state during regen
Classifier latency profiling

Profile the live classifier under load. If P95 exceeds 500ms, identify bottleneck (model call, image download, post-processing) and optimise.
Regen loop monitoring

Instrument how often images break and how many regen attempts are needed. High regen rates = LoRA quality issue, not a classifier issue. Log separately so we can act on the right thing.
LoRA integration into pre-gen pipeline

Wire all four Tier-1 LoRAs into the batch pre-gen system from S5. Run full NSFW Tier-1 pre-gen batch for active creators. Audit outputs through the batch classifier.
Pre-gen pipeline run report

After the NSFW batch: images generated per category per creator, failure rates, regen counts, final inventory size. Product needs this to know what content is available before surfacing it.
A/B test instrumentation plan

Define the measurement plan for whether broken image elimination improves conversion. What events to track, control vs treatment, primary metric. Hand to product/BE to wire up.

After each creator message, the LLM returns 2-3 contextual suggestions so users always know what to say next. Reduces drop-off from users who lose momentum.

Full sprint. New endpoint, prompt engineering, diversity logic, persona alignment, and load testing. Existing Icebox ticket: D2C-3539 (Implement Suggestion Mode).

AI/ML Tasks

Spike + design

Test Grok at generating contextual suggestions from real conversation history samples. Define behaviour: trigger timing, input (last N messages + creator persona), output format (2-3 ranked strings), edge cases (first message, post-paywall, post-media-share).

👁 Manual review of suggestion quality — defines the quality bar. Output: design doc.
Suggestion endpoint — build

Build a /suggest endpoint (or extend chat response to return suggestions[] optionally). Must be async and non-blocking relative to main chat response. Claude Code for scaffolding. Reference D2C-3539.
Prompt engineering: contextual quality

Iteratively tune the suggestion prompt using real conversation history samples. Target: suggestions feel like a natural nudge, not a chatbot menu. Multiple prompt iterations with manual review each round.
Suggestion diversity logic

Ensure suggestions don't repeat across consecutive turns, cover different types (question / action / flirt / media prompt), and vary in intensity based on conversation stage. Implement deduplication and variety scoring.
Creator persona alignment

Suggestions must match each creator's voice and style — a suggestion that sounds like the wrong creator breaks immersion. Test each suggestion type against each active creator's persona profile.
Stress test + latency validation

Load test at realistic volume (fires after every message). Suggestions must not add latency to the chat response. Document P95 and confirm fully non-blocking behaviour.

suggest → FE: Tappable chips below last creator message suggest → BE: Call suggestion endpoint async/non-blocking

Enable users to send images, audio, and video to the AI. Grok handles the heavy lifting — low effort, high impact.

AI/ML Tasks

Grok multimodal input — spike

Test Grok's vision API with sample user-sent images across categories: selfie, clothing, body. Evaluate response quality, persona consistency, tone appropriateness. Identify any categories Grok handles poorly.

👁 Manual review of AI responses required — tone and persona need careful checking
Media-type specific prompt engineering

Creator responses to a selfie should feel different from an outfit photo or a body pic. Build and test prompt variations per media type so reactions feel natural and persona-consistent, not generic.
LLM API: multimodal input mode

Extend LLM API endpoint to accept media_type + media_url alongside text. Route to Grok vision model. Text-only mode must be completely unaffected. Claude Code for scaffolding.
Media content guardrails

Before user-uploaded media reaches Grok, classify it: reject illegal/underage content, log all media inputs, fail safe (reject) on classifier uncertainty.

⚠ Compliance requirement — must not be skipped or minimised
Static media suggestion prompt library

Build 20-30 suggestion prompts that encourage users to share media naturally ("Show me what you're wearing", "Share a selfie — I'll rate it"). Static baseline that S7 smart suggestions will augment over time.

suggest → FE: Surface as suggestion chips in chat input (Snapchat-style UX)
Stress test + payload size limits

Load test multimodal endpoint with realistic image payload sizes. Measure latency delta vs text-only. Document max payload size, P95 latency, and recommended FE compression settings.

S9
P1

Produce complete, product-signed-off design docs before any code is written. This design touches product, AI, BE, and FE.

Why split: Spec decisions have downstream consequences for DB schema, LLM prompting, content unlocking, and frontend. One week for design, one week for build. Never build on an unreviewed design.

AI/ML Tasks

Competitive analysis

How do Replika, Candy.ai, and OnlyFans handle user progression and retention mechanics? What works, what feels manipulative, what converts? Informs the spec significantly.
Intimacy progression spec

Tier 1 (0–50 chats): SFW / flirty → selfie content. Tier 2 (50–100): Suggestive → cleavage/lingerie. Tier 3 (100–500): Revealing → nipslip/removing clothes. Tier 4 (500+): Explicit → full NSFW. Decay mechanic if user goes silent. Per-creator tracking.

📄 Product sign-off required before Part 2 begins
Memory management system — design

What to remember: preferences, shared facts, emotional moments, media shared. Storage: structured summary per user-creator pair. Retrieval: injected at session start. Evaluate: full history vs compressed summary vs RAG.
Knowledge graph — design + tooling evaluation

User-creator relationship graph. Nodes: user, creator, memory events, media, milestones. Edges: shared_with, reacted_to, milestone_reached, unlocked. Evaluate: Redis graph vs Postgres jsonb vs dedicated graph DB.
Memory privacy policy definition

Define what the system can and cannot store (PII rules, retention period, right to delete), how extracted memories are scoped to user-creator pairs. Must be defined before build, not after.

S9
P2

Implement the memory summarization and injection pipeline. Only begins after product signs off on the Part 1 spec.

AI/ML Tasks

Memory summarization pipeline

After each session: extract key facts + emotional signals from conversation history. Store as structured summary per user-creator pair. Claude Code for scaffolding.

👁 Manual review of extracted memory samples required — check for accuracy, hallucination, privacy
Memory injection into system prompt

Inject the stored summary at session start. Test that injected memory reads naturally in context and does not interfere with the creator persona or conversation flow.
Memory decay logic

Implement staleness handling: memories older than X days or contradicted by newer conversation are deprioritised or pruned. Define decay parameters based on expected chat frequency.
Memory quality evaluation

Run 20 real conversation samples through the full pipeline. Manually verify: are the right things remembered? Is anything hallucinated? Does memory injection improve conversation feel vs without?
Memory performance benchmark

Measure latency impact of memory retrieval + injection on session start time. If injection adds >200ms, optimise the retrieval query or consider async pre-loading.
A/B test instrumentation plan for memory

Define how we measure whether memory improves retention: events to track, control (no memory) vs treatment (with memory), primary metric (D7 retention or session length). Hand to product/BE to wire.

Future

Backlog & Voice Note

Avatar and video pre-generation — deprioritised until the image pipeline is fully stable. Generic only, no per-creator personalisation. SFW always before NSFW.

Wan2.2 Evaluation (D2C-4681)

Burn the $20 Mulerouter coupon. Test I2V + image edit spicy + face swap. Write go/no-go memo. Unblocks everything below.

Generic Avatar Pre-gen

Generate a library of generic avatar animations (no creator likeness). Evaluate Wan2.2 for avatar-style generation.

Video Pre-gen Pipeline

Mirror image pre-gen pipeline for short clips. Same escalation taxonomy. SFW first — NSFW video is significantly more time-intensive.

Video Broken-clip Classifier

Frame-by-frame vs whole-clip classification. Reuse classifier design from S5 where possible.

🔇

Voice — NOT AI Team Work

Voice is tightly coupled with backend infrastructure, not the AI/ML stack. ElevenLabs SDK interacts directly with the frontend. Adding AI layers in between hurts latency — inference speed is the #1 constraint for voice. This has been communicated to Dave.

Voice is owned by backend. AI team has no sprint items for voice. If voice quality issues are LLM-related (response tone, persona in voice context), AI team can advise on prompting only.

Guidelines

Key Principles

For the product team — context on why the plan is structured this way.

Invisible work is not optional

Stress testing, infra migrations, load benchmarks, and observability don't ship visible features but gate everything downstream. Do not compress these sprints.

Manual QA is the bottleneck, not code

Every generated image, LoRA output, LLM response, classifier verdict, and extracted memory needs a human eye. The automated classifier (S5–S6) reduces this over time but doesn't eliminate it.

SFW always before NSFW

NSFW LoRA has higher artifact rates and more QA rounds. Always validate SFW first before moving to NSFW — for both images and video.

Grok is the LLM bet

NSFW-capable, cheaper, multimodal. Everything here assumes Grok via OpenRouter as primary. RunPod LLM is backup only → migrated to Fal.ai → eventually deprecated.

Design sign-off gates build

Intimacy meter and memory system touch multiple systems. Product must approve the S9 Part 1 spec before Part 2 build begins. Half-built is worse than not started.

Smart suggestions are a retention multiplier

Users who don't know what to say next drop off. Suggestions remove that friction. Faster to ship than the intimacy meter and delivers immediate retention impact.

Document everything non-obvious

LoRA strength settings, classifier prompt versions, Fal.ai cold start behaviour, GCP cost breakdowns — if it's not written down it will be rediscovered the hard way.

Claude Code accelerates but doesn't replace review

We use Claude Code for scaffolding and boilerplate. Every AI output — images, LLM responses, classifier verdicts, extracted memories — needs a human eye before shipping.