§ 04
The four projects
LLM · Image · Voice · Video
Routes every conversation, owns the intimacy meter, decides when to pivot, when to tease, when to ask for a photo. OpenRouter integration is in review (300+ models accessible once it lands). Next: wire to backend, tier the calls (cheaper, faster models for guardrails and intent detection; smart models for the actual response), then build the intimacy meter and psychological hooks on top.
▸ Now
In Review
- OpenRouter integration in review
- 300+ models accessible (once it lands)
- Statsig hookup planned for experiments
- Backend wiring not started yet
▸ Next
Wire & Tier
- Connect OpenRouter to backend
- List every internal LLM call
- Cheap model for guardrails + intent
- Smart model for the actual reply
- Set speed targets per call type
▸ Then
Intimacy Meter
- Build intimacy meter (3–4 stages)
- Paywall triggers at stage 2 → 3
- Detect boredom · auto-pivot topics
- Accept user image uploads
- Use cheap model to read user images
▸ Later
Per-Creator
- Fine-tune LLM per creator
- Long-term memory of past chats
- Intimacy carries across sessions
- Smarter topic-matching engine
★ The Bet
Using cheaper models for routine calls (guardrails, intent, routing) cuts cost a lot without users noticing. The intimacy meter turns the conversation into a game users want to climb — and the paywall lands at the moment they want it most, not at message N.
⚠ The Risk
If we swap the global LLM without writing down every internal call first, we might silently break tone or safety. We'll only find out when conversion drops. Need to list every call and which tier it uses before swapping.
We have two models today: old (slow but very detailed — good for tattoos and fine features, weak at NSFW) and new (under 5 seconds, NSFW LoRAs, good consistency — but struggles with tattoos and fine details). Neither alone is perfect. We need a per-creator photo library that runs from SFW curvy photos shown on landing through suggestive tease photos in chat through full NSFW after the paywall. End state: our own internal image-to-image model with per-creator LoRAs (base identity, undress, cleavage, outfit) feeding a pre-generated photo library the LLM pulls from during chat.
▸ Now
Two Models
- Old: slow, detailed, weak NSFW
- New: under 5s, good consistency, NSFW LoRAs
- New struggles with tattoos / fine details
- No unified pipeline yet
▸ Next
Internal i2i
- Build internal i2i prototype
- Base identity LoRA per creator
- Pilot on 3 creators
- Compare quality vs nano-banana
▸ Then
Photo Library
- SFW curvy photos for landing
- Suggestive LoRAs (cleavage, undress, outfit)
- 30–50 photos per creator pre-generated
- Tagged by tier (SFW / suggestive / NSFW)
- LLM pulls right photo at right time
▸ Later
All Creators
- Roll out to full creator catalog
- Self-serve LoRA pipeline
- Generate new photos in chat
- Style/outfit experiments
★ The Bet
Pre-generated photo libraries beat live generation. The funnel — SFW on landing, suggestive in chat, NSFW after paywall — only works if every creator has a deep, consistent library. Speed and consistency here unlock everything else.
⚠ The Risk
LoRA training takes time and compute. Train too few creators and we won't know if it works for all. Train too many before validating and we waste runs. Nano-banana is tempting as a shortcut but weakens our long-term advantage. Need a clear policy for when it's used.
Voice is in good shape — do not over-optimize. The work here is instrumentation, expressiveness, and parity with the intimacy state LLM owns. Photo-drops over voice wait until Image is solid; never ship that on the old image stack.
▸ Now
Stable
- Production-ready, holding up
- Quality benchmark established
- No reliability fires
▸ Next
The Funnel
- Drop-off events instrumented
- Init → 1st reply → 60s → 3min → end
- Shared intimacy state w/ LLM
- Per-stage CVR baseline
▸ Then
Expression
- Tease/expression tuning
- Topic-pivot parity with LLM
- Boredom-detection in voice signals
- Whispers, laughs, breaths catalog
▸ Later
Cross-Modal
- Photo drops mid-call (gated on Pyg)
- Video send-link mid-call
- Voice → image affinity model
★ The Bet
Voice is high-perceived-value, low-marginal-improvement right now. Best ROI is instrumentation. Once we know where users hang up, surgical expression and pivot tuning move CVR more than any model swap.
⚠ The Risk
Treating voice as "done" is dangerous — owners lose urgency, regressions creep in, a competitor leapfrogs while we're heads-down on images. Set a maintenance bar, not zero attention.
Most powerful modality, deliberately deferred. Video LoRAs are expensive in time and compute; surface area only pays off once the rest is solid. Only video work earning roadmap space now: generic creator B-roll — short ambient loops LLM can drop like high-value emoji at peak intimacy.
▸ Now
Hold
- Deprioritized intentionally
- No active engineering
▸ Next
B-Roll
- 3–5s generic creator loops
- Ambient, not narrative
- Drop mechanic in LLM (rare, high-tier)
- No LoRA training
▸ Then
Pilot
- Per-creator video LoRA spike
- Only after Image stable
- Pilot on 1–2 creators
- Quality + cost benchmark
▸ Later
Spectacle
- Full video escalation track
- Slot-machine pic mechanic → video
- Modal-driven reveals
- Voice + video calls
★ The Bet
Doing nothing on video is the right move now. Engineering attention is finite; image consistency is the bottleneck for the entire product. Cheap B-roll buys perceived motion without committing to the LoRA pipeline.
⚠ The Risk
Video is where the category is heading. Hold too long and a competitor ships consistent per-creator video first — we lose a positioning beat that's hard to recover. Re-evaluate the moment Image clears its first roster milestone.