A multi-agent system that turns a Slack message or Linear task into a shipped Metabase dashboard on BigQuery — with quality gates, post-build audit, and a chat layer for corrections.
Each box is its own Anthropic Managed Agent session. Discovery was an agent until we measured it taking 7–10 min per run; replacing it with a deterministic backend call cut that to ~2 seconds without sacrificing the output the architect needs.
Turns natural-language requests into structured brief.yaml. Reads Linear tasks directly when a URL is mentioned.
Reads Metabase metadata, ranks tables by keyword match against the brief, derives partition columns + field IDs. No LLM — pure deterministic lookup.
Designs the dashboard layout — chart types, grid placement, filters. Pure design pass; no SQL, no API calls. Outputs design.md.
Writes BigQuery SQL for every card with mandatory partition pruning + {{date_range}} template-tags. Validates all 9 previews in parallel via /api/dataset.
Quality gate with three modes: SQL (reviews queries pre-ship), visual (post-build data audit catching axis-clamp, empty-card, wrong-chart-type bugs), audit (existing dashboard review).
The only mutating agent. Creates cards + the dashboard via Metabase REST, in parallel; wires the date_range filter via parameter_mappings; verifies after create.
Long-lived conversational layer after delivery. Three jobs in priority order:
Stateless functions for the Slack webhook, Anthropic webhook, orchestrator state machine, and admin/cron endpoints. waitUntil() keeps work alive past 200 OK.
Single source of truth for run state, artifacts (the files agents pass between each other), captured learnings, and cached Metabase metadata.
Each agent is its own Anthropic session with a baked system prompt (preamble + business context + metrics + learnings). They get bootstrapped with run-specific env vars at start.
Added per-agent and per-transition timing to RunState. /api/admin/run-timings now surfaces a table of where every run actually spends its wall clock.
Stopped guessing about latency.
Replaced the mb-discovery LLM agent with backend code that reads Metabase metadata (cached 6h), ranks tables by keyword match, and writes data-model.md directly.
10 min → 2 sec
SQL critic's CHANGES_REQUESTED used to send the run back to mb-sql for another full pass (~7 min). Now we ship best-effort; the post-build audit catches anything real.
+7 min → 0 min
mb-builder used to post "Done!" to Slack before the visual critic even ran. Now the orchestrator composes the delivery message AFTER the audit, with the verdict inline. User sees issues before they open the dashboard.
UX correctness, not raw speed.
Both mb-sql (validating 9 SQL previews against BigQuery) and mb-builder (creating 9 cards) now fan out with curl & + wait instead of looping sequentially.
~90s + 25s → ~15s + 5s
When Anthropic webhooks are flowing, we skip the polling fallback entirely; when they lag, we poll with exponential backoff starting at 1.5s instead of a flat 4s.
13s overhead → 3s overhead
Paste a Linear URL or just say GTM-123 — mb-brief fetches the title + description + comments via GraphQL and uses them as the authoritative spec. No need to re-paste requirements every time.
When the user corrects mb-chat ("net_revenue isn't computed that way"), it writes the correction to KV. On the next agent deploy, every agent gets that knowledge baked into its system prompt. Mistakes are corrected once.
After the build, we fetch every card's live query data + viz settings, hand it to mb-critic in visual mode, and only THEN compose the delivery message — with the audit verdict inline. The user sees issues before they open the dashboard.
Every card's SQL uses a {{date_range}} template-tag bound to its partition column; the dashboard declares the matching parameter; each dashcard's parameter_mappings wires the two together. The filter actually filters.
mb-brief classifies each dashboard into one of 8 categories (acquisition, retention, onboarding, funnel, revenue, leadership, investors, other). mb-builder places it into the right Metabase folder automatically.
A cron sweeps every 15 minutes and expires runs that have been idle >30 min in chat or waiting-user state. Sessions are archived to free Anthropic resources without losing the transcript.
Hourly health check pings /api/health; daily synthetic run kicks off a "test dashboard" request and validates the pipeline starts cleanly. Anomalies post to Slack only on state change — no spam.
mb-chat gets an HMAC token narrowly scoped to append-learning, 24h TTL — not the master deploy secret. Compromise of a chat session log doesn't leak admin access.
Get a real-world dashboard build to consistently under 5 min by trimming the SQL agent's output volume and the architect's prose tendencies. Measured via /api/admin/run-timings.
When the visual critic returns CHANGES_REQUESTED with concrete viz_settings fixes (axis clamp, wrong title), route back to mb-builder ONCE to apply them before delivery — no human in the loop.
Today two runs with the same Linear task overwrite each other's runId. Append a short hash so re-runs accumulate in the index for comparison.