BAKEOFF evaluates model quality under governance. Every comparison deterministic.
Double-blind. The user does not know which provider is serving. The provider does not know it is being evaluated.
User opens a governed chat (patient on TALK, dev on CODE). System randomly assigns a model (Claude, OpenAI). User chats normally. Model identity is hidden. User MAY switch models at any time. Switch is frictionless. The chain records: model assignment, switch events, timestamps, conversation. Machine telemetry (latency, cost, error rate) is captured per-request. Behavioral metrics (retention, time-to-switch, switch destination, return rate) are derived from the chain.
No surveys. No scoring panels. The switch is the vote. The chain is the evidence.
The dashboard answers the questions. Presentation-first. Every section has a clear answer.
SCOREBOARD — Headline metrics: Best Model, Total Sessions, Total Evidence. The answer at a glance. MODEL PERFORMANCE — Claude vs OpenAI head-to-head. Retention, latency, cost, time-to-switch. Bar chart + table. RETENTION TRAJECTORY — Area chart: retention over session cohorts. Who keeps users longer? BAKER LEADERBOARD — Who generates the most evidence? Donut chart + table. Every session is a contribution. COST-PERFORMANCE — Price/token vs retention. The efficient frontier. COMMUNITY — Track breakdown (TALK + CODE). Engagement gauge. The bakeoff is the community.
Data is simulated from the Spectral Analysis (EVO-ANALYSIS.md) until runtime chain recording goes live. Simulation basis:
When runtime launches, simulated data is replaced by live chain evidence. The dashboard structure stays the same.
| Track | Community | Product | What they judge |
|---|---|---|---|
| TALK | Patients + clinicians | MammoChat, OncoChat, MedChat | Clinical quality, compassion, clarity |
| CODE | Developers | Kilocode, Claude Code | Code correctness, codebase understanding, conventions |
Kilocode is an OpenAI-compatible AI coding extension for VSCode. It connects to the BAKEOFF through api.canonic.org.
Setup:
Install Kilocode extension in VSCode. Open Kilocode settings. Set the API base URL to https://api.canonic.org/v1. Set the model to canonic-kilocode. Code normally. The system randomly assigns a provider (DeepSeek, Qwen, GLM, GPT-4, Claude). Provider identity is hidden. If unsatisfied with responses, switch models via Kilocode’s model selector. The switch is the vote.
Available model IDs:
The developer does NOT need to know the underlying provider. The proxy handles assignment. The chain records everything.
Claude Code is Anthropic’s CLI agent. It participates in the CODE track through direct Anthropic API usage.
Setup:
Install Claude Code CLI (claude). Use normally against your codebase. Sessions are captured as CODE track evidence when operating under governed repositories.
Every CODE track session records: model assignment, switch events, timestamps, conversation, latency, token usage, cost. No surveys. The switch is the vote. The chain is the evidence.
| Model ID | What it routes to |
|---|---|
| `canonic-kilocode` | Default (RunPod DeepSeek Coder) |
| `canonic-kilocode-commercial` | Commercial (OpenAI GPT-4) |
| `canonic-kilocode-opensource-runpod` | Open-source on RunPod |
| `canonic-kilocode-opensource-vast` | Open-source on Vast.ai |
The scoreboard is the output. One row per model, per track:
A model that nobody switches away from is winning. A model that everyone abandons after 2 messages is losing. The routing policy writes itself from this data.
| Model | Track | Retention | Switch-away | Avg time-to-switch | Latency | Cost/1K tok |
|---|---|---|---|---|---|---|
| Claude | CODE | 78.4% | 21.6% | 14.3 msg | 1.8s | $0.015 |
| OpenAI | CODE | 61.2% | 38.8% | 7.8 msg | 1.2s | $0.010 |
| Claude | TALK | — | — | — | — | — |
| OpenAI | TALK | — | — | — | — | — |
The community runs the bakeoff. Every session is a contribution. Bakers who evaluate the most models generate the most evidence.
Evidence score = sessions + switches. More switches = more comparative evidence.
| Baker | Track | Sessions | Switches | Models evaluated | Evidence score |
|---|---|---|---|---|---|
| dexterhadley | CODE | 512 | 89 | 2 | 601 |
| ir4y | CODE | 198 | 47 | 2 | 245 |
| yanabeda | CODE | 137 | 31 | 2 | 168 |
Upstream: MAGIC governance contracts and SERVICES meta-governance. Runtime: api.canonic.org performs random model assignment and chain recording. ~/.canonic stores local evidence artifacts. Frontend: BAKEOFF dashboard surfaces on governed catalog. Scoreboard is the presentation. Ledger plane: consumes chain evidence (assignments, switches, telemetry) and feeds dashboard.
Models in scope:
| *BAKEOFF | SPEC | DESIGN | DASHBOARD | SCOREBOARD | BAKERS* |