BAKEOFF

BAKEOFF

BAKEOFF evaluates model quality under governance. Every comparison deterministic.

DESIGN

Design

Double-blind. The user does not know which provider is serving. The provider does not know it is being evaluated.

User opens a governed chat (patient on TALK, dev on CODE). System randomly assigns a model (Claude, OpenAI). User chats normally. Model identity is hidden. User MAY switch models at any time. Switch is frictionless. The chain records: model assignment, switch events, timestamps, conversation. Machine telemetry (latency, cost, error rate) is captured per-request. Behavioral metrics (retention, time-to-switch, switch destination, return rate) are derived from the chain.

No surveys. No scoring panels. The switch is the vote. The chain is the evidence.

DASHBOARD

Dashboard

The dashboard answers the questions. Presentation-first. Every section has a clear answer.

Sections

SCOREBOARD — Headline metrics: Best Model, Total Sessions, Total Evidence. The answer at a glance. MODEL PERFORMANCE — Claude vs OpenAI head-to-head. Retention, latency, cost, time-to-switch. Bar chart + table. RETENTION TRAJECTORY — Area chart: retention over session cohorts. Who keeps users longer? BAKER LEADERBOARD — Who generates the most evidence? Donut chart + table. Every session is a contribution. COST-PERFORMANCE — Price/token vs retention. The efficient frontier. COMMUNITY — Track breakdown (TALK + CODE). Engagement gauge. The bakeoff is the community.

Simulation

Data is simulated from the Spectral Analysis (EVO-ANALYSIS.md) until runtime chain recording goes live. Simulation basis:

  • Baker velocity ratios from spectral commit analysis (dexterhadley 35.2x, ir4y 3.9x, yana 3.1x)
  • Model retention derived from governed vs ungoverned loop patterns
  • Latency and cost from production API benchmarks

When runtime launches, simulated data is replaced by live chain evidence. The dashboard structure stays the same.

TRACKS

Tracks

TrackCommunityProductWhat they judge
TALKPatients + cliniciansMammoChat, OncoChat, MedChatClinical quality, compassion, clarity
CODEDevelopersKilocode, Claude CodeCode correctness, codebase understanding, conventions
CODE TRACK: HOW TO PARTICIPATE

CODE Track: How to Participate

Kilocode (VSCode)

Kilocode is an OpenAI-compatible AI coding extension for VSCode. It connects to the BAKEOFF through api.canonic.org.

Setup:

Install Kilocode extension in VSCode. Open Kilocode settings. Set the API base URL to https://api.canonic.org/v1. Set the model to canonic-kilocode. Code normally. The system randomly assigns a provider (DeepSeek, Qwen, GLM, GPT-4, Claude). Provider identity is hidden. If unsatisfied with responses, switch models via Kilocode’s model selector. The switch is the vote.

Available model IDs:

The developer does NOT need to know the underlying provider. The proxy handles assignment. The chain records everything.

Claude Code (CLI)

Claude Code is Anthropic’s CLI agent. It participates in the CODE track through direct Anthropic API usage.

Setup:

Install Claude Code CLI (claude). Use normally against your codebase. Sessions are captured as CODE track evidence when operating under governed repositories.

What the system records (both tools)

Every CODE track session records: model assignment, switch events, timestamps, conversation, latency, token usage, cost. No surveys. The switch is the vote. The chain is the evidence.

Model IDWhat it routes to
`canonic-kilocode`Default (RunPod DeepSeek Coder)
`canonic-kilocode-commercial`Commercial (OpenAI GPT-4)
`canonic-kilocode-opensource-runpod`Open-source on RunPod
`canonic-kilocode-opensource-vast`Open-source on Vast.ai
SCOREBOARD

Scoreboard

The scoreboard is the output. One row per model, per track:

A model that nobody switches away from is winning. A model that everyone abandons after 2 messages is losing. The routing policy writes itself from this data.

ModelTrackRetentionSwitch-awayAvg time-to-switchLatencyCost/1K tok
ClaudeCODE78.4%21.6%14.3 msg1.8s$0.015
OpenAICODE61.2%38.8%7.8 msg1.2s$0.010
ClaudeTALK
OpenAITALK
BAKERS

Bakers

The community runs the bakeoff. Every session is a contribution. Bakers who evaluate the most models generate the most evidence.

Evidence score = sessions + switches. More switches = more comparative evidence.

BakerTrackSessionsSwitchesModels evaluatedEvidence score
dexterhadleyCODE512892601
ir4yCODE198472245
yanabedaCODE137312168
ECOSYSTEM

Ecosystem

Upstream: MAGIC governance contracts and SERVICES meta-governance. Runtime: api.canonic.org performs random model assignment and chain recording. ~/.canonic stores local evidence artifacts. Frontend: BAKEOFF dashboard surfaces on governed catalog. Scoreboard is the presentation. Ledger plane: consumes chain evidence (assignments, switches, telemetry) and feeds dashboard.

Models in scope:

  • claude (Anthropic — Claude Opus 4.6, Claude Sonnet 4.5)
  • openai (OpenAI — GPT-4o, GPT-4-turbo)
*BAKEOFF SPEC DESIGN DASHBOARD SCOREBOARD BAKERS*
TALK AUTO