Fono — Model Decision Page

Matrix last extended 2026-05-23 · 629 measured iterations · 210 cells · 5 hosts · higher RTF = faster (audio-s / wall-s).

Goal. Tune Fono's first-run wizard so that on every host it picks the model, quantisation, and backend that maximises perceived speed without crossing the accuracy ceiling — and so the algorithm doing the picking is small enough to fit on a page, audit by hand, and ship in the binary with no runtime probe.

Scope. We have collected ≈ 920 benchmark runs to date: 629 of them survive in the current matrix as 210 (host × backend × model × quant) cells; the rest were discarded as I did some performance optimizations and data was no longer relevant. The findings are computed from the cells still in the matrix; the per-host Pareto frontier (section 6) simulates the live wizard algorithm so the page can be read as both data and decision.

Host

Backend Family Language Quant

Preset

Key findings (data-derived)

…

1 · Decision Heatmap — what the wizard picks per (host, family, language)

Rows = hosts sorted by release year, annotated with detected GPU and the GPU class the wizard uses (legacy = 1× multiplier, integrated = 2×, discrete = 4×; see ADR 0028). Columns = model family × language. Each cell runs the wizard's full picker on the (host, family, language) niche: filter by gates (batch RTF ≥ 2.0 · peak RSS ≤ 1024 MiB · model not in AccuracyBucket::Inaccurate per the registry's wer_by_lang["en"]), group by quant, take Vulkan over CPU at ≥ 1.2× CPU, prefer q8 → q5 → fp16. Accuracy is decided from the registry (Open-ASR-Leaderboard means) — not from the worst single fixture in our equivalence set — to mirror what the Rust wizard does (crates/fono/src/wizard.rs:1442-1445). Our ~13-clip set is too small to reliably gate on the worst fixture. The displayed quant and backend are what the wizard would actually use; the ⚡ badge fires when Vulkan wins the tiebreak.

2 · Speed Sweep — batch RTF by model variant, faceted per host

One panel per host. Bars within a model variant show fp16 / q8 / q5. Empty slots = not measured. Y axis is log by default; toggle "Piecewise RTF" in the filter bar to give the decision zone (0–10×) more vertical room. The horizontal line marks batch RTF = 2.0 (the wizard's batch gate).

3 · Quant Speedup vs fp16 — CPU build, VNNI vs avx2-fallback kernels

Ratio = (quant batch RTF) ÷ (fp16 batch RTF), aggregated across all measured (host, model) pairs that ran on the CPU build. Bars show the median; thin whiskers span min→max; n is the sample count. Hosts with AVX-VNNI (Alder Lake+, Ryzen Zen3+) execute integer-quant kernels in vector units; pre-VNNI hosts fall back to scalar paths. Vulkan rows are excluded here — see chart 4 for the CPU-vs-Vulkan comparison.

4 · CPU vs Vulkan — batch RTF and speedup

Paired bars per (host, model, quant) where both backends have data. Speedup = Vulkan RTF ÷ CPU RTF (>1 means Vulkan faster).

5 · Speed vs Accuracy — picking a model per host

One point per measured (host, build, model, quant, language) cell. X = English WER (lower is better, leftmost is most accurate); Y = batch RTF (higher is faster, log scale). The top-left is the sweet spot. Colour = host; circle = CPU build, triangle = Vulkan build. Marker size grows with model family (tiny→turbo). A horizontal dashed green line marks the batch RTF ≥ 2.0 comfortable gate used by the wizard. Hover any point for full details including Δ accuracy vs that cell's fp16 baseline. Click legend entries to hide/show individual host/build pairs.

6 · Pareto frontier per host — what the wizard would pick

One scatter per host. X-axis is the bench's English WER mean across all measured fixtures (accuracy_en_mean); the worst-fixture value (accuracy_en_max) is shown only in the tooltip — our ~13-clip equivalence set is too small to gate on the worst row without one bad transcript flipping the verdict (same reasoning as section 1; see comment in wizardVerdict). Y-axis is batch RTF (higher=better). Cells with bench mean WER above the 0.30 display cap are dropped from the plot; the per-host header notes how many are hidden.

Gates are evaluated against the registry, not the bench. The accuracy gate is REGISTRY_WER[model] ≤ 0.15 (AccuracyBucket::Inaccurate in crates/fono/src/wizard.rs:1442-1445) — Open-ASR-Leaderboard means published per model, sourced from the much larger fixture pool than ours. Bench RTF / RSS still gate live perf.

The green diamonds are the Pareto frontier in the bench-mean / RTF plane — cells where no other cell on that host is both more accurate AND faster. The blue reticle (⊕) marks the recommendation the wizard would actually pick: it runs the same wizardPickFor simulator as the section-1 heatmap on every shipped (family · language) niche, keeps those that pass the gates, and sorts by Rust shortlist order (wizard.rs:1506-1517): registry accuracy bucket asc, then largest approx_mb first. Vulkan beats CPU only at ≥ 1.2×; quant inside a niche follows q8 → q5 → fp16. Dashed horizontal line shows the batch RTF gate.

7 · Capability Ladder — measured RTF vs host capability (with wizard predictions)

Three orthogonal capability axes. Solid dots are measured; the grey dashed line is what the wizard's affordability estimator would predict (hwcheck.rs:228-259 and ADR 0028). Where dots sit above the line the wizard is under-promising; below = over-promising. All three charts respect the filter bar above.

1. Measured Vulkan / CPU speedup vs GPU class — validates ADR 0028 multipliers

●q8_0▲q5◆fp16

2. CPU q8_0 RTF vs AVX-VNNI (normalized to wizard prediction) — exposes the unmodeled VNNI cliff

●q8_0

3. CPU RTF (normalized to registry anchor) vs physical cores — validates core_scale formula

●q8_0▲q5◆fp16

8 · Decision Quadrants — interactive accuracy/speed thresholds

Drag the sliders to set your accuracy ceiling (max WER) and speed floor (min batch RTF). Every measured cell falls into one of four quadrants: Ship it (low WER + high RTF, top-left), Fast but inaccurate (top-right), Accurate but slow (bottom-left), or Unusable (bottom-right) — shown as faint background tints. Dot shape and colour encode the model family: ● tiny · ▲ base · ◆ small · ★ turbo (also reflected in the legend). The lists below show exactly which (host, build, model, quant) cells land in each quadrant.

Max WER (accuracy ceiling) — Min batch RTF (speed floor) — Language: Quant:

9 · Coverage Matrix — what's measured, what's missing

Rows = hosts. Columns = (backend × family × language × quant). Green ≥2 iterations, yellow = 1, grey = not measured. Use this to plan the next bench session.

10 · Full Data Table

Click headers to sort. Filters above apply here too.

Host	Build	Fam	Lang	Model	Quant	Batch RTF	Stream RTF	TTFF s	RSS MiB	Acc (EN↓)	Δacc	Disk MiB	Iters	Verdict