2 Commits

Author SHA1 Message Date
corby
406b26d4a3 docs: add WebTailBench V1↔V2 diff page; note refresh in README
V2 refreshes 609 WebTailBench tasks and their precomputed rubrics
(V1 was calendar-bound through Nov 2025). New side-by-side diff page
under docs/ shows per-task changes in task_summary plus a collapsible
unified diff of the precomputed_rubric JSON for each task.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:10:51 -07:00
corby
2ec67d7236 Add WebTailBench rubric comparison visualizer
Standalone HTML page that shows, for each WebTailBench task, the
rubric criteria produced by three different judge configurations
side by side:

  1. O4-Mini Rubric                    — historical baseline
  2. GPT-5 (v1)                        — original GPT-5 judge
  3. Universal Verifier Rubric (GPT-5.2) — current release

Tasks are grouped by WebTailBench benchmark category, with
incremental search across ids / summaries / criteria, and an
"all three rubrics only" toggle. The header links directly to the
microsoft/WebTailBench dataset on Hugging Face and to the
WebTailBench-v1-rubrics.tsv download so readers can grab the
underlying data.

Source layout:
  docs/webtailbench_rubric_comparison.html  (single self-contained file)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 21:12:47 -07:00