V2 refreshes 609 WebTailBench tasks and their precomputed rubrics
(V1 was calendar-bound through Nov 2025). New side-by-side diff page
under docs/ shows per-task changes in task_summary plus a collapsible
unified diff of the precomputed_rubric JSON for each task.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone HTML page that shows, for each WebTailBench task, the
rubric criteria produced by three different judge configurations
side by side:
1. O4-Mini Rubric — historical baseline
2. GPT-5 (v1) — original GPT-5 judge
3. Universal Verifier Rubric (GPT-5.2) — current release
Tasks are grouped by WebTailBench benchmark category, with
incremental search across ids / summaries / criteria, and an
"all three rubrics only" toggle. The header links directly to the
microsoft/WebTailBench dataset on Hugging Face and to the
WebTailBench-v1-rubrics.tsv download so readers can grab the
underlying data.
Source layout:
docs/webtailbench_rubric_comparison.html (single self-contained file)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>