mirror of
https://github.com/microsoft/fara.git
synced 2026-06-10 02:54:01 +08:00
Standalone HTML page that shows, for each WebTailBench task, the rubric criteria produced by three different judge configurations side by side: 1. O4-Mini Rubric — historical baseline 2. GPT-5 (v1) — original GPT-5 judge 3. Universal Verifier Rubric (GPT-5.2) — current release Tasks are grouped by WebTailBench benchmark category, with incremental search across ids / summaries / criteria, and an "all three rubrics only" toggle. The header links directly to the microsoft/WebTailBench dataset on Hugging Face and to the WebTailBench-v1-rubrics.tsv download so readers can grab the underlying data. Source layout: docs/webtailbench_rubric_comparison.html (single self-contained file) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>