Microsoft-fara

AI/Microsoft-fara

Fork 0

mirror of https://github.com/microsoft/fara.git synced 2026-05-16 10:22:39 +08:00

Commit Graph

Author	SHA1	Message	Date
corby	406b26d4a3	docs: add WebTailBench V1↔V2 diff page; note refresh in README V2 refreshes 609 WebTailBench tasks and their precomputed rubrics (V1 was calendar-bound through Nov 2025). New side-by-side diff page under docs/ shows per-task changes in task_summary plus a collapsible unified diff of the precomputed_rubric JSON for each task. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:10:51 -07:00
corby	2ec67d7236	Add WebTailBench rubric comparison visualizer Standalone HTML page that shows, for each WebTailBench task, the rubric criteria produced by three different judge configurations side by side: 1. O4-Mini Rubric — historical baseline 2. GPT-5 (v1) — original GPT-5 judge 3. Universal Verifier Rubric (GPT-5.2) — current release Tasks are grouped by WebTailBench benchmark category, with incremental search across ids / summaries / criteria, and an "all three rubrics only" toggle. The header links directly to the microsoft/WebTailBench dataset on Hugging Face and to the WebTailBench-v1-rubrics.tsv download so readers can grab the underlying data. Source layout: docs/webtailbench_rubric_comparison.html (single self-contained file) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 21:12:47 -07:00

Author

SHA1

Message

Date

corby

406b26d4a3

docs: add WebTailBench V1↔V2 diff page; note refresh in README

V2 refreshes 609 WebTailBench tasks and their precomputed rubrics
(V1 was calendar-bound through Nov 2025). New side-by-side diff page
under docs/ shows per-task changes in task_summary plus a collapsible
unified diff of the precomputed_rubric JSON for each task.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-12 15:10:51 -07:00

corby

2ec67d7236

Add WebTailBench rubric comparison visualizer

Standalone HTML page that shows, for each WebTailBench task, the
rubric criteria produced by three different judge configurations
side by side:

  1. O4-Mini Rubric                    — historical baseline
  2. GPT-5 (v1)                        — original GPT-5 judge
  3. Universal Verifier Rubric (GPT-5.2) — current release

Tasks are grouped by WebTailBench benchmark category, with
incremental search across ids / summaries / criteria, and an
"all three rubrics only" toggle. The header links directly to the
microsoft/WebTailBench dataset on Hugging Face and to the
WebTailBench-v1-rubrics.tsv download so readers can grab the
underlying data.

Source layout:
  docs/webtailbench_rubric_comparison.html  (single self-contained file)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-16 21:12:47 -07:00

2 Commits