CUAVerifierBench is a human-annotated benchmark for evaluating CUA
verifiers (judges that score agent trajectories). Released to
huggingface.co/datasets/microsoft/CUAVerifierBench as two configs
(trajectories + annotations) joinable on task_id, with two splits:
- fara7b_om2w_browserbase: 106 Fara-7B Online-Mind2Web/Browserbase
trajectories x ~2 reviewers (UV-blind + UV-informed labels)
- internal: 154 trajectories from a heldout aurora-v2 task suite,
single reviewer per task (UV-blind only)
This commit adds:
- cuaverifierbench/build_dataset.py — builder script
- cuaverifierbench/README.md — dataset card mirrored to HF
- README.md — new badge, Updates entry (2026-04-19), and a
CUAVerifierBench section after the WebTailBench results table
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes in one PR:
1. Remove webeval's dependency on ``autogen-core`` / ``autogen-ext``.
All chat completion clients, message types, and the graceful-retry
layer now live under ``webeval/src/webeval/oai_clients/`` —
self-contained wrappers around openai / azure-identity. Install no
longer needs the autogen submodule; just ``pip install -e .[vllm]``
then ``cd webeval; pip install -e .``.
2. Incorporate the initial (now-stale) WebTailBench benchmark into the
codebase. ``webeval/src/webeval/benchmarks/webtailbench/`` +
``webeval/scripts/webtailbench.py``. Loader auto-downloads
``WebTailBench-v1-rubrics.tsv`` from
``huggingface.co/datasets/microsoft/WebTailBench`` and threads each
task's published ``precomputed_rubric`` through to the verifier so
rubrics never get regenerated.
3. Release the Universal Verifier (``MMRubricAgent``) as the official
judge for WebTailBench. Multimodal, rubric-grounded, two-model
ensemble (``gpt-5.2`` + ``o4-mini``) with per-criterion scoring,
outcome verification, ambiguity / invalid-task classification, and
first-point-of-failure analysis. ``webeval/scripts/verify_trajectories.py``
is a stand-alone parallel runner that re-scores any directory of
webeval-shaped trajectories without touching the solver.
Documentation: repo-root README ``Updates`` section + Reproducibility
CLI block; ``webeval/README.md`` documents the Trajectory / FinalAnswer
schema, the ``<no_answer>`` semantics, and per-benchmark score-file
shape.
Tests: 18 passing, 1 skipped (opt-in HF download).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>