llm_call_expect_json did `result.content.content`, assuming a nested object.
On the OpenAI/phyagi-gateway path CreateResult.content is already a plain
str, so this raised "'str' object has no attribute 'content'" on every
attempt — Step 9b (trajectory-informed task verification) failed all retries
and silently fell back to a default. The bug was masked because the test only
asserts the result keys exist, not that the step ran.
Mirror task_classification's helper: unwrap .content only when present. Step
9b now runs clean against the gateway (no retries, real verdict).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A live verifier test wedged for ~19h: a judge endpoint returning HTTP 401
(mapped by the openai SDK to AuthenticationError) hit the AuthenticationError
branch, which neither blocklisted the endpoint nor consumed the `tries`
budget — so next_client() kept handing back the same failing endpoint and
the loop spun forever, opening connections the whole time.
- AuthenticationError and the check-access-response-enc branch now decrement
`tries` (and AuthenticationError backs off 1s) so a persistent auth/access
failure on a small pool can't loop forever.
- Add a hard `max_total_attempts` cap (default max_retries + 2*n_endpoints)
as a backstop: create() always terminates regardless of which branch fires,
including blocklisting paths that intentionally don't spend the budget.
- Exhaustion error now reports total_attempts + both caps.
- Tests: 3 regression cases (persistent-auth termination across 1 and 3
endpoints, all-endpoints-blocklisted termination), each guarded by a 30s
asyncio.wait_for so any future regression fails loudly instead of hanging.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings the rubric_agent package up to parity with agento_next/main
(post-#1071 architecture), adapted to webeval:
- Adds `CriticalPointAgent` (Step -1) that classifies the task against
a critical-point taxonomy (`critical_point_types.yaml`) and threads
the result through rubric generation, action-only scoring, outcome
verification, and a new CP-violation check.
- Adds `VerifierAgent` (Steps 9a/9b/10): first-point-of-failure
analysis with the error taxonomy, trajectory-informed task
verification, and unified task verification. Step 11 (synthetic
human-voice feedback) is intentionally dropped — not needed in webeval.
- Mirrors upstream #1071's DRY refactor: extracts shared helpers
(`format_action_history`, `call_llm`, `encode_image_b64`,
`get_init_url_context`, `build_scored_rubric_summary`,
`build_all_screenshot_evidence_text`) into `formatting.py` so
`MMRubricAgent` and `VerifierAgent` don't duplicate them. Also
removes the lazy `MMRubricAgent._format_action_history` import
from `critical_point_classifier.py`.
- Picks up upstream #889 `run_command` support: adds
`StepSummary.tool_output`, populates it from post-action
`ToolOutput` observations, renders a `Command Output:` line in the
action history, and teaches the prompts to treat that output as
ground truth (so an unchanged desktop after `run_command` is not
read as failure) while sanity-checking that the command isn't a
fake (`echo "success"`).
- Adds missing runtime deps (`imagehash`, `jinja2`) to
`webeval/pyproject.toml` — both are imported by the new modules.
Architecture note: the branch keeps `MMRubricAgent` and `VerifierAgent`
as independent agents orchestrated by the caller, rather than upstream
#1071's compose pattern. This is the same direction (decoupling) and
goes a step further by also stripping Step 11. `verify_trajectories.py`
drives them in sequence.
All 28 webeval unit tests pass; the live-LLM end-to-end test
(`test_verify_trajectories_live_llm`, opt-in via `FARA_VERIFY_LIVE_TEST=1`)
asserts the new CP-aware fields (`cp_type_used`, `cp_violation`,
`error_taxonomy.first_point_of_failure.failure_points[].error_code`,
Steps 9b/10 `is_ambiguous`/`is_invalid`) hit the score file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HF datasets metadata does not allow '-' in split names. Match the
corrected split name on microsoft/WebTailBench.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
V2 refreshes 609 WebTailBench tasks and their precomputed rubrics
(V1 was calendar-bound through Nov 2025). New side-by-side diff page
under docs/ shows per-task changes in task_summary plus a collapsible
unified diff of the precomputed_rubric JSON for each task.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Commit acc1404 removed the TrajectoryDiagnosticsResult class but left
it inside the VerificationResultEvent discriminated Union, which made
`from webeval.rubric_agent.data_point import *` raise NameError at
import time and broke every test that transitively imports the module
(21 failures / errors across test_rubric_agent_imports,
test_shared_data_adapter, test_webtailbench_dataset,
test_verify_trajectories). Removing the orphan reference restores the
test suite to 27 passed / 2 skipped.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the cuaverifierbench/ build script (now hosted alongside the
dataset on HuggingFace) and checks in a full WebTailBench trajectory
under webeval/data/example_trajectory/ (web_surfer.log, final_answer,
screenshots, core.log, times.json, task_data.json, rubric score file).
webeval/README.md now walks through each file against that concrete
example and corrects several field-level inaccuracies (times.json keys,
emitted event types, webtailbench score payload, auto-0 vs excluded
semantics). Adds test_trajectory_loading and test_verify_trajectories
coverage; repairs HuggingFace dataset URLs and doubled /path/to paths
in the root README.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CUAVerifierBench is a human-annotated benchmark for evaluating CUA
verifiers (judges that score agent trajectories). Released to
huggingface.co/datasets/microsoft/CUAVerifierBench as two configs
(trajectories + annotations) joinable on task_id, with two splits:
- fara7b_om2w_browserbase: 106 Fara-7B Online-Mind2Web/Browserbase
trajectories x ~2 reviewers (UV-blind + UV-informed labels)
- internal: 154 trajectories from a heldout aurora-v2 task suite,
single reviewer per task (UV-blind only)
This commit adds:
- cuaverifierbench/build_dataset.py — builder script
- cuaverifierbench/README.md — dataset card mirrored to HF
- README.md — new badge, Updates entry (2026-04-19), and a
CUAVerifierBench section after the WebTailBench results table
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes in one PR:
1. Remove webeval's dependency on ``autogen-core`` / ``autogen-ext``.
All chat completion clients, message types, and the graceful-retry
layer now live under ``webeval/src/webeval/oai_clients/`` —
self-contained wrappers around openai / azure-identity. Install no
longer needs the autogen submodule; just ``pip install -e .[vllm]``
then ``cd webeval; pip install -e .``.
2. Incorporate the initial (now-stale) WebTailBench benchmark into the
codebase. ``webeval/src/webeval/benchmarks/webtailbench/`` +
``webeval/scripts/webtailbench.py``. Loader auto-downloads
``WebTailBench-v1-rubrics.tsv`` from
``huggingface.co/datasets/microsoft/WebTailBench`` and threads each
task's published ``precomputed_rubric`` through to the verifier so
rubrics never get regenerated.
3. Release the Universal Verifier (``MMRubricAgent``) as the official
judge for WebTailBench. Multimodal, rubric-grounded, two-model
ensemble (``gpt-5.2`` + ``o4-mini``) with per-criterion scoring,
outcome verification, ambiguity / invalid-task classification, and
first-point-of-failure analysis. ``webeval/scripts/verify_trajectories.py``
is a stand-alone parallel runner that re-scores any directory of
webeval-shaped trajectories without touching the solver.
Documentation: repo-root README ``Updates`` section + Reproducibility
CLI block; ``webeval/README.md`` documents the Trajectory / FinalAnswer
schema, the ``<no_answer>`` semantics, and per-benchmark score-file
shape.
Tests: 18 passing, 1 skipped (opt-in HF download).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone HTML page that shows, for each WebTailBench task, the
rubric criteria produced by three different judge configurations
side by side:
1. O4-Mini Rubric — historical baseline
2. GPT-5 (v1) — original GPT-5 judge
3. Universal Verifier Rubric (GPT-5.2) — current release
Tasks are grouped by WebTailBench benchmark category, with
incremental search across ids / summaries / criteria, and an
"all three rubrics only" toggle. The header links directly to the
microsoft/WebTailBench dataset on Hugging Face and to the
WebTailBench-v1-rubrics.tsv download so readers can grab the
underlying data.
Source layout:
docs/webtailbench_rubric_comparison.html (single self-contained file)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>