Commit Graph

49 Commits

Author SHA1 Message Date
Hussein Mozannar
29bc9c33a0 Fix formatting in update entry for Fara 1.5 2026-05-21 17:27:30 -04:00
Hussein Mozannar
1a7ad8384d Update README
update readme for fara 1.5 coming soon
2026-05-21 17:26:57 -04:00
Corby Rosset
8d823d72c3 docs: WebTailBench V1↔V2 diff page + README refresh note (#74) 2026-05-12 18:22:24 -04:00
corby
80d96b1ffc README: rename split test-v2 -> test_v2
HF datasets metadata does not allow '-' in split names. Match the
corrected split name on microsoft/WebTailBench.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:16:43 -07:00
corby
406b26d4a3 docs: add WebTailBench V1↔V2 diff page; note refresh in README
V2 refreshes 609 WebTailBench tasks and their precomputed rubrics
(V1 was calendar-bound through Nov 2025). New side-by-side diff page
under docs/ shows per-task changes in task_summary plus a collapsible
unified diff of the precomputed_rubric JSON for each task.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 15:10:51 -07:00
Corby Rosset
44908264c8 Fix NameError from orphaned TrajectoryDiagnosticsResult reference (#68) 2026-04-23 01:42:24 -04:00
corby
ea3ce6eac4 Drop leftover TrajectoryDiagnosticsResult reference
Commit acc1404 removed the TrajectoryDiagnosticsResult class but left
it inside the VerificationResultEvent discriminated Union, which made
`from webeval.rubric_agent.data_point import *` raise NameError at
import time and broke every test that transitively imports the module
(21 failures / errors across test_rubric_agent_imports,
test_shared_data_adapter, test_webtailbench_dataset,
test_verify_trajectories). Removing the orphan reference restores the
test suite to 27 passed / 2 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 22:41:39 -07:00
Corby Rosset
cde2ebcd6d Corby/universal verifier (#66) 2026-04-23 01:39:12 -04:00
Corby Rosset
acc14047f0 Remove TrajectoryDiagnosticsResult class
Removed the TrajectoryDiagnosticsResult class and its attributes from data_point.py.
2026-04-23 01:38:28 -04:00
Corby Rosset
b0030f7abe Update __init__.py 2026-04-23 01:33:57 -04:00
corby
7ea1e441c4 Add example_trajectory + expand webeval trajectory-format docs
Drops the cuaverifierbench/ build script (now hosted alongside the
dataset on HuggingFace) and checks in a full WebTailBench trajectory
under webeval/data/example_trajectory/ (web_surfer.log, final_answer,
screenshots, core.log, times.json, task_data.json, rubric score file).
webeval/README.md now walks through each file against that concrete
example and corrects several field-level inaccuracies (times.json keys,
emitted event types, webtailbench score payload, auto-0 vs excluded
semantics). Adds test_trajectory_loading and test_verify_trajectories
coverage; repairs HuggingFace dataset URLs and doubled /path/to paths
in the root README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 22:30:04 -07:00
corby
8cd48183e0 Add CUAVerifierBench dataset (build script + README + repo references)
CUAVerifierBench is a human-annotated benchmark for evaluating CUA
verifiers (judges that score agent trajectories). Released to
huggingface.co/datasets/microsoft/CUAVerifierBench as two configs
(trajectories + annotations) joinable on task_id, with two splits:

- fara7b_om2w_browserbase: 106 Fara-7B Online-Mind2Web/Browserbase
  trajectories x ~2 reviewers (UV-blind + UV-informed labels)
- internal: 154 trajectories from a heldout aurora-v2 task suite,
  single reviewer per task (UV-blind only)

This commit adds:
- cuaverifierbench/build_dataset.py — builder script
- cuaverifierbench/README.md — dataset card mirrored to HF
- README.md — new badge, Updates entry (2026-04-19), and a
  CUAVerifierBench section after the WebTailBench results table

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 20:01:57 -07:00
corby
9f14b6e340 Universal Verifier (MMRubricAgent) + WebTailBench, autogen-free clients
Three changes in one PR:

1. Remove webeval's dependency on ``autogen-core`` / ``autogen-ext``.
   All chat completion clients, message types, and the graceful-retry
   layer now live under ``webeval/src/webeval/oai_clients/`` —
   self-contained wrappers around openai / azure-identity. Install no
   longer needs the autogen submodule; just ``pip install -e .[vllm]``
   then ``cd webeval; pip install -e .``.

2. Incorporate the initial (now-stale) WebTailBench benchmark into the
   codebase. ``webeval/src/webeval/benchmarks/webtailbench/`` +
   ``webeval/scripts/webtailbench.py``. Loader auto-downloads
   ``WebTailBench-v1-rubrics.tsv`` from
   ``huggingface.co/datasets/microsoft/WebTailBench`` and threads each
   task's published ``precomputed_rubric`` through to the verifier so
   rubrics never get regenerated.

3. Release the Universal Verifier (``MMRubricAgent``) as the official
   judge for WebTailBench. Multimodal, rubric-grounded, two-model
   ensemble (``gpt-5.2`` + ``o4-mini``) with per-criterion scoring,
   outcome verification, ambiguity / invalid-task classification, and
   first-point-of-failure analysis. ``webeval/scripts/verify_trajectories.py``
   is a stand-alone parallel runner that re-scores any directory of
   webeval-shaped trajectories without touching the solver.

Documentation: repo-root README ``Updates`` section + Reproducibility
CLI block; ``webeval/README.md`` documents the Trajectory / FinalAnswer
schema, the ``<no_answer>`` semantics, and per-benchmark score-file
shape.

Tests: 18 passing, 1 skipped (opt-in HF download).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 17:15:46 -07:00
Corby Rosset
2c27710499 Add WebTailBench rubric comparison visualizer (#65) 2026-04-17 00:34:52 -04:00
corby
2ec67d7236 Add WebTailBench rubric comparison visualizer
Standalone HTML page that shows, for each WebTailBench task, the
rubric criteria produced by three different judge configurations
side by side:

  1. O4-Mini Rubric                    — historical baseline
  2. GPT-5 (v1)                        — original GPT-5 judge
  3. Universal Verifier Rubric (GPT-5.2) — current release

Tasks are grouped by WebTailBench benchmark category, with
incremental search across ids / summaries / criteria, and an
"all three rubrics only" toggle. The header links directly to the
microsoft/WebTailBench dataset on Hugging Face and to the
WebTailBench-v1-rubrics.tsv download so readers can grab the
underlying data.

Source layout:
  docs/webtailbench_rubric_comparison.html  (single self-contained file)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-16 21:12:47 -07:00
Spencer Whitehead
0aa8a48430 update README (#52) 2026-01-14 15:09:53 -05:00
Andrew Zhao
e648c0e844 update README 2026-01-14 15:03:24 -05:00
Hussein Mozannar
6261cdb6fc Add citation for Fara-7B in README (#47) 2025-12-15 11:33:48 -05:00
Spencer Whitehead
76f38253e9 Add citation for Fara-7B in README
Updated citation information for Fara-7B and added a link to the associated paper.
2025-12-15 11:17:43 -05:00
Hussein Mozannar
3f3f1fd82c docs: fix typos and formatting in README (#44) 2025-12-10 12:08:47 -05:00
Mehul Jariwala
9d7efb0e11 Fix typos and enhance clarity in README.md
Corrected typos and improved clarity in README.
2025-12-10 10:52:24 +05:30
Hussein Mozannar
6cbb4f3694 Windows Support (#40) 2025-12-03 17:28:43 -05:00
Hussein Mozannar
82637b54df fixes 2025-12-02 14:01:05 -08:00
Corby Rosset
879dc9561a online mind2web works (#23) 2025-12-02 00:16:35 -05:00
corby
a3e2750c6a Merge branch 'main' into corby/om2w 2025-12-01 21:13:43 -08:00
Hussein Mozannar
5ebe2590f8 Update README with Magentic-UI instructions (#37) 2025-11-28 20:49:34 -05:00
Hussein Mozannar
5745ea715f Enhance README with video demos for Magentic-UI
Added video demos reference for Magentic-UI integration.
2025-11-28 20:49:18 -05:00
Hussein Mozannar
2b81016788 Update README with Magentic-UI instructions
Added instructions for using Fara-7B with Magentic-UI and included a note about WSL2 for Windows users.
2025-11-28 20:48:25 -05:00
Hussein Mozannar
f10b9319b3 Refine Fara-7B description in README (#31) 2025-11-28 14:30:48 -05:00
Hussein Mozannar
ccdc3def6e Refine Fara-7B description in README
Removed redundancy in the description of Fara-7B's visual operation.
2025-11-28 14:30:31 -05:00
Hussein Mozannar
21469308d6 Update README with Windows and memory usage notes (#30) 2025-11-28 13:35:43 -05:00
Hussein Mozannar
b555f0874b Update README with Windows and memory usage notes
Added notes for Windows users and memory command.
2025-11-28 13:34:46 -05:00
corby
01f8a17ffd forgot om2w score file type 2025-11-27 11:37:43 -08:00
Wassim Chegham
ca1d3868f4 fix(fara-cli): refactor logging for action execution in FaraAgent (#25) 2025-11-26 10:39:45 -05:00
corby
1d6d1d64a6 online mind2web works 2025-11-25 22:47:53 -08:00
Corby Rosset
222c2a38e7 Multi round Fara fix (#16) 2025-11-25 23:49:56 -05:00
corby
f71083c9bd other fixes to webeval 2025-11-25 20:47:34 -08:00
corby
3fa15aa99a split to two devices 2025-11-25 20:04:24 -08:00
Hussein Mozannar
2816cb10a3 update cli 2025-11-25 15:40:56 -08:00
Hussein Mozannar
c2472e5a37 nicer experience 2025-11-25 15:22:42 -08:00
Hussein Mozannar
d330c93b22 use logging 2025-11-25 14:59:49 -08:00
Hussein Mozannar
9238221c15 small fix 2025-11-25 14:13:35 -08:00
Hussein Mozannar
d18cd7b675 Merge branch 'main' into multi-round 2025-11-25 16:45:27 -05:00
Alexey Taymanov
4b093dd243 self-hosted scenario fix (#13)
* import fix

* vllm requirements

* vllm config

* readme update
2025-11-25 13:44:02 -08:00
Hussein Mozannar
1d0e145150 revert back az vllm 2025-11-25 13:39:54 -08:00
Hussein Mozannar
e6a0174662 fix multi-turn 2025-11-25 13:36:24 -08:00
Hussein Mozannar
38c58ca681 cleanup 1 2025-11-25 13:03:48 -08:00
Alexey Taymanov
2606c8e1c4 self-hosted scenario fix (#13)
* import fix

* vllm requirements

* vllm config

* readme update
2025-11-24 20:12:38 -08:00
ataymano@microsoft.com
ff0dbac1d1 Initial commit 2025-11-24 10:32:01 -05:00