Microsoft-fara

mirror of https://github.com/microsoft/fara.git synced 2026-06-10 02:54:01 +08:00

Author	SHA1	Message	Date
Hussein Mozannar	29bc9c33a0	Fix formatting in update entry for Fara 1.5	2026-05-21 17:27:30 -04:00
Hussein Mozannar	1a7ad8384d	Update README update readme for fara 1.5 coming soon	2026-05-21 17:26:57 -04:00
Corby Rosset	8d823d72c3	docs: WebTailBench V1↔V2 diff page + README refresh note (#74 )	2026-05-12 18:22:24 -04:00
corby	80d96b1ffc	README: rename split test-v2 -> test_v2 HF datasets metadata does not allow '-' in split names. Match the corrected split name on microsoft/WebTailBench. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:16:43 -07:00
corby	406b26d4a3	docs: add WebTailBench V1↔V2 diff page; note refresh in README V2 refreshes 609 WebTailBench tasks and their precomputed rubrics (V1 was calendar-bound through Nov 2025). New side-by-side diff page under docs/ shows per-task changes in task_summary plus a collapsible unified diff of the precomputed_rubric JSON for each task. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 15:10:51 -07:00
Corby Rosset	44908264c8	Fix NameError from orphaned TrajectoryDiagnosticsResult reference (#68 )	2026-04-23 01:42:24 -04:00
corby	ea3ce6eac4	Drop leftover TrajectoryDiagnosticsResult reference Commit `acc1404` removed the TrajectoryDiagnosticsResult class but left it inside the VerificationResultEvent discriminated Union, which made `from webeval.rubric_agent.data_point import *` raise NameError at import time and broke every test that transitively imports the module (21 failures / errors across test_rubric_agent_imports, test_shared_data_adapter, test_webtailbench_dataset, test_verify_trajectories). Removing the orphan reference restores the test suite to 27 passed / 2 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 22:41:39 -07:00
Corby Rosset	cde2ebcd6d	Corby/universal verifier (#66 )	2026-04-23 01:39:12 -04:00
Corby Rosset	acc14047f0	Remove TrajectoryDiagnosticsResult class Removed the TrajectoryDiagnosticsResult class and its attributes from data_point.py.	2026-04-23 01:38:28 -04:00
Corby Rosset	b0030f7abe	Update __init__.py	2026-04-23 01:33:57 -04:00
corby	7ea1e441c4	Add example_trajectory + expand webeval trajectory-format docs Drops the cuaverifierbench/ build script (now hosted alongside the dataset on HuggingFace) and checks in a full WebTailBench trajectory under webeval/data/example_trajectory/ (web_surfer.log, final_answer, screenshots, core.log, times.json, task_data.json, rubric score file). webeval/README.md now walks through each file against that concrete example and corrects several field-level inaccuracies (times.json keys, emitted event types, webtailbench score payload, auto-0 vs excluded semantics). Adds test_trajectory_loading and test_verify_trajectories coverage; repairs HuggingFace dataset URLs and doubled /path/to paths in the root README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 22:30:04 -07:00
corby	8cd48183e0	Add CUAVerifierBench dataset (build script + README + repo references) CUAVerifierBench is a human-annotated benchmark for evaluating CUA verifiers (judges that score agent trajectories). Released to huggingface.co/datasets/microsoft/CUAVerifierBench as two configs (trajectories + annotations) joinable on task_id, with two splits: - fara7b_om2w_browserbase: 106 Fara-7B Online-Mind2Web/Browserbase trajectories x ~2 reviewers (UV-blind + UV-informed labels) - internal: 154 trajectories from a heldout aurora-v2 task suite, single reviewer per task (UV-blind only) This commit adds: - cuaverifierbench/build_dataset.py — builder script - cuaverifierbench/README.md — dataset card mirrored to HF - README.md — new badge, Updates entry (2026-04-19), and a CUAVerifierBench section after the WebTailBench results table Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 20:01:57 -07:00
corby	9f14b6e340	Universal Verifier (MMRubricAgent) + WebTailBench, autogen-free clients Three changes in one PR: 1. Remove webeval's dependency on ``autogen-core`` / ``autogen-ext``. All chat completion clients, message types, and the graceful-retry layer now live under ``webeval/src/webeval/oai_clients/`` — self-contained wrappers around openai / azure-identity. Install no longer needs the autogen submodule; just ``pip install -e .[vllm]`` then ``cd webeval; pip install -e .``. 2. Incorporate the initial (now-stale) WebTailBench benchmark into the codebase. ``webeval/src/webeval/benchmarks/webtailbench/`` + ``webeval/scripts/webtailbench.py``. Loader auto-downloads ``WebTailBench-v1-rubrics.tsv`` from ``huggingface.co/datasets/microsoft/WebTailBench`` and threads each task's published ``precomputed_rubric`` through to the verifier so rubrics never get regenerated. 3. Release the Universal Verifier (``MMRubricAgent``) as the official judge for WebTailBench. Multimodal, rubric-grounded, two-model ensemble (``gpt-5.2`` + ``o4-mini``) with per-criterion scoring, outcome verification, ambiguity / invalid-task classification, and first-point-of-failure analysis. ``webeval/scripts/verify_trajectories.py`` is a stand-alone parallel runner that re-scores any directory of webeval-shaped trajectories without touching the solver. Documentation: repo-root README ``Updates`` section + Reproducibility CLI block; ``webeval/README.md`` documents the Trajectory / FinalAnswer schema, the ``<no_answer>`` semantics, and per-benchmark score-file shape. Tests: 18 passing, 1 skipped (opt-in HF download). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 17:15:46 -07:00
Corby Rosset	2c27710499	Add WebTailBench rubric comparison visualizer (#65 )	2026-04-17 00:34:52 -04:00
corby	2ec67d7236	Add WebTailBench rubric comparison visualizer Standalone HTML page that shows, for each WebTailBench task, the rubric criteria produced by three different judge configurations side by side: 1. O4-Mini Rubric — historical baseline 2. GPT-5 (v1) — original GPT-5 judge 3. Universal Verifier Rubric (GPT-5.2) — current release Tasks are grouped by WebTailBench benchmark category, with incremental search across ids / summaries / criteria, and an "all three rubrics only" toggle. The header links directly to the microsoft/WebTailBench dataset on Hugging Face and to the WebTailBench-v1-rubrics.tsv download so readers can grab the underlying data. Source layout: docs/webtailbench_rubric_comparison.html (single self-contained file) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-16 21:12:47 -07:00
Spencer Whitehead	0aa8a48430	update README (#52 )	2026-01-14 15:09:53 -05:00
Andrew Zhao	e648c0e844	update README	2026-01-14 15:03:24 -05:00
Hussein Mozannar	6261cdb6fc	Add citation for Fara-7B in README (#47 )	2025-12-15 11:33:48 -05:00
Spencer Whitehead	76f38253e9	Add citation for Fara-7B in README Updated citation information for Fara-7B and added a link to the associated paper.	2025-12-15 11:17:43 -05:00
Hussein Mozannar	3f3f1fd82c	docs: fix typos and formatting in README (#44 )	2025-12-10 12:08:47 -05:00
Mehul Jariwala	9d7efb0e11	Fix typos and enhance clarity in README.md Corrected typos and improved clarity in README.	2025-12-10 10:52:24 +05:30
Hussein Mozannar	6cbb4f3694	Windows Support (#40 )	2025-12-03 17:28:43 -05:00
Hussein Mozannar	82637b54df	fixes	2025-12-02 14:01:05 -08:00
Corby Rosset	879dc9561a	online mind2web works (#23 )	2025-12-02 00:16:35 -05:00
corby	a3e2750c6a	Merge branch 'main' into corby/om2w	2025-12-01 21:13:43 -08:00
Hussein Mozannar	5ebe2590f8	Update README with Magentic-UI instructions (#37 )	2025-11-28 20:49:34 -05:00
Hussein Mozannar	5745ea715f	Enhance README with video demos for Magentic-UI Added video demos reference for Magentic-UI integration.	2025-11-28 20:49:18 -05:00
Hussein Mozannar	2b81016788	Update README with Magentic-UI instructions Added instructions for using Fara-7B with Magentic-UI and included a note about WSL2 for Windows users.	2025-11-28 20:48:25 -05:00
Hussein Mozannar	f10b9319b3	Refine Fara-7B description in README (#31 )	2025-11-28 14:30:48 -05:00
Hussein Mozannar	ccdc3def6e	Refine Fara-7B description in README Removed redundancy in the description of Fara-7B's visual operation.	2025-11-28 14:30:31 -05:00
Hussein Mozannar	21469308d6	Update README with Windows and memory usage notes (#30 )	2025-11-28 13:35:43 -05:00
Hussein Mozannar	b555f0874b	Update README with Windows and memory usage notes Added notes for Windows users and memory command.	2025-11-28 13:34:46 -05:00
corby	01f8a17ffd	forgot om2w score file type	2025-11-27 11:37:43 -08:00
Wassim Chegham	ca1d3868f4	fix(fara-cli): refactor logging for action execution in FaraAgent (#25 )	2025-11-26 10:39:45 -05:00
corby	1d6d1d64a6	online mind2web works	2025-11-25 22:47:53 -08:00
Corby Rosset	222c2a38e7	Multi round Fara fix (#16 )	2025-11-25 23:49:56 -05:00
corby	f71083c9bd	other fixes to webeval	2025-11-25 20:47:34 -08:00
corby	3fa15aa99a	split to two devices	2025-11-25 20:04:24 -08:00
Hussein Mozannar	2816cb10a3	update cli	2025-11-25 15:40:56 -08:00
Hussein Mozannar	c2472e5a37	nicer experience	2025-11-25 15:22:42 -08:00
Hussein Mozannar	d330c93b22	use logging	2025-11-25 14:59:49 -08:00
Hussein Mozannar	9238221c15	small fix	2025-11-25 14:13:35 -08:00
Hussein Mozannar	d18cd7b675	Merge branch 'main' into multi-round	2025-11-25 16:45:27 -05:00
Alexey Taymanov	4b093dd243	self-hosted scenario fix (#13 ) * import fix * vllm requirements * vllm config * readme update	2025-11-25 13:44:02 -08:00
Hussein Mozannar	1d0e145150	revert back az vllm	2025-11-25 13:39:54 -08:00
Hussein Mozannar	e6a0174662	fix multi-turn	2025-11-25 13:36:24 -08:00
Hussein Mozannar	38c58ca681	cleanup 1	2025-11-25 13:03:48 -08:00
Alexey Taymanov	2606c8e1c4	self-hosted scenario fix (#13 ) * import fix * vllm requirements * vllm config * readme update	2025-11-24 20:12:38 -08:00
ataymano@microsoft.com	ff0dbac1d1	Initial commit	2025-11-24 10:32:01 -05:00

49 Commits