For installation + the full reproducibility CLI, see the repo-root README.md. Activate the fara_webeval conda env before running anything below.

This document focuses on a single internal contract that every benchmark, system, and verifier reads or writes: the trajectory — i.e. the on-disk record of one agent run on one task.

Trajectory format

Every executed task lives in its own directory under <out_url>/runs/<system.hash>/<model_prefix>/<user>/<benchmark.exec_hash>/<run_id>/traj/<task_id>/, e.g.:

/path/to/Fara/eval/runs/WebSurfer-fara-100-max_n_images-3/
  fara-7b/<user>/WebTailBench_hf/<run_id>/traj/<task_id>/
    web_surfer.log                 # newline-delimited JSON action+observation log
    <task_id>_final_answer.json    # FinalAnswer dump (final_answer, screenshots, token_usage)
    screenshot_1.png ... screenshot_N.png
    times.json                     # {start_time, end_time, duration} of the solver run
    core.log                       # webeval.core per-task log (Execution / Evaluation lifecycle)
    task_data.json                 # only present after the rubric verifier runs (see below)
    scores/
      <benchmark.eval_hash>.json   # {score: float, gpt_response_text: str|json}

A concrete example is checked in at webeval/data/example_trajectory/ — a single WebTailBench AllTrails task (alltrails_find_23) with its web_surfer.log, alltrails_find_23_final_answer.json, screenshot_{1..4}.png frames, times.json, core.log, task_data.json (written by the rubric verifier on re-eval), and the scores/mmrubric_0.8-5-3.json rubric verdict (score=1). Point Trajectory.from_folder(...) at that dir to see every field below populated end-to-end.

`web_surfer.log` (one JSON object per line)

Written by the LogHandler in webeval/utils.py, attached to the WebSurferLogger inside WebSurferSystem.get_answer(). The handler runs at DEBUG level so the FaraAgent.{generate_model_call, execute_action} methods (src/fara/fara_agent.py) — which emit at logger.debug(...) — actually flow through. Every line carries a top-level type field (LogHandler.emit in webeval/utils.py:124). In the current FaraAgent pipeline only two of them are actually produced:

`type`	Source	Key fields
`WebSurferEvent`	`FaraAgent.execute_action` (`src/fara/fara_agent.py:467`)	`timestamp`, `source`, `message`, `url`, `action`, `arguments`. The structured action event — one per tool call, with `action=<tool>` and `arguments=<dict>`.
`OtherEvent`	`FaraAgent`'s raw `logger.debug(...)` strings (`fara_agent.py:403,415`)	`timestamp`, `type`, `message`. The `Thought #N: ...\nAction #N: executing tool '<tool>' with arguments {...}` and `Observation#N: ...` prints land here, because they're plain strings, not dataclasses, and `LogHandler.emit` falls through to its catch-all branch.

Three more event dataclasses (OrchestrationEvent, AgentEvent, LLMCallEvent) are declared in webeval/systems/messages.py / webeval/utils.py and handled by LogHandler.emit, but have no construction callsites in the current repo — they're reserved for other solver systems / legacy trajectories and won't appear in a fresh FaraAgent run.

Trajectory.__init__ reads every line and either:

Uses each event's top-level action field directly (the WebSurferEvent shape), OR
If no event has an action field (the all-OtherEvent case), runs parse_text_based_event to extract action + arguments (with thoughts) from the Thought #N: ... Action #N: executing tool … text. This is how trajectories from older WebSurferSystem runs are recovered.

After parsing, Trajectory.actions is a list of JSON-encoded action arg dicts (with thoughts stripped) and Trajectory.thoughts is the parallel list of thought strings.

Gotcha: if web_surfer.log is empty, Trajectory.events == [] and Trajectory.actions == []. Downstream evaluators (e.g. WebTailBenchBenchmark.evaluator) treat that as verifier_reasoning="No actions found in the trajectory" and return score 0 without invoking MMRubricAgent. The fix is in WebSurferSystem.get_answer(), which now bumps the logger to DEBUG before the FaraAgent runs.

`<task_id>_final_answer.json` (`FinalAnswer` dataclass)

Written by WebSurferSystem.get_answer() once the agent terminates or hits max_rounds. Field reference (webeval/trajectory.py:FinalAnswer):

Field	Type	Notes
`final_answer`	`str`	The agent's final response text. Defaults to `"<no_answer>"` — see the next section for what that means.
`env_state_json`	`str` (JSON)	Parsed JSON dump of the final webpage `<pre>` element at `/finish` (synthetic envs only).
`env_state_raw`	`str`	Raw text from that `<pre>` element (debugging / fallback if `env_state_json` parse failed).
`screenshots`	`List[str]`	Screenshot filenames captured during the run, relative to the trajectory dir if `is_rel_paths` is true (the default), else absolute.
`is_aborted`	`bool`	True if the trajectory was killed mid-execution by an environment error. Aborted trajectories are typically retried up to `--max_error_task_retries` (see `webeval/core.py`).
`is_rel_paths`	`bool`	Whether `screenshots` paths are relative to `trajectory.path`. New runs use `True`.
`token_usage`	`Dict[str, RequestUsage]`	Per-component LLM token usage. The example uses `{"FaraAgent": {"prompt_tokens": 23383, "completion_tokens": 1920}}` — the key is whichever component name `WebSurferSystem.get_answer()` attributes the call to. Aggregated via `RequestUsage.__add__`.

Persisted via FinalAnswer.save(path); loaded via FinalAnswer.load(path). Trajectory.__init__ calls FinalAnswer.load(...) on the single *_final_answer.json it finds in the dir (raises ValueError if 0 or >1 are present), reattaches the loaded FinalAnswer as trajectory.answer, and exposes the screenshot list directly as trajectory.screenshots. trajectory.is_aborted is just a pass-through to trajectory.answer.is_aborted.

`final_answer == "<no_answer>"` is the load-bearing flag

FinalAnswer is constructed with final_answer = "<no_answer>" as the default, and WebSurferSystem.get_answer() only overwrites it with the agent's actual answer string when the agent calls terminate(...) within --max_rounds steps. Three trajectory outcomes therefore look different on disk:

Outcome	`is_aborted`	`final_answer`	Notes
Agent called `terminate(...)` within budget	`False`	the model's response	Only this case has a real string to score.
Agent never called `terminate`, hit `--max_rounds`	`False`	`"<no_answer>"`	Trajectory ran to completion-by-cutoff; downstream is free to score the final state but the model itself produced no answer.
Run errored mid-execution (browser crash, timeout, …)	`True`	`"<no_answer>"`	Will be retried up to `--max_error_task_retries` (`webeval/core.py:run_eval_single_example`); only the final attempt's `_final_answer.json` is kept.

Filtering "actually completed" runs: anywhere you want trajectories that finished within step budget AND produced a real model answer, the canonical check is

candidate.answer.final_answer != "<no_answer>" and not candidate.is_aborted

This is what webeval/post_eval_analysis.py uses when computing "average score across non-aborted, non-step-budget-exceeded trajectories", and it's also why WebTailBenchBenchmark.evaluator short-circuits with verifier_reasoning='The last action must be "terminate"' for any trajectory whose last logged action is not terminate — those are guaranteed to have final_answer == "<no_answer>" and therefore nothing the rubric judge can grade.

Auto-0 vs excluded: two superficially similar cases are scored very differently.

Over-budget (<no_answer>, is_aborted=False): WebTailBench short-circuits to score=0 without calling the rubric judge, writes it to scores/<eval_hash>.json, and counts it as 0 in the mean.

Aborted / errored mid-run (is_aborted=True, loader fails, or evaluator raises): returns score=None, no score file is written, and every aggregator filters on score is not None — so aborted trajectories are excluded from all final metrics rather than counted as 0.

screenshot_{i}.png files keep accumulating regardless of how the trajectory ended, so screenshot count alone is not a proxy for "in budget" — always combine it with the final_answer check above. Note: trajectory.latest_screenshot is a static convenience path (trajectory.path / "screenshot_scaled.png", see webeval/trajectory.py:131) that may not exist on disk for every run — the example directory, for instance, has only screenshot_1..4.png.

`times.json`

{"start_time": <unix-float>, "end_time": <unix-float>, "duration": <float-seconds>}, written around the solver call in webeval/core.py:run_single_task (webeval/core.py:86-93). Both stamps are time.time() values (not ISO-8601); e.g. the example is {"start_time": 1762809276.0520103, "end_time": 1762809454.79363, "duration": 178.74161958694458}. evaluate_single_example reads duration to populate EvalResult.duration.

`scores/<benchmark.eval_hash>.json`

Written by webeval/core.py:evaluate_single_example after a successful Benchmark.evaluate_example() call. Single writer, single shape across all benchmarks (top-level {score, gpt_response_text}); the benchmark-specific verdict lives inside gpt_response_text.

Field	Type	Notes
`score`	`float` or `int`	Top-line success metric (0/1 for outcome-style judges; fraction for rubric-style judges).
`gpt_response_text`	`str` (often a JSON-encoded dict)	Free-form judge output. See per-benchmark table below.

Where each benchmark's verdict lives

The score file is the single source of truth for an evaluation result. final_answer.json is never overwritten by the verifier. task_data.json is a separate dump the WebTailBench rubric verifier writes alongside the score file (benchmarks/webtailbench.py:346 → [Rubric] Saved rubric results to …/task_data.json), so a fresh solver-only run won't have one, but any trajectory that's been re-scored — including the checked-in example — will. Re-running the verifier (--redo_eval or scripts/verify_trajectories.py) rewrites the score file in place and, for webtailbench, refreshes task_data.json with the latest rubric payload.

Benchmark	Score file path under `traj/<task_id>/`	`gpt_response_text` payload
`webvoyager`	`scores/gpt_eval.json`	gpt-4o judge text — verdict is `"SUCCESS"` / `"NOT SUCCESS"` parsed back to `score ∈ {0.0, 1.0}` (`benchmarks/webvoyager/webvoyager.py`).
`om2w`	`scores/<eval_method>-<threshold>.json` (e.g. `WebJudge_Online_Mind2Web_eval-3.json`)	o4-mini judge text — verdict via `extract_predication(...)` from the OnlineMind2Web judge methods (`benchmarks/om2w/om2w.py`).
`webtailbench`	`scores/mmrubric_<rubric_threshold>-<max_imgs>-<keypt>.json`	JSON-encoded dict assembled in `benchmarks/webtailbench/webtailbench.py:355-442`: `is_success` (int 0/1), `success_criterion` (`outcome` / `process` / `both`), `final_answer`, `outcome_success` (bool), `outcome_reasoning`, `outcome_primary_intent`, `rubric_is_success` (int), the `rubric_*` family forwarded from MMRubricAgent (`rubric_items`, `rubric_total_earned_points`, `rubric_total_max_points`, `rubric_all_scores_list`, `rubric_all_rubric_dicts`, `rubric_intermediate_mm_rubric_steps`, `rubric_outcome_verification`, `rubric_majority_vote_metadata`), and the error-taxonomy fields nested inside the rubric payload (`first_point_of_failure`, `is_ambiguous` / `ambiguity_codes`, `is_invalid` / `invalid_task_codes` — `mm_rubric_agent.py:392-406`). Top-level `score` equals `is_success`, which is driven by `--success`: `outcome` → `int(outcome_success)`, `process` → `rubric_is_success`, `both` → both.
`verify_trajectories.py` (stand-alone)	same path + shape as `webtailbench`	same Universal Verifier output.

`core.log`

Per-task log lines from the webeval.core.<task_id> logger. Useful when a task hangs or errors and the rest of the pipeline keeps moving. A run's core.log typically mixes:

[Execution <task_id>] Start / Trajectory is not complete. Reexecuting... / Completed / Error
BrowserBase plumbing: Initializing BrowserBase session..., Connected to BrowserBase session: <url>, Closing browser...
Every WebSurferEvent event re-emitted as its dataclass repr(...) (so the same step shows up both here and in web_surfer.log, just unparsed)
[Evaluation <task_id>] Start / Loading from: <path> / Completed: score=<n>, duration=<seconds> / Error
[Rubric] Saved rubric results to <path>/task_data.json when the WebTailBench rubric verifier runs

The example's core.log shows the full sequence for one solve on 2025-11-10 plus three separate rubric re-eval passes on 2026-03-26, all against the same trajectory directory.

`task_data.json` (optional, rubric-verifier output)

Only written by the WebTailBench rubric verifier (benchmarks/webtailbench/webtailbench.py:346), so trajectories that have never been re-scored won't have one. When present, it's a large JSON dump of the task's rubric/outcome state — everything MMRubricAgent produced plus the task metadata it loaded. In the example, task_data.json has 76 keys, including task_summary, task_type, rubric_score, verifier_agent_response, verifier_rubric, start_timestamp, end_timestamp, token_usage, plan_list, facts_list, observations_list, actions, actions_summary, and the boolean error-taxonomy flags (did_web_surfing_fail, web_surfing_stalled, encountered_captcha, …). Treat this file as diagnostic detail; the authoritative score still lives in scores/<eval_hash>.json.

`Trajectory` consumers

from webeval.trajectory import Trajectory
traj = Trajectory.from_folder("/path/to/<task_id>/")
# Returns None if the dir is missing the answer file or the log can't be parsed.
traj.events                # list[dict] of parsed action/observation events
traj.actions               # list[str]  of JSON-encoded action-arg dicts
traj.thoughts              # list[str]  parallel to .actions
traj.screenshots           # list[Path] absolute paths to screenshot files
traj.answer                # FinalAnswer
traj.is_aborted            # bool (delegates to .answer.is_aborted)
traj.latest_screenshot     # convenience: trajectory.path / "screenshot_scaled.png"

Three places consume it today:

Benchmark verifiers (webeval/benchmarks/*/<bench>.py): each benchmark's evaluator(task_data, candidate) receives a Trajectory and returns (score, gpt_response_text). WebTailBench wraps MMRubricAgent (the Universal Verifier) for this; webvoyager / om2w call their own GPT judges.
webeval/scripts/verify_trajectories.py: stand-alone parallel runner that points MMRubricAgent at any directory of Trajectory-shaped folders and writes per-task scores/mmrubric_<…>.json. Useful for re-scoring an existing eval run with new judge parameters.
webeval/post_eval_analysis.py: aggregates _final_answer.json + scores/<…>.json + counts of WebSurferEvent lines in web_surfer.log for run-level reporting.

Where each file is written

File	Writer
`web_surfer.log`	`LogHandler` attached in `WebSurferSystem.get_answer()` (capturing `FaraAgent` log calls)
`<task_id>_final_answer.json`	`WebSurferSystem.get_answer()` → `FinalAnswer.save()`
`screenshot{i}.png`	`FaraAgent.execute_action` (browser screenshots; one per action by default)
`times.json`	`webeval/core.py:run_single_task`
`core.log`	`_logger = logger.getChild(task_id)` in `webeval/core.py:run_eval_single_example`
`scores/<benchmark.eval_hash>.json`	`webeval/core.py:evaluate_single_example`
`scores/mmrubric_<…>.json`	`webeval/scripts/verify_trajectories.py`

README.md

webeval

Trajectory format

web_surfer.log (one JSON object per line)

<task_id>_final_answer.json (FinalAnswer dataclass)

final_answer == "<no_answer>" is the load-bearing flag

times.json

scores/<benchmark.eval_hash>.json