Evaluation#

This page describes how to evaluate WideSeek-R1 in RLinf.

The provided scripts support two evaluation settings:

  • WideSearch benchmark evaluation.

  • Standard QA evaluation.

The reference configs use Qwen3-series dense models.

Prerequisites#

Before evaluation, make sure the following components are ready:

  • The RLinf environment is installed. See Installation.

  • The judge model server is running. See WideSeek-R1.

  • The appropriate tool backend is configured. See Tool Setup.

Download the Model#

The released checkpoint is available at:

You may also evaluate your own Qwen3-series dense model.

After downloading the model, set the local path in the evaluation config:

rollout:
  model:
    model_type: qwen3
    model_path: /PATH/TO/MODEL

Evaluation Datasets#

WideSeek-R1 currently supports two dataset types for evaluation.

WideSearch Benchmark#

Use the formatted WideSearch evaluation set from Hugging Face:

Compared with the original raw benchmark, this version is converted into the format expected by RLinf and includes several data fixes.

Update examples/agent/wideseek_r1/config/eval_qwen3_widesearch.yaml as follows:

data:
  is_markdown: True
  val_data_paths: /PATH/TO/EVAL/WIDESEARCH/DATASET
  data_size: -1

Key fields:

  • is_markdown should remain True for the WideSearch dataset.

  • val_data_paths points to the evaluation dataset.

  • data_size: -1 means to evaluate on the full dataset.

For a quick sanity check, start with a smaller data_size.

In the reference setup, full evaluation on 200 WideSearch examples took about 7 hours with 8 GPUs for generation and 8 GPUs for the judge model.

Standard QA Evaluation#

For standard QA evaluation, use the dataset released by ASearcher:

This dataset includes both single-hop tasks, such as Natural Questions, and multi-hop tasks, such as HotpotQA.

Update examples/agent/wideseek_r1/config/eval_qwen3_qa.yaml as follows:

data:
  is_markdown: False
  val_data_paths: /PATH/TO/EVAL/QA/DATASET
  data_size: -1

Here is_markdown must be False.

Compared to the WideSearch evaluation, the standard QA evaluation is much faster. It is recommended to first evaluate a subset of the standard QA data as a quick sanity check.

Run Evaluation#

Before launching evaluation, verify all of the following:

  • rollout.model.model_path points to the model you want to evaluate.

  • data.val_data_paths points to the correct dataset.

  • agentloop.llm_ip is set correctly.

  • The required tools are configured. See Tool Setup.

Then run one of the following commands:

bash examples/agent/wideseek_r1/run_eval.sh eval_qwen3_widesearch
bash examples/agent/wideseek_r1/run_eval.sh eval_qwen3_qa

Output Files#

Evaluation outputs are written to:

${runner.output_dir}/${runner.experiment_name}

Important files include:

  • metric.json: aggregate metrics such as output length and tool usage.

  • allresult.json: full multi-turn interaction logs.

  • responses/: final model answers for each example.

For standard QA evaluation, metric.json also includes the final LLM-judge results.

For WideSearch evaluation, RLinf stores the generated responses so they can be scored with the official WideSearch evaluation pipeline.

Additional WideSearch Scoring#

For final WideSearch benchmark scoring, use the dedicated evaluation repository:

Refer to the repository README for the complete procedure.

Two-Engine Evaluation#

WideSeek-R1 also supports evaluation with two separate model instances in the multi-agent setting, so the planner and worker roles can use different models.

Use examples/agent/wideseek_r1/config/eval_qwen3_qa_2eng.yaml. The relevant fields are:

agentloop:
  fixed_role: worker  # planner or worker

rollout:
  use_fixed_worker: True

use_fixed_worker enables the second model instance. fixed_role selects which role uses that second model.

You can then set different model paths under rollout.model.model_path and rollout_fixed_worker.model.model_path.

Notes#

As in training, agentloop.workflow controls whether evaluation uses single-agent or multi-agent execution:

  • mas: multi-agent evaluation.

  • sa: single-agent evaluation.

The single-agent mode is designed to be comparable to ASearcher.