Evaluation#
This page describes how to evaluate WideSeek-R1 in RLinf.
The provided scripts support two evaluation settings:
WideSearch benchmark evaluation.
Standard QA evaluation.
The reference configs use Qwen3-series dense models.
Prerequisites#
Before evaluation, make sure the following components are ready:
The RLinf environment is installed. See Installation.
The judge model server is running. See WideSeek-R1.
The appropriate tool backend is configured. See Tool Setup.
Download the Model#
The released checkpoint is available at:
You may also evaluate your own Qwen3-series dense model.
After downloading the model, set the local path in the evaluation config:
rollout:
model:
model_type: qwen3
model_path: /PATH/TO/MODEL
Evaluation Datasets#
WideSeek-R1 currently supports two dataset types for evaluation.
WideSearch Benchmark#
Use the formatted WideSearch evaluation set from Hugging Face:
Compared with the original raw benchmark, this version is converted into the format expected by RLinf and includes several data fixes.
Update
examples/agent/wideseek_r1/config/eval_qwen3_widesearch.yaml
as follows:
data:
is_markdown: True
val_data_paths: /PATH/TO/EVAL/WIDESEARCH/DATASET
data_size: -1
Key fields:
is_markdownshould remainTruefor the WideSearch dataset.val_data_pathspoints to the evaluation dataset.data_size: -1means to evaluate on the full dataset.
For a quick sanity check, start with a smaller data_size.
In the reference setup, full evaluation on 200 WideSearch examples took about 7 hours with 8 GPUs for generation and 8 GPUs for the judge model.
Standard QA Evaluation#
For standard QA evaluation, use the dataset released by ASearcher:
This dataset includes both single-hop tasks, such as Natural Questions, and multi-hop tasks, such as HotpotQA.
Update
examples/agent/wideseek_r1/config/eval_qwen3_qa.yaml
as follows:
data:
is_markdown: False
val_data_paths: /PATH/TO/EVAL/QA/DATASET
data_size: -1
Here is_markdown must be False.
Compared to the WideSearch evaluation, the standard QA evaluation is much faster. It is recommended to first evaluate a subset of the standard QA data as a quick sanity check.
Run Evaluation#
Before launching evaluation, verify all of the following:
rollout.model.model_pathpoints to the model you want to evaluate.data.val_data_pathspoints to the correct dataset.agentloop.llm_ipis set correctly.The required tools are configured. See Tool Setup.
Then run one of the following commands:
bash examples/agent/wideseek_r1/run_eval.sh eval_qwen3_widesearch
bash examples/agent/wideseek_r1/run_eval.sh eval_qwen3_qa
Output Files#
Evaluation outputs are written to:
${runner.output_dir}/${runner.experiment_name}
Important files include:
metric.json: aggregate metrics such as output length and tool usage.allresult.json: full multi-turn interaction logs.responses/: final model answers for each example.
For standard QA evaluation, metric.json also includes the final LLM-judge
results.
For WideSearch evaluation, RLinf stores the generated responses so they can be scored with the official WideSearch evaluation pipeline.
Additional WideSearch Scoring#
For final WideSearch benchmark scoring, use the dedicated evaluation repository:
Refer to the repository README for the complete procedure.
Two-Engine Evaluation#
WideSeek-R1 also supports evaluation with two separate model instances in the multi-agent setting, so the planner and worker roles can use different models.
Use
examples/agent/wideseek_r1/config/eval_qwen3_qa_2eng.yaml.
The relevant fields are:
agentloop:
fixed_role: worker # planner or worker
rollout:
use_fixed_worker: True
use_fixed_worker enables the second model instance. fixed_role selects
which role uses that second model.
You can then set different model paths under rollout.model.model_path and
rollout_fixed_worker.model.model_path.
Notes#
As in training, agentloop.workflow controls whether evaluation uses
single-agent or multi-agent execution:
mas: multi-agent evaluation.sa: single-agent evaluation.
The single-agent mode is designed to be comparable to ASearcher.