refactor: split train and val dataset in response dataset #1649

yuki-97 · 2025-12-17T16:41:13Z

Related issue: #1050

Split train and val in build-in dataset, so that we could unblock multiple dataset support.
Unify the built-in datasets under nemo_rl/data/datasets/response_datasets/ into a similar format.
Remove duplicated dataset name: clevr_cogent and openmathinstruct2.

New Param
Add a new param split_validation_size to handle the case that one dataset is used for both training and validation. (e.g., OpenMathInstruct-2 in examples/configs/grpo_math_1B.yaml)

If data.train.split_validation_size > 0 and data.validation is None, will use part of the training dataset as validation dataset.
If data.train.split_validation_size > 0 and data.validation is not None, will use both "part of the training dataset" and "provided validation dataset" as validation dataset.

Usage

train:
  dataset_name: ResponseDataset
  data_path: <PathToTrainingDataset>  # e.g., /path/to/local/dataset.jsonl or hf_org/hf_dataset_name (HuggingFace)
  input_key: <QuestionKey>, default is "input"
  output_key: <AnswerKey>, default is "output"
  split: <TrainSplit>, default is None  # used for HuggingFace datasets
  split_validation_size: 0.05 # use 5% of the training data as validation data
validation:
  dataset_name: ResponseDataset
  data_path: <PathToValidationDataset>
  input_key: <QuestionKey>, default is "input"
  output_key: <AnswerKey>, default is "output"
  split: <ValidationSplit>, default is None  # used for HuggingFace datasets

Test Result

algo	result
sft
sft-vlm
grpo
grpo-vlm
distillation

Summary by CodeRabbit

Release Notes

New Features
- Added support for separate training and validation dataset configuration with new train and validation blocks in data settings
- Introduced new datasets: AIME2024, DAPOMath variants with automatic validation split capability
- Enhanced dataset framework with improved flexibility for processor selection and environment configuration
Documentation
- Updated guides with new data configuration structure and examples for train/validation dataset setup
- Clarified supported dataset listings and configuration format for multi-dataset training scenarios
Bug Fixes & Improvements
- Improved dataset loading workflow with better support for shared datasets and per-task processing
- Streamlined configuration migration from flat to nested dataset structure across all example configs

_{✏️ Tip: You can customize this high-level summary in your review settings.}

terrykong

some initial thoughts

since it's a big PR @ashors1 could you help as a second review?

terrykong · 2025-12-18T07:29:32Z

examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-fsdp2tp2.yaml

+    output_key: generated_solution
+    split: train_1M
+    seed: 42
+    split_validation_size: 0.05


i kind of feel we shouldn't split on the fly, it makes reproducing potentially problematic. i think it's better that each dataset is static at the time of running

I think it's reproducible since it'll use seed when using train_test_split. actually we also used this before, just not expose the split_validation_size param.

RL/nemo_rl/data/datasets/response_datasets/openmathinstruct2.py

Line 59 in 48dbb37

split_ds = original_ds.train_test_split(test_size=test_size, seed=seed)

btw for the seed, which one do you think is better?

remove seed in the data config, pass it thru load_response_dataset using config["grpo"]["seed"].

keep seed in the data config, inherit from ${grpo.seed}.

examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron.yaml

terrykong · 2025-12-18T07:34:53Z

nemo_rl/data/__init__.py

+    prompt_file: NotRequired[str | None]
+    system_prompt_file: NotRequired[str | None]


i see now that there are two, should we just remove this outer one to avoid dealing with surprising precedence issues if someone forgot to set one

let's discuss it here #1649 (comment).

and even if we have a default like I said in that conversation, we still need to keep it for now, because this PR only refactor the response dataset, the preference dataset will still need to use it.

terrykong · 2025-12-18T07:36:55Z

examples/run_grpo.py

-    assert hasattr(data, "processor"), "Dataset must have a processor attribute"
-    task_data_processors[task_name] = (task_spec, data.processor)
+    # setup train dataset
+    update_single_dataset_config(data_config["train"], data_config)


wdyt about just expecting users to populate the train config? then we don't have dup keys

I think we should have a default value especially when we support multiple datasets in next PR, otherwise people need to write the same things for every dataset, then the data config will be a bit redundant.

and I'm thinking if it's better to provide a default like train and validation, it seems more directly than just put it outside. wdyt?

# now data: train: # this dataset will override prompt_key and use the default values for other vars - data_path: /path/to/local/train_dataset_1.jsonl prompt_key: question # this dataset will use all the default values - data_path: /path/to/local/train_dataset_2.jsonl validation: - data_path: /path/to/local/val_dataset.jsonl # will use below vars as default values if dataset doesn't specify it dataset_name: BinaryPreferenceDataset prompt_key: prompt chosen_key: chosen rejected_key: rejected prompt_file: null system_prompt_file: null env_name: math # add `default` data: train: # this dataset will override prompt_key and use the default values for other vars - data_path: /path/to/local/train_dataset_1.jsonl prompt_key: question # this dataset will use all the default values - data_path: /path/to/local/train_dataset_2.jsonl validation: - data_path: /path/to/local/val_dataset.jsonl default: # will use below vars as default values if dataset doesn't specify it dataset_name: BinaryPreferenceDataset prompt_key: prompt chosen_key: chosen rejected_key: rejected prompt_file: null system_prompt_file: null env_name: math

terrykong · 2025-12-18T07:40:38Z

nemo_rl/data/datasets/response_datasets/__init__.py

-        "tulu3_sft_mixture",
-    ]:
-        base_dataset.set_processor()
+    base_dataset.set_processor()


do you think we need to keep this? it kind of seems like we could do without it

some datasets are associated with processor (e.g., helpsteer3), so we need to keep it for now.
I think we don't need to keep this eventually, as designed I'll make the processor associated with algorithm instead of dataset in later PR.

track it here #1658.

terrykong · 2025-12-18T07:42:57Z

nemo_rl/data/datasets/response_datasets/__init__.py

    """Loads response dataset."""
    dataset_name = data_config["dataset_name"]

-    # TODO @yukih: remove duplicated dataset_name (openmathinstruct2, clevr_cogent)


what was this comment referring to? not sure i follow from the changes you made

for these two commits: 487d354, d9344a6.
we have openmathinstruct2 and OpenMathInstruct-2 both use OpenMathInstruct2Dataset, clevr_cogent and clevr-cogent both use CLEVRCoGenTDataset before.

terrykong · 2025-12-18T07:44:23Z

nemo_rl/data/datasets/response_datasets/oasst.py

+        self.task_name = "oasst"
+
+        # load from huggingface
+        filename = hf_hub_download(


this looks a lot cleaner :)

yuki-97 · 2025-12-18T14:02:06Z

tests/unit/data/datasets/test_response_dataset.py

+    [
+        ("clevr-cogent", format_clevr_cogent_dataset),
+        ("geometry3k", format_geometry3k_dataset),
+        # ("refcoco", format_refcoco_dataset), # this needs download 13.5G image


@terrykong shall we enable this?

examples/run_grpo.py

nemo_rl/data/datasets/response_datasets/deepscaler.py

examples/run_grpo.py

nemo_rl/data/datasets/response_datasets/tulu3.py

nemo_rl/data/datasets/response_datasets/oasst.py

Signed-off-by: Yuki Huang <[email protected]>

Signed-off-by: Rayen <[email protected]>

Signed-off-by: Yuki Huang <[email protected]>

yuki-97 added the CI:L0 Run doctests and unit tests label Dec 17, 2025

yuki-97 temporarily deployed to nemo-ci December 17, 2025 16:41 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci December 17, 2025 16:44 — with GitHub Actions Inactive

yuki-97 force-pushed the yukih/split-train-val-dataset branch from f8dcf7c to 2f78c84 Compare December 18, 2025 05:05

yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Dec 18, 2025

yuki-97 temporarily deployed to nemo-ci December 18, 2025 05:06 — with GitHub Actions Inactive

yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2f78c84 to fd448be Compare December 18, 2025 05:23

github-actions bot added the documentation Improvements or additions to documentation label Dec 18, 2025

yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2aa7ce0 to 6a093d1 Compare December 18, 2025 07:08

yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Dec 18, 2025

yuki-97 temporarily deployed to nemo-ci December 18, 2025 07:09 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci December 18, 2025 07:13 — with GitHub Actions Inactive

terrykong reviewed Dec 18, 2025

View reviewed changes

yuki-97 changed the title ~~feat: split train val dataset and refactor for response dataset~~ refactor: split train val dataset in response dataset Dec 18, 2025

yuki-97 changed the title ~~refactor: split train val dataset in response dataset~~ refactor: split train and val dataset in response dataset Dec 18, 2025

yuki-97 commented Dec 18, 2025

View reviewed changes