Skip to content

Conversation

@yuki-97
Copy link
Contributor

@yuki-97 yuki-97 commented Dec 17, 2025

Related issue: #1050

  1. Split train and val in build-in dataset, so that we could unblock multiple dataset support.
  2. Unify the built-in datasets under nemo_rl/data/datasets/response_datasets/ into a similar format.
  3. Remove duplicated dataset name: clevr_cogent and openmathinstruct2.

New Param
Add a new param split_validation_size to handle the case that one dataset is used for both training and validation. (e.g., OpenMathInstruct-2 in examples/configs/grpo_math_1B.yaml)

  1. If data.train.split_validation_size > 0 and data.validation is None, will use part of the training dataset as validation dataset.
  2. If data.train.split_validation_size > 0 and data.validation is not None, will use both "part of the training dataset" and "provided validation dataset" as validation dataset.

Usage

train:
  dataset_name: ResponseDataset
  data_path: <PathToTrainingDataset>  # e.g., /path/to/local/dataset.jsonl or hf_org/hf_dataset_name (HuggingFace)
  input_key: <QuestionKey>, default is "input"
  output_key: <AnswerKey>, default is "output"
  split: <TrainSplit>, default is None  # used for HuggingFace datasets
  split_validation_size: 0.05 # use 5% of the training data as validation data
validation:
  dataset_name: ResponseDataset
  data_path: <PathToValidationDataset>
  input_key: <QuestionKey>, default is "input"
  output_key: <AnswerKey>, default is "output"
  split: <ValidationSplit>, default is None  # used for HuggingFace datasets

Test Result

algo result
sft image
sft-vlm image
grpo image
grpo-vlm image
distillation image

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for separate training and validation dataset configuration with new train and validation blocks in data settings
    • Introduced new datasets: AIME2024, DAPOMath variants with automatic validation split capability
    • Enhanced dataset framework with improved flexibility for processor selection and environment configuration
  • Documentation

    • Updated guides with new data configuration structure and examples for train/validation dataset setup
    • Clarified supported dataset listings and configuration format for multi-dataset training scenarios
  • Bug Fixes & Improvements

    • Improved dataset loading workflow with better support for shared datasets and per-task processing
    • Streamlined configuration migration from flat to nested dataset structure across all example configs

✏️ Tip: You can customize this high-level summary in your review settings.

@yuki-97 yuki-97 added the CI:L0 Run doctests and unit tests label Dec 17, 2025
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from f8dcf7c to 2f78c84 Compare December 18, 2025 05:05
@yuki-97 yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Dec 18, 2025
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2f78c84 to fd448be Compare December 18, 2025 05:23
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 18, 2025
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2aa7ce0 to 6a093d1 Compare December 18, 2025 07:08
@yuki-97 yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Dec 18, 2025
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial thoughts

since it's a big PR @ashors1 could you help as a second review?

output_key: generated_solution
split: train_1M
seed: 42
split_validation_size: 0.05
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i kind of feel we shouldn't split on the fly, it makes reproducing potentially problematic. i think it's better that each dataset is static at the time of running

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reproducible since it'll use seed when using train_test_split. actually we also used this before, just not expose the split_validation_size param.

split_ds = original_ds.train_test_split(test_size=test_size, seed=seed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw for the seed, which one do you think is better?

  1. remove seed in the data config, pass it thru load_response_dataset using config["grpo"]["seed"].
  2. keep seed in the data config, inherit from ${grpo.seed}.

Comment on lines +47 to +49
prompt_file: NotRequired[str | None]
system_prompt_file: NotRequired[str | None]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see now that there are two, should we just remove this outer one to avoid dealing with surprising precedence issues if someone forgot to set one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's discuss it here #1649 (comment).

and even if we have a default like I said in that conversation, we still need to keep it for now, because this PR only refactor the response dataset, the preference dataset will still need to use it.

assert hasattr(data, "processor"), "Dataset must have a processor attribute"
task_data_processors[task_name] = (task_spec, data.processor)
# setup train dataset
update_single_dataset_config(data_config["train"], data_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about just expecting users to populate the train config? then we don't have dup keys

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a default value especially when we support multiple datasets in next PR, otherwise people need to write the same things for every dataset, then the data config will be a bit redundant.

and I'm thinking if it's better to provide a default like train and validation, it seems more directly than just put it outside. wdyt?

# now
data:
    train:
        # this dataset will override prompt_key and use the default values for other vars
        - data_path: /path/to/local/train_dataset_1.jsonl
          prompt_key: question
        # this dataset will use all the default values
        - data_path: /path/to/local/train_dataset_2.jsonl
    validation:
        - data_path: /path/to/local/val_dataset.jsonl
    # will use below vars as default values if dataset doesn't specify it
    dataset_name: BinaryPreferenceDataset
    prompt_key: prompt
    chosen_key: chosen
    rejected_key: rejected
    prompt_file: null
    system_prompt_file: null
    env_name: math

# add `default`
data:
    train:
        # this dataset will override prompt_key and use the default values for other vars
        - data_path: /path/to/local/train_dataset_1.jsonl
          prompt_key: question
        # this dataset will use all the default values
        - data_path: /path/to/local/train_dataset_2.jsonl
    validation:
        - data_path: /path/to/local/val_dataset.jsonl
    default:
        # will use below vars as default values if dataset doesn't specify it
        dataset_name: BinaryPreferenceDataset
        prompt_key: prompt
        chosen_key: chosen
        rejected_key: rejected
        prompt_file: null
        system_prompt_file: null
        env_name: math

"tulu3_sft_mixture",
]:
base_dataset.set_processor()
base_dataset.set_processor()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we need to keep this? it kind of seems like we could do without it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some datasets are associated with processor (e.g., helpsteer3), so we need to keep it for now.
I think we don't need to keep this eventually, as designed I'll make the processor associated with algorithm instead of dataset in later PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

track it here #1658.

"""Loads response dataset."""
dataset_name = data_config["dataset_name"]

# TODO @yukih: remove duplicated dataset_name (openmathinstruct2, clevr_cogent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was this comment referring to? not sure i follow from the changes you made

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for these two commits: 487d354, d9344a6.
we have openmathinstruct2 and OpenMathInstruct-2 both use OpenMathInstruct2Dataset, clevr_cogent and clevr-cogent both use CLEVRCoGenTDataset before.

self.task_name = "oasst"

# load from huggingface
filename = hf_hub_download(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks a lot cleaner :)

@yuki-97 yuki-97 changed the title feat: split train val dataset and refactor for response dataset refactor: split train val dataset in response dataset Dec 18, 2025
@yuki-97 yuki-97 changed the title refactor: split train val dataset in response dataset refactor: split train and val dataset in response dataset Dec 18, 2025
[
("clevr-cogent", format_clevr_cogent_dataset),
("geometry3k", format_geometry3k_dataset),
# ("refcoco", format_refcoco_dataset), # this needs download 13.5G image
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@terrykong shall we enable this?

@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch 2 times, most recently from 6b34af3 to fea258d Compare December 19, 2025 15:50
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Dec 19, 2025
yuki-97 and others added 29 commits January 8, 2026 07:45
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
Signed-off-by: Yuki Huang <[email protected]>
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2a4cedd to 20f3a62 Compare January 8, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants