DataFlex

🎉 If you like our project, please give us a star ⭐ on GitHub for the latest update.

📰 1. News

[2025-12-23] 🎉 We’re excited to announce the first Data-Centric Training System DataFlex, is now released! Stay tuned for future updates.

🔍 2. Overview

DataFlex is an advanced dynamic training framework built on top of LLaMA-Factory.
It intelligently schedules data during training, supporting dynamic sample selection, domain ratio adjustment, and dynamic weighting, aiming to improve both training efficiency and final model performance.

DataFlex integrates seamlessly with LlamaFactory, offering researchers and developers more flexible and powerful training control, for goals and design philosophy, please refer to Dataflex-Doc.

Dynamic Select Trainer: Dynamically selects training samples according to a given strategy (e.g., focus on “hard” samples). The data selection algorithms are summarized as follows:

Method	Category	Requires Model-in-the-Loop?
LESS	Gradient-Based	✅ Yes
NICE	Gradient-Based	✅ Yes
Loss	Loss-Based	✅ Yes
Delta Loss	Loss-Based	✅ Yes
NEAR	Data Distribution-Based	❌ No
TSDS	Data Distribution-Based	❌ No
Static	No Selection	❌ No
Random	Random Sampling	❌ No

Dynamic Mix Trainer: Dynamically adjusts the ratio of data from different domains during training. The data mixture algorithms are summarized as follows:

Method	Category	Requires Model-in-the-Loop?
DOREMI	Offline Mixture	✅ Yes
ODM	Online Mixture	✅ Yes

Dynamic Weight Trainer: Dynamically adjusts sample weights during backpropagation to emphasize data preferred by the model. The data reweighting algorithms are summarized as follows:

Method	Category	Requires Model-in-the-Loop?
Loss Reweighting	Loss-Based	✅ Yes

Full compatibility with LlamaFactory, drop-in replacement.

📌 3. Quick Start

Please use the following commands for environment setup and installation👇

git clone https://github.com/OpenDCAI/DataFlex.git
cd DataFlex
pip install -e .
pip install llamafactory==0.9.3

The launch command is similar to LLaMA-Factory. Below is an example using LESS :

FORCE_TORCHRUN=1 DISABLE_VERSION_CHECK=1 dataflex-cli train examples/train_lora/selectors/less.yaml

Unlike vanilla LlamaFactory, your .yaml config file must also include DataFlex-specific parameters, for details please refer to DataFlex-Doc.

📚 4. Experimental Results

Using DataFlex can improve performance over the default LLaMA-Factory training.

Data Selector & Reweightor Results

We use a subset of Open-Hermes-2.5 as the training dataset. The data selection algorithms and data reweighting algorithm outperform the random selector baseline on the MMLU benchmark subset relevant to the training dataset. For the Less and Nice algorithm, we set the validation set as the MMLU-Validation-Set, using a GPT-5-generated trajectory.

Data Mixture Results

We use a subset of SlimPajama-627B for data mixture。The data mixture algorithm also outperforms baselines (default data mixture) on the MMLU benchmark.

Dataset	Baseline	DoReMi	ODM
MMLU	25.27	25.84	26.04

🤝 5. Acknowledgements

We thank LLaMA-Factory for offering an efficient and user-friendly framework for large model fine-tuning, which greatly facilitated rapid iteration in our training and experimentation workflows.
Our gratitude extends to all contributors in the open-source community—their efforts collectively drive the development of DataFlex.

🤝 6. Community & Support

We welcome contributions of new trainers and selectors! Please ensure code formatting is consistent with the existing style before submitting a PR.

We also welcome you to join the DataFlex and DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!

• 📮 GitHub Issues: Report bugs or suggest features

• 🔧 GitHub Pull Requests: Contribute code improvements

• 💬 Join our community groups to connect with us and other contributors!

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
.github/workflows		.github/workflows
data		data
examples		examples
src/dataflex		src/dataflex
.gitignore		.gitignore
README-zh.md		README-zh.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataFlex

📰 1. News

🔍 2. Overview

📌 3. Quick Start

📚 4. Experimental Results

Data Selector & Reweightor Results

Data Mixture Results

🤝 5. Acknowledgements

🤝 6. Community & Support

About

Uh oh!

Releases

Packages

Contributors 9

Uh oh!

Languages

OpenDCAI/DataFlex

Folders and files

Latest commit

History

Repository files navigation

DataFlex

📰 1. News

🔍 2. Overview

📌 3. Quick Start

📚 4. Experimental Results

Data Selector & Reweightor Results

Data Mixture Results

🤝 5. Acknowledgements

🤝 6. Community & Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Uh oh!

Languages

Packages