MMIRAGE, which stands for Modular Multimodal Intelligent Reformatting and Augmentation Generation Engine, is an advanced platform designed to streamline the processing of datasets using generative models. It is engineered to handle large-scale data reformatting and augmentation tasks with efficiency and precision. By leveraging state-of-the-art generative models, MMIRAGE enables users to perform complex dataset transformations, ensuring compatibility across various formats and schemas. Its multi-node support and parallel processing capabilities make it an ideal choice for scenarios demanding substantial computational power, such as distributed training and inference workflows. MMIRAGE not only simplifies the integration of powerful language models but also provides a customizable framework for diverse use cases, from reformatting conversational datasets to generating Q/A pairs from plain text.
To install the library, you can clone it from GitHub and then use pip to install it directly. It is recommended to have already installed torch and sglang to take advantage of GPU acceleration.
git clone [email protected]:EPFLiGHT/MMIRAGE.git
pip install -e ./MMIRAGEFor testing and scripts that make use of the library, it is advised to create a .env file. You can do this by running the following command:
curl https://raw.githubusercontent.com/EPFLiGHT/MMIRAGE/refs/heads/json-output/scripts/generate_env.sh | shTo install the library, you can clone it from GitHub and then use pip to install it directly. It is recommended to have already installed torch and sglang to take advantage of GPU acceleration.
git clone [email protected]:EPFLiGHT/MIRAGE.git
pip install -e ./MIRAGEFor testing and scripts that make use of the library, it is advised to create a .env file. You can do this by running the following command:
curl https://raw.githubusercontent.com/EPFLiGHT/MIRAGE/refs/heads/json-output/scripts/generate_env.sh | sh- Easily configurable with a YAML file which configure the following parameters
- The prompt to the LLM
- Variables with the name and their key to a JSON
- Parallelizable with a multi-node support
- The training pipeline should use either distributed inference using accelerate
- Support a variety of LLMs and VLMs (LLM only for a first version)
- Support any dataset schemas (configurable with the YAML format)
- The ability to either output a JSON (or any other structured format) or a plain text
Suppose you have a dataset with samples of the following format
{
"conversations" : [{"role": "user", "content": "Describe the image"}, {"role": "assistant", "content": "This is a badly formmatted answer"}],
"modalities" : [<the images>]
}The dataset contains assistant answers that are badly formatted. The goal would be to use a LLM to format our answer in Markdown. With MMIRAGE, it would be as simple as defining a YAML configuration file. Then in the YAML configuration file, we could specify
inputs:
- name: assistant_answer
key: conversations[1].content
- name: user_prompt
key: conversations[0].content
- name: modalities
key: modalities
outputs:
- name: formatted_answer
type: llm
output_type: plain
prompt: |
Reformat the answer in a markdown format without adding anything else:
{assistant_answer}
output_schema:
conversations:
- role: user
content: {user_prompt}
- role: assistant
content: {formatted_answer}
modalities: {modalities}
Configuration explanation:
inputs: specify variables that are defined from the input dataset. For instance by specifying the keyconversations[1].content, we say that this variable corresponds tosample["conversations"][1]["content"]outputs: specify variables that are created from the pipeline. We specify how the variable should be created:- Here
formatted_answeris created using a LLM prompt and is a plain text variable (as opposed to JSON variables)
- Here
output_schema: specify the output schema of the dataset. So each sample will follow this format. Here we know that each sample will contain 2 keys:conversationsandmodalities
In the second example, we want to generate questions from plain text document. The 3 keys that we want to generate are:
- "question"
- "answer"
- "explanation"
Suppose we have the following format:
{
"text" : "This is a very interesting article about cancer"
}inputs:
- name: plain_text
key: text
outputs:
- name: output_dict
type: prompt
output_type: JSON
prompt: |
I want to generate Q/A pairs from the following text:
{plain_text}
output_schema:
- question
- explanation
- answer
output_schema:
conversations:
- role: user
content: {question}
- role: assistant
content: |
{explanation}
Answer: {answer}
Here, we choose to output a JSON answer with 3 keys ("question", "explanation" and "answer"). That we will match