selfchatbot is a LM fine tuning pipeline to fine tune base HuggingFace LMs to talk like you.
Disclaimer: This project is a personal learning experience rather than something designed for production level applications. You can see this in many design choices I've made learning how to finetune LMs with HuggingFace.
This currently only works with Discord direct messages exported from DiscordChatExporter, but other message formats can be supported.
Documentation on how to add your own message format support will be added later. If you want to add support for a new message format now you can look at the Preprocessors.
selfchatbot requires Python 3.12 or higher. Follow these steps to install the package:
-
Python 3.12+: Download the latest version from the official Python website.
-
pip: Ensure you have the latest version of
pipinstalled. Update it if necessary:python -m pip install --upgrade pip
Clone the project repository to your local machine:
git clone https://github.com/kagamiAL/selfchatbot
cd selfchatbotSet up a virtual environment to manage dependencies:
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`Install the package along with its dependencies using pip:
pip install .For more details about the installation process or dependencies, refer to the pyproject.toml.
selfchatbot requires several environment variables to be set for proper functionality. These variables determine the paths for storing raw data, preprocessed data, and results. You can declare these variables directly in your environment or use a .env file (built-in support is provided).
selfChatBot_raw: Path to the directory containing raw datasets.selfChatBot_preprocessed: Path to the directory for storing preprocessed datasets.selfChatBot_results: Path to the directory where results will be saved.
-
Using a
.envFile Create a.envfile in the root directory of selfchatbot and add the following lines:selfChatBot_raw=/path/to/raw/datasets selfChatBot_preprocessed=/path/to/preprocessed/datasets selfChatBot_results=/path/to/results
-
Setting Variables Manually
On Linux/MacOS:
export selfChatBot_raw=/path/to/raw/datasets export selfChatBot_preprocessed=/path/to/preprocessed/datasets export selfChatBot_results=/path/to/results
On Windows (Command Prompt):
set selfChatBot_raw=C:\path\to\raw\datasets set selfChatBot_preprocessed=C:\path\to\preprocessed\datasets set selfChatBot_results=C:\path\to\results
- Ensure that the paths you provide are absolute paths for consistency.
- The project will automatically detect and load the .env file if it exists.
- For large datasets, ensure the directories have sufficient storage capacity.
Datasets in the selfChatBot_raw directory must adhere to a specific folder structure to ensure proper functionality. Below are the details for organizing datasets:
Each dataset folder should follow the naming template:
Dataset_ID_Name
ID: A unique integer identifier for the dataset. This ensures each dataset is distinct (e.g.,1,2,42).Name: A descriptive, human-readable name for the dataset. This can be any string that helps you identify the dataset (e.g.,RedditChats,Discord_Data).
Dataset_1_Discord_Data, here1is the ID andDiscord_Datais the nameDataset_2_RedditChats, here2is the ID andRedditChatsis the name
-
parameters.json- Each dataset folder must contain a
parameters.jsonfile, specifying fine-tuning parameters. See the Parameters section for more details.
- Each dataset folder must contain a
-
Sub-Folders for Message Formats
-
Each dataset folder contains sub-folders named after the format of the .txt files they contain. Examples of sub-folder names include:
- DiscordChatExporter for data exported by DiscordChatExporter.
-
-
.txt Files
- Each sub-folder contains .txt files representing the messages. These files must conform to the format associated with their sub-folder name
Here's an example directory layout for a dataset:
selfChatBot_raw/
├── Dataset_1_Discord_Data/
│ ├── parameters.json
│ ├── DiscordChatExporter/
│ │ ├── channel1.txt
│ │ ├── channel2.txt
├── Dataset_2_RedditChats/
│ ├── parameters.json
│ ├── RedditFormat/
│ ├── thread1.txt
│ ├── thread2.txt
For more clarity, you can look at the SampleEnvironment folder.
- Use unique integers for ID to avoid conflicts.
- Choose meaningful names for Name to easily identify datasets.
- Ensure the parameters.json file is present in every dataset folder and contains all required parameters.
- Sub-folder names must reflect the format of their .txt files for clarity and proper processing.
Each dataset folder in selfChatBot_raw must include a parameters.json file that specifies the parameters for fine-tuning. Below is an example and detailed explanation of the structure and its fields.
{
"model": "gpt2-large",
"type_fine_tune": "lora",
"max_length": 1024, // Defaults to 1024 if not specified
"preprocessor_data": {
"DiscordChatExporter" : {
"username": "your_username_here_without_@"
}
}
}-
model- Description: Specifies the base model to use for fine-tuning.
- Example:
"gpt2-large","gpt2-xl","EleutherAI/gpt-neo-1.3B"
-
type_fine_tune- Description: Defines the fine-tuning method.
- Allowed Values:
lora: Use LoRA fine-tuning (Low Rank Adaptation).qlora: Use Quantized LoRA.finetune: Full model fine-tuning.
-
max_length- Description: The maximum sequence length for training and inference.
- Example: 1024 (recommended for GPT-2 models).
-
preprocessor_data- Description: A nested field containing dataset-specific preprocessing parameters.
- Structure:
- The keys are the names of sub-folder formats (e.g.,
DiscordChatExporter,SlackExporter). - Each key maps to an object with format-specific settings.
- The keys are the names of sub-folder formats (e.g.,
Example for
DiscordChatExporter:"preprocessor_data": { "DiscordChatExporter": { "username": "your_username_here_without_@" } }
- Field:
username- Description: Your Discord username without the @.
- Example:
"john_doe"
- Make sure all fields are correctly defined; missing or invalid values can cause errors during fine-tuning.
- The preprocessor_data field is optional but must be included if specific preprocessing is required for the dataset's format.
- For fine-tuning with different methods (lora, qlora, or finetune), ensure the base model and parameters are compatible with the chosen method
Before using selfchatbot make sure you have checked out Installation, Environment Variables, Data Folder Structure, and Parameters JSON. For a quick start, see SampleEnvironment.
You must preprocess your data before moving on to finetuning
To preprocess a dataset, use the command-line tool selfChatBot_preprocess. This command processes data from various sources within a dataset folder and prepares it for fine-tuning.
selfChatBot_preprocess -d <Dataset_ID>-d <Dataset_ID>: The unique ID of the dataset to preprocess. This corresponds to the ID in the dataset folder name (Dataset_{ID}_{Name}).
selfChatBot_preprocess -d 1This example preprocesses the dataset located in selfChatBot_raw/Dataset_1_Discord_Data.
- The preprocessed data will be saved in the
selfChatBot_preprocesseddirectory. - Refer to the Data Folder Structure and Parameters JSON section for details on preparing datasets before fine-tuning.
You must have preprocessed your data before moving on to finetuning
To fine-tune a model on a dataset, use the command-line tool selfChatBot_train. This command trains the model using preprocessed data from the specified dataset.
selfChatBot_train -d <Dataset_ID>-d <Dataset_ID>: The unique ID of the dataset to fine-tune the model on. This corresponds to the ID in the dataset folder name (Dataset_{ID}_{Name}).
- The fine-tuning results (including the model weights) will be saved in the
selfChatBot_resultsdirectory.
To interact with the fine-tuned chat model, use the command-line tool selfChatBot_play. This command allows you to play with the model using either a session-based interaction or a prompt-based interaction.
selfChatBot_play -d <Dataset_ID> [-t <interaction_type>] [-p <prompt>] [-mt <model_type>]-
-d <Dataset_ID>: The unique ID of the dataset to use for interaction. This corresponds to theIDin the dataset folder name (Dataset_{ID}_{Name}). -
-t <interaction_type>: (Optional) Specifies the type of interaction with the model.session: Initiates an ongoing chat session.prompt: Interacts with the model using a custom prompt.
Default is
session. -
-p <prompt>: (Required if-t promptis selected) The custom prompt to use for the interaction when-t promptis specified. -
-mt <model_type>: (Optional) Specifies which model to use for interaction.best: Use the best-performing model (the one with the lowest validation loss).final: Use the final model after training.
Default is
best.
selfChatBot_play -d 1 -t prompt -p "Hello, how are you?"This example uses the dataset Dataset_1_Discord_Data and sends the prompt "Hello, how are you?" to the model for a prompt-based interaction.
selfChatBot_play -d 1 -t sessionThis example initiates a session-based interaction with the model, using the dataset Dataset_1_Discord_Data.
- Ensure that the fine-tuned model is available in the
selfChatBot_resultsdirectory. - The model used for interaction can be the best model or the final model, depending on your preference.
Alan Bach, [email protected]