Repository for the Paper Summarizer project for the course 'Machine Learning' @ Université Paris-Saclay, CentraleSupélec. BDMA, Fall 2023.
Authors:
The datasets can be found at:
- Arxiv-Summarization dataset: download from Arxiv-Summarization at HuggingFace
- DialogSum dataset: download from DialogSum at HuggingFace
Install the dependencies from the requirements.txt file. We recommend using a virtual environment and Python 3.9.18.
To preprocess the datasets, you have to use the source/preprocessing/preprocess.py script. You can run it with the following command:
python source/preprocessing/preprocess.py --zip_dir <zip_location> --zip_to_stories [True/False] --toChunk [True/False] --ch_sum_sent <n_sents> --stories_dir <stories_location> --json_dir <json_location> --max_sentences <max_sents> --min_sentence_length <min_length> --model_name <model_name> --output_dir <output_dir> --preprocess [True/False]Where:
zip_diris the location of the zip file containing the datasetçzip_to_storiesis a boolean indicating whether to convert the zip files to stories or nottoChunkis a boolean indicating whether to chunk stories longer than 512 tokens or notch_sum_sentis the number of sentences used to summarize each chunkstories_diris the location of the stories folderjson_diris the location of the json file to save the preprocessed datamax_sentencesis the maximum number of sentences to keep in the whole summarymin_sentence_lengthis the minimum number of tokens in a sentence to keep it in the summarymodel_nameis the name of the model used to tokenize the sentencesoutput_diris the location of the output folder, to save the summariespreprocessis a boolean indicating whether to preprocess the data or not
If you use DialogSum, you might need to use the alternative script source/preprocessing/preprocess_dialogue.py instead, which works similarly.
To train the model, you have to use the source/train.py script. You can run it with the following command:
python source/train.py --train_loc <train_dataset_loc> --valid_loc <validation_dataset_loc> --model_loc <path_to_model> --output_dir <output_dir> --model_type <model_type> --verbose [True/False] --batch_size <bsize> --train_size <tsize> --valid_size <vsize> --epochs <n_epochs>Where:
train_locis the location of the training datasetvalid_locis the location of the validation datasetmodel_locis the name of the model to useoutput_diris the location of the output folder, to save the model and the logsmodel_typeis the type of model to use, 'linear' or 'transformer'verboseis a boolean indicating whether to print the logs or notbatch_sizeis the batch size to usetrain_sizeis the number of training samples to usevalid_sizeis the number of validation samples to useepochsis the number of epochs to train the model
You can use the script experiment.sh to run experiments. You can run different experiments with the script run_experiments.sh, where you can add all the different parameters you want to test.
You can also predict the summaries with the baseline model with both sum_of_sums strategy and of the full text at once by running: python bert_summarize.py