Conversational Recommender Systems (CRSs) facilitate item discovery through multi-turn dialogues that elicit user preferences via natural language interaction. This field has gained significant attention following advancements in Natural Language Processing (NLP) enabled by Large Language Models (LLMs). However, current CRS research remains constrained by datasets with fundamental limitations. Human-generated datasets suffer from inconsistent dialogue quality, limited domain expertise, and insufficient scale for real-world application, while synthetic datasets created with proprietary LLMs ignore the diversity of real-world user behavior and present significant barriers to accessibility and reproducibility. The development of effective CRSs depends critically on addressing these deficiencies. To this end, we present DistillRecDial, a novel conversational recommendation dataset generated through a knowledge distillation pipeline that leverages smaller, more accessible open LLMs. Crucially, DistillRecDial simulates a range of user types with varying intentions, preference expression styles, and initiative levels, capturing behavioral diversity that is largely absent from prior work.
Human evaluation demonstrates that our dataset significantly outperforms widely adopted CRS datasets in dialogue coherence and domain-specific expertise, indicating its potential to advance the development of more realistic and effective conversational recommender systems.
Important
DistillRecDial exceeds GitHub storage quota, for this reason it is only available through Hugging Face: https://huggingface.co/datasets/planeB/DistillRecDial
Tip
To support researchers wanting to extend or adapt our approach, we release the scripts used to generate the DistillRecDial dataset.
- Create a virtual environment (Python 3.10.12 recommended):
python -m venv env source env/bin/activate # On Windows: `env\Scripts\activate`
- Install dependencies:
pip install -r requirements.txt
The data sources needed to re-build DistillRecDial from scratch are:
- Amazon Reviews Dataset 2023: Movies and TV category (downloaded through the createDatasetToConversation_tmdb.py). The original raw files are available in the Amazon Reviews 2023 Hugging Face repository:
- Item features sourced from TMDB: available in the Hugging Face repository as tmdb_movies.json
- Visual captions from VLM: already included in tmdb_movies.json
createDatasetToConversation_tmbdb.py
Handles the preprocessing of multiple data sources - Amazon Movies and TV, TMDB metadata, and visual captions from VLM - to populate prompt templates for dataset creation.knowledgeDistillationDataset.py
Formats the dialogues generated by the teacher model for supervised fine-tuning of the student model.exportConversationDataset.py
Post-processes the student-generated dialogues and splits the data into training, validation, and test sets to produce the final DistillRecDial dataset.
prompts.py
Defines the prompt templates used to simulate various user stereotypes during dialogue generation.ConvertToCrsLab.py
Converts the DistillRecDial dataset into the CRSLab-compatible format.EntityMentionDetector.py
SupportsConvertToCrsLab.pyby performing Named Entity Recognition (NER) on dialogue content.
@inproceedings{Martina2025DistillRecDial,
author = {Martina, Alessandro Francesco Maria and Petruzzelli, Alessandro and Musto, Cataldo and de Gemmis, Marco and Lops, Pasquale and Semeraro, Giovanni},
title = {{DistillRecDial}: A Knowledge-Distilled Dataset Capturing User Diversity in Conversational Recommendation},
booktitle = {Proceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys '25)},
year = {2025},
month = {September},
day = {22--26},
address = {Prague, Czech Republic},
publisher = {ACM},
doi = {10.1145/3705328.3748161},
isbn = {979-8-4007-1364-4/2025/09}
}