- Introduction
- Problem Statement
- Dataset Overview
- Challenges and How We Overcame Them
- Project Workflow
- Evaluation and Results
- Kaggle Competition Insights
- User Interface
- Conclusion and Future Work
- Video Presentation
- Setup Instructions
Imagine walking into a library with thousands of books. You love reading, but the sheer number of choices is overwhelming. What should you read next? Which books will match your taste? This is where NextRead, our personalized book recommendation system, steps in.
NextRead is a project we developed to make book discovery easy and enjoyable. By analyzing borrowing history and leveraging advanced machine learning techniques, the system provides highly tailored recommendations for each user. From handling messy, multilingual datasets to integrating state-of-the-art transformer models, this project represents our journey into building an intelligent recommendation system.
Libraries serve as treasure troves of knowledge, offering vast collections of books across genres, languages, and themes. However, this abundance often leaves users struggling to find titles they’d truly enjoy. Libraries rarely have personalized recommendation systems, and without guidance, users miss out on discovering hidden gems.
Our goal was to address this problem by building a recommendation system capable of:
- Understanding user preferences through their borrowing history.
- Analyzing book metadata to identify similarities and connections between titles.
- Providing personalized recommendations that align with user interests.
- Purpose: Contains records of books borrowed by users over time.
- Key Features:
user_id: Unique identifier for each user.book_id: Unique identifier for each book.timestamp: Date and time of interaction.
- Observations:
- Clean and complete with no missing data.
- Sparse interactions, with users borrowing only a few books.
- Purpose: Contains metadata about books.
- Key Features:
Title,Author,ISBN,Subjects,Language,Publisher,book_id.
- Observations:
- Missing values in most columns except
Title. - Metadata in multiple languages, primarily French.
- Missing values in most columns except
- Problem: The interactions dataset was sparse. Users had borrowed very few books, and there was little overlap between users and books, making it difficult to identify clear patterns.
- Solution:
- Augmented the dataset by generating synthetic user-item interactions. This helped improve the performance of collaborative filtering by simulating broader user behavior patterns.
- Enhanced the content-based filtering model, ensuring the system could recommend books even for users with minimal borrowing history.
- Problem: Many books lacked critical information like descriptions and categories, which are essential for content-based recommendations.
- Solution:
- Enriched the metadata by using the Google Books API to fetch missing information. For books with valid ISBNs, we pulled detailed descriptions and categories, making the dataset more comprehensive and useful for recommendations.
- Re-cleaned the enriched data to ensure consistency and uniformity.
- Problem: The metadata was in multiple languages, predominantly French, which complicated text preprocessing.
- Solution:
- Used
langdetectto identify the language of each text field and applied language-specific preprocessing using SpaCy models. For unsupported languages, we fell back on basic text cleaning techniques to ensure no data was left unprocessed.
- Used
- Problem: Fetching metadata from the Google Books API for thousands of books was slow and time-consuming.
- Solution:
- Parallelized the API requests using Python’s
multiprocessingmodule, significantly reducing the runtime. - Cached results to avoid redundant API calls for the same ISBNs.
- Parallelized the API requests using Python’s
- Problem: Evaluating recommendation quality was tricky due to the lack of explicit feedback from users.
- Solution:
- Used standard metrics like Precision@10 and Recall@10 to measure the system’s performance.
- Implemented 3-fold cross-validation to ensure robust evaluation.
We started by cleaning and preprocessing both datasets. For the items dataset, we:
- Handled Missing Values: Filled in missing fields using placeholders or enriched data from the Google Books API.
- Text Cleaning: Removed non-standard symbols, lemmatized text, and standardized formats.
- Language Detection: Applied language-specific preprocessing to improve text quality.
The interactions dataset required minimal preprocessing, but we carefully analyzed its structure to identify patterns.
To overcome the lack of metadata, we enriched the dataset using the Google Books API. This added valuable fields like book descriptions and categories, enabling better content-based recommendations. The augmented data was then reprocessed for consistency.
Through EDA, we discovered key insights:
- Borrowing patterns were highly uneven, with some books being borrowed far more frequently than others.
- The dataset’s multilingual nature posed unique challenges but also added diversity to our recommendations.
-
Collaborative Filtering:
- We implemented User-User Collaborative Filtering and Item-Item Collaborative Filtering to analyze user interaction patterns and item similarities.
- These methods rely on user borrowing history to compute similarities and recommend books. However, due to the sparse nature of user-item interaction data, their performance was limited.
- To improve results, we combined user-based and item-based CF into a Hybrid CF model, which outperformed individual collaborative filtering methods.
-
Content-Based Filtering:
- For content-based filtering, we utilized transformer models:
- bert-base-uncased: Extracted dense embeddings from book metadata, such as titles, descriptions, and categories. Its strong contextual understanding allowed it to identify meaningful relationships between books.
- distilbert-base-uncased: A lighter and faster version of BERT that provided competitive results while requiring fewer computational resources.
- xlm-roberta-base: Designed for multilingual datasets, it generalized well across languages but performed slightly worse on this predominantly English dataset.
- These models generated rich embeddings, enabling semantic comparisons between books and enhancing recommendation quality.
- For content-based filtering, we utilized transformer models:
-
Hybrid Model:
- To leverage the strengths of both collaborative filtering and content-based filtering, we developed a Hybrid Model that combined:
- User and item-based collaborative filtering similarities.
- Transformer-generated content embeddings.
- This approach resulted in a more robust and accurate recommendation system.
- To leverage the strengths of both collaborative filtering and content-based filtering, we developed a Hybrid Model that combined:
We evaluated our models using Precision@10 and Recall@10 metrics to assess their effectiveness. The results are summarized below:
| Model | Precision@10 | Recall@10 |
|---|---|---|
| User-User CF | 0.0398 | 0.3021 |
| Item-Item CF | 0.0378 | 0.2814 |
| Hybrid CF | 0.0418 | 0.3115 |
| bert-base-uncased | 0.0692 | 0.2426 |
| distilbert-base-uncased | 0.0691 | 0.2417 |
| xlm-roberta-base | 0.0689 | 0.2408 |
Among the models we evaluated, bert-base-uncased performed the best, demonstrating its ability to deeply understand the context within metadata, especially for English text. This strong contextual understanding allowed it to capture meaningful relationships between items and users, leading to the highest Precision@10 (0.0692) and Recall@10 (0.2426) scores in our tests.
Additionally:
- bert-base-uncased was particularly effective in handling our multilingual and sparse dataset, showcasing its robustness in extracting meaningful embeddings even from limited or diverse data.
- distilbert-base-uncased, as a lighter and faster version of BERT, offered a good balance of speed and accuracy but fell slightly behind due to its reduced capacity to capture deeper context.
- xlm-roberta-base, while powerful for multilingual scenarios, was less effective in this specific dataset, as its broader generalization came at the expense of precision.
Based on our findings:
-
bert-base-uncased:
- Recommended for handling multilingual and sparse datasets, particularly those dominated by English text.
- Best choice when accuracy and robust contextual understanding are priorities.
-
distilbert-base-uncased:
- Suitable for scenarios where computational resources are limited.
- Provides a strong balance between speed and accuracy.
-
xlm-roberta-base:
- Best suited for multilingual datasets where language diversity is a critical factor.
- Performs well when generalization across languages is prioritized.
-
Hybrid CF:
- Effective in sparse interaction scenarios by combining collaborative filtering methods.
- Achieved the best performance among CF approaches with Precision@10 = 0.0418 and Recall@10 = 0.3115.
Among the models we evaluated, bert-base-uncased emerged as the top performer, demonstrating exceptional capability in understanding the context within book metadata. Its strong contextual understanding, particularly for English text, allowed it to capture meaningful relationships between items and users, achieving the highest Precision@10 and Recall@10 scores. This performance highlights its robustness in handling both sparse and multilingual datasets, where it effectively extracted meaningful embeddings even from limited or diverse data.
Distilbert-base-uncased, as a lighter and faster version of BERT, provided a strong balance between computational efficiency and accuracy. However, it fell slightly behind bert-base-uncased due to its reduced capacity to capture deeper contextual relationships in the metadata.
On the other hand, xlm-roberta-base demonstrated its strengths in handling multilingual scenarios, making it ideal for datasets with greater language diversity. However, its broader generalization came at the cost of precision, making it less effective in this predominantly English dataset.
- User 6:
- Borrowed: "La Suisse et l'esclavage des Noirs" → Recommended: "Héritages coloniaux : les Suisses d'Algérie" (similar author/genre).
- User 9:
- Borrowed: "Antiquitates rerum divinarum" → Recommended: "Before the collapse : a guide to the other side of growth" (genre mismatch).
- Submissions that included previously borrowed books in the recommendations scored significantly higher.
- A simple submission that recommended books users had already read (filled with popular books when necessary) achieved a score of 0.1580, outperforming more complex models.
- Sparse interactions limited the model's ability to generalize.
To make our system accessible, we developed a web application using Streamlit. NextRead offers:
- A login system for personalized access.
- Borrowing history display.
- Tailored book recommendations with metadata and optional downloadable csv file.
NextRead successfully combined collaborative filtering and transformer-based content models to create a robust recommendation system. Despite challenges, the bert-base-uncased emerged as the top performer, especially with enriched metadata.
- Incorporate implicit feedback (e.g., user ratings).
- Enhance multilingual support for global scalability.
- Optimize API fetching and caching for larger datasets.
Watch our video presentation here.
# Python
# Ensure Python version 3.7 or higher is installed on your system.
# Required Libraries
pip install pandas numpy seaborn matplotlib scikit-learn torch transformers requests Pillow langdetect spacy
# SpaCy Language Models
# Install the necessary SpaCy language models based on your dataset:
python -m spacy download en_core_web_sm
python -m spacy download fr_core_news_sm
# Add additional models if needed:
python -m spacy download de_core_news_sm
python -m spacy download es_core_news_sm