Given that conventional recommenders, while deeply effective, rely on large distributed systems pre-trained on aggregate user data, incorporating new data necessitates large training cycles, making them slow to adapt to real-time user feedback and often lacking transparency in recommendation rationale. We explore the performance of smaller personal models trained on per-user data using Weightless Neural Networks (WNNs), an alternative to neural backpropagation that enable continuous learning by using neural networks as a state machine rather than a system with pre-trained weights. We contrast our approach against a classic weighted system (also on a per-user level), and the industry standard, collaborative filtering, achieving similar levels of accuracy on MovieLens dataset.
pip install -r requirements.txt
This program is consists of 2 broad categories:
- Per-User level recommendation code (
PerUserfolder) - Collaborative Filtering codes (
Collaborativefolder)
Along with these codes for the different models, there are 2 additional programs :
- To download and preprocess the data that we need for our models
- To compare all the models and save the results in a .csv file.
A line-by-line breakdown of most of the important lines of code can be found in the comments of the respective file the function belongs to. This section goes over what each file does overall, and describes a few important functions
-
We first define all the functions to binary encode the columns we are interested in, these are
encode_genres(): to encode the genres the movie belongs to,encode_lang(): to encode the original language the movie was released in,encode_vote_count(): to encode the number of reviews the movie has in general, and finally,encode_vote_average: to encode the overall rating the movie has -
We use binary encodings for encoding all our input columns because weightless neural networks only take binary inputs, and then only output binary values
-
We then define the
data_prep()function that prepares the data on which we train all the 3 models (weightless neural networks, weighted neural networks, and collaborative filtering)- We first use
kagglehubto download the data, if the data is already downloaded then it just uses the already saved data. - We then load 2 datasets from the movies dataset, the
movies_metadata: which has all the information about all the movies (actors, directors, language, genre etc.) andratingsdataset: which has information about all the ratings given by each user to whatever movie they've watched. - We handle all the missing values in
movies_metadataand filter all the popular movies iftop_percentileis specified in thedata_prepparameter. - There are 2 important parameters that we need to specify in the function :
num_userandreviews_per_user.num_usergives us the number of unique users we want in our final dataset, andreviews_per_userspecifies how many reviews should each of those unique users should have. Depending on the values of both, there are 4 possible cases :- If the (
reviews_per_user== None or 0), AND (num_reviews== None or 0), then we'll take all the reviews from all the users - If (
reviews_per_user== None or 0), AND (num_user!= None or 0), then we'll take all the reviews fromnum_userrandomly sampled users. - If (
reviews_per_user!= None or 0), AND (num_user!= None or 0), then we'll samplereviews_per_userreviews fromnum_userrandomly sampled users. - If (
reviews_per_user!= None or 0), AND (num_user== None or 0), then we'll samplereviews_per_userreviews from all the users with >=reviews_per_userreviews
- If the (
- Once we filter the number of users and the number of reviews for each user, we then binary encode all the necessary columns according to the functions we defined earlier.
- If the
per_userparameter is False, we'll return the merged data after the previous processing - If the
per_userparameter is True, then Users_data is returned. Users_data is a list, where each element represents a user. Each element (user) is also a list. And within that list are all the encodings and labels (of that user and a random movie. So, for example [[[user1, movie1, rating1],[user1, movie2, rating2]],[user2], [user3]]
- We first use
It has 1 function, called, run_ao_model(), this function takes in num_users and reviews_per_user and split.
num_userstakes in an integer and is the number of users you want to run the model for (or number of individual models you want to test). If 0 or None, then all the users are takenreviews_per_usertakes in an integer and is the number of reviews of each user you want to consider. If 0 or None, then all the reviews are considered to train/test the modelsplitis the ratio of training set from the entire dataset. The default value is 0.8, therefore 80-20 split is considered for training and testing.- The function returns 3 values :
- the mean accuracy of all the users
- the median accuracy of all the users
- Time taken for the model to go through all the users.
It also has just one function with exactly the same parameters as perUser.py which do the exact same thing and return the exact same values
It also has only one function, but with 2 arguments. num_users and reviews_per_user. And unlike the other two functions, it only returns average accuracy and the time taken by the model to train and test.
Program to run all the models and save the results to a .csv file.