Problem: To research and develop a recommendation system that will generate a personalized list of products that are likely to be suitable for a customer
Python • Pandas • Data Preparation • Visualization • Collaborative-Filtering
March 2023
Solo Project
Popularity-Based Recommender: It offers generalized recommendations to every user, based on product popularity. This system recommends same products to all users and it does not give personalized recommendations to users.
Content-Based Recommender: It builds an engine that computes similarity between products based on certain metrics such as description, and suggests products that are most similar to a particular product that a user liked. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.
Assumptions:
Users with similar interest will have similar preferences
What it does:
Makes filtering decisions for an individual user based on judgement of other users- infers individual’s interest/
preferences from that of other ‘similar’ users
User-User: Find users similar to a user U, use their ratings to predict the ratings of U
Item-Item: Find similar products to P and based on their ratings predict rating for P
Item-Item CF outperforms User-User CF because items are just simpler than users. Items can generally be classified under categories or genres that help us establish ‘similar items’ but users are all unique and have different tastes/preferences and hence are difficult to categorize. The similarity established under item-item CF is hence more meaningful.
The dataset had variables such as time of review, name of reviewers, text along with rating, etc. Since we are not doing any text analysis and only need the ratings, we can drop all columns except userID, productID and rating given.
Due to computational restrictions because of working in Google Colaboratory, I decided to cut down the dataset and keep only those products that had over 100 reviews, which also helps make the data more robust and gives more accurate and relevant recommendations.
Most of the products are highly rated (>4.3/5) and have less that 1000 reviews.
• We make a matrix for user ratings with UserID on one axis and ProductID on another axis. The values are filled by the ratings
• Calculate average rating for each product. Normalize the data by subtracting the average rating from each product
• We calculate a cosine similarity matrix called mean-centered cosine similarity based on our normalized data. It gives values ranging from -1 to 1 to the ratings, -1 meaning opposite product similarity and 1 meaning very high product similarity
To predict rating for product P by user U:
• Get list of products bought and rated by U
• Rank similarities between P and user-rated products
• Prepare a neighborhood of products with highest similarity scores to P
• Calculate predicted rating for P by U using weighted avg. of similarity score with neighborhood of products
• Get the user's rated items and finds the similar unrated items to those items
• Predict the user's ratings for the unrated items
• Sort the predicted rating in descending order and return the top 5 recommended items
Some Cons of Collaborative Filtering
• Popularity bias: Tends to recommend popular items.
• First rater: Cannot recommend unrated items.
• Sparsity: The user/product matrix is sparse because it is difficult to find a set of users who have rated the same set of items. Hence a lot of ratings are predicted.
Potential Improvements (Realistic)
• Hybrid Approach: Develop a hybrid model using content-based recommendation and collaborative filtering.
• Text Analysis: Analyze user reviews to get a better understanding of the product and hence provide better recommendations.
Thanks again to Jianmo Li at UCSD for his work on recommender systems and for making the free ratings dataset among many others available online, and Anand Rajaraman for his lectures at Stanford University on Collaborative Filtering.