On Github tomaspdc / amazon-recsys
Filters information
Discovers preferences on a subject
Measures similarities between users
So before jumping into the project itself, I'd like to briefly explain what is a recommender system. A recommender system, at the most basic level, is just a system that can filter information...
But that also is now widely used by lots of companies worldwide to discover preferences of their user base...
And also to measure how similar their users are...
Which can lead to other interesting conclusions as well...
After these abstract explanations, I'd like to give you a few examples of recommender systems "in the wild".
In the movie and music side, several companies use recommender systems heavily like Netflix, Last.fm, Pandora...
But also Google, both in the search engine and in other products like Google+ uses recommendations, eBay and Valve's Steam both use recommendations based on your historical preferences.
If what we want to recommend are not objects but other persons, Twitter recommends you who to follow based on your previous preferences as well. And of course the core of dating sites like Lulu, OkCupid and Match.com is recommending you how to date based on your tastes and historical preferences too. I focused my attention on a what most people agree that is the canonical recommender system at the core of the business: Amazon.com
Over 144 million active customer accounts.( ~2.27 times the population of the UK )
Over 222 million products on sale.
426 items sold per second.(Christmas 2013)
Customers can review products they bought on a scale of 1 to 5
~34.5 million reviews.
~6.5 million users.
~2.5 million products.
Spanning from Jun 1995 to Mar 2013
Source: http://snap.stanford.edu/data/web-Amazon.html Permission granted by Julian McAuley (jmcauley@cs.stanford.edu) J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.
product/productId: B00005X3U4 product/title: The voice of Bugle Ann product/price: unknown review/userId: A169ZYI77GT1F3 review/profileName: Janet K. May review/helpfulness: 0/0 review/score: 5.0 review/time: 1288051200 review/summary: Childhood Memories review/text: My husband remembered this as a little boy. He tried to find one in the library but they had none. What a surprise he had on his birthday and thoroughly enjoyed it again. Brought a lot of old memories and stories to be told.This is what a raw review looks like in the dataset. Reviews are just separated by an empty line, they need to be cleaned.
... B000HEKTIW::A7EWCPD8COL3X::5::4.99::2/2::1292889600 B00005AQF1::A19CQRD6DIHMQL::5::unknown::0/3::1124409600 B000DZH89I::A2POGVCWFR6738::2::unknown::0/0::1358208000 B0007HEURA::A3C2A3D2KG1F1A::5::unknown::2/2::1266796800 B0002DJNNA::A1MFR5PGMZFQPX::1::5.93::0/1::1290297600 B003Y6ID2Y::ATGPAY0V61JO7::5::2.99::0/0::1178928000 B00029BM6A::A7M0T2XJM74DN::5::unknown::0/0::1333929600 B0000DD75Q::A1BKIHESLDFD95::4::9.89::3/3::1180656000 B743504704::A1IE6VWY0U0VNT::3::unknown::0/0::1204156800 B000E0C6SK::A16QQ78I8J29PA::4::unknown::3/3::1275264000 ...And this is what we have after a simple transformation.
True Blind subset: Random sample of ~9.8 million reviews
Second Blind subset: Random sample of ~6.5 million reviews
Training/Test sets: 80%/20% in random incremental samples with step size of 100k reviews, from 100,500 reviews to 11 million reviews.
That's a 100500 reviews subset, a 200,500 reviews subset, etc.
One SVD model was computed for each Train/Test subset.
5-fold averages on Mean Absolute Error and Root Mean Squared Error
MAE and RMSE against Second Blind subset.
The model minimizing the Second Blind MAE error was chosen.
MAE and RMSE were measured against the True Blind subset.
MODEL/ID: 7500500 MAE: 0.766031 RMSE: 1.596316
Baseline model: guess at random, weighted:
1 "star" ~7.62% 2 "stars" ~5.13% 3 "stars" ~8.55% 4 "stars" ~19.38% 5 "stars" ~59.29%
MODEL/ID: WEIGHTED-RANDOM MAE: 3.93757 RMSE: 4.192413
MODEL/ID: 7500500 MAE: 0.766031 RMSE: 1.596316
SVD model is ~5 times more accurate than the Weighted-Random model
Big Data != Better Data
Model Storage Size
Predict Offline, Recommend Online
The current model can predict the review an user will give to a product, with a mean absolute error of ~0.7 "stars".
More data doesn't necessarily means better results, although this model needed a considerable amount of reviews to bypass the "cold-start" disadvantage of recommender systems (final model was trained with ~7.5 million reviews).
An SVD model grows non-trivially in size (this one sits at ~1.1GB), and is not able to predict in real-time without a considerable amount of processing power.
Although, it's trivial to compute periodically the predictions for the users, and if needed, is also trivial to re-train the model in parallel, making it a viable solution to offer recommendations.