Crab: A Python Framework for Building Recommender Systems
-
Upload
marcel-caraciolo -
Category
Technology
-
view
8.417 -
download
1
description
Transcript of Crab: A Python Framework for Building Recommender Systems
CrabA Python Framework for Building
Recommendation Engines
Marcel Caraciolo@marcelcaraciolo
Bruno Melo@brunomelo
Ricardo Caspirro@ricardocaspirro
PythonBrasil 2011, São Paulo, SP
What is Crab ?
A python framework for building recommendation engines
A Scikit module for collaborative, content and hybrid filtering
Mahout Alternative for Python Developers :D
Open-Source under the BSD license
https://github.com/muricoca/crab
When started ?
It began one year ago
Community-driven, 4 members
Since April,2011 the open-source labs Muriçoca incorporated it
https://github.com/muricoca/
Since April,2011 rewritting it as Scikit
Knowing Scikits
Scikits are Scipy Toolkits - independent and projects hosted under a common namespace.
Scikits ImageScikits MlabWrapScikits AudioLabScikit Learn
....
http://scikits.appspot.com/scikits
Knowing Scikits
Scikit-Learn
Machine Learning Algorithms + scientific Python packages (Numpy, Scipy and Matplotlib)
http://scikit-learn.sourceforge.net/
Our goal: Incorporate the Crab as Scikit and incorporate some parts of them at Scikit-learn
Why Recommendations ?
!"#$%&'()$*+$,-$&.#'/0'&%)#)$1(,0#The world is an over-crowded place
Why Recommendations ?
We are overloaded
!"#$%"#&'"%(&$)")
* +,&-.$/).#&0#/"1.#$%234(".#
$/)#5(&6 7&.2.#"$4,#)$8
* 93((3&/.#&0#:&'3".;#5&&<.#
$/)#:-.34#2%$4<.#&/(3/"
* =/#>$/&3;#?#@A#+B#4,$//"(.;#
2,&-.$/).#&0#7%&6%$:.#
"$4,#)$8
* =/#C"1#D&%<;#."'"%$(#
2,&-.$/).#&0#$)#:"..$6".#
."/2#2&#-.#7"%#)$8
Thousands of news articles and blog posts each day
Millions of movies, books and music tracks online
Even Friends sometimes we are overloaded !
Several Places, Offers and Events
Why Recommendations ?
We really need and consume only a few of them!
“A lot of times, people don’t know what they want until you show it to them.”
Steve Jobs
“We are leaving the Information age, and entering into the Recommendation age.”
Chris Anderson, from book Long Tail
Why Recommendations ?
Can Google help ?
Yes, but only when we really know what we are looking for
Can Facebook help ?Yes, I tend to find my friends’ stuffs interesting
But, what’s does it mean by “interesting” ?
What if i had only few friends and what they like do not always attract me ?
Can experts help ?Yes, but it won’t scale well.
But it is what they like, not me! Exactly same advice!
Why Recommendations ?
Recommendation Systems
Systems designed to recommend to me something I may like
Why Recommendations ?
Recommendation Systems!"#$%&"'$"'(')*#*+,)
-+*#)+. -#/') 0#)1#
2' 23&4"+')1 5,6 7),*%'"&863
!
Graph Representation
The current Crab
Collaborative Filtering algorithms
Evaluation of the Recommender Algorithms
User-Based, Item-Based and Factorization Matrix (SVD)
Precision, Recall, F1-Score, RMSE
Precision-Recall Charts
The current Crab
Precision-Recall Charts
Collaborative Filtering
O Vento Levou
Thor
Similar
Armagedon ToyStore
Marcel
like recommends
Items
Rafael Amanda Users
The current Crab
The current Crab>>>#load the dataset
The current Crab>>>#load the dataset
>>> from crab.datasets import load_sample_movies
The current Crab>>>#load the dataset
>>> from crab.datasets import load_sample_movies
>>> data = load_sample_movies()
The current Crab>>>#load the dataset
>>> from crab.datasets import load_sample_movies
>>> data = load_sample_movies()
>>> data
The current Crab
{'DESCR': 'sample_movies data set was collected by the book called \nProgramming the Collective Intelligence by Toby Segaran \n\nNotes\n----- \nThis data set consists of\n\t* n ratings with (1-5) from n users to n movies.', 'data': {1: {1: 3.0, 2: 4.0, 3: 3.5, 4: 5.0, 5: 3.0}, 2: {1: 3.0, 2: 4.0, 3: 2.0, 4: 3.0, 5: 3.0, 6: 2.0}, 3: {2: 3.5, 3: 2.5, 4: 4.0, 5: 4.5, 6: 3.0}, 4: {1: 2.5, 2: 3.5, 3: 2.5, 4: 3.5, 5: 3.0, 6: 3.0}, 5: {2: 4.5, 3: 1.0, 4: 4.0}, 6: {1: 3.0, 2: 3.5, 3: 3.5, 4: 5.0, 5: 3.0, 6: 1.5}, 7: {1: 2.5, 2: 3.0, 4: 3.5, 5: 4.0}}, 'item_ids': {1: 'Lady in the Water', 2: 'Snakes on a Planet', 3: 'You, Me and Dupree', 4: 'Superman Returns', 5: 'The Night Listener', 6: 'Just My Luck'}, 'user_ids': {1: 'Jack Matthews', 2: 'Mick LaSalle', 3: 'Claudia Puig', 4: 'Lisa Rose', 5: 'Toby', 6: 'Gene Seymour', 7: 'Michael Phillips'}}
>>>#load the dataset
>>> from crab.datasets import load_sample_movies
>>> data = load_sample_movies()
>>> data
The current Crab
The current Crab
>>> from crab.models import MatrixPreferenceDataModel
The current Crab
>>> from crab.models import MatrixPreferenceDataModel
>>> m = MatrixPreferenceDataModel(data.data)
The current Crab
>>> print mMatrixPreferenceDataModel (7 by 6) 1 2 3 4 5 ...1 3.000000 4.000000 3.500000 5.000000 3.0000002 3.000000 4.000000 2.000000 3.000000 3.0000003 --- 3.500000 2.500000 4.000000 4.5000004 2.500000 3.500000 2.500000 3.500000 3.0000005 --- 4.500000 1.000000 4.000000 ---6 3.000000 3.500000 3.500000 5.000000 3.0000007 2.500000 3.000000 --- 3.500000 4.000000
>>> from crab.models import MatrixPreferenceDataModel
>>> m = MatrixPreferenceDataModel(data.data)
The current Crab
The current Crab
>>> #import pairwise distance
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import euclidean_distances
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import euclidean_distances
>>> #import similarity
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import euclidean_distances
>>> #import similarity>>> from crab.similarities import UserSimilarity
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import euclidean_distances
>>> #import similarity>>> from crab.similarities import UserSimilarity
>>> similarity = UserSimilarity(m, euclidean_distances)
The current Crab
>>> #import pairwise distance
>>> from crab.metrics.pairwise import euclidean_distances
>>> #import similarity>>> from crab.similarities import UserSimilarity
>>> similarity = UserSimilarity(m, euclidean_distances)
>>> similarity[1]
The current Crab
[(1, 1.0), (6, 0.66666666666666663), (4, 0.34054242658316669), (3, 0.32037724101704074), (7, 0.32037724101704074), (2, 0.2857142857142857), (5, 0.2674788903885893)]
>>> #import pairwise distance
>>> from crab.metrics.pairwise import euclidean_distances
>>> #import similarity>>> from crab.similarities import UserSimilarity
>>> similarity = UserSimilarity(m, euclidean_distances)
>>> similarity[1]
The current Crab
[(1, 1.0), (6, 0.66666666666666663), (4, 0.34054242658316669), (3, 0.32037724101704074), (7, 0.32037724101704074), (2, 0.2857142857142857), (5, 0.2674788903885893)]
>>> #import pairwise distance
>>> from crab.metrics.pairwise import euclidean_distances
>>> #import similarity>>> from crab.similarities import UserSimilarity
>>> similarity = UserSimilarity(m, euclidean_distances)
>>> similarity[1]
MatrixPreferenceDataModel (7 by 6) 1 2 3 4 5 ...1 3.000000 4.000000 3.500000 5.000000 3.0000002 3.000000 4.000000 2.000000 3.000000 3.0000003 --- 3.500000 2.500000 4.000000 4.5000004 2.500000 3.500000 2.500000 3.500000 3.0000005 --- 4.500000 1.000000 4.000000 ---6 3.000000 3.500000 3.500000 5.000000 3.0000007 2.500000 3.000000 --- 3.500000 4.000000
The current Crab
The current Crab
>>> from crab.recommenders.knn import UserBasedRecommender
The current Crab
>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True)
The current Crab
>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True)
>>> recsys.recommend(5)array([[ 5. , 3.45712869], [ 1. , 2.78857832], [ 6. , 2.38193068]])
The current Crab
>>> recsys.recommended_because(user_id=5,item_id=1)array([[ 2. , 3. ], [ 1. , 3. ], [ 6. , 3. ], [ 7. , 2.5], [ 4. , 2.5]])
>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True)
>>> recsys.recommend(5)array([[ 5. , 3.45712869], [ 1. , 2.78857832], [ 6. , 2.38193068]])
The current Crab
>>> recsys.recommended_because(user_id=5,item_id=1)array([[ 2. , 3. ], [ 1. , 3. ], [ 6. , 3. ], [ 7. , 2.5], [ 4. , 2.5]])
>>> from crab.recommenders.knn import UserBasedRecommender
>>> recsys = UserBasedRecommender(model=m, similarity=similarity, capper=True,with_preference=True)
>>> recsys.recommend(5)array([[ 5. , 3.45712869], [ 1. , 2.78857832], [ 6. , 2.38193068]])
MatrixPreferenceDataModel (7 by 6) 1 2 3 4 5 ...1 3.000000 4.000000 3.500000 5.000000 3.0000002 3.000000 4.000000 2.000000 3.000000 3.0000003 --- 3.500000 2.500000 4.000000 4.5000004 2.500000 3.500000 2.500000 3.500000 3.0000005 --- 4.500000 1.000000 4.000000 ---6 3.000000 3.500000 3.500000 5.000000 3.0000007 2.500000 3.000000 --- 3.500000 4.000000
The current Crab
Using REST APIs to deploy the recommenderdjango-piston, django-rest, django-tastypie
Crab is already in production
News from Abril Publisher recommendations!Collecting over 10 magazines, 20 books and 100+ articles
Running on Python+ Scipy +Django
Easy-to-use interface
Content-Based-Filtering
Still in development
Content Based Filtering
O Vento Levou
Duro de Matar
Similar
Armagedon ToyStore
Marcel
likesrecommend
Items
Users
Crab is already in production
PythonBrasil keynotes RecommenderRecommending keynotes based on a hybrid approach
Running on Python+ Scipy +Django
Schedule yourkeynotes
Content-Based-Filtering+
Collaborative Filtering
Still in development
source, the recommendation architecture that we propose willaggregate the results of such filtering techniques.We aim at integrating the previously mentioned hybrid prod-
uct recommendation approach in a mobile application so theusers could benefit from useful and logical recommendations.Moreover, we aim at providing a suited explanation for eachrecommendation to the user, since the current approaches justonly deliver product recommendations with a overall scorewithout pointing out the appropriateness of such recommen-dation [13]. Besides the basic information provided by thesuppliers, the system will deliver the explanation, providingrelevant reviews of similar users, we believe that it willincrease the confidence in the buying decision process and theproduct accepptance rate. In the mobile context this approachcould help the users in this process and showing the useropinions could contribute to achieve this task.
!"#$%&'%($)
!"*+#,$+'-)
!".,"/#)
!"*+#,$+'-)
0+($"($)1%#"2)
3,4$"',(5)
!"#$%&"'()*+,#&-,.)
/$%,0"12()*3$4%)3""5.)
0+44%6+'%$,.")1%#"2)
3,4$"',(5)
)))67,8,#%)+,4%$91$'%4)-1":))))
))))1,;&,<4)<1&%%,')=2)4&:&8$1))
)))))))))))%$4%,5)94,14>?)
7"$%)
!"8+99"(2"'))
!"8+99"(2%$,+(#)
Fig. 1. Meta Recommender Architecture
Since one of the goals of this work is to incorporatedifferent data sources of user opinions and descriptions, wehave addopted an meta recommendation architecture. By usinga meta recommender architecture, the system would providea personalized control over the generated recommendation listformed by the combination of rich data [16]. The influenceof the specific data sources could be explicitly controlled byevaluating the past user interaction with the recommender todecide how to balance the different knowledge sources. Forinstance, if the product or service to be recommended has arich structured description (e.g. restaurant) , then the systemtends to use more content-based filtering approach. Otherwise,if the product is poorly described (attributes), then the system
would rely more on collaborative-filtering techniques, that is,the reviews from similar users.Figure 1 shows a overview of our meta recommender
approach. By combining the content-based filtering and thecollaborative-based one into a hybrid recommender system, itwould use the services/products repositories which cataloguesthe services to be recommended, and the review repositorythat contains the user opinions about those services. All thisdata can be extracted from data source containers in the websuch as the location-based social network Foursquare [17] asdisplayed at the Figure 2 and the location recommendationengine from Google: Google HotPot [18].
Fig. 2. User Reviews from Foursquare Social Network
The content-based filtering approach will be used to filterthe product/service repository, while the collaborative basedapproach will derive the product review recommendations. Inaddition we will use text mining techniques to distinct thepolarity of the user review between positive or negative one.This information summarized would contribute in the productscore recommendation computation. The final product recom-mendation score is computed by integrating the result of bothrecommenders. By now, we are considering to use differentoptions regarding this integration approach, one at specialis the symbolic data analysis approach (SDA) [19], whicheach product description and user ratings/reviews are modeledas set of modal symbolic descriptions that summarizes theinformation provided by the corresponding data sources. It isa novel approach in hybrid recommender systems which,i nour domain, can encapsulate in entities the levels of influenceof both user reviews and product descriptions.
B. Symbolic Recommendation ApproachThe Symbolic Data Analysis (SDA) is a research field that
provides suitable tools to manage aggregated data detailedby multi-valued variables, where data table entries are sets
of categories, ordered list of categories, intervals or weighthistograms [19]. It is also provides approaches for informationfiltering algorithms such as Content-Based , Collaborative-Based and Hybrid Base ones. The main idea is to representthe user profile, in our domain the product to be recom-mended, through symbolic data structures and the user anditem correlations are computed through dissimilarity functionsadapted from the symbolic data analysis (SDA) domain. Theadvantage of using SDA based information-filtering methodsin the context of recommendations is that the user descriptionsynthetizes the entire body of information taken from the itemdescriptions belonging to the user profile. Therefore, itensare described by histogram-value symbolic data, so it can becompared through a dissimilarity function. By using the userreviews and the product descriptions modeled by histogram-value symbolic data, it would attend our requirements sinceour recommendations would be balanced by both structures.Bezerra and Carvalho proposed approaches where the resultsachieved showed to be very promising [19].
III. SYSTEM DESIGNApplication data information our mobile recommender sys-
tem can be divided into two parts: the product description(such as location, description and its attributes) and the userreviews or ratings provided by user (such as rating, comments,tags, etc.). The Figure 3 gives the system’s architecture andrelative components.
!"#$"%&'$
!(#$()&'*&%$+,-*.&$
/01&'234&$
5&-$
!6#$6,00&41&7$
8&4,99&0731*,0$:0;*0&$
<',7)41$
8&=,%*1,'>$
8&?*&@$
8&=,%*1,'>$
8&%).1%$
!<#$<'&2&'&04&%A$B,431*,0A$&14C$
!B#$B*%1$,2$D4,'&7$<',7)41%$
!8#$830E&7$<',7)41%$
!(#$()&'*&%$
Fig. 3. Mobile Recommender System Architecture
In our mobile product/service recommender, the user couldfilter some products or services and get a list of recommen-tations. The user also can enter his preferences or give hisfeedback to some offered product recommendation.Other functionalities are the retrieval of the next ve best
recommendations, the search for reviews satisfying somegiven constraints (text-length, date, review attitude) or thosecontaining some keywords. Let us place a typical use scenarioof this recommender by showing a restaurant query example.For instance, a user wants to know good restaurants for eating
a chinese food around his current location (the system alsocould integrate location-based services). Refining the querythe user also searches for places with highly positive reviews.According to the coordinate information and then calculate thedistance of their current location, the system would provide alist ranked by the highly positive reviews of restaurants joinedby summarized reviews written by the most similar users,that reviewed the restaurant services. According to personalpreferences, the user continues to review the recommendationrestaurant list, and select places he wishes to go. After gonethe places he selected, the user coult enter his feedback, inform of wishes or critiques, to the service recommendation.This information would be added to the reviews repository forthose places where it would be processed and summarized byour recommender, extracting useful information such as thepolarity and adding new keywords for the product featuresvector.
IV. METHODOLOGY AND EXPECTED RESULTSA. MethodologyOur research focuses on methodologies and techniques
for improving the user acceptance of product and servicesrecommendations and explaining these suggestions in themobile context. To achieve this, we have used a new approachwhere both product descriptions and user reviews are incor-porated into the mobile recommendation process. We believethis approach will bring to the user more condence on therecommendations and a better understanding of the productsspecially supporting him in the buying decision process. Toaccomplish this task, we will do a overview of the context inwhich mobile recommender systems are included and the mainresearch concerns about the topic. We wil also analyze and in-vestigate approaches for filtering algorithms, where could baseour design and implementation choices on previous failures orsucesses and reuse and adapt successful solutions. We will alsopropose our meta-recommender system by providing a detailedexplanation of our recommender architecture, implementation,main processes, and crucial features. Finally, to validate it,we will use standard measures of recommendation systemscomparing our suggestions to mobile users in a real datasetextracted from Web and discussing the results.
B. Expected ResultsWe believe that our product mobile recommender incorpo-
rating user reviews will increase the user trust in the recom-mended products, since he can read opinions, both positiveand negative from a group of similar users. Moreover, viewingother reviews, the user can feel more estimulated to share hisown experiences as also obtain recognition from other users.Reviews also provide a better product understanding sincethere will be more information for the user to decide if theproduct is (or is not) suitable for him. Finally, presenting alist of recommendations aside with explanations may provide’local Hidden knowledge’. Local hidden knowledge can bedescribed as knowledege that you gain only after purschasinga product or visiting a place. It can not be found using
Crab is already in production
Hybrid Meta Approach
Crab is already in production
Brazilian Social Network called Atepassar.comEducational network with more than 60.000 students and 120 video-classes
Running on Python + Numpy + Scipy and
Django
Backend for Recommendations MongoDB - mongoengine
Daily Recommendations with Explanations
Evaluating your recommender
Crab implements the most used recommender metrics.Precision, Recall, F1-Score, RMSE
Using matplotlib for a plotter utility
Implement new metrics
Simulations support maybe (??)
Evaluating your recommender
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()
>>> evaluator.evaluate(recommender=recsys,metric='rmse')
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()
>>> evaluator.evaluate(recommender=recsys,metric='rmse'){'rmse': 0.69467177857026907}
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()
>>> evaluator.evaluate(recommender=recsys,metric='rmse'){'rmse': 0.69467177857026907}
>>> evaluator.evaluate_on_split(recommender=recsys, at =2)
Evaluating your recommender
>>> from crab.metrics.classes import CfEvaluator
>>> evaluator = CfEvaluator()
>>> evaluator.evaluate(recommender=recsys,metric='rmse'){'rmse': 0.69467177857026907}
>>> evaluator.evaluate_on_split(recommender=recsys, at =2)
({'error': [{'mae': 0.345, 'nmae': 0.4567, 'rmse': 0.568}, {'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788}, {'mae': 0.456, 'nmae': 0.356778, 'rmse': 0.6788}],
'ir': [{'f1score': 0.456, 'precision': 0.78557, 'recall':0.55677}, {'f1score': 0.64567, 'precision': 0.67865, 'recall': 0.785955},
{'f1score': 0.45070, 'precision': 0.74744, 'recall': 0.858585}]}, {'final_score': {'avg': {'f1score': 0.495955,
'mae': 0.429292, 'nmae': 0.373739,
'precision': 0.63932929, 'recall': 0.729939393, 'rmse': 0.3466868},
'stdev': {'f1score': 0.09938383 , 'mae': 0.0593933,
'nmae': 0.03393939, 'precision': 0.0192929, 'recall': 0.031293939, 'rmse': 0.234949494}}})
Distributing the recommendation computations
Use Hadoop and Map-Reduce intensivelyhttps://github.com/pfig/mrjobInvestigating the Yelp mrjob framework
Develop the Netflix and novel standard-of-the-art usedMatrix Factorization, Singular Value Decomposition (SVD), Boltzman machines
The most commonly used is Slope One technique.Simple algebra math with slope one algebra y = a*x+b
Cache/Paralelism with joblib
class UserSimilarity(BaseSimilarity): ...
@memory.cache def get_similarity(self, source_id, target_id): source_preferences = self.model.preferences_from_user(source_id) target_preferences = self.model.preferences_from_user(target_id)
return self.distance(source_preferences, target_preferences) \ if not source_preferences.shape[1] == 0 \ and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
...
def get_similarities(self, source_id): return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]
from joblib import Memory memory = Memory(cachedir=’’, verbose=0)
http://packages.python.org/joblib/index.html
Cache/Paralelism with joblib
class UserSimilarity(BaseSimilarity): ...
@memory.cache def get_similarity(self, source_id, target_id): source_preferences = self.model.preferences_from_user(source_id) target_preferences = self.model.preferences_from_user(target_id)
return self.distance(source_preferences, target_preferences) \ if not source_preferences.shape[1] == 0 \ and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
...
def get_similarities(self, source_id): return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]
from joblib import Memory memory = Memory(cachedir=’’, verbose=0)
>>> #Without memory.cache
http://packages.python.org/joblib/index.html
Cache/Paralelism with joblib
class UserSimilarity(BaseSimilarity): ...
@memory.cache def get_similarity(self, source_id, target_id): source_preferences = self.model.preferences_from_user(source_id) target_preferences = self.model.preferences_from_user(target_id)
return self.distance(source_preferences, target_preferences) \ if not source_preferences.shape[1] == 0 \ and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
...
def get_similarities(self, source_id): return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]
from joblib import Memory memory = Memory(cachedir=’’, verbose=0)
>>> #Without memory.cache >>># With memory.cache
http://packages.python.org/joblib/index.html
Cache/Paralelism with joblib
class UserSimilarity(BaseSimilarity): ...
@memory.cache def get_similarity(self, source_id, target_id): source_preferences = self.model.preferences_from_user(source_id) target_preferences = self.model.preferences_from_user(target_id)
return self.distance(source_preferences, target_preferences) \ if not source_preferences.shape[1] == 0 \ and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
...
def get_similarities(self, source_id): return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]
from joblib import Memory memory = Memory(cachedir=’’, verbose=0)
>>> #Without memory.cache >>># With memory.cache>>> timeit similarity.get_similarities
(‘marcel_caraciolo’)
http://packages.python.org/joblib/index.html
Cache/Paralelism with joblib
class UserSimilarity(BaseSimilarity): ...
@memory.cache def get_similarity(self, source_id, target_id): source_preferences = self.model.preferences_from_user(source_id) target_preferences = self.model.preferences_from_user(target_id)
return self.distance(source_preferences, target_preferences) \ if not source_preferences.shape[1] == 0 \ and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
...
def get_similarities(self, source_id): return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]
from joblib import Memory memory = Memory(cachedir=’’, verbose=0)
>>> #Without memory.cache >>># With memory.cache>>> timeit similarity.get_similarities
(‘marcel_caraciolo’)>>> timeit similarity.get_similarities
(‘marcel_caraciolo’)
http://packages.python.org/joblib/index.html
Cache/Paralelism with joblib
class UserSimilarity(BaseSimilarity): ...
@memory.cache def get_similarity(self, source_id, target_id): source_preferences = self.model.preferences_from_user(source_id) target_preferences = self.model.preferences_from_user(target_id)
return self.distance(source_preferences, target_preferences) \ if not source_preferences.shape[1] == 0 \ and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
...
def get_similarities(self, source_id): return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]
from joblib import Memory memory = Memory(cachedir=’’, verbose=0)
>>> #Without memory.cache >>># With memory.cache>>> timeit similarity.get_similarities
(‘marcel_caraciolo’)>>> timeit similarity.get_similarities
(‘marcel_caraciolo’) 100 loops, best of 3: 978 ms per loop
http://packages.python.org/joblib/index.html
Cache/Paralelism with joblib
class UserSimilarity(BaseSimilarity): ...
@memory.cache def get_similarity(self, source_id, target_id): source_preferences = self.model.preferences_from_user(source_id) target_preferences = self.model.preferences_from_user(target_id)
return self.distance(source_preferences, target_preferences) \ if not source_preferences.shape[1] == 0 \ and not target_preferences.shape[1] == 0 else np.array([[np.nan]])
...
def get_similarities(self, source_id): return[(other_id, self.get_similarity(source_id, other_id)) for other_id, v in self.model]
from joblib import Memory memory = Memory(cachedir=’’, verbose=0)
>>> #Without memory.cache >>># With memory.cache>>> timeit similarity.get_similarities
(‘marcel_caraciolo’)>>> timeit similarity.get_similarities
(‘marcel_caraciolo’) 100 loops, best of 3: 978 ms per loop 100 loops, best of 3: 434 ms per loop
http://packages.python.org/joblib/index.html
Cache/Paralelism with joblibhttp://packages.python.org/joblib/index.html
Investigate how to use multiprocessing and parallel packages with similarities computation
def get_similarities(self, source_id): return Parallel(n_jobs=3) ((other_id, delayed(self.get_similarity) (source_id, other_id)) for other_id, v in self.model)
from joblib import Parallel ...
Distributed Computing with mrJobhttps://github.com/Yelp/mrjob
Distributed Computing with mrJobhttps://github.com/Yelp/mrjob
It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)
Distributed Computing with mrJobhttps://github.com/Yelp/mrjob
It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)
Distributed Computing with mrJobhttps://github.com/Yelp/mrjob
"""The classic MapReduce job: count the frequency of words."""from mrjob.job import MRJobimport re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1)
def reducer(self, word, counts): yield (word, sum(counts))
if __name__ == '__main__': MRWordFreqCount.run()
It supports Amazon’s Elastic MapReduce(EMR) service, your own Hadoop cluster or local (for testing)
Distributed Computing with mrJobhttps://github.com/Yelp/mrjob
Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
Distributed Computing with mrJobhttps://github.com/Yelp/mrjob
Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
Future studies with Sparse MatricesReal datasets come with lots of empty values
Apontador Reviews Dataset
http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html
Solutions:
scipy.sparse package
Sharding operations
Matrix Factorization techniques (SVD)
Future studies with Sparse MatricesReal datasets come with lots of empty values
Apontador Reviews Dataset
http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html
Solutions:
scipy.sparse package
Sharding operations
Matrix Factorization techniques (SVD)
Crab implements a Matrix Factorization with Expectation
Maximization algorithm
Future studies with Sparse MatricesReal datasets come with lots of empty values
Apontador Reviews Dataset
http://aimotion.blogspot.com/2011/05/evaluating-recommender-systems.html
Solutions:
scipy.sparse package
Sharding operations
Matrix Factorization techniques (SVD)
Crab implements a Matrix Factorization with Expectation
Maximization algorithmscikits.crab.svd package
Optimizations with Cythonhttp://cython.org/
Cython is a Python extension that lets developers annotate functions so they can be compiled to C.
http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
Optimizations with Cythonhttp://cython.org/
Cython is a Python extension that lets developers annotate functions so they can be compiled to C.
http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
# setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
# for notes on compiler flags see:
# http://docs.python.org/install/index.html
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("spearman_correlation_cython", ["spearman_correlation_cython.pyx"])]
)
Optimizations with Cythonhttp://cython.org/
Cython is a Python extension that lets developers annotate functions so they can be compiled to C.
http://aimotion.blogspot.com/2011/09/high-performance-computation-with_17.html
# setup.py
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
# for notes on compiler flags see:
# http://docs.python.org/install/index.html
setup(
cmdclass = {'build_ext': build_ext},
ext_modules = [Extension("spearman_correlation_cython", ["spearman_correlation_cython.pyx"])]
)
Benchmarks
Dataset Pure Python w/ dicts
Python w/ Scipy and Numpy
MovieLens 100k 15.32 s 9.56 shttp://www.grouplens.org/node/73
Old Crab New Crab
Benchmarks
Dataset Pure Python w/ dicts
Python w/ Scipy and Numpy
MovieLens 100k 15.32 s 9.56 shttp://www.grouplens.org/node/73
0 4 8 12 16
Time ellapsed ( Recommend 5 items)
Old Crab New Crab
Benchmarks
Dataset Pure Python w/ dicts
Python w/ Scipy and Numpy
MovieLens 100k 15.32 s 9.56 shttp://www.grouplens.org/node/73
0 4 8 12 16
Time ellapsed ( Recommend 5 items)
Old Crab New Crab
Benchmarks
Dataset Pure Python w/ dicts
Python w/ Scipy and Numpy
MovieLens 100k 15.32 s 9.56 shttp://www.grouplens.org/node/73
0 4 8 12 16
Time ellapsed ( Recommend 5 items)
Old Crab New Crab
Why migrate ?
Old Crab running only using Pure Python
Recommendations demand heavy maths calculations and lots of processing
Compatible with Numpy and Scipy libraries
High Standard and popular scientific libraries optimized for scientific calculations in Python
Scikits projects are amazing! Active Communities, Scientific Conferences and updated projects (e.g. scikit-learn)
Turn the Crab framework visible for the community Join the scientific researchers and machine learning developers around the Globe coding with
Python to help us in this project
Be Fast and Furious
Why migrate ?
http://morepypy.blogspot.com/2011/05/numpy-in-pypy-status-and-roadmap.html
Numpy optimized with PyPy
2.x - 48.x Faster
How are we working ?
Sprints, Online Discussions and Issues
https://github.com/muricoca/crab/wiki/UpcomingEvents
How are we working ?
Our Project’s Home Page
http://muricoca.github.com/crab
Future Releases
Planned Release 0.1Collaborative Filtering Algorithms working, sample datasets to load and test
Planned Release 0.11Sparse Matrixes and Database Models support
Planned Release 0.12Slope One Agorithm, new factorization techniques implemented
....
Join us!
1. Read our Wiki Pagehttps://github.com/muricoca/crab/wiki/Developer-Resources
2. Check out our current sprints and open issueshttps://github.com/muricoca/crab/issues
3. Forks, Pull Requests mandatory
4. Join us at irc.freenode.net #muricoca or at our discussion list
http://groups.google.com/group/scikit-crab
Recommended Books
SatnamAlag, Collective Intelligence in Action, Manning Publications, 2009
Toby Segaran, Programming Collective Intelligence, O'Reilly, 2007
ACM RecSys, KDD , SBSC...
CrabA Python Framework for Building
Recommendation Engines
Marcel Caraciolo@marcelcaraciolo
Bruno Melo@brunomelo
Ricardo Caspirro@ricardocaspirro
{marcel, ricardo,bruno}@muricoca.com
https://github.com/muricoca/crab