Ambari-Apex-RTS Integration for Big Data Hadoop Streaming Apps Operations
Recommendations with hadoop streaming and python
-
Upload
andrew-look -
Category
Technology
-
view
4.825 -
download
4
Transcript of Recommendations with hadoop streaming and python
Recommendations with Python and Hadoop Streaming
Andrew Look
Senior EngineerShopzilla
Getting started
● Slides○ http://bit.ly/J7vmx7
● Python/NumPy Installed○ http://bit.ly/JWNWbq
● Sample code○ http://aws-hadoop.s3.amazonaws.com/similarity.zip
Outline
● Problem● Recommendation basics● MapReduce review and conventions● Python + Hadoop Streaming basics● MapReduce jobs (data, code, data-flow)● Recommendation algorithm
Problem - Music Recommendations
● We want to recommend similar artists● We have data from Last.fm ● Which Last.fm users liked which artists?● How can we decide which artists are similar?
Toby Keith Tupac
De La Soul
Garth Brooks
Solution - Find Artist Similarities
● We'll follow along with a tutorial from AWS● By Data Wrangling blogger/AWS developer
Peter Skomoroch● Uses publicly available data from Last.fm● User's rating of artist is number of plays
Solution - Find Artist Similarities
● We can look at co-ratings● One user played artist A songs X times● Same user played artist B songs Y times
co-rating = ((A,X),(B,Y))
Recommendation Basics
● User Based○ Given a user, recommend the artists that are favored
by users with similar artist preferences ● Item Based
○ Given an item (artist), recommend the artists that were most commonly favored by users that also liked the input artist
Recommendation Basics
● Types of data○ Explicit - user rates a movie on Netflix○ Implicit - user watches a YouTube video
● Types of ratings
○ Multivalued - bounded, ex. star rating (1-5)○ Multivalued - unbounded, ex. number of plays (>0)○ Binary - did a user play a movie or not?
Last.fm Recommendations
● Data was implicitly collected (as users play songs)● Transform binary data (did user listen to artist?) ...● Into multivalued data (how many times?)● We'll use item-based recommendations
Mapper Input
Map Output - Reduce Input
Chaining MapReduce Jobs
Distributed Cache
Python Shell and Hadoop Streaming
Streaming API requires shell commands● Mapper● Reducer
Python Shell and Hadoop Streaming
Streaming API requires shell commands● Mapper● Reducer
For mapper / reducer commands Streaming API will
● Partition the input ● Distribute across mappers and
reducers
Python Shell and Hadoop Streaming
Full Recommendation Job Overview
Example - Working Data Set
○ Inspect your working data set ...○ Each row is one "rating"○ Each "number of plays" is the "rating value" Code
cat input/sample_user_artist_data.txt \| head
Example - Working Data Set
User ID Artist ID Number of Plays
1000020 1001820 20
1000020 1003557 1
1000021 700 1
1000029 1001819 1
1000036 1001820 34
1000036 1011819 2
1000036 700 2
1000040 1001820 1
1000057 1011819 37
1000060 700 17
Mapper 1 - Count Ratings per Artist
○ Prepend LongValueSum:<artist ID>○ More on this later○ Use a value of "1"
Code
cat input/sample_user_artist_data.txt \| ./similarity.py mapper1
Mapper 1 - Count Ratings per ArtistArtist ID Number of Ratings
LongValueSum:1001820 1
LongValueSum:1003557 1
LongValueSum:700 1
LongValueSum:1001819 1
LongValueSum:1001820 1
LongValueSum:1011819 1
LongValueSum:700 1
LongValueSum:1001820 1
LongValueSum:1011819 1
LongValueSum:700 1
Mapper 1 - Count Ratings per Artist
○ We use the sort command locally○ We sort by artist ID○ Emulates shuffle/sort in Hadoop Code
cat input/sample_user_artist_data.txt \| ./similarity.py mapper1 | sort
Mapper 1 - Count Ratings per Artist
Artist ID Number of Plays
LongValueSum:1001820 1
LongValueSum:1001820 1
LongValueSum:1001820 1
LongValueSum:1003557 1
LongValueSum:1011819 1
LongValueSum:1011819 1
LongValueSum:1011819 1
LongValueSum:700 1
LongValueSum:700 1
LongValueSum:700 1
Reducer 1 - Count Ratings by Artist
○ LongValueSum tells 'aggregate' reducer○ Group by artist ID○ Sum up the 1's○ Emit artist ID as Key, count(ratings) as Value
Code
cat input/sample_user_artist_data.txt \| ./similarity.py mapper1 | sort \| ./similarity.py reducer1 \> input/artist_playcounts.txt
Reducer 1 - Count Ratings by Artist
Artist ID Number of Ratings
1000143 1905
1000418 184
1001820 12950
700 7243
1003557 2976
1011819 7601
1012511 1881
Mapper 2 - User Artist Preferences
○ Mapper2 outputs key user ID, artist ID○ Mapper2 outputs rating as value (# plays) Code
cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 int
Mapper 2 - User Artist Preferences
User ID, Artist ID Number of Plays
1000020,1001820 20
1000020,1003557 1
1000021,700 1
1000029,1011819 1
1000036,1001820 34
1000036,1011819 2
1000036,700 2
1000040,1001820 1
1000057,1011819 37
1000060,700 17
Mapper 2 - User Artist Preferences
○ Can large counts skew our results?○ Apply log function to outliers. Code
cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 log | sort
Mapper 2 - Logarithmic Smoothing
User ID, Artist ID Smoothing Smoothed Count
1000020,1001820 log(20) 3
1000020,1003557 log(1) 1
1000021,700 log(1) 1
1000029,1011819 log(1) 1
1000036,1001820 log(34) 4
1000036,1011819 log(2) 1
1000036,700 log(2) 1
1000040,1001820 log(1) 1
1000057,1011819 log(37) 4
1000060,700 log(17) 3
Reducer 2 - Aggregate User Prefs
○ Reduce for each user○ Key - user ID○ Value is complex
○ Count(ratings)○ Sum(rating values)○ Space delimited list of - artist ID, rating value
Code
cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 log | sort \| ./similarity.py reducer2
Reducer 2 - Aggregated User Prefs
User ID Smoothing
1000020 2 | 4 | 1001820,3 1003557,1
1000021 1 | 1 | 700,1
1000029 1 | 1 | 1011819,1
1000036 3 | 6 | 1001820,4 1011819,1 700,1
1000040 1 | 1 | 1001820,1
1000057 1 | 4 | 1011819,4
1000060 1 | 3 | 700,3
Mapper 3 - User Co-Ratings
○ Mapper3 culls users via cutoff○ Drop user ID, emit pairwise
Code
cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 log | sort \| ./similarity.py reducer2 \| ./similarity.py mapper3 100 \input/artist_playcounts.txt | sort
Mapper 3 - User Co-Ratings
Artist ID: X, Y Rating: X, Y
1000143 1003577 2 3
1000143 1011819 2 3
1001820 700 1 2
1001820 700 1 3
1011819 700 3 2
1011819 700 3 3
1011819 700 4 2
1011819 700 4 2
1011819 700 5 5
1012511 700 1 1
Reducer 3 - Artist Similarities
○ Given num artists, computes similarities○ Each pair of artists emitted w/ similarities
Code
cat input/sample_user_artist_data.txt \| ./similarity.py mapper2 log | sort \| ./similarity.py reducer2 \| ./similarity.py mapper3 100 \input/artist_playcounts.txt | sort \| ./similarity.py reducer3 147160 \> artist_similarities.txt
Reducer 3 - Artist Similarities
Artist ID, Similarity, Artist ID, Co-Ratings
1003557 0.121659425105 1012511 360
1012511 0.121659425105 1003557 360
1003557 0.0197107349416 700 212
700 0.0197107349416 1003557 212
1011819 0.0128808637553 1012511 259
1012511 0.0128808637553 1011819 259
1011819 0.297222927702 700 3050
700 0.297222927702 1011819 3050
1012511 0.0426446192482 700 270
700 0.0426446192482 1012511 270
Mapper 4 - Sort by Artist Correlation
○ Emit artist ID, similarity concatenated○ Sort by similarity = recommendation Code
cat artist_similarities.txt \| ./similarity.py mapper4 20 | sort
Mapper 4 - Sort by Artist Correlation
Artist X-ID, Similarity Artist Y-ID, Num Co-Ratings
1012511,0.924219271937 1000143 237
1012511,0.945653412649 1001820 468
1012511,0.957355380752 700 270
1012511,0.961454917198 1000418 50
1012511,0.987119136245 1011819 259
700,0.702777072298 1011819 3050
700,0.898811337303 1001820 2250
700,0.95212801312 1000143 114
700,0.957355380752 1012511 270
700,0.980289265058 1003557 212
Reducer 4 - Cosmetic Results
○ Reducer attaches artist names
Code
cat artist_similarities.txt \| ./similarity.py mapper4 20 \| sort \| ./similarity.py reducer4 3 lastfm/artist_data.txt \> related_artists.tsv
Reducer 4 - Cosmetic Results
Artist ID Related Artist ID
Similarity Number of Co-Ratings
Artist Name
1000143 1000143 1 0 Toby Keith
1000143 1003557 0.2434 809 Garth Brooks
1000143 1000418 0.1068 120 Mark Chestnutt
1000143 1012511 0.0758 237 Kenny Rogers
1000418 1000418 1 0 Mark Chestnutt
1000418 1000143 0.1068 120 Toby Keith
1000418 1003557 0.056 114 Garth Brooks
1000418 1012511 0.0385 50 Kenny Rogers
Pearson Similarity - Visualization
covariance(A, B) = 2.44covariance(C, D) = -2.36
Pearson Similarity - Equation
pearson(x, y)
= covariance(x, y) / (stddev(x) * stddev(y))
pearson(A, B) = 0.772pearson(C, D) = -0.746
Pearson Similarity - Summary
○ Pearson similarity normalizes correlation○ Linear dependence between two variables○ Normalized ...
-1 < pearson(x, y) < 1
(for any x, y)
Questions?
● Hadoop Streaming○ http://hadoop.apache.org/common/docs/r0.20.1/streaming.html
● Explanation of LongValueSum○ http://stackoverflow.com/questions/1946953/availiable-reducers-in-elastic-mapreduce
● Pearson Correlation○ http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
● Finding Similar Items with Amazon Elastic MapReduce, Python, and Hadoop Streaming○ http://aws.amazon.com/articles/2294
Appendix
● Anscombe's Quartet○ http://en.wikipedia.org/wiki/Anscombe's_quartet
● Tau Coefficient○ http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient
● Jaccard Index○ http://en.wikipedia.org/wiki/Jaccard_index
● Quality of Recommendations○ http://en.wikipedia.org/wiki/Mean_squared_error
Appendix