Music recommendations @ MLConf 2014

44
April 14, 2014 Music discovery at Spotify Music + ML =

description

Presentation about Spotify's music discovery system at MLConf NYC 2014

Transcript of Music recommendations @ MLConf 2014

Page 1: Music recommendations @ MLConf 2014

April 14, 2014

Music discovery at Spotify

Music + ML = ❤

Page 2: Music recommendations @ MLConf 2014

April 14, 2014

I’m Erik Bernhardsson

Engineering Manager at Spotify in NYC @fulhack

Page 3: Music recommendations @ MLConf 2014

The “Prism” team

Chris Johnson Andy Sloane Sam Rozenberg Ahmad Qamar Romain Yon Gandalf Hernandez Neville Li Rodrigo Araya Edward Newett Emily Samuels Vidhya Murali Rohan Agrawal

3

Page 4: Music recommendations @ MLConf 2014

Section name

40 million tracks... but where to start?

4

Page 5: Music recommendations @ MLConf 2014

Discover page5

Page 6: Music recommendations @ MLConf 2014

Radio6

Page 7: Music recommendations @ MLConf 2014

How do you scale this?

7

Page 8: Music recommendations @ MLConf 2014

How do we structure music understanding?

How do you teach music to machines? !Editorial tagging Audio analysis Metadata Natural language processing Collaborative filtering

8

Page 9: Music recommendations @ MLConf 2014

Collaborative filtering

Find patterns in usage data !With millions of users and billions of streams, lots of patterns

9

Hey,I like tracks P, Q, R, S!

Well,I like tracks Q, R, S, T!

Then you should check out track P!

Nice! Btw try track T!

Page 10: Music recommendations @ MLConf 2014

Some real data points36.5% of playlists containing Notorious BIG also contain 2Pac (6.4% of playlists containing Notorious BIG also contain Justin Bieber) !

10

Page 11: Music recommendations @ MLConf 2014

Main problem: how similar are two items?

If you understand that well, you can do most other things. !So our main problem: how do you model a function similarity(x, y) !For item similarity it’s also much easier to acquire good test set data, unlike personal recommendations. It’s hard to evaluate personal recommendations – most offline metrics like precision are irrelevant. !

11

Page 12: Music recommendations @ MLConf 2014

“Essentially, all models are wrong, but some are useful.”– George Box !!!We can’t perfectly model how users choose music. But modeling is a craft not a science and we can

use common sense when building models. !For play count, is Poisson or a Normal distribution better? !Always check your assumptions. Eg. SVD minimizes squared loss, which assumes the underlying

data is Gaussian. Is it?

12

Page 13: Music recommendations @ MLConf 2014

OK so how do we do it?

There’s a lot of interesting unsupervised language models that work really well for us. Docs = playlists/users, words=tracks/artists/albums. You could also call it implicit collaborative filtering because we have no ratings whatsoever.

!Main approach: matrix factorization (or latent factor methods), historically with bag-of-words on play

counts (but today sequence is also important)

13

Turns out people have been doing this in NLP for a while

M =

0

BBB@

c11 c12 . . . c1nc21 c22 . . . c2n...

...cm1 cm2 . . . cmn

1

CCCA

| {z }Lots of words

9>>>>>>>>=

>>>>>>>>;

Lots of documents

Or more generally:

P =

0

BBB@

p11 p12 . . . p1np21 p22 . . . p2n...

...pm1 pm2 . . . pmn

1

CCCA

The idea with matrix factorization is to represent this probability distribu-tion like this:

pui = aTubi

M 0 = ATB

0

BBBBBB@

1

CCCCCCA⇡

0

BBBBBB@

1

CCCCCCA

| {z }f

� � f

0

BBBBBB@

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1

CCCCCCA

| {z }probabilities for next event

0

BBBBBB@

. .

. .

. .

. .

. .

. .

1

CCCCCCA

| {z }user vectors

✓. . . . . . .. . . . . . .

| {z }item vectors

We can look at it as a probability distribution:0

BBBBBB@

0 0.07 0.21 00.05 0 0 0.010.04 0 0.13 0.090 0 0 0.07

0.19 0.01 0 0.130 0.03 0 0

1

CCCCCCA

4

Page 14: Music recommendations @ MLConf 2014

Section name 14

Page 15: Music recommendations @ MLConf 2014

Step 1: Put everything into a big sparse matrix

15

Using some definition of correlation. Eg. for Pearson:

cij =

Pu NuiNuj

pPu N

2ui

qPu N

2uj

but it’s very slow because:

N =

0

BBBBBB@

0 7 21 05 0 0 14 0 13 90 0 0 719 1 0 130 3 0 0

1

CCCCCCA

NTN =

0

BB@

402 19 52 28819 59 147 1352 147 610 117288 13 117 300

1

CCA

O(U · (N/I)2)...where U = number of users

I = number of itemsN = number of nonzero entries

⇡ 107 · (1010/107)2 = 1013 mapper outputs

It’s an extremely sparse matrix

M =

0

BBBBBBBBBBBB@

......

.... . . . . . 53 . . . . . .

......

.... . . . . . . . . 12 . . .

......

.... . . 7 . . . . . . . . .

......

...

1

CCCCCCCCCCCCA

It’s a very big matrix too:

M =

0

BBB@

c11 c12 . . . c1nc21 c22 . . . c2n...

...cm1 cm2 . . . cmn

1

CCCA

| {z }107 items

9>>>>>>>>>=

>>>>>>>>>;

107 users

3

Page 16: Music recommendations @ MLConf 2014

Matrix example

Roughly 25 billion nonzero entries Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)

16

Erik

Never gonna give you up

Erik listened to Never gonna give you up 1

times

Page 17: Music recommendations @ MLConf 2014

For instance, for PLSA

Probabilistic Latent Semantic Indexing (Hoffman, 1999) Invented as a method intended for text classification

17

P =

0

BBBBBB@

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1

CCCCCCA⇡

0

BBBBBB@

. .

. .

. .

. .

. .

. .

1

CCCCCCA

| {z }user vectors

✓. . . . . . .. . . . . . .

| {z }item vectors

PLSA0

BBBBBB@

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1

CCCCCCA

| {z }P (u,i)=

PzP (u|z)P (i,z)

0

BBBBBB@

. .

. .

. .

. .

. .

. .

1

CCCCCCA

| {z }P (u|z)

✓. . . . . . .. . . . . . .

| {z }P (i,z)

X

u

P (u|z) = 1

X

i,z

P (i, z) = 1

So in general we want to optimize

logY

P (u, i)Nui =X

u,i

Nui logP (u, i) =X

u,i

Nui logX

z

P (u|z)P (i, z)

N logP =

0

BBBBBB@

0 7 21 05 0 0 14 0 13 90 0 0 719 1 0 130 3 0 0

1

CCCCCCAlog

0

BBBBBB@

0

BBBBBB@

. .

. .

. .

. .

. .

. .

1

CCCCCCA

✓. . . .. . . .

1

CCCCCCA

KOREN

N =

0

BBBBBB@

0 7 21 05 0 0 14 0 13 90 0 0 719 1 0 130 3 0 0

1

CCCCCCA

5

Page 18: Music recommendations @ MLConf 2014

Run this for n iterations

Start with random vectors around the origin. !Then run alternating least squares, gradient descent, or something like that.

18

Page 19: Music recommendations @ MLConf 2014

Why are latent factor models nice?

They find vectors which are super small fingerprints of the musical style or the user’s taste Usually something like 40-1000 elements

19

0.87 1.17 -0.26 0.56 2.21 0.77 -0.03

Latent factor 1

Latent factor 2

track x's vector

Track X:

Page 20: Music recommendations @ MLConf 2014

Why are latent factor models nice? (part 2)

- Fast (linear in input size) - Do not have a big problem with overfitting - Have a solid underlying model (i.e. not just a bunch of heuristics) - Easy to scale (at least compared to other models) - Gives a compact representation of items

20

Page 21: Music recommendations @ MLConf 2014

Similarity now becomes schoolbook trigonometry

21

Latent factor 1

Latent factor 2

track xtrack y

cos(x, y) = HIGH

IPMF item item:

P (i ! j) = exp(bTj bi)/Zi =

exp(bTj bi)P

k exp(bTk bi)

VECTORS:pui = aTubi

simij = cos(bi,bj) =bTi bj

|bi||bj|

O(f)

i j simi,j

2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81

IPMF item item MDS:

P (i ! j) = exp(bTj bi)/Zi =

exp(� |bj � bi|2)Pk exp(� |bk � bi|2)

simij = � |bj � bi|2

(u, i, count)

@L

@au

7

IPMF item item:

P (i ! j) = exp(bTj bi)/Zi =

exp(bTj bi)P

k exp(bTk bi)

VECTORS:pui = aTubi

simij = cos(bi,bj) =bTi bj

|bi||bj|

O(f)

i j simi,j

2pac 2pac 1.02pac Notorious B.I.G. 0.912pac Dr. Dre 0.872pac Florence + the Machine 0.26Florence + the Machine Lana Del Rey 0.81

IPMF item item MDS:

P (i ! j) = exp(bTj bi)/Zi =

exp(� |bj � bi|2)Pk exp(� |bk � bi|2)

simij = � |bj � bi|2

(u, i, count)

@L

@au

7

Page 22: Music recommendations @ MLConf 2014

Why does cosine make sense?

Intuitively it makes sense, because we’re factoring out popularity and introducing a distance metric. !In fact, best result seems to be: train a latent factor model as usual, but normalize all vectors as a

post-processing step. !Even for models without any geometric interpretation (like LDA), cosine works

22

Page 23: Music recommendations @ MLConf 2014

It’s still tricky to search for similar tracks though

Locality Sensitive Hashing: Cut the space recursively by random plane. If two points are close, they are more likely to end up on the same side of each plane. !https://github.com/spotify/annoy

23

Page 24: Music recommendations @ MLConf 2014

Source:

…So what models have we experimented with?

24

Page 25: Music recommendations @ MLConf 2014

Section name

Old school models

- Latent Semantic Analysis (LSA) - Probabilistic Latent Semantic Analysis (PLSA) - Latent Dirichlet Allocation (LDA) !Bag of words models Need a lot of topics, and usually not very great for music recs

25

Page 26: Music recommendations @ MLConf 2014

What about scalability of models?

When we started experimenting with latent factor models, PLSA needed at least 400 factors (topics) to give decent results. !That gives at least 10 billion parameters, or way more that you could conveniently fit in RAM. !So what to do? We turned to Hadoop.

26

Page 27: Music recommendations @ MLConf 2014

One iteration, one map/reduce job

“Google News Personalization: Scalable Online Collaborative Filtering”

27

Reduce stepMap step

u % K = 0i % L = 0

u % K = 0i % L = 1 ... u % K = 0

i % L = L-1

u % K = 1i % L = 0

u % K = 1i % L = 1 ... ...

... ... ... ...

u % K = K-1i % L = 0 ... ... u % K = K-1

i % L = L-1

item vectorsitem%L=0

item vectorsitem%L=1

item vectorsi % L = L-1

user vectorsu % K = 0

user vectorsu % K = 1

user vectorsu % K = K-1

all log entriesu % K = 1i % L = 1

u % K = 0

u % K = 1

u % K = K-1

Page 28: Music recommendations @ MLConf 2014

Section name

Other MF models

- Collaborative Filtering for Implicit Feedback Datasets (“Koren”) - “vector_exp”: our own: every stream is a softmax over all tracks !Need a much more compact representation of items, typically only say 40 elements. !Benefit a lot from handling the zero case separately

28

Page 29: Music recommendations @ MLConf 2014

Section name

New trendy models

- Recurrent neural networks (RNN) - word2vec !Take into account sequence of events !Future: Take into account the time – maybe hidden markov models, etc?

29

Page 30: Music recommendations @ MLConf 2014

Power of combining models

All models have their own objective and their own biases. Combining them (with Gradient Boosting Decision Trees) yields kickass results:

30

Page 31: Music recommendations @ MLConf 2014

Section name

Album cover based modelsJust a fun experiment that shows that any signal (weak learner) adds value to the ensemble. Turns

out it probably just works as a classifier for minimal techno. We will most likely never put this in production :)

31

Page 32: Music recommendations @ MLConf 2014

What happened with Hadoop?

Most newer models don’t need a ton of latent factors, so all parameters fit nicely in RAM. !Additionally, you can do more complex things on a single machine. Lately we’ve started focusing on a combination of non-scalable models (more complex, less data) and scalable models (simple, but with more data) !Hadoop makes things “scale”, but at a ridiculous constant I/O overhead. We are in the process of moving our models to Spark instead !

32

Page 33: Music recommendations @ MLConf 2014

Orders of magnitude numbers

Data points Parameters Time to train Single-machine model 1B 100M 10h Hadoop model 100B 10B 10h Spark?? 100B 10B 1h

33

Page 34: Music recommendations @ MLConf 2014

Source:

What are we optimizing for?... a story of surrogate loss functions

34

Page 35: Music recommendations @ MLConf 2014

We want to optimize Spotify’s “success”

Long term business value or something similar. Problem: You only get one shot!

35

Page 36: Music recommendations @ MLConf 2014

Let’s run A/B tests

Typically: DAU (daily active users), Day 2 retention, etc Super inefficient way of collecting roughly 1 bit of information!

36

Page 37: Music recommendations @ MLConf 2014

So let’s do offline testing

Editorial judgement “Look at the results” !

37

Page 38: Music recommendations @ MLConf 2014

The “Daft Punk Test” … why does collaborative filtering always fail?

38

LDA RNN Koren PLSA vector_exp

Daft Punk Daft Punk Daft Punk Daft Punk Daft Punk

Daft Punk - Stardust Rizzle Kicks Coldplay The PURSUIT Gorillaz

Raccoon Daft Funk Gotye Junior Senior deadmau5

Dave Droid La Roux Lana Del Rey Chuckie & LMFAO Macklemore & Ryan Lewis

The Local Abilities Rudimental Of Monsters And Men Beatbullyz M83

Daft Funk Pacjam The Lumineers Pursuit Gotye

M83 VS Big Black Delta Su Bailey Green Day La Roux The xx

Leandro Dutra Capital Cities John Mayer Fatboy Slim Calvin Harris

Huw Costin YYZ Foster The People Chase & Drive Kavinsky

Jesús Alonso Various, WMGA Florence + The Machine Knivez Out Coldplay

Page 39: Music recommendations @ MLConf 2014

Wait maybe machines can evaluate things?

Sure! We just need a ground truth data set !Use things like thumbs, skips, editorial data sets !Note that thumbs etc has observation bias !Doesn’t have to be as high volume, few thousand data points is enough !We can also optimize for this using e.g. GBDT

39

Page 40: Music recommendations @ MLConf 2014

Again, GBDT’s are pretty cool:

40

Page 41: Music recommendations @ MLConf 2014

Ensemble workflow

41

Cross validate ensemble model

Model 1 Model 2 Model 3 ... Model n

Thumbs Gradient boosted decision tree

Combined model Offline metrics

Production

Editorial data sets

Page 42: Music recommendations @ MLConf 2014

This One Weird Trick Sort of Fixes Observation Bias

Augment the data set with lots of random negative data. Works well in practice.

42

parameter 2

parameter 1

current best estimate

+

-

+

+

++

+

+

+ --

-

--

-

-

data points from earlier batches

Page 43: Music recommendations @ MLConf 2014

What have we learned so far?

- Figuring out what to optimize for is hard - Combining lots of models really helps - Large scale algorithms are great, but not everything has to scale

43

Page 44: Music recommendations @ MLConf 2014

So what are we working on now?

Combine even more signals - Content-based methods: use audio, lyrics, images - Read about music and understand it - Personalize everything - Just acquired Echo Nest in Boston!

44