© 2009 IBM Corporation Extracting User Profiles from Large Scale Data Joint work with Michal...

© 2009 IBM Corporation

Extracting User Profiles from Large Extracting User Profiles from Large Scale DataScale Data

Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass

IBM Haifa Research Lab

David Konopnicki

© 2009 IBM Corporation2

Motivating Example

san-francisco peer michael jackson alive

analysis

User Browsing

Large scale content analysis for mass amount of users.

Update users profiles

Keywords Modeling: for each user, report the most meaningful keywords to describe her profile.

Profiles database

Track statistics about readers interests

Dashboard

AdvertisementSystem


Contributions

User Profiling Framework: – User profile model– KL approach to weight user profile

Large scale implementation:

– MapReduce flow

Experiments:

– Quality analysis– Scalability analysis


User Profiling Framework- Setting

<userID, docID>

<u1,d1>

<u1,d2>

<u2,d2>

<docID, content>

<d1,{bla,bla,bla}>

<d2,{foo,foo}>

logging targeting


User Profiling - Definitions

Vocabulary )),(,),(),(()( 21 imuj

uj

uj ttwtwtwup

)()1()()(~1 upupup jjjjj

jD

Bag of words model (BOW)

Profile maintenance

User snapshot

Community snapshot

)(uD j


User Profiling - Intuition

Find terms that are highly frequent in the user snapshot and separate the most between the user and the community snapshots

{ Travel, Tennis ,Sport }


User Profiling – Naïve approach

Term frequency: number of times a term t appears in document d- tf(t,d)

Document frequency: the number of documents containing the term t – df(t,D)

))(,(),())(,()( uDtudfDtidfuDttftw jjjuj

average tf over the user snapshot

inverse document frequency (df) of a term in the community snapshot

probability to find a term in the user snapshot

frequent separate


Kullback-Leibler (KL) Divergence

Measures the difference between two probability distributions

P1 and P2 :

VtKL tP

tPtPPPD

)(

)(log)()||(

2

1121

KL measures the distance between the Community distribution and the User distribution

Each term is scored according to its contribution to the KL distance between the community and the user distributions.

The top scored terms are then selected as the user important terms.

UserCommunity


jjjj NDtcdfDttfDtP ),(),()||(

User Profiling – KL method

Community marginal term distribution:

User marginal term distribution

)|())(

)()(1())(||( j

Vt

uj

uj

j DtPtw

twuDtP

average tf over the community snapshot

Probability to find a term t in community snapshot

probability normalization factor

=0.001

Smoothing with the community snapshot

Relative initial weight of term t


MapReduce FlowHDFS

textd ,

),(,, dttftd

du,

|)(|),,(,,, uDdttfdtu j

TF

|)(|,, uDdu j

),(),,(|,|),,(),,(, jjjjj DtcdfDtidfDDtdfDttft ))(,(,, uDtudftu j

UDF

DF

||),,(, jj DDttft

T̄F

jNt,

Nj

|Dj(u)|

HDFS

Mapper: input: (u,d)output (u,1)Reducer: output (u,|Dj(u)|) // Sum

Mapper: input: (d,text)output ({t,d},1)Reducer: output ({t,d}, tf(t,d)) // Sum

||),,(,, jDdttftd

Mapper: input: ({t,d},tf(t,Dj))output (t,1})Reducer: output (t, {df(t,Dj),idf(t,Dj),cdf(t,Dj})

HDFS

),(),,(, jj DtcdfDttft

HDFS

Mapper: input: (t,tf(t,d),|Dj|)output (t,{tf(t,d),|Dj|,1})Reducer: output (t, tf(t, Dj)) //Avg

Mapper: input: ({t},{tf(t,Dj),cdf(t,Dj)})output (t,Nj})Reducer: identity

Mapper: input: ({t},{tf(t,Dj),|Dj|,cdf(t,Dj),Nj})output (t,P(t|Dj)})Reducer: identity

P(t|Dj)

)||(, jDtPt

HDFS

jjj NDtcdfDttft ),,(),,(,

Mapper: input: ({u,t,d},{tf(t,Dj(u)),|Dj(u)|})output ({u,t,|Dj(u)},{1})Reducer: output ({u,t},{udf(t,Dj(u))})


MapReduce Flow- cont.

Vtwtu ,,

))(||(),||(,, uDtPDtPtu jj

wtu ~,,

∑w

,),||(,,, wDtPwtu jVt

HDFS

HDFS

HDFS

wtu ,,

P(t|Dj(u))

))(||(),||(,, uDtPDtPtu jj

HDFS

w~

),()),(,(),,(,,, jj DtidfuDtudfdttfdtu

w

HDFS

wtu ,,

))(,(),())(,()( uDtudfDtidfuDttftw jjjuj

)|())(

)()(1())(||( j

Vt

uj

uj

j DtPtw

twuDtP

)|(

))(|(log))(|()(~

j

jj

u

DtP

uDtPuDtPtw

j


Experimental Data- quality analysis

Open Directory Project (ODP):

– Categories are associated with manual labels– Considered as “ground-truth” in this work– Examples:

• ODP: Science/Technology/Electronics: Manual label: “Electronics” • ODP: Society/Religion/and/Spirituality/Buddhism: Manual label: “Buddhism”

Data Collection:

– 100 different categories randomly selected from ODP– 100 documents randomly selected per category– A total collection size of about 10,000 Web pages

Evaluation: – A match is considered if the suggested label is identical, an inflection, or a Wordnet’s

synonym to the manual label


Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9 10

K

Ma

tch

@K

Mutual InformationXi^2tf-udf-idfKL

In how many cases, we got at least one correct term from the top-K terms.

KL outperforms all other approaches for features selection

ODP Category Label Top-5 KL important terms

Bowling bowl, bowler, lane, bowl center, league

Buddhism Buddhist, Buddhism, Buddha, Zen, dharma

Ice Hockey hockey, nhl, hockey league, coach, head coach

Electronics voltage, high voltage, circuit, laser, power supply


Experimental Data- scalability analysis Blogger.com

Data Collection:– We crawled 973,518 blog posts from March 2007 until January 2009– Total collection size of 5.45GB, with ~120,000 users

Cluster setting:– 4-node commodity machines cluster (each machine with 4GB RAM, 60GB HD, 4 cores)– Hadoop 0.20.1

Blog entry

http://grannyalong.blogspot.com/


Number of User Profiles

1

3

5

7

9

11

13

#2 #3 #4

Dataset

Ra

tio

Document Ratio

User profile Ratio

Time Ratio

Time ratio

Document ratioUser profile ratio

Runtime ratio is correlated with the number of user profiles ratio


Data Size

30

60

90

120

40 80 120 160 200

Number of documents [Thousands]

Ru

nn

ing

tim

e [m

in.]

Runtime linearly increases with the increasing of data size

#user: chose 18,000 users between March-Apr 2007


Related Work

Content-based user profiling:– Profile contains a taxonomic hierarchy for the long-term model. The Taxonomy

is taken from the ODP. Short-term activities update the hierarchy.– Adaptive user profile: Use words that appear in the Web pages and combine

them using tfidf, looking on some window and giving different weights according to the recency of the browsing

KL approach to user tasks: – Filter new documents that are not related to the user based on his profile.– Annotate a url with the most descriptive query term for a given user, based on

his profile.

User targeting in large-scale systems:– Behavioral targeting system over Hadoop MapReduce.– Large scale CF technique for movies recommendations for users.– Incremental algorithm to construct user profile based on monitoring and user

feedback which trades-off between complexity and quality of the profile.


Conclusions & Future Work

We proposed a scalable user profiling solution Implemented on top of Hadoop MapReduce

We showed quality and scalability results

We plan to extend the user model into semantic model

Extend the user profile to include structured data

© 2009 IBM Corporation Extracting User Profiles from Large Scale Data Joint work with Michal...

Documents

Transcript of © 2009 IBM Corporation Extracting User Profiles from Large Scale Data Joint work with Michal...