© 2009 IBM Corporation Extracting User Profiles from Large Scale Data Joint work with Michal...
-
Upload
robert-dunlap -
Category
Documents
-
view
216 -
download
2
Transcript of © 2009 IBM Corporation Extracting User Profiles from Large Scale Data Joint work with Michal...
© 2009 IBM Corporation
Extracting User Profiles from Large Extracting User Profiles from Large Scale DataScale Data
Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass
IBM Haifa Research Lab
David Konopnicki
© 2009 IBM Corporation2
Motivating Example
san-francisco peer michael jackson alive
analysis
User Browsing
Large scale content analysis for mass amount of users.
Update users profiles
Keywords Modeling: for each user, report the most meaningful keywords to describe her profile.
Profiles database
Track statistics about readers interests
Dashboard
AdvertisementSystem
© 2009 IBM Corporation3
Contributions
User Profiling Framework: – User profile model– KL approach to weight user profile
Large scale implementation:
– MapReduce flow
Experiments:
– Quality analysis– Scalability analysis
© 2009 IBM Corporation4
User Profiling Framework- Setting
<userID, docID>
<u1,d1>
<u1,d2>
<u2,d2>
<docID, content>
<d1,{bla,bla,bla}>
<d2,{foo,foo}>
logging targeting
© 2009 IBM Corporation5
User Profiling - Definitions
Vocabulary )),(,),(),(()( 21 imuj
uj
uj ttwtwtwup
)()1()()(~1 upupup jjjjj
jD
Bag of words model (BOW)
Profile maintenance
User snapshot
Community snapshot
)(uD j
© 2009 IBM Corporation6
User Profiling - Intuition
Find terms that are highly frequent in the user snapshot and separate the most between the user and the community snapshots
{ Travel, Tennis ,Sport }
© 2009 IBM Corporation7
User Profiling – Naïve approach
Term frequency: number of times a term t appears in document d- tf(t,d)
Document frequency: the number of documents containing the term t – df(t,D)
))(,(),())(,()( uDtudfDtidfuDttftw jjjuj
average tf over the user snapshot
inverse document frequency (df) of a term in the community snapshot
probability to find a term in the user snapshot
frequent separate
© 2009 IBM Corporation8
Kullback-Leibler (KL) Divergence
Measures the difference between two probability distributions
P1 and P2 :
VtKL tP
tPtPPPD
)(
)(log)()||(
2
1121
KL measures the distance between the Community distribution and the User distribution
Each term is scored according to its contribution to the KL distance between the community and the user distributions.
The top scored terms are then selected as the user important terms.
UserCommunity
© 2009 IBM Corporation9
jjjj NDtcdfDttfDtP ),(),()||(
User Profiling – KL method
Community marginal term distribution:
User marginal term distribution
)|())(
)()(1())(||( j
Vt
uj
uj
j DtPtw
twuDtP
average tf over the community snapshot
Probability to find a term t in community snapshot
probability normalization factor
=0.001
Smoothing with the community snapshot
Relative initial weight of term t
© 2009 IBM Corporation10
MapReduce FlowHDFS
textd ,
),(,, dttftd
du,
|)(|),,(,,, uDdttfdtu j
TF
|)(|,, uDdu j
),(),,(|,|),,(),,(, jjjjj DtcdfDtidfDDtdfDttft ))(,(,, uDtudftu j
UDF
DF
||),,(, jj DDttft
T̄F
jNt,
Nj
|Dj(u)|
HDFS
Mapper: input: (u,d)output (u,1)Reducer: output (u,|Dj(u)|) // Sum
Mapper: input: (d,text)output ({t,d},1)Reducer: output ({t,d}, tf(t,d)) // Sum
||),,(,, jDdttftd
Mapper: input: ({t,d},tf(t,Dj))output (t,1})Reducer: output (t, {df(t,Dj),idf(t,Dj),cdf(t,Dj})
HDFS
),(),,(, jj DtcdfDttft
HDFS
Mapper: input: (t,tf(t,d),|Dj|)output (t,{tf(t,d),|Dj|,1})Reducer: output (t, tf(t, Dj)) //Avg
Mapper: input: ({t},{tf(t,Dj),cdf(t,Dj)})output (t,Nj})Reducer: identity
Mapper: input: ({t},{tf(t,Dj),|Dj|,cdf(t,Dj),Nj})output (t,P(t|Dj)})Reducer: identity
P(t|Dj)
)||(, jDtPt
HDFS
jjj NDtcdfDttft ),,(),,(,
Mapper: input: ({u,t,d},{tf(t,Dj(u)),|Dj(u)|})output ({u,t,|Dj(u)},{1})Reducer: output ({u,t},{udf(t,Dj(u))})
© 2009 IBM Corporation11
MapReduce Flow- cont.
Vtwtu ,,
))(||(),||(,, uDtPDtPtu jj
wtu ~,,
∑w
,),||(,,, wDtPwtu jVt
HDFS
HDFS
HDFS
wtu ,,
P(t|Dj(u))
))(||(),||(,, uDtPDtPtu jj
HDFS
w~
),()),(,(),,(,,, jj DtidfuDtudfdttfdtu
w
HDFS
wtu ,,
))(,(),())(,()( uDtudfDtidfuDttftw jjjuj
)|())(
)()(1())(||( j
Vt
uj
uj
j DtPtw
twuDtP
)|(
))(|(log))(|()(~
j
jj
u
DtP
uDtPuDtPtw
j
© 2009 IBM Corporation12
Experimental Data- quality analysis
Open Directory Project (ODP):
– Categories are associated with manual labels– Considered as “ground-truth” in this work– Examples:
• ODP: Science/Technology/Electronics: Manual label: “Electronics” • ODP: Society/Religion/and/Spirituality/Buddhism: Manual label: “Buddhism”
Data Collection:
– 100 different categories randomly selected from ODP– 100 documents randomly selected per category– A total collection size of about 10,000 Web pages
Evaluation: – A match is considered if the suggested label is identical, an inflection, or a Wordnet’s
synonym to the manual label
© 2009 IBM Corporation13
Results
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10
K
Ma
tch
@K
Mutual InformationXi^2tf-udf-idfKL
In how many cases, we got at least one correct term from the top-K terms.
KL outperforms all other approaches for features selection
ODP Category Label Top-5 KL important terms
Bowling bowl, bowler, lane, bowl center, league
Buddhism Buddhist, Buddhism, Buddha, Zen, dharma
Ice Hockey hockey, nhl, hockey league, coach, head coach
Electronics voltage, high voltage, circuit, laser, power supply
© 2009 IBM Corporation14
Experimental Data- scalability analysis Blogger.com
Data Collection:– We crawled 973,518 blog posts from March 2007 until January 2009– Total collection size of 5.45GB, with ~120,000 users
Cluster setting:– 4-node commodity machines cluster (each machine with 4GB RAM, 60GB HD, 4 cores)– Hadoop 0.20.1
Blog entry
http://grannyalong.blogspot.com/
© 2009 IBM Corporation15
Number of User Profiles
1
3
5
7
9
11
13
#2 #3 #4
Dataset
Ra
tio
Document Ratio
User profile Ratio
Time Ratio
Time ratio
Document ratioUser profile ratio
Runtime ratio is correlated with the number of user profiles ratio
© 2009 IBM Corporation16
Data Size
30
60
90
120
40 80 120 160 200
Number of documents [Thousands]
Ru
nn
ing
tim
e [m
in.]
Runtime linearly increases with the increasing of data size
#user: chose 18,000 users between March-Apr 2007
© 2009 IBM Corporation17
Related Work
Content-based user profiling:– Profile contains a taxonomic hierarchy for the long-term model. The Taxonomy
is taken from the ODP. Short-term activities update the hierarchy.– Adaptive user profile: Use words that appear in the Web pages and combine
them using tfidf, looking on some window and giving different weights according to the recency of the browsing
KL approach to user tasks: – Filter new documents that are not related to the user based on his profile.– Annotate a url with the most descriptive query term for a given user, based on
his profile.
User targeting in large-scale systems:– Behavioral targeting system over Hadoop MapReduce.– Large scale CF technique for movies recommendations for users.– Incremental algorithm to construct user profile based on monitoring and user
feedback which trades-off between complexity and quality of the profile.
© 2009 IBM Corporation18
Conclusions & Future Work
We proposed a scalable user profiling solution Implemented on top of Hadoop MapReduce
We showed quality and scalability results
We plan to extend the user model into semantic model
Extend the user profile to include structured data
© 2009 IBM Corporation
Thank You !