Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL...

31
Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

Transcript of Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL...

Page 1: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

Adaptive Information Filtering

Lanbo Zhang (ISSDM fellow)Yi Zhang (UCSC advisor)

Carla Kuiken (LANL mentor)

Page 2: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

2

Outline

• Introduction• Our Research– Interactive Retrieval Based on Faceted Feedback

(SIGIR 2010)– Discriminative Factored Prior Models for

Personalized Content-Based Recommendation (CIKM 2010)

• Future Work

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 3: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

3

Why Filtering?• In some cases, users want to persistently track certain

kinds of information on the Internet– CDC (Centers for Disease Control and Prevention) personnel

• News reports about H1N1

– Physicians• New treatments of a disease

– FBI investigators• Potential terrorist threats

– Financial analysts• News that may influence a stock

• For these tasks, search engines that require users to actively issue the queries are not enough

We need an intelligent system that can PUSH our desired information

to us whenever it is available!

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 4: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

4

Adaptive Information Filtering

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

• The central task– Identify the relevant documents from a document stream

Page 5: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

5

The Cold-Start Problem

• The filtering performance for new users is usually bad due to a lack of enough training data (user feedback) from these users

• We follow two directions to handle this problem– Explore new user interaction mechanisms to

encourage more user feedback– Research advanced filtering models that can borrow

information for new users

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 6: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

6

Outline

• Introduction• Our Research– Direction 1: A New User Feedback Mechanism• Faceted Feedback

– Direction 2: A New Filtering Model• Discriminative Factored Prior Model

• Future Work

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 7: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

7

Semi-Structured Documents

• Semi-structured documents with metadata are proliferating on the Internet– Authors, Topic, Publisher, Created Time, etc.– Metadata might be useful for filtering

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 8: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

8From New York Times

Human assigned metadata

Algorithm generated metadata

Page 9: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

Definitions

• Facet– Each metadata field is called a facet– E.g., Date, Topic, Location, Author, etc.

• Facet-Value Pair– A metadata field with a specific value is called a

facet-value pair– E.g., Publisher = New York Times

9Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL

mentor)

Page 10: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

Faceted Feedback

• Traditional User Feedback Mechanism– Allows users to provide feedback on the relevance of

documents• Doc1 Relevant• Doc2 Non-relevant

• Faceted Feedback– Allows users to provide feedback on facet-value pairs– Each facet-value pair represents a constraint on the

desired documents• Topic = FIFA World Cup Yes• Year = 2010 Yes• Year = 2006 No 10

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 11: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

Why Faceted Feedback

• Users may have clear ideas on some facets of the target documents– “FIFA World Cup”• Year = 2010

• May encourage user feedback– Facet-value pairs are short and easy to understand

11Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL

mentor)

Page 12: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

12

Research Questions

• Question 1– How to select a small number of facet-value pair

candidates?• Question 2– How to make use of faceted feedback?

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 13: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

13

Q1: Facet-Value Pair Selection

• Four approaches to rank facet-value pairs– Top Document Frequency (TDF)• Frequency in the top N ranked documents

– TDF*IDF (Inverse Document Frequency)– Query Likelihood (QL)• P(q|f=v)

– TDF+QL• TDF: P(f=v|q)• QL: P(q|f=v)

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 14: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

14

Q2: How to Use Faceted Feedback?

• The commonly used method– Boolean Model

• Problem with Boolean Model– Document metadata is not perfect• Inaccurate / incomplete

– This may badly hurt the retrieval performance

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 15: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

15

The Soft Model

• The basic idea– Rewarding documents with user-identified facet-

value pairs by adding a certain number of credits– The number of credits for each facet are learnt on

training queries

Score(d) = original score + rewards for facet match

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 16: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

Experimental Settings

Datasets OHSUMED + Queries from TREC (Text REtrieval

Conference) 2000 filtering track 348,566 medical articles, 63 queries

RCV1 + Queries from TREC 2002 filtering track ~810,000 news articles from Reuters, 50 queries

User Study We collected user faceted feedback on Amazon

Mechanical Turk

16Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL

mentor)

Page 17: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

Chosen Facets

17

OHSUMED

RCV1

MeSH (Medical Subject Headlines)

Region

Industry

Topic

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 18: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

18

Experimental Results: Overall Performance of Faceted Feedback

Faceted feedback significantly improves the retrieval performance

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 19: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

19

Experimental Results:Boolean Models vs. Soft Model

OHSUMED

RCV1

The Boolean models don’t work well or even hurt, while the soft model always performs well

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 20: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

20

Outline

• Introduction• Our Research– Direction 1: A New User Feedback Mechanism• Faceted Feedback

– Direction 2: A New Filtering Model• Discriminative Factored Prior Model

• Future Work

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 21: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

21

Existing Filtering Approaches

• Two categories– Retrieval models + threshold setting methods• Rocchio, BM25, Language Models, etc.

– Standard machine learning models for binary text classification• Naïve Bayes, logistic regression, SVM, neural networks,

etc.

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 22: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

22

Characteristics of User Interests

• For example,– User 1: “Sports”, “Technology”– User 2: “Sports”, “Politics”, “Shopping”– User 3: “Politics”, “Technology”, “Travel”

• Characteristics– A single user may have multiple interests– Different users may have overlapped interests

• Existing filtering approaches don’t explicitly capture these characteristics

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 23: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

23

Discriminative Factored Prior Models (DFPM)

),(~w ΣΛ mm N

The hidden factor matrix

The variance matrix

The profile/classifier of user m

The feature vector of the j-th training document of user m

The label of the j-th training document of user m

The hidden vector of user m ),|P( mjmmjy xw

),(~ I0Λ aNi

),(~ 1 cW IΣ

),(~ 2I0 cNm )1

,...,1(~

HHlMultinomiam

Page 24: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

24

Advantages

• As discriminative models, our models can incorporate any kinds of features– Textual features (words)– Semantic features (very useful)

• Topic = Lung Cancer• Source = Cancer Cause and Control

• Borrow information from other users when learning profiles for new users– All user profiles share a common hidden factor matrix

• Capture a single user’s multiple interests– Each user profile follows a factored prior distribution

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 25: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

25

Parameter Estimation

),()|(ˆ maxargmaxargMAP

DPDP

},,,{ mmλ wΣΛ

Assume is diagonal and all entries are equivalent to a constant value c1, then

Σ

Page 26: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

26

Optimization

• Use an EM-like iterative algorithm to solve the above optimization problem

1: Initialize

2:

3:

0

im

im

i ,wΛ

1, iim

im Λw Close form solution!

Conjugate gradient decent

Page 27: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

27

Experimental Settings• Dataset

– Collected from Digg.com, where users can “digg” their interested news articles to promote their rankings

– 15,162 users, 251 relevant documents per user• Details

– 80%(training), 10%(validation), 10%(test)– Words as features: 35,865 (TFIDF score)– Metrics: Precision, Recall, Macro-F1

• Baselines– L-2 normalized Logistic Regression (L2LR)

• Learns user profile separately without borrowing information

– The standard Bayesian Hierarchical model with Logistic Regression (BHLR)• Uses a standard prior

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 28: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

28

Performance Comparison

Our models outperform the baselines significantly

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 29: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

29

Outline

• Introduction• Our Research– A New User Interaction Mechanism• Faceted Feedback

– A New Filtering Approach• Discriminative Factored Prior Model

• Future Work

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 30: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

30

Future Work

• Active learning on facet-value pair selection– To maximize learning benefits

• Integrating multiple types of user feedback– Feedback on documents– Feedback on facets– …

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)

Page 31: Adaptive Information Filtering Lanbo Zhang (ISSDM fellow) Yi Zhang (UCSC advisor) Carla Kuiken (LANL mentor)

31

Thanks!

• Comments & Questions ?

Adaptive Information Filtering. Lanbo Zhang, Yi Zhang (UCSC advisor), Carla Kuiken (LANL mentor)