Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

18
Mining Rules from Surveys and Questionnaires Scott Burton and Richard Morris CS 676 Presentation 12 April 2011

Transcript of Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Page 1: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Mining Rules from Surveys and Questionnaires

Scott Burton and Richard MorrisCS 676 Presentation

12 April 2011

Page 2: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Frequently Used Problems for data mining• Rarity• Related and dependent questions• Ordinal / Likert scale

Surveys and Questionnaires

Page 3: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Association Rule Mining

Market basket analysis

Cookies -> Milk

Customer Milk Cookies Butter Bread

A x x

B x x x

C x x

D x x

Page 4: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Our Goal: Improve PrecisionStandard Algorithms/Approaches• Apriori, MS-Apriori• Too many rules• Rules are not “interesting” or actionable• Finding the needle in the haystack

Our goal• Improve Precision• How do you measure “interestingness?”

Page 5: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Mostly based on Support or Confidence Considered about 40 different metrics All seemed to favor the wrong types of rules

Interestingness Measures

Page 6: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Our Datasets Smoking habits of middle school students

in Mexico• Global Youth Tobacco Survey for the Pan

American Health Organization (GYTSPAHO)• ~65 Questions and 13,000 responses

HINTS (Health Information National Trends Survey)• hints.cancer.gov• 2007 response data had ~475 Questions and

8,000 responses• We focused on a subset of ~100 questions

Page 7: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Apriori vs. MS-Apriori

Apriori (Figure 1)

MS-Apriori (Figure 2)

Page 8: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Related and Dependent QuestionsTrue but worthless rules• Do you smoke=no -> Did you smoke last

week=no

Our approach• Cluster similar questions• Remove any intra-cluster rules

1

2 3

4

5 6

7

8 9

Page 9: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Distance Metrics◦ Bi-conditional prediction

Attribute vs. Attribute-Value pair

Involving the subject matter expert

Creating Clusters

Page 10: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

A Sample Clustering of Questions

(see handout)

Page 11: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Effects of Cluster PruningMS-Apriori (Figure 2)

After cluster pruning (Figure 3)

Page 12: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Similar Rules

Abstract Viewpoint:• A B -> C D• A -> C D• A B -> C• A B Z -> C D

Page 13: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Similar Rule Pruning

Page 14: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Effects of Similar Rule Pruning

After cluster pruning (Figure 3)

After Similar Rule Pruning (Figure 4)

Page 15: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Ordinal and Likert DataTwo Approaches• Pre-process• Post-process

Ordinal Likert

Page 16: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Effects of Pre-Binning (Figure 5)

Page 17: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

HINTS Data

(see handout, Figures 6-10)

Other Examples

Page 18: Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Conclusions and Future WorkConclusions• Increased precision of “interesting” rules• More work to be done

Future work• Tuning of existing processes• Handle numerical data• Handle questions not asked to everyone• Handle questions with multiple responses• Try other record matching techniques for similar

rule pruning