Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Mining Rules from Surveys and Questionnaires

Scott Burton and Richard MorrisCS 676 Presentation

12 April 2011

Frequently Used Problems for data mining• Rarity• Related and dependent questions• Ordinal / Likert scale

Surveys and Questionnaires

Association Rule Mining

Market basket analysis

Cookies -> Milk

Customer Milk Cookies Butter Bread

A x x

B x x x

C x x

D x x

Our Goal: Improve PrecisionStandard Algorithms/Approaches• Apriori, MS-Apriori• Too many rules• Rules are not “interesting” or actionable• Finding the needle in the haystack

Our goal• Improve Precision• How do you measure “interestingness?”

Mostly based on Support or Confidence Considered about 40 different metrics All seemed to favor the wrong types of rules

Interestingness Measures

Our Datasets Smoking habits of middle school students

in Mexico• Global Youth Tobacco Survey for the Pan

American Health Organization (GYTSPAHO)• ~65 Questions and 13,000 responses

HINTS (Health Information National Trends Survey)• hints.cancer.gov• 2007 response data had ~475 Questions and

8,000 responses• We focused on a subset of ~100 questions

Apriori vs. MS-Apriori

Apriori (Figure 1)

MS-Apriori (Figure 2)

Related and Dependent QuestionsTrue but worthless rules• Do you smoke=no -> Did you smoke last

week=no

Our approach• Cluster similar questions• Remove any intra-cluster rules

1

2 3

4

5 6

7

8 9

Distance Metrics◦ Bi-conditional prediction

Attribute vs. Attribute-Value pair

Involving the subject matter expert

Creating Clusters

A Sample Clustering of Questions

(see handout)

Effects of Cluster PruningMS-Apriori (Figure 2)

After cluster pruning (Figure 3)

Similar Rules

Abstract Viewpoint:• A B -> C D• A -> C D• A B -> C• A B Z -> C D

Similar Rule Pruning

Effects of Similar Rule Pruning

After cluster pruning (Figure 3)

After Similar Rule Pruning (Figure 4)

Ordinal and Likert DataTwo Approaches• Pre-process• Post-process

Ordinal Likert

Effects of Pre-Binning (Figure 5)

HINTS Data

(see handout, Figures 6-10)

Other Examples

Conclusions and Future WorkConclusions• Increased precision of “interesting” rules• More work to be done

Future work• Tuning of existing processes• Handle numerical data• Handle questions not asked to everyone• Handle questions with multiple responses• Try other record matching techniques for similar

rule pruning

Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.

Documents

Transcript of Scott Burton and Richard Morris CS 676 Presentation 12 April 2011.