Data Mining Algorithms - Stanford...
Transcript of Data Mining Algorithms - Stanford...
![Page 1: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/1.jpg)
CS102Data Mining
Data Mining Algorithms
CS102
Winter 2019
![Page 2: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/2.jpg)
CS102Data Mining
Big Data Tools and Techniques
§ Basic Data Manipulation and Analysis
Performing well-defined computations or asking
well-defined questions (“queries”)
§ Data Mining
Looking for patterns in data
§ Machine Learning
Using data to build models and make predictions
§ Data Visualization
Graphical depiction of data
§ Data Collection and Preparation
![Page 3: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/3.jpg)
CS102Data Mining
Data MiningLooking for patterns in data
Similar to unsupervised machine learning
• Popularity predates popularity of machine learning
• “Data mining” often associated with specific data
types and patterns
We will focus on “market-basket” data
• Widely applicable (despite the name)
And two types of data mining patterns
• Frequent item-sets
• Association rules
![Page 4: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/4.jpg)
CS102Data Mining
Other Data and Patterns
Other types of data• Networks/graphs
• Streams
• Text (“text mining”)
Other patterns• Similar items
• Structural patterns in large graphs/networks
• Clusters, anomalies
Specific techniquesfor each one
![Page 5: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/5.jpg)
CS102Data Mining
(In)Famous Early Success Stories
Victoria’s Secret
Walmart
Beer & Diapers
![Page 6: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/6.jpg)
CS102Data Mining
Market-Basket Data
Originated with retail data• Each shopper buys “market basket” of groceries• Mine data for patterns in buying habits
General definition• Domain of items• Transaction - one or more items occurring
together• Dataset – set of transactions (usually large)
![Page 7: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/7.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
![Page 8: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/8.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries
![Page 9: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/9.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
![Page 10: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/10.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods
![Page 11: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/11.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
![Page 12: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/12.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
University courses
![Page 13: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/13.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
University courses Student transcript
![Page 14: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/14.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
University courses Student transcript
University students
![Page 15: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/15.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
University courses Student transcript
University students Party
![Page 16: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/16.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
University courses Student transcript
University students Party
Movies
![Page 17: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/17.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
University courses Student transcript
University students Party
Movies Person
![Page 18: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/18.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
University courses Student transcript
University students Party
Movies Person
Symptoms
![Page 19: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/19.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
University courses Student transcript
University students Party
Movies Person
Symptoms Patient
![Page 20: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/20.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
University courses Student transcript
University students Party
Movies Person
Symptoms Patient
Menu items
![Page 21: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/21.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
University courses Student transcript
University students Party
Movies Person
Symptoms Patient
Menu items Restaurant customer
![Page 22: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/22.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
University courses Student transcript
University students Party
Movies Person
Symptoms Patient
Menu items Restaurant customer
Words
![Page 23: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/23.jpg)
CS102Data Mining
Market-Basket Examples
Items Transaction
Groceries Grocery cart
Online goods Virtual shopping cart
University courses Student transcript
University students Party
Movies Person
Symptoms Patient
Menu items Restaurant customer
Words Document
![Page 24: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/24.jpg)
CS102Data Mining
Data Mining Algorithms
Frequent Item-Sets – sets of items that occur
frequently together in transactions
• Groceries bought together
• Courses taken by same students
• Students going to parties together
• Movies watched by same people
Association Rules – When certain items occur
together, another item frequently occurs with them
• Shoppers who buy phone + charger also buy case
• Students who take Databases also take Machine
Learning
• Diners who order curry and rice also order bread
![Page 25: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/25.jpg)
CS102Data Mining
Frequent Item-SetsSets of items that occur frequently
together in transactions
ØHow large is a “set”?
ØWhat does “frequently” mean?
![Page 26: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/26.jpg)
CS102Data Mining
Frequent Item-SetsSets of items that occur frequently
together in transactions
ØHow large is a “set”?Usually specify a minimum min-set-size
Possibly also a maximum max-set-size
ØWhat does “frequently” mean?Notion of support
![Page 27: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/27.jpg)
CS102Data Mining
Support
Support for a set of items S in a dataset of transactions is the fraction of the transactions containing S:
Specify support-threshold for frequent item-setsOnly return sets wheresupport > support-threshold
# of transactions containing S____________________________________
total # of transactions
![Page 28: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/28.jpg)
CS102Data Mining
Your Turn
Transactions:
T1: milk, eggs, juice
T2: milk, juice, cookies
T3: eggs, chips
T4: milk, eggs
T5: milk, juice, cookies, chips
What are the frequent item-sets if:
• min-set-size = 2 (no max-set-size)
• support-threshold = 0.3
Support:# of transactions containing S
____________________________________
total # of transactions
![Page 29: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/29.jpg)
CS102Data Mining
Computing Frequent Item-Sets
“Apriori” algorithm
Efficiency relies on the following property:
Or the inverse:
If S is a frequent item-set satisfyingsupport-threshold t, then every subset of S is alsoa frequent item-set satisfying support-threshold t.
If S is not a frequent item-set satisfyingsupport-threshold t, then no superset of S can bea frequent item-set satisfying support-threshold t.
![Page 30: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/30.jpg)
CS102Data Mining
Association Rules
When a set of items S occurs together, another item i frequently occurs with them
S à i
ØHow large is a “set”?
ØWhat does “occurs together” mean?
ØWhat does “frequently occurs with them” mean?
![Page 31: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/31.jpg)
CS102Data Mining
Association Rules
When a set of items S occurs together, another item i frequently occurs with them
S à i
ØHow large is a “set”?Usually specify a minimum min-set-size for SPossibly also a maximum max-set-size for S
ØWhat does “occurs together” mean?
ØWhat does “frequently occurs with them” mean?
![Page 32: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/32.jpg)
CS102Data Mining
Association Rules
When a set of items S occurs together, another item i frequently occurs with them
S à i
ØHow large is a “set”?Usually specify a minimum min-set-size for SPossibly also a maximum max-set-size for S
ØWhat does “occurs together” mean?Notion of support
ØWhat does “frequently occurs with them” mean?Notion of confidence
![Page 33: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/33.jpg)
CS102Data Mining
Support and Confidence
Support for association rule S à i in a dataset of
transactions is fraction of transactions containing S:
Confidence for association rule S à i in a dataset of
transactions is the fraction of transactions
containing S that also contain i:
# of transactions containing S____________________________________
total # of transactions
# of transactions containing S and i__________________________________________
# of transactions containing S
![Page 34: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/34.jpg)
CS102Data Mining
Support and Confidence
Specify support-threshold and confidence-threshold
for association rules
Only return rules where:support > support-threshold andconfidence > confidence-threshold
![Page 35: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/35.jpg)
CS102Data Mining
Your Turn
Transactions:
T1: milk, eggs, juice
T2: milk, juice, cookies
T3: eggs, chips
T4: milk, eggs
T5: milk, juice, cookies, chips
What are the association rules S à i if:
• min-set-size = 1 (no max-set-size)
• support-threshold = 0.5
• confidence-threshold = 0.5
# of transactions containing S____________________________________
total # of transactions
# of transactions containing S and i__________________________________________
# of transactions containing SConfidence:
Support:
Reminder: support and
confidence must be >
threshold, not ≥
![Page 36: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/36.jpg)
CS102Data Mining
Computing Association Rules
1. Use frequent item-sets to find left-hand sides S satisfying support threshold
2. Then extend to find right-hand sides S à isatisfying confidence threshold
NOT a property:
If S à i is an association rule satisfyingsupport-threshold t and confidence-threshold c,
and S’Í S, then S’ à i is an an associationrule satisfying support-threshold t and
confidence-threshold c.
Why Not?
![Page 37: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/37.jpg)
CS102Data Mining
Association Rules: Lift
Association rule S à i might have high confidence
because item i appears frequently, not because it’s
associated with S.
Confidence for association rule S à i in a dataset of
transactions is the fraction of transactions
containing S that also contain i :
# of transactions containing S and i__________________________________________
# of transactions containing S
Lift
divided by the overall frequency of i :,
#trans containing S and i #trans containing i______________________________
÷_______________________
#trans containing S total #trans
![Page 38: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/38.jpg)
CS102Data Mining
Lift: Examples
Transactions:
T1: milk, eggs, juice
T2: milk, juice, cookies
T3: eggs, chips
T4: milk, eggs
T5: milk, juice, cookies, chips
juice à cookies Lift = (2/3) ÷ (2/5) = 10/6 = 1.67
eggs à milk Lift = (2/3) ÷ (4/5) = 10/12 = 0.83
Lift:#trans containing S and i #trans containing i
______________________________ ÷ _______________________
#trans containing S total #trans
Lift = 1: no association
Lift > 1: association
Lift < 1: anti-association
![Page 39: Data Mining Algorithms - Stanford Universityweb.stanford.edu/class/cs102/lecturenotes/DataMining.pdf · Data Mining CS102 Association Rules: Lift Association rule Sàimight have high](https://reader036.fdocuments.net/reader036/viewer/2022070704/5e821b1a982c5361695c3248/html5/thumbnails/39.jpg)
CS102Data Mining
Data Mining Algorithms
CS102
Winter 2019