Big Data Analysis Technology
description
Transcript of Big Data Analysis Technology
![Page 1: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/1.jpg)
Big Data Analysis Technology
University of PaderbornL.079.08013 Seminar: Cloud Computing and Big Data Analysis (in English)Summer semester 2013June 12, 2013
Tobias Hardes (6687549) – [email protected]
![Page 2: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/2.jpg)
2June 12, 2013
Table of content
Introduction Definitions
Background Example
Related Work Research
Main Approaches Association Rule Mining MapReduce Framework
Conclusion
![Page 3: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/3.jpg)
3June 12, 2013
4 Big keywords
![Page 4: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/4.jpg)
4June 12, 2013
Big Data vs. Business Intelligence
How can we predict cancer early enough to treat it successfully?
How Can I make significant profit on the stock market next month?
Which is the most profitable branch of our supermarket? In a specific country? During a specific period of time
Docs.oralcle.com
![Page 5: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/5.jpg)
5June 12, 2013
Background
home.web.cern.ch
![Page 6: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/6.jpg)
6June 12, 2013
Big Science – The LHC
600 million times per second, particles collide within the Large Hadron Collider (LHC)
Each collision generate new particles Particles decay in complex way Each collision is detected The CERN Data Center reconstruct this collision event
15 petabytes of data stored every year Worldwide LHC Computing Grid (WLCG) is
used to crunch all of the data
home.web.cern.ch
![Page 7: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/7.jpg)
7June 12, 2013
Data Stream Analysis
- Just in time analysis of data.- Sensor networks
- Analysis for a certain time (last 30 seconds)
http://venturebeat.com
![Page 8: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/8.jpg)
8June 12, 2013
Complex event processing (CEP)
- Provides queries for streams- Usage of „Event Processing Languages“ (EPL)
- select avg(price) from StockTickEvent.win:time(30 sec)
https://forge.fi-ware.eu
Tumbling Window(Slide = WindowSize) Sliding Window
(Slide < WindowSize)
Window Slide
![Page 9: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/9.jpg)
9June 12, 2013
Complex Event Processing - Areas of application
- Just in time analysis Complexity of algorithms- CEP is used with Twitter:
- Identify emotional states of users
- Sarcasm?
![Page 10: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/10.jpg)
10June 12, 2013
Related Work
![Page 11: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/11.jpg)
11June 12, 2013
Big Data in companies
![Page 12: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/12.jpg)
12June 12, 2013
Principles
- Statistics- Probability theory- Machine learning
Data Mining- Association rule learning- Cluster analysis- Classificiation
![Page 13: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/13.jpg)
13June 12, 2013
Association Rule Mining – Cluster analysis
Association Rule Mining-Relationships between items-Find associations, correlations or causal structures
-Apriori algorithm-Frequent Pattern (FP)-Growth algorithm
Is soda purchased with bananas?
![Page 14: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/14.jpg)
14June 12, 2013
Cluster analysis – Classification
Cluster Analysis-Classification of similar objects into classes-Classes are defined during the clustering
-k-Means-K-Means++
![Page 15: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/15.jpg)
16June 12, 2013
Research and future work
- Performance, performance, performance…- Passes of the data source- Parallelization- NP-hard problems- ….
- Accuracy- Optimized solutions
![Page 16: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/16.jpg)
17June 12, 2013
Example
- Apriori algorithm: n+1 database scans- FP-Growth algorithm: 2 database scans
![Page 17: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/17.jpg)
18June 12, 2013
Distributed computing – Motivation
- Complex computational tasks- Serveral terabytes of data- Limited hardware resources
Google‘s MapReduce frameworkProf. Dr. Erich Ehses (FH Köln)
![Page 18: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/18.jpg)
20June 12, 2013
Main approaches
http://ultraoilforpets.com
![Page 19: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/19.jpg)
21June 12, 2013
Structure
- Association rule mining- Apriori algorithm- FP-Growth algorithm
- Googles MapReduce
![Page 20: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/20.jpg)
22June 12, 2013
Association rule mining
- Identify items that are related to other items- Example: Analysis of baskets in an online shop
or in a supermarket
http://img.deusm.com/
![Page 21: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/21.jpg)
23June 12, 2013
Terminology
- A stream or a database with n elements: S - Item set: - Frequency of occurrence of an item set: Φ(A)
- Association rule B :
- Support: - Confidence:
![Page 22: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/22.jpg)
24June 12, 2013
Example
- Rule: „If a basket contains cheese and chocolate, then it also contains bread“
- 6 of 60 transactions contains cheese and chocolate
- 3 of the 6 transactions contains bread
![Page 23: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/23.jpg)
25June 12, 2013
Common approach
- Disjoin the problem into two tasks:
1. Generation of frequent item sets• Find item sets that satisfy a minimum support value
2. Generation of rules• Find Confidence rules using the item sets
𝐦𝐢𝐧𝐬𝐮𝐩≤𝐬𝐮𝐩 ( 𝑨 )= 𝜱 (𝑨)¿𝑺∨¿ ¿
𝒎𝒊𝒏𝒄𝒐𝒏𝒇 ≤𝒄𝒐𝒏𝒇 ¿
![Page 24: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/24.jpg)
26June 12, 2013
Aprio algorithm – Frequent item set
Input:Minimum support: min_supDatasource: S
![Page 25: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/25.jpg)
27June 12, 2013
Apriori – Frequent item sets (I)
Generation of frequent item sets : min_sup = 2TID Transaction1 (B,C)
2 (B,C)
3 (A,C,D)
4 (A,B,C,D)
5 (B,D)
{}
A B C D2 341 12 21 3 122 3 4 24
https://www.mev.de/
![Page 26: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/26.jpg)
28June 12, 2013
Apriori – Frequent item sets (II)
Generation of frequent item sets : min_sup = 2TID Transaction1 (B,C)
2 (B,C)
3 (A,C,D)
4 (A,B,C,D)
5 (B,D)
{}
A B C D
AB AC AD BC BD CD
4 342
1 2 2 3 2 2
ACD BCD
Candidates
Candidates 2 1
https://www.mev.de/
L1
L2
L3
![Page 27: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/27.jpg)
29June 12, 2013
Apriori Algorithm – Rule generation
- Uses frequent item sets to extract high-confidence rules- Based on the same principle as the item set generation- Done for all
frequent item set Lk
![Page 28: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/28.jpg)
30June 12, 2013
Example: Rule generation
TID ItemsT1 {Coffee; Pasta; Milk}
T2 {Pasta; Milk}
T3 {Bread; Butter}
T4 {Coffee; Milk; Butter}
T5 {Milk; Bread; Butter}𝐵𝑢𝑡𝑡𝑒𝑟❑
⇒𝑀𝑖𝑙𝑘
¿ (𝐵𝑢𝑡𝑡𝑒𝑟❑⇒𝑀𝑖𝑙𝑘)=Φ(𝐵𝑢𝑡𝑡𝑒𝑟∪𝑀𝑖𝑙𝑘)
¿𝑆∨¿=25=40%¿
conf (𝐵𝑢𝑡𝑡𝑒𝑟❑⇒𝑀𝑖𝑙𝑘)=𝑠𝑢𝑝(𝐵𝑢𝑡𝑡𝑒𝑟∪𝑀𝑖𝑙𝑘)
¿ (𝐵𝑢𝑡𝑡𝑒𝑟 )=40%60%
=66%
![Page 29: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/29.jpg)
31June 12, 2013
Summary Apriori algorithm
- n+1 scans of the database- Expensive generation of the candidate item set- Implements level-wise search using frequent
item property.
- Easy to implement- Some opportunities for specialized optimizations
![Page 30: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/30.jpg)
32June 12, 2013
FP-Growth algorithm
- Used for databases- Features:
- Requires 2 scans of the database- Uses a special data structure – The FP-Tree1. Build the FP-Tree2. Extract frequent item sets
- Compression of the database- Devide this database and apply data mining
![Page 31: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/31.jpg)
33June 12, 2013
Construct FP-Tree
TID Items
1 {a,b}
2 {b,c,d}
3 {a,c,d,e}
4 {a,d,e}
5 {a,b,c}
6 {a,b,c,d}
7 {a}
8 {a,b,c}
9 {a,b,d}
10 {b,c,e}
d:1
![Page 32: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/32.jpg)
34June 12, 2013
Extract frequent itemsets (I)
- Bottom-up strategy
- Start with node „e“- Then look for „de“- Each path is processed
recursively- Solutions are merged
![Page 33: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/33.jpg)
35June 12, 2013
Extract frequent itemsets (II)
Φ(e) = 3 – Assume the minimum support was set to 2
- Is e frequent?- Is de frequent?
- …- Is ce frequent?
- ….- Is be frequent?
- ….- Is ae frequent?
- …..Using subproblems to identify frequent itemsets
![Page 34: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/34.jpg)
36June 12, 2013
Extract frequent itemsets (III)
1. Update the support count along the prefix path
2. Remove Node e3. Check the frequency of the paths
Find item sets withde, ce, ae or be
![Page 35: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/35.jpg)
37June 12, 2013
Apriori vs. FP-Growth
- FP-Growth has some advantages- Two scans of the database- No expensive computation of candidates- Compressed datastructure- Easier to parallelize
W. Zhang, H. Liao, and N. Zhao, “Research on the fp growth algorithmabout association rule mining
![Page 36: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/36.jpg)
45June 12, 2013
MapReduce
- Map and Reduce functions are expressed by a developer
- map(key, val)- Emits new key-values p
- reduce(key, values) - Emits an arbitrary output- Usually a key with one value
![Page 37: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/37.jpg)
46June 12, 2013
MapReduce – Word count
![Page 38: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/38.jpg)
User Programm
Master
worker
worker
worker
worker
worker
worker
worker
worker
Input files Mapphase
Intermediate files Shuffle Reduce
phase Output files
Worker for red keys
Worker for blue keys
Worker for yellow keys
(1)fork (1)fork(1)fork
(2) assign (2) assign
(3) read (4) local write (5) RPC
(6) write
(7) return
![Page 39: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/39.jpg)
48June 12, 2013
Conclusion: MapReduce (I)
- MapReduce is design as a batch processing framework
- No usage for ad-hoc analysis- Used for very large data sets- Used for time intensive computations
- OpenSource implementation: Apache Hadoop
http://hadoop.apache.org/
![Page 40: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/40.jpg)
49June 12, 2013
Conclusion
![Page 41: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/41.jpg)
50June 12, 2013
Conclusion (I)
- Big Data is important for research and in daily business
- Different approaches- Data Stream analysis
- Complex event processing- Rule Mining
- Apriori algorithm- FP-Growth algorithm
![Page 42: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/42.jpg)
51June 12, 2013
Conclusion (II)
- Clustering- K-Means- K-Means++
- Distributed computing- MapReduce
- Performance / Runtime- Multiple minutes- Hours- Days…- Online analytical processing for Big Data?
![Page 43: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/43.jpg)
Thank you for your attention
![Page 44: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/44.jpg)
Appendix
![Page 45: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/45.jpg)
54June 12, 2013
Big Data definitions
Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.(Gartner Inc.)
Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.(IBM Corporate ) Big data” refers to datasets whose size is
beyond the ability of typical database software tools to capture, store, manage, and analyze.(McKinsey & Company)
![Page 46: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/46.jpg)
55June 12, 2013
Big Data definitions
Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.(Gartner Inc.)
Every day, we create 2.5 quintillion bytes of …. . This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.(IBM Corporate ) Big data” refers to datasets whose size is
beyond the ability of typical database software tools to capture, store, manage, and analyze.(McKinsey & Company)
![Page 47: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/47.jpg)
56June 12, 2013
Complex Event Processing – Windows
Tumbling Window-Moves as much as the window size
Sliding Window-Slides in time-Buffers the last x elements
Tumbling Window(Slide = WindowSize) Sliding Window
(Slide < WindowSize)
Window Slide
![Page 48: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/48.jpg)
57June 12, 2013
MapReduce vs. BigQuery
![Page 49: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/49.jpg)
58June 12, 2013
Apriori Algorithm (Pseudocode)
- for (
- for each do- - for each do
- end for- end for - if then
- end if- end for- return
![Page 50: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/50.jpg)
59June 12, 2013
Apriori Algorithm (Pseudocode)
- for (
- for each do- - for each do
- end for- end for - if then
- end if- end for- return
![Page 51: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/51.jpg)
60June 12, 2013
Apriori Algorithm (Pseudocode)
- for (
- for each do- - for each do
- end for- end for - if then
- end if- end for- return
![Page 52: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/52.jpg)
61June 12, 2013
Apriori Algorithm (Pseudocode)
- for (
- for each do- - for each do
- end for- end for - if then
- end if- end for- return
![Page 53: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/53.jpg)
62June 12, 2013
![Page 54: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/54.jpg)
63June 12, 2013
Distributed computing of Big Data
CERN‘s Worldwide LHC Computing Grid (WLCG) launched in 2002
Stores, distributes and analyse the 15 petabytes of data 140 centres across 35 countries
![Page 55: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/55.jpg)
64June 12, 2013
Apriori Algorithm – 𝑎𝑝𝑟𝑖𝑜𝑟𝑖𝐺𝑒𝑛 Join
- Do not generate not too many candidate item sets, but making sure to not lose any that do turn out to be large.
- Assume that the items are ordered (alphabetical)
- {a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak < bk, {a1, a2 , … ak, bk} is a candidate k+1-itemset.
![Page 56: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/56.jpg)
65June 12, 2013
Big Data vs. Business Intelligence
Big Data Large and complex data sets Temporal, historical, … Difficult to process and to
analyse Used for deep analysis and
reporting: How can we predict cancer
early enough to treat it successfully?
How Can I make significant profit on the stock market next month?
Business Intelligence Transformed Data Historical view Easy to process and to
analyse Used for reporting:
Which is the most profitable branch of our supermarket?
Which postcodes suffered the most dropped calls in July?
![Page 57: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/57.jpg)
66June 12, 2013
Improvement approaches
- Selection of startup parameters for algorithms
- Reducing the number of passes over the database
- Sampling the database
- Adding extra constraints for patterns
- Parallelization
![Page 58: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/58.jpg)
67June 12, 2013
Improvement approaches – Examples
![Page 59: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/59.jpg)
68June 12, 2013
Example: FA-DMFI
- Algorithm for Discovering frequent item sets- Read the database once
- Compress into a matrix- Frequent item sets are generated by cover relations Further costly computations are avoided
![Page 60: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/60.jpg)
69June 12, 2013
K-Means algorithm
1. Select k entities as the initial centroids.2. (Re)Assign all entities to their closest centroids.3. Recompute the centroid of each newly
assembled cluster.4. Repeat step 2 and 3 until the centroids do not
change or until the maximum value for the iterations is reached
![Page 61: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/61.jpg)
70June 12, 2013
Solving approaches
- K-Means cluster is NP-hard- Optimization methods to handle NP-hard
problems (K-Means clustering)
![Page 62: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/62.jpg)
71June 12, 2013
Examples
- Apriori algorithm: n+1 database scans- FP-Growth algorithm: 2 database scans
- K-Means: Exponential runtime- K-Means++: Improve startup parameters
![Page 63: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/63.jpg)
72June 12, 2013
Google‘s BigQuery
UploadUpload the data set to the Google Storage
http://glenn-packer.net/
Analyse
Import data to tablesProcess
Run queries
![Page 64: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/64.jpg)
73June 12, 2013
The Apriori algorithm
- Most known algorithm for rule mining- Based on a simple principle:
- „If an item set is frequent, then all subsets of this item are also frequent“
- Input:- Minimum confidence: min_conf- Minimum support: min_sup- Data source: S
![Page 65: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/65.jpg)
74June 12, 2013
Apriori Algorithm – aprioriGen
- Generates a candidate item set that might by larger
- Join: Generation of the item set- Prune: Elimination of item sets with
![Page 66: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/66.jpg)
75June 12, 2013
Apriori Algorithm – Rule generation -- Example
- {Butter, milk, bread} {cheese}- {Butter, meat, bread} {cola}
{Butter, bread} {cheese, cola}
![Page 67: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/67.jpg)
76June 12, 2013
How to improve the Apriori algorithm
- Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent.
- Sampling: mining on a subset of given data- Dynamic itemset counting:
![Page 68: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/68.jpg)
77June 12, 2013
Construction of FP-Tree
- Compressed representation of the database- First scan
- Get the support of every item and sort them by the support count
- Second scan- Each transaction is mapped to a path- Compression is done if overlapping path are
detected- Generate links between same nodes
- Each node has a counter Number of mapped transactions
![Page 69: Big Data Analysis Technology](https://reader035.fdocuments.net/reader035/viewer/2022062521/56816869550346895dded4c0/html5/thumbnails/69.jpg)
78June 12, 2013
FP-Growth algorithm
Calculate the support count of
each item in S
Sort items in decreasing support
counts
Read transaction t
Create new nodes labeled with the
items in t
Set the frequency count to 1
No overlappedprefix found
Increment the frequency count for
each overlapped item
Overlapped prefix found
Create new nodes for none overlapped
items
Create additional path to common
items
hasNext
return