MODREN INSTITUTE OF TECHNOLOGY & RESEARCH CENTRE VIPIN KUMAR YADAV.
Scalable Benchmarks and Kernels for Data Mining and Analytics Vipin Kumar University of Minnesota...
-
date post
21-Dec-2015 -
Category
Documents
-
view
226 -
download
5
Transcript of Scalable Benchmarks and Kernels for Data Mining and Analytics Vipin Kumar University of Minnesota...
Scalable Benchmarks and Kernels for Data Mining and Analytics
Vipin Kumar
University of Minnesota [email protected]
www.cs.umn.edu/~kumar
Joint work with Alok Choudhary and Gokhan Memik (Northwestern) and Michael Steinbach (University of Minnesota)
Research funded by NSF
Need for High Performance Data Mining
Today’s digital society has seen enormous data growth in both commercial and scientific databases
Data Mining is becoming a commonly used tool to extract information from large and complex datasets
Advances in computing capabilities and technological innovation needed to harvest the available wealth of data
Computational Simulations
Internet
Sensor Networks
Geo-spatial data
Biomedical DataHomeland Security
SST
Precipitation
NPP
Pressure
SST
Precipitation
NPP
Pressure
Longitude
Latitude
Timegrid cell zone
...
Data Mining for Climate Data
NASA ESE questions: How is the global Earth system changing?
What are the primary forcings?
How does Earth system respond to natural & human-induced changes?
What are the consequences of changes in the Earth system?
How well can we predict future changes?
Global snapshots of values for a number of variables on land surfaces or water
NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS
NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters, human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years….
http://www.nasa.gov/centers/ames/news/releases/2003/03_51AR.html
High Resolution EOS Data:
•EOS satellites provide high resolution measurements• Finer spatial grids
• 1 km 1 km grid produces 694,315,008 data points• Going from 0.5º 0.5º degree data to 1 km 1 km data results in a 2500-
fold increase in the data size• More frequent measurements• Multiple instruments
•High resolution data allows us to answer more detailed questions:• Detecting patterns such as trajectories, fronts, and movements of regions with uniform properties
• Finding relationships between leaf area index (LAI) and topography of a river drainage basin
• Finding relationships between fire frequency and elevation as well as topographic position
•Leads to substantially high computational and memory requirementsDisturbance Viewer
This interactive module displays the locations on the earth surface where significant disturbance events have been detected.
Detection of Ecosystem Disturbances:
Data Mining for Cyber Security
• Due to proliferation of Internet, more and more organizations are becoming vulnerable to sophisticated cyber attacks
• Traditional Intrusion Detection Systems (IDS) have well-known limitations– Too many false alarms– Unable to detect sophisticated and novel attacks– Unable to detect insider abuse/ policy abuse
• Data Mining is well suited to address these challenges
0
20000
40000
60000
80000
100000
120000
1 2 3 4 5 6 7 8 9 10 11 12 13 14
• Incorporated into Interrogator architecture at ARL Center for Intrusion Monitoring and Protection (CIMP)
• Helps analyze data from multiple sensors at DoD sites around the country• Routinely detects Insider Abuse / Policy Violations / Worms / Scans
Large Scale Data Analysis is needed for
• Correlation of suspicious events across network sites
– Helps detect sophisticated attacks not identifiable by single site analyses
• Analysis of long term data (months/years)
– Uncover suspicious stealth activities (e.g. insiders leaking/modifying information)
MINDS – Minnesota Intrusion Detection System
Data Mining for Biomedical Informatics
Recent technological advances are helping to generate large amounts of both medical and genomic data• High-throughput experiments/techniques
- Gene and protein sequences- Gene-expression data- Biological networks and phylogenetic profiles
• Electronic Medical Records- IBM-Mayo clinic partnership has created a DB of 5
million patients- NIH Roadmap
Data mining offers potential solution for analysis of large-scale data
• Automated analysis of patients history for customized treatment
• Design of drugs/chemicals• Prediction of the functions of anonymous genes
Protein Interaction Network
Role of Benchmarks in Architecture Design
Benchmarks guide the development of new processor architectures in addition to measuring the relative performance of different systems
• SPEC: General purpose architecture(“Advances in the microprocessor industry would not have been possible without the SPEC benchmarks” - David Patterson)
• TPC: Database Systems
• SPLASH: Parallel machine architectures
• Mediabench: Media and Communication Processors
• NetBench: Network/Embedded processors
Do We Need Benchmarks Specific to Data Mining?
Performance metrics of several benchmarks gathered from Vtune• Cache miss ratios, Bus usage, Page faults etc.
Benchmark applications were grouped using Kohenen clustering to spot trends:
012
345
8
9
11
apri
ori
bayesi
an
bir
checl
at
hop
scalp
arc
kM
eans
fuzz
yrs
earc
hse
mphy
snp
genenet
svm
-rfe
MineBench
67
10
Clu
ster
Num
ber
gcc
bzi
p2
gzi
pm
cftw
olf
vort
ex
vpr
pars
er
apsi
art
equake
luca
sm
esa
mgri
dsw
imw
upw
ise
raw
caudio
epic
enco
de
cjpeg
mpeg2
pegw
itgs
toast
Q1
7Q
3Q
4Q
6
SPEC FP MediaBench TPC-HSPEC INT
Reference: [Pisharath J., Zambreno J., Ozisikyilmaz B., Choudhary A., 2006]
Recently funded NSF project: Scalable Benchmarks, Software and Datafor Data Mining, Analytics and Scientific Discoveries
PIs: A. Choudhary and Gokhan Memik (NW) , V. Kumar and M. Steinbach (UM)
Goal: Establish a comprehensive benchmarking suite for data mining applications.
Motivate the development of new processor architectures and system design for data mining
Motivate the implementation of more sophisticated data mining algorithms that can work with the constraints imposed by current architecture designs
Improvement the productivity of scientists and engineers using data mining application in a wide variety of domains
Profiling
Typ
es o
f da
ta
(str
eam
ing,
fil
e I/
O)
Types of applications (scientific,
bioinformatics,security, …)
Typ
es o
f st
orag
e(m
emor
y, d
isks
, …)
Scalability(data-level, processor)
Performance (execution time,
cache behavior, …)
Profiling
Typ
es o
f da
ta
(str
eam
ing,
fil
e I/
O)
Types of applications (scientific,
bioinformatics,security, …)
Typ
es o
f st
orag
e(m
emor
y, d
isks
, …)
Scalability(data-level, processor)
Performance (execution time,
cache behavior, …)
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes 10
Predictive M
odeling
Clustering
Association
Rules
Anomaly Detection
Milk
Data
Data Mining Tasks …
Key Data Mining Algorithms
Clustering• K-means, EM, SOM• Single link / Group Average hierarchical clustering• DBSCAN, SNN
Classification• Bayes• SVM• Decision trees, Rule based systems
Association Rule Mining• Apriori, FP-Growth
Anomaly Detection• Statistical methods• Distance-based• Clustering-based
Preprocessing• SVD, PCA
Major Data Mining Kernels
Counting• Given a set of data records, count types of different
categories to build a contingency table• Count the occurrence of a set of items in a set of
transactions
Pairwise computations• Given a set of data records, perform pairwise
distane/similarity computations
Linear Algebra operations• SVD, PCA
General Characteristics of Data Mining Algorithms
Dense/Sparse data
Hash table / Hash tree
Linked Lists
Iterative nature
Data often too large to fit in main memory• Spatial locality is critical
Constructing a Decision Tree
10
Tid Employed Level of
Education
# years at present address
Credit Worthy
1 Yes Graduate 5 Yes
2 Yes High School 2 No
3 No Undergrad 1 No
4 Yes High School 10 Yes
5 Yes Graduate 2 No
6 No High School 2 No
7 Yes Undergrad 3 No
8 Yes Graduate 8 Yes
9 Yes High School 4 Yes
10 No Graduate 1 No
Employed
Worthy: 4Not Worthy: 3
Yes
10
Tid Employed Level of
Education
# years at present address
Credit Worthy
1 Yes Graduate 5 Yes
2 Yes High School 2 No
3 No Undergrad 1 No
4 Yes High School 10 Yes
5 Yes Graduate 2 No
6 No High School 2 No
7 Yes Undergrad 3 No
8 Yes Graduate 8 Yes
9 Yes High School 4 Yes
10 No Graduate 1 No
No
Worthy: 0Not Worthy: 3
10
Tid Employed Level of
Education
# years at present address
Credit Worthy
1 Yes Graduate 5 Yes
2 Yes High School 2 No
3 No Undergrad 1 No
4 Yes High School 10 Yes
5 Yes Graduate 2 No
6 No High School 2 No
7 Yes Undergrad 3 No
8 Yes Graduate 8 Yes
9 Yes High School 4 Yes
10 No Graduate 1 No
Graduate High School/ Undergrad
Worthy: 2Not Worthy: 2
Education
Worthy: 2Not Worthy: 4
Key Computation
WorthyNot Worthy
4 3
0 3
Employed = Yes
Employed = No
10
Tid Employed Level of
Education
# years at present address
Credit Worthy
1 Yes Graduate 5 Yes
2 Yes High School 2 No
3 No Undergrad 1 No
4 Yes High School 10 Yes
5 Yes Graduate 2 No
6 No High School 2 No
7 Yes Undergrad 3 No
8 Yes Graduate 8 Yes
9 Yes High School 4 Yes
10 No Graduate 1 No
Worthy: 4Not Worthy: 3
Yes No
Worthy: 0Not Worthy: 3
Employed
Constructing a Decision Tree
Employed = Yes
Employed = No
10
Tid Employed Level of
Education
# years at present address
Credit Worthy
1 Yes Graduate 5 Yes
2 Yes High School 2 No
3 No Undergrad 1 No
4 Yes High School 10 Yes
5 Yes Graduate 2 No
6 No High School 2 No
7 Yes Undergrad 3 No
8 Yes Graduate 8 Yes
9 Yes High School 4 Yes
10 No Graduate 1 No
10
Tid Employed Level of
Education
# years at present address
Credit Worthy
1 Yes Graduate 5 Yes
2 Yes High School 2 No
4 Yes High School 10 Yes
5 Yes Graduate 2 No
7 Yes Undergrad 3 No
8 Yes Graduate 8 Yes
9 Yes High School 4 Yes
10
Tid Employed Level of
Education
# years at present address
Credit Worthy
3 No Undergrad 1 No
6 No High School 2 No
10 No Graduate 1 No
Constructing a Decision Tree in Parallel
Partitioning of data only– global reduction per
node is required– large number of
classification tree nodes gives high communication cost
n records
m categorical attributesWorthy Not Worthy
Yes 4 3No 0 3
Worthy Not Worthy
Yes 2 5No 1 2
Worthy Not Worthy
Yes 6 1No 1 2
Constructing a Decision Tree in Parallel
Partitioning of classification tree nodes– natural concurrency– load imbalance
– the amount of work associated with each node varies
– limited concurrency on the upper portion of the tree
– child nodes use the same data as used by parent node
– loss of locality– high data movement cost
7,000 records
10,000 training records
3,000 records
2,000 5,000 2,000 1,000
Speedup Comparison of the Three Parallel Algorithms
Data set used in SLIQ paper (Ref: Mehta, Agrawal and Rissanen, 1996) IBM SP2 with 128 processors
hybrid
Data partitioning
Tree partitioning
hybrid
Data partitioning
Tree partitioning
0.8 million examples
1.6 million examples
Dynamic load balancing inspired by parallel sparse Cholesky factorization and parallel tree search
Speedup of the Hybrid Algorithm with Different Size Data Sets
ID Income
0 25K
2 28K
8 30K
4 30K
5 35K
1 50K
3 52K
6 55K
7 70K 10
ID Age
2 25
5 31
8 33
1 37
3 41
6 52
4 55
7 60
0 61 10
Hash Table Access
• Some efficient decision tree algorithms require random access to large data structures.
• Example: SPRINT (Ref: Shafer, Agrawal, Mehta, 1996)Hash Table
Processor P0
Processor P1
Processor P2
ID Left/ Right
0 Left
1 Left
2 Right
3 Right
4 Right
5 Left
6 Right
7 Left
8 Left 10
10
Tid Employed Level of
Education
# years at present address
Credit Worthy
1 Yes Graduate 5 Yes
2 Yes High School 2 No
3 No Undergrad 1 No
4 Yes High School 10 Yes
5 Yes Graduate 2 No
6 No High School 2 No
7 Yes Undergrad 3 No
8 Yes Graduate 8 Yes
10 No Graduate 1 No
Left Right
10
Tid Employed Level of
Education
# years at present address
Credit Worthy
5 Yes Graduate 2 No
6 No High School 2 No
7 Yes Undergrad 3 No
8 Yes Graduate 8 Yes
10 No Graduate 1 No
10
Tid Employed Level of
Education
# years at present address
Credit Worthy
6 No High School 2 No
7 Yes Undergrad 3 No
8 Yes Graduate 8 Yes
10 No Graduate 1 No
ID Age
2 25
5 31
8 33
1 37
3 41
6 52
4 55
7 60
0 61 10
ID Left/ Right
0 Left
1 Left
2 Right
3 Right
4 Right
5 Left
6 Right
7 Left
8 Left 10
Processor P0
ID Income
0 25K
2 28K
8 30K
4 30K
5 35K
1 50K
3 52K
6 55K
7 70K 10
ID Age
2 25
5 31
8 33
1 37
3 41
6 52
4 55
7 60
0 61 10
Processor P1
ID Left/ Right
0 Left
1 Left
2 Right
3 Right
4 Right
5 Left
6 Right
7 Left
8 Left 10
ID Income
0 25K
2 28K
8 30K
4 30K
5 35K
1 50K
3 52K
6 55K
7 70K 10
ID Age
2 25
5 31
8 33
1 37
3 41
6 52
4 55
7 60
0 61 10
Processor P2
ID Left/ Right
0 Left
1 Left
2 Right
3 Right
4 Right
5 Left
6 Right
7 Left
8 Left 10
Storing the entire has table on one processor makes the algorithm unscalable
ID Left/ Right
0 Left
1 Left
2 Right
3 Right
4 Right
5 Left
6 Right
7 Left
8 Left 10
ID Income
0 25K
2 28K
8 30K
4 30K
5 35K
1 50K
3 52K
6 55K
7 70K 10
ID Age
2 25
5 31
8 33
1 37
3 41
6 52
4 55
7 60
0 61 10
Processor P0
ScalParC (Ref: Joshi, Karypis, Kumar, 1998)
ScalParC is a scalable parallel decision tree construction algorithm• Scales to large number of processors• Scales to large training sets
ScalParC is memory efficient • The hash-table is distributed among the processors
ScalParC performs minimum amount of communication
This ScalParC Design is Inspired by..
Communication Structure of Parallel Sparse Matrix-Vector Algorithms
Processor P1
Processor P0
Processor P2
Processor P0
Processor P1
Processor P2
Hash Table Entries
Parallel Runtime (Ref: Joshi, Karypis, Kumar, 1998)
0
20
40
60
80
100
120
0 50 100 150
Processors
Ru
nti
me
(sec
on
ds) 0.2M
0.4M
0.8M
1.6M
3.2M
6.4M
128 Processor Cray T3D
Computing Association Patterns
1. Market-basket transactions2. Find item combinations (itemsets) that occur frequently in data
{Diaper}{Bread}
Beer}{}MilkDiaper,{
3. Generate association rules
Counting Candidates
Frequent Itemsets are found by counting candidates
Simple way: • Search for each candidate in each transaction
Transactions Candidates
N M
A B C D
A C E
B C D
A B D E
B C E
B D
Count
A B 0
A C 0
A D 0
A E 0
B C 0
B D 0
A B E 0
B C D 0
A B D E 0
A B C D E 0
1A D
0A E
A B C D E
A B D E
B C D
A B E
B D
B C
A C
A B
0
0
1
0
1
1
1
1
1A D
1A E
A B C D E
A B D E
B C D
A B E
B D
B C
A C
A B
0
0
1
0
1
1
2
1
Reduce the number
of comparisons (NM) by using hash tables to store the candidate itemsets
2A D
2A E
A B C D E
A B D E
B C D
A B E
B D
B C
A C
A B
0
1
2
2
4
3
2
2 Naïve approach requires O(NM) comparisons
Parallel Association Rules: Scaleup Results (100K,0.25%) (Ref: Han, Karypis, and Kumar, 2000)
DD (Agrawal & Shafer, 1996)
IDD (Han, Karypis, Kumar, 2000)
HD (Han, Karypis, Kumar, 2000)
Efficient implementation of collective communication
Dynamic restructuring of computation
Candidates for MineBenchAlgorithms Category Description Lang. Parallel
PCA Preprocessing Principal component analysis C/C++/FORT.
Y
ABB Preprocessing Automatic Branch and Bound C/C++ N LVF Preprocessing A probabilistic feature selection algorithm C/C++ N Normalization Preprocessing Variable transformation C/C++ Y ScalParC Predictive Modeling Decision tree classifier C Y
Naïve Bayesian Predictive Modeling Statistical classifier based on class conditional independence
C++ N
RIPPER Predictive Modeling Rule-based predictive modeling C/C++ Y SVMlight Predictive Modeling Support Vector Machines C/C++ N K-means Clustering Partitioning method C Y Bisecting K-means
Clustering Partitioning method C Y
Fuzzy K-means Clustering Fuzzy logic based K-means C Y EM Clustering Clustering Partitioning method C/C++ Y MAFIA(N) Clustering Multidimensional Clustering C Y BIRCH Clustering Hierarchical method C++ N AHC Clustering Agglomerative Hierarchical Clustering C/C++ N DBSCAN Clustering Density-based method C/C++ Y HOP Clustering Density-based method C Y LOF Anomaly Detection Local Outlier Factor C/C++ Y Outlier Detection Anomaly Detection Distance-based outlier detection C/C++ Y
Apriori ARM Horizontal database, level-wise mining based on Apriori property
C/C++ Y
MAFIA(C) ARM Maximal frequent itemset mining C/C++ N
Eclat ARM Vertical database, break large search space into equivalence classes
C++ N
FP-growth ARM Encodes database into a compact FP-tree C/C++ N
Analysis of Benchmark Algorithms
Explore the bottlenecks associated with the current general purpose sequential and parallel machines
Explore how different architectural features impact the performance of data mining algorithms
Preliminary Evaluation of Some Sample Data Sets
Example small (S), medium (M), and large (L) data set
Execution time for some algorithms in the MineBench suite.
Classification Association Rule Mining (ARM) Dataset
Parameter DB Size(MB) Parameter DB Size(MB) Small F26-A32-D125K 27 T10-I4-D1000K 47
Medium F26-A32-D250K 54 T20-I6-D2000K 175 Large F26-A64-D250K 108 T20-I6-D4000K 350
Data set = S Data set = M Data set = L Program
P1 P4 P8 P1 P4 P8 P1 P4 P8
HOP 6.3 1.8 1.2 52.7 27.4 18.7 435.3 128.0 81.5
K-means 5.7 2.0 1.3 12.9 3.3 2.6 - - -
Fuzzy K-means 164.1 54.6 26.4 146.8 42.7 27.1 - - -
BIRCH 3.5 - - 31.7 - - 172.6 - -
ScalParC 51.0 13.5 10.4 110.6 28.5 21.6 225.9 56.2 36.5 Bayesian 12.6 - - 25.1 - - 51.5 - - Apriori 6.1 3.0 2.6 102.7 38.6 30.5 200.2 72.6 63.0
Eclat 11.8 - - 81.5 - - 127.8 - -
Reference: [Liu Y., Pisharath J., Liao W., Memik G., Choudhary A., Dubey P., 2004]
Designing Efficient Kernels for Data Mining
Frequency of Kernel Operations in Representative Applications
Understanding of the bottlenecks in executing DM algorithms on current architectures will help design new, more efficient algorithms
Focus will be on design frequently used kernels that dominates the execution time of most DM algorithms
Both sequential and parallel versions will be developed
Reference: [Pisharath J., Zambreno J., Ozisikyilmaz B., Choudhary A., 2006]
Conclusions
Data mining applications are becoming increasingly important
Current systems design approach not adequate for DM applications
MineBench – a new benchmark suite which encompasses many algorithms found in data mining
Initial findings:• Data mining applications are unique in terms of
performance characteristics• There exists much room for optimization with regards
to data mining workloads
Bibliography Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Addison-Wesley
April 2005 Introduction to Parallel Computing, (Second Edition) by Ananth Grama, Anshul Gupta, George
Karypis, and Vipin Kumar. Addison-Wesley, 2003 Data Mining for Scientific and Engineering Applications, edited by R. Grossman, C. Kamath, W. P.
Kegelmeyer, V. Kumar, and R. Namburu, Kluwer Academic Publishers, 2001 J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon, "Emerging Scientific Applications in
Data Mining", Communications of the ACMVolume 45, Number 8, pp 54-58, August 2002
C. Potter, P. Tan, M. Steinbach, S. Klooster, V. Kumar, R. Myneni, V. Genovese, Major Disturbance Events in Terrestrial Ecosystems Detected using global Satellite Data Sets, Global Change Biology 9 (7), 1005-1021, 2003
Vipin Kumar, “Parallel and Distributed Computing for Cyber Security". An article based on the keynote talk by the author at 17th International Conference on Parallel and Distributed Computing Systems (PDCS-2004). DS Online Journal, OLUME 6, NUMBER 10, October 2005
• Ying Liu, Jayaprakash Pisharath, Wei-keng Liao, Gokhan Memik, Alok Choudhary, and Pradeep Dubey. Performance Evaluation and Characterization of Scalable Data Mining Algorithms. In Proceedings of the 16th International Conference on Parallel and Distributed Computing and Systems (PDCS), November 2004.
• Joseph Zambreno, Berkin Ozisikyilmaz, Jayaprakash Pisharath, Gokhan Memik, and Alok Choudhary. Performance Characterization of Data Mining Applications using MineBench. In Proceedings of the 9th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-9), February 2006.
• Jayaprakash Pisharath, Joseph Zambreno, Berkin Ozisikyilmaz, and Alok Choudhary. Accelerating Data Mining Workloads: Current Approaches and Future Challenges in System Architecture Design. In Proceedings of the 9th International Workshop on High Performance and Distributed Mining (HPDM), April 2006