Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew...
-
Upload
meryl-turner -
Category
Documents
-
view
220 -
download
0
Transcript of Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew...
![Page 1: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/1.jpg)
Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David
O’Gwynn, Andrew Prout, Albert Reuther, Antonio Rosa, Charles Yee
2012 IEEE High Performance Extreme Computing Conference
10 - 12 September 2012
Driving Big Data With Big Compute
This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
![Page 2: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/2.jpg)
Presentation Name - 2Author Initials MM/DD/YY
• Introduction
• LLGrid MapReduce
• Dynamic Distributed Dimensional Data Model (D4M)
• Demonstration
– Data Ingestion Performance
– Database Query Performance
• Summary
Outline
![Page 3: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/3.jpg)
Presentation Name - 3Author Initials MM/DD/YY
The Big Four Cloud Ecosystems
Enterprise
Big Data DBMS
Supercomputing
• Each ecosystem is at the center of a multi-$B market• Pros/cons of each are numerous; diverging hardware/software• Some missions can exist wholly in one ecosystem; some can’t
• Each ecosystem is at the center of a multi-$B market• Pros/cons of each are numerous; diverging hardware/software• Some missions can exist wholly in one ecosystem; some can’t
IaaS- Interactive- On-demand- Elastic
PaaS- High performance- Parallel Languages- Scientific computing
PaaS- Java- Map/Reduce- Easy admin
SaaS- Indexing- Search- Security
IaaS: Infrastructure as ServicePaaS: Platform as a ServiceSaaS: Software as a Service
![Page 4: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/4.jpg)
Presentation Name - 4Author Initials MM/DD/YY
• LLGrid provides interactive, on-demand supercomputing• Accumulo database provides high performance indexing, search, and
authorizations within a Hadoop environment
• LLGrid provides interactive, on-demand supercomputing• Accumulo database provides high performance indexing, search, and
authorizations within a Hadoop environment
LLGridEnterprise
Big Data DBMS
The Big Four Cloud Ecosystems
IaaS- Interactive- On-demand- Elastic
PaaS- High performance- Parallel Languages- Scientific computing
PaaS- Java- Map/Reduce- Easy admin
SaaS- Indexing- Search- Security
IaaS: Infrastructure as ServicePaaS: Platform as a ServiceSaaS: Software as a Service
Supercomputing
![Page 5: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/5.jpg)
Presentation Name - 5Author Initials MM/DD/YY
LLGridEnterprise
Big Data DBMS
MapReduce
The Big Four Cloud Ecosystems
Supercomputing
IaaS- Interactive- On-demand- Elastic
PaaS- High performance- Parallel Languages- Scientific computing
PaaS- Java- Map/Reduce- Easy admin
SaaS- Indexing- Search- Security
IaaS: Infrastructure as ServicePaaS: Platform as a ServiceSaaS: Software as a Service
• LLGrid MapReduce provides map/reduce interface to supercomputing• D4M provides an interactive parallel scientific computing environment
to databases
• LLGrid MapReduce provides map/reduce interface to supercomputing• D4M provides an interactive parallel scientific computing environment
to databases
![Page 6: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/6.jpg)
Presentation Name - 6Author Initials MM/DD/YY
Big Compute + Big Data Stack
High Level Composable API: D4M (“Databases for Matlab”)
Weak Signatures,Noisy Data,Dynamics
Novel Analytics for:Text, Cyber, Bio
Interactive Super-computing
High Performance Computing: LLGrid + Hadoop
Distributed Database/ Distributed File System
Distributed Database: Accumulo/HBase (triple store)
• Combining Big Compute and Big Data enables entirely new domains
A
C
E
B
Array Algebra
D
![Page 7: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/7.jpg)
Presentation Name - 7Author Initials MM/DD/YY
• Introduction
• LLGrid MapReduce
• Dynamic Distributed Dimensional Data Model (D4M)
• Demonstration
– Data Ingestion Performance
– Database Query Performance
• Summary
Outline
![Page 8: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/8.jpg)
Presentation Name - 8Author Initials MM/DD/YY
Hadoop Architecture Overview
NameNode
DataNode
DataNode
DataNode
JobTracker
TaskTracker
TaskTracker
TaskTracker
Hadoop cluster
Hadoop MapReduce
Jobs
![Page 9: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/9.jpg)
Presentation Name - 9Author Initials MM/DD/YY
LLGrid_MapReduce Diagram
input
a
b
output
a.out
b.out
LLGrid_MapReduceMapper Task 1
Mapper Task 2
Reduce Task
scan scan
Reduce Out
1
2
4
5
3 Set dependency on Mapper Tasks
Scheduler
![Page 10: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/10.jpg)
Presentation Name - 10Author Initials MM/DD/YY
LLGrid_MapReduce –-np nTasks \
–-mapper MyMapper \
–-input input_dir \
–-output output_dir \
[--reducer MyReducer \]
[--redout output_filename]– MyMapper must have two inputs: input filename and output
filename.– LLGrid_MapReduce creates an array job to process all the input
files in the input directory for MyMapper – [Optional] LLGrid_MapReduce creates a job for MyReducer to
process the output from MyMapper.
LLGrid_MapReduce API
![Page 11: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/11.jpg)
Presentation Name - 11Author Initials MM/DD/YY
• Introduction
• LLGrid MapReduce
• Dynamic Distributed Dimensional Data Model (D4M)
• Demonstration
– Data Ingestion Performance
– Database Query Performance
• Summary
Outline
![Page 12: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/12.jpg)
Presentation Name - 12Author Initials MM/DD/YY
High Level Language: D4M
Distributed Database
Query:AliceBobCathyDavidEarl
Associative ArraysNumerical Computing Environment
D4MDynamic Distributed Dimensional Data Model
A
C
DE
B
A D4M query returns a sparse matrix or a graph…
…for statistical signal processing or graph analysis in MATLAB
D4M binds associative arrays to databases, enabling rapid prototyping of data-intensive cloud analytics and visualization
![Page 13: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/13.jpg)
Presentation Name - 13Author Initials MM/DD/YY
• Graphs can be represented as a sparse matrices
– Multiply by adjacency matrix step to neighbor vertices
– Work-efficient implementation from sparse data structures
Triple Store Representation:Graphs as Matrices
1 2
3
4 7
6
5
x ATxAT
![Page 14: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/14.jpg)
Presentation Name - 14Author Initials MM/DD/YY
• Like Perl associative arrays but in 2D and mixed data types
A('alice ','bob ') = 'talked '
or A('alice ','bob ') = 47.0
• 1-to-1 correspondence with triple store
('alice ','bob ’,'talked ’)
or ('alice ','bob ’,47.0)
Associative Arrays Concept
alice
bob
talkedalice bob
![Page 15: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/15.jpg)
Presentation Name - 15Author Initials MM/DD/YY
• Composable mathematical operations
A + B A - B A & B A|B A*B
• Composable query operations via array indexing
A('alice ', :) alice row
A('alice bob ', :) alice and bob row
A('al* ', :) rows beginning with al
A('alice : bob ', :) rows alice to bob
A(1:2, :) first two rows
A == 47.0 all entries equal to 47.0
Associative Arrays Implementation
• Complex queries with ~50x less effort than Java/SQL• Naturally leads to high performance parallel implementation
![Page 16: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/16.jpg)
Presentation Name - 16Author Initials MM/DD/YY
• Introduction
• LLGrid MapReduce
• Dynamic Distributed Dimensional Data Model (D4M)
• Demonstration
– Data Ingestion
– Database Query
• Summary
Outline
![Page 17: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/17.jpg)
Presentation Name - 17Author Initials MM/DD/YY
• A Python application to parse ASCII files and to ingest the result into an Accumulo database
Data Ingestion With LLGrid MapReduce
• Presplit the table by letters+numbers+punctuation• Prepend random string (32 in this case) to row keys
![Page 18: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/18.jpg)
Presentation Name - 18Author Initials MM/DD/YY
Accumulo Ingestion Scalability StudyLLGrid MapReduce With A Python Application
Data #1: 5 GB of 200 files
Data #2: 30 GB of 1000 files
4 Mil e/s
Accumulo Database: 1 Master + 7 Tablet servers
![Page 19: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/19.jpg)
Presentation Name - 19Author Initials MM/DD/YY
• Scalable benchmark specified by graph community
• Very large power law graph– Local Rows, Cols, Vals: 220790, 220935, 2047790
Graph500 Benchmark
Adjacency Matrix Vertex In Degree Distribution
Power Law
In Degree
Nu
mb
er o
f V
erti
ces
![Page 20: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/20.jpg)
Presentation Name - 20Author Initials MM/DD/YY
Accumulo Data Ingestion ScalabilitypMATLAB Application Using D4M
![Page 21: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/21.jpg)
Presentation Name - 21Author Initials MM/DD/YY
Ingestion Rate History
Time (HH:MM)
Nu
mb
er o
f en
trie
s/se
con
d
Ingestion Rate History with 6 Tablet Servers
Ingestion Rate History with 1 Tablet Server
![Page 22: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/22.jpg)
Presentation Name - 22Author Initials MM/DD/YY
Effect of Pre-Split
Accumulo with 8 tablet servers
![Page 23: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/23.jpg)
Presentation Name - 23Author Initials MM/DD/YY
Effect of Ingestion Block Size
Accumulo with 8 tablet servers
![Page 24: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/24.jpg)
Presentation Name - 24Author Initials MM/DD/YY
Accumulo Column Query Time pMATLAB Application Using D4M
![Page 25: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/25.jpg)
Presentation Name - 25Author Initials MM/DD/YY
Accumulo Row Query Time pMATLAB Application Using D4M
![Page 26: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/26.jpg)
Presentation Name - 26Author Initials MM/DD/YY
Scan Rate History Accumulo DB With 1 Tablet Server
Time (HH:MM)
Sca
n R
ate
(en
trie
s/se
con
d)
End of the query operation
Start of the query operation
![Page 27: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/27.jpg)
Presentation Name - 27Author Initials MM/DD/YY
Scan Rate History Accumulo DB With 6 Tablet Server
Time (HH:MM)
Sca
n R
ate
(en
trie
s/se
con
d)
End of the query operation
Start of the query operation
![Page 28: Chansup Byun, William Arcand, David Bestor, Bill Bergeron, Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter Michaleas, Julie Mullen, David O’Gwynn,](https://reader031.fdocuments.net/reader031/viewer/2022032703/56649d185503460f949edd1c/html5/thumbnails/28.jpg)
Presentation Name - 28Author Initials MM/DD/YY
• We have demonstrated using an MPI cluster (LLGrid) environment to drive big data application on a Hadoop cluster environment.– LLGrid MapReduce– Parallel Matlab with D4M (Dynamic Distributed Dimensional Data
Model )
• Data ingestion and database query results show good scalability in the following use-case scenarios.– A Python application with LLGrid MapReduce – A parallel Matlab application with D4M
• Graph500 benchmark
Summary