Download - Analyzing Massive Data using Heterogeneous …Updating clustering coefficients in streaming graph • Using RMAT as a graph and edge stream generator. – Mix of insertions and deletions

Analyzing Massive Datausing Heterogeneous Computing

David A. Bader

OPPORTUNITIES

David A. Bader 2

Opportunities

• Application-oriented Opportunities:– High performance computing for massive graphs– Streaming analytics– Informational Visualization techniques for

massive graphs– Heterogeneous systems: Methodologies for

combining the use of the Cloud and Manycore for high-performance computing

– Energy-efficient high-performance computing

David A. Bader 3

Opportunity 1: High performance computing for massive graphs

• Traditional HPC has focused primarily on solving large problems from chemistry, physics, and mechanics, using dense linear algebra.

• HPC faces new challenges to deal with:– time-varying interactions among entities, and – massive-scale graph abstractions where the vertices represent the nouns or

entities and the edges represent their observed interactions.• Few parallel computers run well on these problems because

– they often lack locality required to get high performance from distributed-memory cache-based supercomputers.

• Case study: Massively threaded architectures are shown to run several orders of magnitude faster than the fastest supercomputers on these types of problems!

A focused research agenda is needed to design algorithms that scale on these new platforms.

David A. Bader 4

• While our high performance computers have delivered a sustained petaflop, they have done so using the same antiquated batch processing style where a program and a static data set are scheduled to compute in the next available slot.

• Today, data is overwhelming in volume and rate, and we struggle to keep up with these streams.

Fundamental computer science research is needed in: the design of streaming architectures, and data structures and algorithms that can compute important analytics while sitting in the

middle of these torrential flows.

Opportunity 2: Streaming analytics

David A. Bader 5

Opportunity 3: Information Visualizationtechniques for massive graphs• Information Visualization today

– addresses traditional scientific computing (fluid flow, molecular dynamics), or – when handling discrete data, scale to only hundreds of vertices at best.

However, there is a strong need for visualization in the data sciences so that analytics can gain understanding from data sets with from millions to billions of interacting non-planar discrete entities.

– Applications include: data mining, intelligence, situational awareness

David A. Bader 6

NNDB Mapper of George Washington

Twitter social network using Large Graph Layout

Source: Akshay Java, from ebiquity group

Opportunity 4: Heterogeneous Systems: Methodologies for combining the use of the Cloud and Manycore for high-performance computing.• Today, there is a dichotomy

between using clouds (e.g. Hadoop, map-reduce) for massive data storage, filtering, summarization, and massively parallel/multithreaded systems for data-intensive computation.

We must develop methodologies for employing these complementary systems for solving grand challenges in data analysis.

David A. Bader 7

Steve Mills, SVP of IBM Software (left), and Dr. John Kelly, SVP of IBM Research, view Stream Computing technology

Opportunity 5: Energy-efficient high-performance computing

• The main constraint for our ability to compute has changed– from availability of compute resources– to the ability to power and cool our systems within budget.

Holistic research is needed that can permeate from the architecture and systems up to the applications AND DATA CENTERS, whereby energy use is a first-class object that can be optimized at all levels.

David A. Bader 8

Microsoft’s Chicago Million Server DataCenter

MOTIVATION

David A. Bader 9

Exascale Streaming Data Analytics:Real-world challenges

All involve analyzing massive streaming complex networks:• Health care disease spread, detection

and prevention of epidemics/pandemics (e.g. SARS, Avian flu, H1N1 “swine” flu)

• Massive social networks understanding communities, intentions, population dynamics, pandemic spread, transportation and evacuation

• Intelligence business analytics, anomaly detection, security, knowledge discovery from massive data sets

• Systems Biology understanding complex life systems, drug design, microbial research, unravel the mysteries of the HIV virus; understand life, disease,

• Electric Power Grid communication, transportation, energy, water, food supply

• Modeling and Simulation Perform full-scale economic-social-political simulations

David A. Bader 10

0

50

100

150

200

250

300

350

400

450

Dec-

04

Mar

-05

Jun-

05

Sep-

05

Dec-

05

Mar

-06

Jun-

06

Sep-

06

Dec-

06

Mar

-07

Jun-

07

Sep-

07

Dec-

07

Mar

-08

Jun-

08

Sep-

08

Dec-

08

Mar

-09

Jun-

09

Sep-

09

Dec-

09

Million Users

Exponential growth:More than 900 million active users

Sample queries: Allegiance switching: identify entities that switch communities.Community structure:identify the genesis and dissipation of communitiesPhase change: identify significant change in the network structure

REQUIRES PREDICTING / INFLUENCE CHANGE IN REAL-TIME AT SCALE

Ex: discovered minimal changes in O(billions)-size complex network that could hide or reveal top influencers in the community

Ubiquitous High Performance Computing (UHPC)

Goal: develop highly parallel, security enabled, power efficient processing systems, supporting ease of programming, with resilient execution through all failure modes and intrusion attacks

Program Objectives:One PFLOPS, single cabinet including self-contained cooling

50 GFLOPS/W (equivalent to 20 pJ/FLOP)

Total cabinet power budget 57KW, includes processing resources, storage and cooling

Security embedded at all system levels

Parallel, efficient execution models

Highly programmable parallel systems

Scalable systems – from terascale to petascale

Architectural Drivers:Energy EfficientSecurity and DependabilityProgrammability

Echelon: Extreme-scale Compute Hierarchies with Efficient Locality-Optimized Nodes

“NVIDIA-Led Team Receives $25 Million Contract From DARPA to Develop High-Performance GPU Computing Systems” -MarketWatch

David A. Bader (CSE)Echelon Leadership Team

Example: Mining Twitter for Social Good

David A. Bader 12

ICPP 2010

Image credit: bioethicsinstitute.org

• CDC / Nation-scale surveillance of public health

• Cancer genomics and drug design– computed Betweenness Centrality

of Human Proteome

Human Genome core protein interactionsDegree vs. Betweenness Centrality

Degree

1 10 100

Bet

wee

nnes

s C

entra

lity

1e-7

1e-6

1e-5

1e-4

1e-3

1e-2

1e-1

1e+0

Massive Data Analytics: Protecting our Nation

US High Voltage Transmission Grid (>150,000 miles of line)

Public Health

David A. Bader 13

ENSG00000145332.2Kelch-

like protein implicat

ed in breast cancer

Graphs are pervasive in large-scale data analysis

• Sources of massive data: petascale simulations, experimental devices, the Internet, scientific applications.

• New challenges for analysis: data sizes, heterogeneity, uncertainty, data quality.

Astrophysics Problem: Outlier detection. Challenges: massive datasets, temporal variations.Graph problems: clustering, matching.

BioinformaticsProblem: Identifying drug target proteins.Challenges: Data heterogeneity, quality.Graph problems: centrality, clustering.

Social InformaticsProblem: Discover emergent communities, model spread of information.Challenges: new analytics routines, uncertainty in data.Graph problems: clustering, shortest paths, flows.

Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg(2,3) www.visualComplexity.com

14David A. Bader

Flywheel has driven HPC into a corner

David A. Bader 15

For decades, HPC has been on a vicious cycle of enabling applications that run well on HPC systems.

Real-World Applications(with natural parallelism)

Data-intensive computing has natural concurrency; yet HPC architectures are designed to achieve high floating-point rates by exploiting spatial and temporal localities.

• For the first time in decades, manycore and multithreaded can let us rethink architecture.

Data-intensive computing has natural concurrency; yet HPC architectures are designed to achieve high floating-point rates by exploiting spatial and temporal localities.

• For the first time in decades, manycore and multithreaded can let us rethink architecture.

These are today’s MPI applications!

Graph500 Benchmark, www.graph500.org

• Cybersecurity– 15 Billion Log Entires/Day (for large

enterprises)– Full Data Scan with End-to-End Join

Required

• Medical Informatics– 50M patient records, 20-200

records/patient, billions of individuals– Entity Resolution Important

• Social Networks– Example, Facebook, Twitter– Nearly Unbounded Dataset Size

• Data Enrichment– Easily PB of data– Example: Maritime Domain

Awareness• Hundreds of Millions of Transponders• Tens of Thousands of Cargo Ships• Tens of Millions of Pieces of Bulk

Cargo• May involve additional data (images,

etc.)

• Symbolic Networks– Example, the Human Brain– 25B Neurons– 7,000+ Connections/Neuron

David A. Bader 16

Defining a new set of benchmarks to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads. Executive Committee: D.A. Bader, R. Murphy, M. Snir, A. Lumsdaine

• Five Business Area Data Sets:

Breadth-first search comparison: Graph500• Graph500 benchmark: BFS on a synthetic graph of a social network• 2Size vertices and 16 * 2Size edges• Processing Rate: Traversed edges per second (TEPS)

David A. Bader 17

Rank Machine Owner Size TEPS

1 NNSA/SC Blue Gene/Q Prototype II (4096 nodes / 65,536 cores)

NNSA and IBM Research,T.J. Watson

32 254.3 B

2 Hopper XE6(1800 nodes / 43,200 cores)

LBNL 37 113.3 B

3 Lomonosov(4096 nodes / 32,768 cores)

Moscow State University 37 103.2 B

17 mirasol (Quad-Socket Intel E7-8870 - 2.4 GHz - 10 cores / 20 threads per socket)

Intel/Georgia Tech(UCB implementation)

28 5.1 B

20 Dingus (Convey HC1ex, 4FPGA) SNL 28 1.7 B

35 Matterhorn (64-proc XMT2) CSCS 31 884.6 M

Massive Streaming Graph Analytics

David A. Bader 18

(A, B, t1, poke)(A, C, t2, msg)

(A, D, t3, view wall)(A, D, t4, post)

(B, A, t2, poke)(B, A, t3, view wall)

(B, A, t4, msg)

Analysts

Streaming Graphs

STINGER: A Data Structure for Graphs with Streaming Updates

Light-weight data structure that supports efficient iteration and efficient updates.

Experiments with Streaming Updates to Clustering Coefficients

Working with bulk updates, can handle almost 200k

Tracking Connected ComponentsTracking Betweenness Centrality

19David A. Bader

STING Extensible Representation (STINGER)Enhanced representation developed for dynamic graphs developed in consultation with David A. Bader, Johnathan Berry, Adam Amos-Binks, Daniel Chavarría-Miranda, Charles Hastings, Kamesh Madduri, and Steven C. Poulos.Design goals:

Be useful for the entire “large graph” communityPortable semantics and high-level optimizations across multiple platforms & frameworks (XMT C, MTGL, etc.)Permit good performance: No single structure is optimal for all.Assume globally addressable memory accessSupport multiple, parallel readers and a single writer

Operations:Insert/update & delete both vertices & edgesAging-off: Remove old edges (by timestamp)Serialization to support checkpointing, etc.

20David A. Bader

STING Extensible RepresentationSemi-dense edge list blocks with free spaceCompactly stores timestamps, types, weightsMaps from application IDs to storage IDsDeletion by negating IDs, separate compaction

21David A. Bader

Goal: Streaming Graph Analytics

• O(Billion) vertices, O(Trillion) edges, 1M updates / sec

• Challenge: Maintain analytics, update quickly– Connected components– Agglomerative clustering

• Problem:– Irregular execution– Bandwidth and latency bound– Irregular memory access– Low-to-no reuse

22David A. Bader

Case Study 1:Clustering Coefficients in Streaming Graph• Roughly, the ratio of actual triangles to possible triangles

around a vertex.

• Defined in terms of triplets.• i-j-v is a closed triplet (triangle).• m-v-n is an open triplet.• Clustering coefficient

# closed triplets / # all triplets• Locally, count those around v.• Globally, count across entire graph.

Multiple counting cancels (3/3=1)

23David A. Bader

Updating clustering coefficients in streaming graph

• Monitoring clustering coefficients could identify anomalies, find forming communities, etc.

• Computations stay local. A change to edge <u, v> affects only vertices u, v, and their neighbors.

• Need a fast method for updating the triangle counts, degrees when an edge is inserted or deleted.

– Dynamic data structure for edges & degrees: STINGER– Rapid triangle count update algorithms: exact and approximate

“Massive Streaming Data Analytics: A Case Study with Clustering Coefficients.” Ediger, David, Karl Jiang, E. Jason Riedy, and David A. Bader. MTAAP 2010, Atlanta, GA, April 2010.

u v+2 +2

+1+1

24David A. Bader

Updating clustering coefficients in streaming graph

• Using RMAT as a graph and edge stream generator.– Mix of insertions and deletions

• Result summary for single actions– Exact: from 8 to 618 actions/second– Approx: from 11 to 640 actions/second

• Alternative: Batch changes– Lose some temporal resolution within the batch– Median rates for batches of size B:

• STINGER overhead is minimal; most time in spent metric.

Algorithm B = 1 B = 1000 B = 4000

Exact 90 25 100 50 100

Approx. 60 83 700 193 300

25David A. Bader

Case study 2:Tracking connected components in streaming graph

• Why? Interested in:– small components over long periods of time– communities that split off from the main group

• Global property– Track <vertex, component> mapping– Undirected, unweighted graphs

• Scale-free network– most changes are within one large component

• Inserting a new edge:– Endpoints are differently labeled– Re-label smaller component

• Deleting an edge:– Must prove that deletion does not cleave a component– Otherwise recompute!

26David A. Bader

Tracking Connected Components:Performance on 32 of 128p Cray XMT

David A. Bader 27

|V| = 1M, |E| = 8M |V| = 16M, |E| = 135M

• Insertions-only improvement 10x-20x over recomputation• Greater parallelism within a batch yields higher performance• Trade-off between time resolution and update speed

“Tracking Structure of Streaming Social Networks, David Ediger, Jason Riedy, David A. Bader, and Henning Meyerhenke. MTAAP 2011, Anchorage, AK, May 2011.

Tracking Connected Components:Faster deletions• Finding triangles rules out 90% of deletions as “safe”• Introduce a spanning tree to easily track deletions

• A deletion is “safe” if:– the edge is not in the spanning tree– the edge endpoints have a common neighbor– a neighbor can still reach the root of the tree

X

Spanning Tree Edge

Non-Tree Edge

Root vertex in black

28David A. Bader

• This method rules out 99.7% of deletions in a RMAT graph with 2M vertices and 32M edges– Often faster than triangle finding

Case Study 3:Tracking Betweenness Centrality (BC)in a streaming graph• Key metric in social network analysis

[Freeman ’77, Goh ’02, Newman ’03, Brandes ’03]

: Number of shortest paths between vertices s and t: Number of shortest paths between vertices s and t passing

through v

• Streaming Updates– Reuse previously computed traversals

• Reduce computational requirements

– Avoid redundant computations

( ) ( )st

s v t V st

vBC v

σσ≠ ≠ ∈

=

)(vstσstσ

29David A. Bader

Updating BC for Edge Insertions in streaming graph

• Currently, memory requirements limit problem size

• Speedup measured: static

insertion

TSpeedupT

=

Collaboration network Vertices Edges Speedup

Astrophysics 18772 198080 148XCondensed matter 23133 93468 91X

General relativity 5242 14490 40XHigh energy physics 120008 118505 108X

High energy physics theory

9877 51970 36X

David A. Bader 30

Updating BC in Streaming Graph:Experimental Results for Insertions-only• 3.4 GHz Intel Core i7-2600K (16 GB RAM)

020406080

100120140160

8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Spee

dup

Edge factor(average edge degree)

R-MAT graph speedup

Scale 14 Scale 13 Scale 12 Scale 11 Scale 10

David A. Bader 31

STREAMING ON HYBRID SYSTEMS

David A. Bader 32

Sandy Bridge: Intel Core i7-2600K

CPU• Clock 3.4GHz (3.8GHz Peak)• 4 Cores with 2 threads each• 4 x 1MB L2, 8MB L3• 21 GB/s Memory Bandwidth• 32 nm Technology• AVX (256-bit Vector)• ~83 GFLOPS

• 95W Thermal Design Power (TDP) for CPU + GPU

GPU• Clock 860MHz

(1350MHz Peak)• Can use L3 Cache• 12 Execution Units

David A. Bader 33

Ivy Bridge: Intel Core i7-3770K

CPU• Clock 3.5GHz (3.9GHz Peak)• 4 Cores with 2 threads each• 4 x 1MB L2, 8MB L3• 25.6 GB/s Memory

Bandwidth• 22 nm Technology• AVX (256-bit Vector)• ~93 GFLOPS

• 77W TDP CPU + GPU

GPU• Clock 650MHz

(1150MHz Peak)• Can use L3 Cache• 16 Execution Units

David A. Bader 34

Llano: AMD A8-3870K

CPU• Clock 3.0GHz• 4 Cores• 4 x 1MB L2, No L3• 21GB/s Memory

Bandwidth• 32nm Technology• ~60 GFLOPS

• 100W TDP CPU + GPU

GPU• Clock 600MHz• 400 Radeon Cores• 480 GFLOPS Peak

David A. Bader 35

CPU Cores vs. GPU Cores

• Independent threads with vector units

• Up to 20 threads per socket• Up to 10 Cores per socket• ~3GHz, High power per core• Branch Prediction, Out of Order,

Register Renaming• ~100s GFLOPs

• SIMT threads with multiple execution units

• 1,000s threads per socket• Up to 1,000s of Cores• ~1GHz, Low power per core• Simple Execution Pipeline• Higher aggregate memory

bandwidth (5x over CPU)• ~1,000s GFLOPs

David A. Bader 36

NVIDIA Echelon Chip (DARPA UHPC)

Throughput Optimized Cores• Clock 2GHz (2.5GHz Peak)• 256 TOCs Per Chip• 8 Lanes per TOC• 64 x 4MB L2• 1600GB/s Local DRAM• 400GB/s Remote DRAM• 16 TFLOPs• 10nm Technology

• 270W / chip

Latency Optimized Cores• Clock 3GHz• 8 Per Chip• 8 x L2 256KB

David A. Bader 37

DARPA UHPC Challenge Problem (CP) 2

• Streaming graph analytics problem • Designed to simulate real-time tracking and

analysis of a dynamic social network graph• Primary metric is updates-per-second• Constraint of updated analytics every 100,000

updates

David A. Bader 38

Deployment Scenario• Single cabinet system in small server room

– Multi-terabyte global memory– Multiple connected Echelon chips– Millions of threads

• Tracking graph of billions of vertices and edges– Multiple edge and vertex types– Relevant temporal data per edge and vertex– Concurrent analytics kernels and graph updates– Millions of updates per second

• Applications– Tracking network connections for cyber security– Assembling / processing semantic graph of input sensor data– Social networks – insider threat detection, intelligence applications

David A. Bader 39

Problem Overview• Initialization

– Kernel 0 - Generate / load synthetic social network

RMAT generated graph of scale 32, edge factor 8– Kernel 1 - Build dynamic graph data structure

Create and maintain STINGER dynamic temporal graph structure– Kernel 2 - Connected Components

Track connected components within the graph– Kernel 3 – Clustering

Track communities via agglomerative clustering optimizing modularity

• Streaming (K1, K2, K3)– Update analytics & data structure in batches of 100,000 actions– Measure rate at which batches can be applied (updates per sec.)

David A. Bader 40

General Observations• Memory

– Off-chip and DRAM bandwidth and latency will be performance limiting factors

– Low compute to memory ratio– Random memory access to large global data structure– Edge blocks can be made to fit memory lines, but accessing single random

words cannot be avoided• Compute

– Irregular computation and workload– Minimal floating point, mostly integer computation– Large sorts and set intersections– Pointer jumping

• Parallelism– Fine grain and coarse grain– Tasks, loops, scans, and reductions– Data structure (Per edge, per vertex, within edge blocks)– Need low thread overhead and fast fine grain synchronization

David A. Bader 41

42

Future Architectures

• Highly multithreaded• High bandwidth (network and memory)• Complex but flexible memory hierarchy• Heterogeneous design in core capability and ISA

David A. Bader

Heterogeneous Design

• Different cores to accomplish different tasks

• Latency Optimized Cores (LOCs)– For branchy, serial code, OS / runtime– Fast clock (3GHz+), higher energy (½ CPU core)– Out of order execution, branch prediction

• Throughput Optimized Cores (TOCs)– For highly parallel code, majority of workload– Slower clock (≤ 2GHz), lower energy (1/10 GPU core)– Simple cores with many hardware thread contexts

43David A. Bader

Case Study: Clustering• Compute on community graph (initially the graph itself)

• Reduce edge and vertex weights into community graph volume

• Compute edge scores and reduce to find mean and variancescore(u,v) = [(weight(u,v) / 2*volume) – (deg (u) * deg (v) / [2*volume]2)]Variance = Variance + (score(u,v) – mean)2

• Parallel for over edges to filter and match– If score(u,v) > mean – 1.5 * √Variance AND score(u,v) > best[u]

• match[u] = v• best[u] = score(u,v)

• Refine matching

• Contract community graph around matching

• Repeat from second step until no edge scores pass the filter

David A. Bader 44

Case Study: Clustering

• Goal: find best score per vertex

• CPU Strategy: – OpenMP parallel for over vertices– Serially search adjacency lists for best score

• GPU Strategy:– Blocks: Parallel for over vertices– Threads in a block: Reduce max score over adjacency list

David A. Bader 45

Case Study: Clustering

• Goal: find best score per vertex

• Hybrid Strategy:– Parallel per vertex on CPU– If low degree, serially determine max score on CPU– If medium degree, use GPU thread blocks– If large degree, CPU to divide edge blocks to GPU thread blocks

• Overcome workload imbalance• Provide parallelism• Tune to reduce impact of overhead

David A. Bader 46

Heterogeneous Exploration• Analysis, updates run on TOCs• Management, load balancing,

serial bottlenecks run on LOCs• Traversal using lightweight

threading and shared memories– Graph stored in a blocked, linked

adjacency structure split across system global memory

– TOC threads cooperatively copy blocks from global to shared memory and explore in parallel

Local L2 Scratchpad

V2 | 1V3 | 1V11 | 1

* nextV9

TOC

V0 | 1

V8 | 1V2| 1

* nextV7

TOC

TOC

TOC

… ……V0| 1

47David A. Bader

CHALLENGES

David A. Bader 48

Graph Analytics for Social Networks• Are there new graph techniques? Do they parallelize?

How do we exploit heterogeneous platforms? Can the computational systems (algorithms, machines) handle massive networks with millions to billions of individuals? Can the techniques tolerate noisy data, massive data, streaming data, etc. …

• Communities may overlap, exhibit different properties and sizes, and be driven by different models– Detect communities (static or emerging)– Identify important individuals– Detect anomalous behavior– Given a community, find a representative

member of the community– Given a set of individuals, find the best

community that includes themDavid A. Bader 49

Open Questions for Massive Data Analytic Apps

• How do we diagnose the health of streaming systems?• Are there new analytics for massive spatio-temporal interaction

networks and graphs (STING)?• Do current methods scale up from thousands to millions and

billions?• How do I model massive, streaming data streams?• Are algorithms resilient to noisy data? • How do I visualize a STING with O(1M) entities? O(1B)? O(100B)?

with scale-free power law distribution of vertex degrees and diameter =6 …

• Can accelerators aid in processing streaming graph data?• How do we leverage the benefits of multiple architectures (e.g.

map-reduce clouds, and massively multithreaded architectures) in a single platform?

David A. Bader 50

New opportunities for Big Data• Don’t let programming paradigms limit your

imagination of what’s possible!• In the HPC community,

Scientific Computing ≡ Message-Passing Programming (MPI)

• Some grand challenge applications do very well, led to numerous scientific insights

• Other applications are impossible to solve efficiently

• For the Big Data community,– Hadoop/Map-Reduce

• Great for some statistical analyses• Problematic for deep insights

David A. Bader 51

New opportunities using heterogeneous systems for Big Data problems

Collaborators and Acknowledgments• Jason Riedy, Research Scientist, (Georgia Tech)• Graduate Students (Georgia Tech):

– David Ediger– Oded Green– Rob McColl– Pushkar Pande– Anita Zakrzewska– Karl Jiang

• Bader PhD Graduates:– Seunghwa Kang (Pacific Northwest National Lab)– Kamesh Madduri (Penn State)– Guojing Cong (IBM TJ Watson Research Center)

• John Feo and Daniel Chavarría-Miranda (Pacific Northwest Nat’l Lab)

David A. Bader 52

Bader, Related Recent Publications (2005-2008)

• D.A. Bader, G. Cong, and J. Feo, “On the Architectural Requirements for Efficient Execution of Graph Algorithms,” The 34th International Conference on Parallel Processing (ICPP 2005), pp. 547-556, Georg Sverdrups House, University of Oslo, Norway, June 14-17, 2005.

• D.A. Bader and K. Madduri, “Design and Implementation of the HPCS Graph Analysis Benchmark on Symmetric Multiprocessors,” The 12th International Conference on High Performance Computing (HiPC 2005), D.A. Bader et al., (eds.), Springer-Verlag LNCS 3769, 465-476, Goa, India, December 2005.

• D.A. Bader and K. Madduri, “Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2,” The 35th International Conference on Parallel Processing (ICPP 2006), Columbus, OH, August 14-18, 2006.

• D.A. Bader and K. Madduri, “Parallel Algorithms for Evaluating Centrality Indices in Real-world Networks,” The 35th International Conference on Parallel Processing (ICPP 2006), Columbus, OH, August 14-18, 2006.

• K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “Parallel Shortest Path Algorithms for Solving Large-Scale Instances,” 9th DIMACS Implementation Challenge -- The Shortest Path Problem, DIMACS Center, Rutgers University, Piscataway, NJ, November 13-14, 2006.

• K. Madduri, D.A. Bader, J.W. Berry, and J.R. Crobak, “An Experimental Study of A Parallel Shortest Path Algorithm for Solving Large-Scale Graph Instances,” Workshop on Algorithm Engineering and Experiments (ALENEX), New Orleans, LA, January 6, 2007.

• J.R. Crobak, J.W. Berry, K. Madduri, and D.A. Bader, “Advanced Shortest Path Algorithms on a Massively-Multithreaded Architecture,” First Workshop on Multithreaded Architectures and Applications (MTAAP), Long Beach, CA, March 30, 2007.

• D.A. Bader and K. Madduri, “High-Performance Combinatorial Techniques for Analyzing Massive Dynamic Interaction Networks,” DIMACS Workshop on Computational Methods for Dynamic Interaction Networks, DIMACS Center, Rutgers University, Piscataway, NJ, September 24-25, 2007.

• D.A. Bader, S. Kintali, K. Madduri, and M. Mihail, “Approximating Betewenness Centrality,” The 5th Workshop on Algorithms and Models for the Web-Graph (WAW2007), San Diego, CA, December 11-12, 2007.

• David A. Bader, Kamesh Madduri, Guojing Cong, and John Feo, “Design of Multithreaded Algorithms for Combinatorial Problems,” in S. Rajasekaran and J. Reif, editors, Handbook of Parallel Computing: Models, Algorithms, and Applications, CRC Press, Chapter 31, 2007.

• Kamesh Madduri, David A. Bader, Jonathan W. Berry, Joseph R. Crobak, and Bruce A. Hendrickson, “Multithreaded Algorithms for Processing Massive Graphs,” in D.A. Bader, editor, Petascale Computing: Algorithms and Applications, Chapman & Hall / CRC Press, Chapter 12, 2007.

• D.A. Bader and K. Madduri, “SNAP, Small-world Network Analysis and Partitioning: an open-source parallel graph framework for the exploration of large-scale networks,” 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), Miami, FL, April 14-18, 2008.

53David A. Bader

Bader, Related Recent Publications (2009-2010)• S. Kang, D.A. Bader, “An Efficient Transactional Memory Algorithm for Computing Minimum Spanning Forest

of Sparse Graphs,” 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Raleigh, NC, February 2009.

• Karl Jiang, David Ediger, and David A. Bader. “Generalizing k-Betweenness Centrality Using Short Paths and a Parallel Multithreaded Implementation.” The 38th International Conference on Parallel Processing (ICPP), Vienna, Austria, September 2009.

• Kamesh Madduri, David Ediger, Karl Jiang, David A. Bader, Daniel Chavarría-Miranda. “A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets.” 3rd Workshop on Multithreaded Architectures and Applications (MTAAP), Rome, Italy, May 2009.

• David A. Bader, et al. “STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation.” 2009.

• David Ediger, Karl Jiang, E. Jason Riedy, and David A. Bader. “Massive Streaming Data Analytics: A Case Study with Clustering Coefficients,” Fourth Workshop in Multithreaded Architectures and Applications (MTAAP), Atlanta, GA, April 2010.

• Seunghwa Kang, David A. Bader. “Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce cluster and a Highly Multithreaded System:,” Fourth Workshop in Multithreaded Architectures and Applications (MTAAP), Atlanta, GA, April 2010.

• David Ediger, Karl Jiang, Jason Riedy, David A. Bader, Courtney Corley, Rob Farber and William N. Reynolds. “Massive Social Network Analysis: Mining Twitter for Social Good,” The 39th International Conference on Parallel Processing (ICPP 2010), San Diego, CA, September 2010.

• Virat Agarwal, Fabrizio Petrini, Davide Pasetto and David A. Bader. “Scalable Graph Exploration on Multicore Processors,” The 22nd IEEE and ACM Supercomputing Conference (SC10), New Orleans, LA, November 2010.

54David A. Bader

Acknowledgment of Support

55David A. Bader