Using R for HPC Data Science -...

Using R for HPC Data Science

Session: Parallel Programming Paradigms

George OstrouchovOak Ridge National Laboratory and University of Tennessee

and ppppppbbbbbbddddddRRRRRR Core Team

Course at IT4Innovations, Ostrava, October 6-7, 2016

Parallel Programming Paradigms Outline

Outline

Brief introduction to parallel hardware and software

Parallel programming paradigms

Shared memory vs. distributed memoryManager-workers and fork-joinMapReduceSPMD - single program, multiple dataData-flow

http://pbdr.org ppppppbbbbbbddddddRRRRRR Core Team Using R for HPC Data Science

Parallel Programming Paradigms Brief introduction to parallel hardware and software

Three Basic Flavors of Hardware

Interconnection Network

PROC + cache

PROC + cache

PROC + cache

PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE + cache

CORE + cache

CORE + cache

CORE + cache

Network

Shared Memory Local Memory

Co-Processor

GPU: Graphical Processing Unit

MIC: Many Integrated Core



Your Laptop or Desktop


PROC + cache

PROC + cache

PROC + cache

PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE + cache

CORE + cache

CORE + cache

CORE + cache

Network


Co-Processor





Server to Cluster to Supercomputer


PROC + cache

PROC + cache

PROC + cache

PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE + cache

CORE + cache

CORE + cache

CORE + cache

Network


Co-Processor





“Native” Programming Models and Tools


PROC + cache

PROC + cache

PROC + cache

PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE + cache

CORE + cache

CORE + cache

CORE + cache

Network

Shared Memory

Default is parallel: what is my data and what do I need from others?

Default is serial: which tasks can the compiler make parallel?

Offload data and tasks. We are slow but many!

Sockets SPMD (MPI)

MapReduce (shuffle)

Local Memory

Co-Processor



CUDA OpenCL

OpenACC OpenMP Pthreads

fork

CUDA OpenCL

OpenACC OpenMP

Pthreads

fork



30+ Years of Parallel Computing Research

Local Memory

Co-Processor



Default is parallel (SPMD): what is my data and what do I need from others?



Sockets MPI

MapReduce


PROC + cache

PROC + cache

PROC + cache

PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE + cache

CORE + cache

CORE + cache

CORE + cache

Network

Shared MemoryCUDA

OpenCL OpenACC OpenMP Pthreads

fork

CUDA OpenCL


fork



Last 10 years of Advances

Local Memory

Co-Processor






Sockets MPI

MapReduce


PROC + cache

PROC + cache

PROC + cache

PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE + cache

CORE + cache

CORE + cache

CORE + cache

Network

Shared MemoryCUDA

OpenCL OpenACC OpenMP Pthreads

fork

CUDA OpenCL


fork



Distributed Programming Works in Shared Memory

Local Memory

Co-Processor



Default is parallel: what is my data and what do I need from others?




PROC + cache

PROC + cache

PROC + cache

PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE + cache

CORE + cache

CORE + cache

CORE + cache

Network

Shared Memory

Sockets SPMD (MPI)

MapReduce (shuffle)

CUDA OpenCL


fork

CUDA OpenCL

OpenACC OpenMP

Pthreads

fork

✔ ✘✘



R Interfaces to Low-Level Native Tools

Local Memory

Co-Processor




PROC + cache

PROC + cache

PROC + cache

PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE + cache

CORE + cache

CORE + cache

CORE + cache

Network

Shared Memory




Sockets MPI

MapReduce

CUDA OpenCL


fork

CUDA OpenCL


fork

snow + multicore = parallel

multicore

Foreign Language Interfaces:

.C .Call Rcpp

OpenCL inline

.

.

.

snow Rmpi Rhpc

pbdMPI

RHadoop SparkR



Some packages in R for parallel computing

parallel: multicore + snow

multicore: an interface to unix fork (no Windows)snow: simple network of workstations

pbdMPI, pbdDMAT and other pbd: use HPC concepts, simplify, and usescalable libraries

foreach, doParallel: iterface to hide hardware reality, can be difficult to debug

Rmpi: simplified with pbdMPI for SPMD

RHadoop, RHipe: needs HDFS, slow because file-backed

datadr: divide-recombine, currently MapReduce/HADOOP back end

SparkR: in-memory, needs HDFS, limited to Shuffle, MPI generally faster andmore flexible


Distributed Parallel Computing Paradigms Manager-Workers

Manager-Workers

1 A serial program (Manager) divides up work and/or data

2 Manager sends work (and data) to workers

3 Workers run in parallel without interaction

4 Manager collects/combines results from workers

Divide-Recombine fits this model

Concept appears similar to interactive and to client-server


Distributed Parallel Computing Paradigms MapReduce

MapReduce

A concept born of a search engine

Decouples certain coupled problems with an intermediate communication:shuffle

User needs to decompose computation into Map and Reduce steps

User writes two serial codes: Map and Reduce



MapReduce: a Parallel Search Engine Concept

Search MANY documents Serve MANY users

WebPages

(records)

p0

p1

p2

p3

Index Words (keys)A1 A2 A3 A4

B1 B2 B3 B4

C1 C2 C3 C4

D1 D2 D3 D4

Shuffle−→

MPI Alltoallv

IndexWords(keys)

p0

p1

p2

p3

Web Pages (records)A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

A4 B4 C4 D4

Matrix transpose in another language?



Can use different sets of processors

WebPages

(records)

p0

p1

p2

p3

Index Words (keys) B1 B2 B3 B4

Streaming

Shuffle−→

MPI Scatter

IndexWords(keys)

p4

p5

p6

p7

Web Pages (records)B1

B2

B3

B4


Distributed Parallel Computing Paradigms SPMD

SPMD: Single Program Multiple Data

Write one general program so many copies of it can run asynchronously andcooperate (usually via MPI) to solve the problem.

The prevalent way of distributed programming in HPC for 30+ years

Can handle tightly coupled parallel computations

It is designed for batch computing

There is usually no manager - rather, all cooperate

Prime driver behind MPI specification

Way to program server side in client-server



SPMD A = XTX , where X =

X1

...X8

(Row-Block partition)

1

A = reduce( crossprod( Xi ) ) A = allreduce( crossprod( Xi ) )

1Ostrouchov (1987). Parallel Computing on a Hypercube: An overview of the architectureand some applications. Proceedings of the 19th Symposium on the Interface of ComputerScience and Statistics, p.27-32.



SPMD and MapReduce (MPI and Shuffle)

Both Concepts are about Communication

SPMD makes communication explicit, gives choices (MPI)

MapReduce hides communication, uses one choice (shuffle)


Distributed Parallel Computing Paradigms Data-Flow

Data-flow: Parallel Runtime Scheduling and ExecutionController (PaRSEC)

Graphic from icl.cs.utk.edu

Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J. ”PaRSEC: Exploiting Heterogeneity to Enhance Scalability,” IEEEComputing in Science and Engineering, Vol. 15, No. 6, 36-45, November, 2013.

Master data-flow controller runs distributed on all cores.

Dynamic generation of current level in flow graph

Effectively removes collective synchronizations


Distributed Parallel Computing Paradigms Libraries

Recall: Hardware flavors and Low-Level Native Tools

Local Memory

Co-Processor




PROC + cache

PROC + cache

PROC + cache

PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE + cache

CORE + cache

CORE + cache

CORE + cache

Network

Shared Memory




Sockets MPI

MapReduce

CUDA OpenCL


fork

CUDA OpenCL


fork


multicore


.C .Call Rcpp

OpenCL inline

.

.

.

snow Rmpi Rhpc

pbdMPI

RHadoop SparkR



Scalable Libraries Mapped to Hardware

Local Memory

Co-Processor




PROC + cache

PROC + cache

PROC + cache

PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE + cache

CORE + cache

CORE + cache

CORE + cache

Network

Shared Memory

Trilinos

PETSc

PLASMA

DPLASMALibSci (Cray) MKL (Intel)

ScaLAPACK PBLAS BLACS

cuBLAS (NVIDIA)

MAGMA

PAPI

Tau

MPImpiP

fpmpi

NetCDF4

ADIOSACML (AMD)

CombBLAS

cuSPARSE (NVIDIA)

ZeroMQ

Profiling

I/O

Most are built for SPMD.



R and ppppppbbbbbbddddddRRRRRR Interfaces to HPC Libraries

Local Memory

Co-Processor




PROC + cache

PROC + cache

PROC + cache

PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE + cache

CORE + cache

CORE + cache

CORE + cache

Network

Shared Memory

Trilinos

PETSc

PLASMA

DPLASMALibSci (Cray) MKL (Intel)

ScaLAPACK PBLAS BLACS

cuBLAS (NVIDIA)

MAGMA

PAPI

Tau

MPImpiP

fpmpi

NetCDF4

ADIOS

ACML (AMD)

OpenBLAS

CombBLAS

cuSPARSE (NVIDIA)

pbdDMATpbdDMATpbdDMATpbdDMAT

pbdBASE pbdSLAP

ZeroMQ

Profiling

I/O

Learning pbdR

Released Under Development

pbdADIOS

pbdNCDF4

pbdPAPI

pbdPROF pbdPROF pbdPROF pbdMPI

pbdDEMO

pbdCS pbdZMQ remoter getPass

pbdIO

Machine Learning

pbdML

HDF5rhdf5



Recall: Hardware flavors and Low-Level Native Tools

Local Memory

Co-Processor




PROC + cache

PROC + cache

PROC + cache

PROC + cache

Mem Mem Mem Mem

Distributed Memory

Memory

CORE + cache

CORE + cache

CORE + cache

CORE + cache

Network

Shared Memory




Sockets MPI

MapReduce

CUDA OpenCL


fork

CUDA OpenCL


fork


multicore


.C .Call Rcpp

OpenCL inline

.

.

.

snow Rmpi Rhpc

pbdMPI

RHadoop SparkR


Distributed Parallel Computing Paradigms Acknowledgments

Prepared by ppppppbbbbbbddddddRRRRRR Core Team

pbdR Core Team DevelopersWei-Chen Chen, FDAGeorge Ostrouchov, ORNL & UTKDrew Schmidt, UTK

DevelopersChristian Heckendorf, Pragneshkumar Patel, GauravSehrawat

ContributorsWhit Armstrong, Ewan Higgs, Michael Lawrence,Michael Matheson, David Pierce, Andrew Raim, BrianRipley, ZhaoKang Wang, Hao Yu

Engaging parallel libraries at scale

R language unchanged

New distributed concepts

New profiling capabilities

New interactive SPMD

In situ distributed capability

In situ staging capability via ADIOS

Plans for DPLASMA GPU capability

SupportThis material is based upon work supported by the National Science Foundation Division of Mathematical Sciences under Grant No. 1418195. This workused resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of theU.S. Department of Energy under Contract No. DE-AC05-00OR22725. This work also used resources of the National Institute for Computational Sciencesat the University of Tennessee, Knoxville, which is supported by the Office of Cyberinfrastructure of the U.S. National Science Foundation.


Using R for HPC Data Science -...

Documents

Transcript of Using R for HPC Data Science -...