Scaling High-Performance Python with Minimal Effort … · Spark MPI/C++ HPAT 20x-256x speedup of...

16
Scaling High-Performance Python with Minimal Effort 1 Ehsan Totoni Research Scientist, Intel Labs STAC Summit NYC, Nov. 1 st , 2017

Transcript of Scaling High-Performance Python with Minimal Effort … · Spark MPI/C++ HPAT 20x-256x speedup of...

Scaling High-Performance Python with Minimal Effort

1

Ehsan Totoni

Research Scientist, Intel Labs

STAC Summit NYC, Nov. 1st, 2017

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability,

fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course

of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here

is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications

and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from

published specifications. Current characterized errata are available on request.

Intel, the Intel logo, Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

2017 © Intel Corporation.

*Other names and brands may be claimed as the property of others

2

Legal Disclaimer

Data analytics is the greatest value driver in technology

Financial services need insights from data

• Exploit market data for financial modeling, etc.

High performance big data analytics is crucial

• Democratize HPC for data scientists

3

Motivation

http://www.businesscomputingworld.co.uk/

Scripting languages like Python are productive but slow and serial

Big data frameworks (Hadoop/Spark) are hard to use and slow

• High overhead runtime libraries

• Not based on parallel computing fundamentals

High performance requires low-level programming

• Not practical for interactive workflows of data scientists and their expertise

4

Productivity-Performance Gap

python.org

llnl.orgisocpp.orgInfoobjects.com

5

Motivation

127.22

64.1

4.28

2.151.11

1

10

100

1 (36) 2 (72) 4 (144)

Exe

cu

tio

n T

ime

(s)

Amazon AWS c4.8xlarge instances (vCPUs)

Spark

MPI/C++

53x

Logistic Regression on Amazon AWS

Totoni et al. “A Case Against Tiny Tasks in Iterative Analytics”, HotOS’17

NOT STAC BENCHMARK

High performance/scalability for analytics/ML/AI with little effort

• Minimal changes to scripting source code

Compiler optimization and parallelization

• Scripting program → efficient parallel binary

High Performance Analytics Toolkit (HPAT)

• Python (previously Julia)

6

Overview

https://github.com/IntelLabs/HPAT.jl

python.org

https://github.com/IntelLabs/hpat

7

HPAT Python Example

@hpat.jit

def logistic_regression(iterations):

f = h5py.File("lr.hdf5", "r")

X = f['points'][:]

Y = f['responses'][:]

D = X.shape[1]

w = np.ones(D) - 0.5

for i in range(iterations):

w -= np.dot(((1.0 / (1.0 + np.exp(-Y * np.dot(X, w))) - 1.0) * Y), X)

return w

33x speedup

on 4 nodes

Example launch command:

mpirun -n 144 python logistic_regression.py

Numpy code is implicitly data-parallel

8

Data Parallelism Extraction

1Anderson et al. “Parallelizing Julia with a Non-invasive DSL”, ECOOP’17

D = A * B + C

parfor i=1:n

t[i]=A[i]*B[i]

parfor i=1:n

D[i]=t[i]+C[i]

parfor i=1:n

t[i]=A[i]*B[i]+C[i]

Recognize parallelism

Fuse loops

*

+

A

B

C

=D

9

Spark Workflow HPAT Workflow

Python code Spark API code

Spark Runtime

Python code

Cluster/cloud

Parallel binary (MPI)

Cluster/cloud

Rewrite by programmer

Compile by HPAT

Driver

Executor 1 …

Rank 0 …Rank 1 Rank N-1

Driver

Executor N-1Executor 0

10

Performance Evaluation

61

767 830351

0.013

10.5

0.24

0.03

2.060.98 0.95

0.01

0.1

1

10

100

1000

KernelDensity

LinearRegression

LogisticRegression

K-MeansE

xe

cutio

n t

ime

(s)

Spark MPI/C++ HPAT

46.2102

64.1

547

0.08

1.47 1.09 0.83

0.18

5.08

1.812.91

0.01

0.1

1

10

100

1000

Kernel Density LinearRegression

LogisticRegression

K-Means

Exe

cutio

n t

ime

(s)

Spark MPI/C++ HPAT

20x-256x speedup of HPAT vs Spark

Cori at NERSC/LBL

64 nodes (2048 cores)

Amazon AWS

4 nodes c4.8xlarge (144 vCPUs)

370x-2000x speedup of HPAT vs Spark

HPAT is within 2-4x MPI/C++HPAT Julia used, Python will be similar

Totoni et al. “HPAT: High Performance Analytics with Scripting Ease-of-Use”, ICS’17

NOT STAC BENCHMARKS

11

Pandas Example

@hpat.jit(locals={'s_open': hpat.float64[:], …})

def intraday_mean_revert():

f = h5py.File("stock_data.hdf5", "r"); …

for i in prange(nsyms):

symbol = sym_list[i]

s_open = f[symbol+'/Open'][:]; …

df = pd.DataFrame({'Open': s_open, …})

df['Stdev'] = df['Close'].rolling(window=90).std()

df['Moving Average'] = df['Close'].rolling(window=20).mean()

df['Criteria1'] = (df['Open'] - df['Low'].shift(1)) < -df['Stdev']

df['Criteria2'] = df['Open'] > df['Moving Average']

df['BUY'] = df['Criteria1'] & df['Criteria2']

df['Pct Change'] = (df['Close'] - df['Open']) / df['Open']

df['Rets'] = df['Pct Change'][df['BUY'] == True]

n_days = len(df['Rets'])

res = np.zeros(max_num_days)

if n_days:

res[-n_days:] = df['Rets'].fillna(0)

all_res += res

100x speedup

on 36 cores

http://www.pythonforfinance.net/2017/02/20/intraday-stock-mean-reversion-trading-backtest-in-python/

Explicit loop parallelism

Numpy:

• Element-wise operations: +, /, ==, exp, log, sqrt, …

• Array creation: zeros, ones_like, random, normal, …

• Others: sum, prod, dot, …

Pandas:

• Column access, and operations: df.A, df[‘A’], df.A.std()

• Filter: df[df.A > .5]

• Rolling windows: df.A.rolling(window=5).mean()

Parallel loop:

for i in prange(n):

s += A[i]**2

12

HPAT Operations

Input code to HPAT should be statically compilable (type stable)

• Dynamic code example:

• Rare in analytics

13

Variable Type Limitation

if flag1:

a = 2

else:

a = np.ones(n)

if isinstance(a, np.ndarray):

doWork(a)

if flag2:

f = np.zeros

else:

f = np.ones

b = f(m)

Data Frame column accesses should be static

• Dynamic code example:

• Refactor to:

14

Pandas Limitation

for i in range(5):

A += df['c'+str(i)]

A += df['c0']

A += df['c1']

A += df['c2']

A += df['c3']

A += df['c4']

Compiler approach superior to library approach for analytics

HPAT bridges productivity-performance gap

• Compiles Python programs to efficient parallel binaries

• Available on GitHub: https://github.com/IntelLabs/hpat

15

Summary

Higher performanceEasier to use

Simpler infrastructureBroader functionality

E. Totoni, A. Roy, S. R. Dulloor, “A Case Against Tiny Tasks in Iterative Analytics”, HotOS’17

E. Totoni, T. A. Anderson, T. Shpeisman, “HPAT: High Performance Analytics with Scripting

Ease-of-Use”, ICS’17

https://arxiv.org/abs/1611.04934

T. A. Anderson, H. Liu, L. Kuper, E. Totoni, J. Vitek, T. Shpeisman, “Parallelizing Julia with a

Non-invasive DSL”, ECOOP’17

E. Totoni, W. Hassan, T. A. Anderson, T. Shpeisman, “HiFrames: High Performance Data

Frames in a Scripting Language”, (arxiv) 2017

https://arxiv.org/abs/1704.02341

16

References