Supercomputing, big data, machine learning and ... · 24.5 Pflop/s (GPU) 54.9 Pflop/s CPU cores...

1

Supercomputing, big data, machine learning and mathematical finance

David H Bailey Lawrence Berkeley Natl. Lab. (recently retired) and University of California, Davis

Marcos Lopez de Prado Hess Energy Trading Corporation

Collaborators: Jonathan Borwein (U. Newcastle), Qiji Jim Zhu (Western Michigan U.) This talk available at: http://www.davidhbailey.com/dhbtalks/dhb-hpc-finance.pdf

!

2

Performance of the world’s top 500 supercomputers 1989-2013 (courtesy www.top500.org)

Performance of #500

Performance of #1

Sum of #1 through #500

At any point in time, the world’s #1 supercomputer has more computing power than the sum of the world’s top 500 systems just 3-4 years earlier!

3

Statistics of three typical leading-edge supercomputers

“Edison” Berkeley Lab

“Mira” Argonne Lab

“Titan” Oak Ridge Lab

“Tianhe-2” Guangzhou, China

Peak performance (1 Pf/s=1015 flop/s)

2.4 Pflop/s 10.0 Pflop/s 2.6 Pflop/s (CPU) 24.5 Pflop/s (GPU)

54.9 Pflop/s

CPU cores 124,800

786,432 299,008 (CPU) 261,632 (GPU)

3,120,000

Memory 333 Tbyte 786 Tbyte 598 Tbyte (CPU) 112 Tbyte (GPU)

1 Pbyte

File system 6.4 PByte 35 Pbyte 10 PByte 12.4 Pbyte

Floor space 1200 sq ft ~1500 sq ft 4352 sq ft 7750 sq ft

Power requirement 2.1 MW 4.0 MW 8.2 MW 17.8 MW

By comparison: 54.9 Pflop/s = 7.8 million arithmetic operations per second for every human on earth. 7750 sq ft = area of two NBA basketball courts. 17.8 MW = power consumption of average U.S. city with population ~13,000.

4

New building to house Berkeley Lab (NERSC) supercomputers

·  Four stories, 140,000 sq ft. ·  300 offices on two floors. ·  20,000-29,000 sq ft computer

floor. ·  12.5 MW to 42 MW power.

§  Natural air and water cooling. §  Heat recovery system.

·  Initial occupancy: Fall 2014. Cost of supercomputer systems in the facility: “tens of millions.”

5

Big data astrophysics computations on Berkeley Lab / NERSC supercomputers

The Palomar Transient “Factory,” coupled with NERSC data processing, has discovered over 2000 confirmed supernovae in the last five years, including the youngest and closest Type Ia supernova in past 40 years. See the arrow identifying the start of this supernova in the Pinwheel Galaxy (NGC 5457): è Data processing pipeline runs on NERSC computers each night, making heavy use of special networks and data storage facilities. 67 refereed publications have resulted from this work to date, including two in Science and four in Nature.

PI: Shri Kulkarni (Caltech)"

6

·  The observational dataset for the Large Synoptic Survey Telescope will be over 100 Pbyte (i.e., 1017 byte)

·  The Square Kilometer Array project will generate 700 Tbyte per day.

·  The Daya Bay neutrino project will require numerical simulations using over 128 Pbyte active memory.

6

Data Rate in gbpsData Rate in gbps

Institution <2 years 2-5 years > 5 years

ALS 5.00E+03 1.00E+04 2.00E+06

NSLS 1.00E+02 5.00E+04 1.00E+05

SLAC 4.00E+03

1

10

100

1000

10000

100000

1000000

10000000

<2 years 2-5 years > 5 years

Expected Data Rate Production

Data

Rate

in M

bp

sALS NSLS SLAC

0"

2"

4"

6"

8"

10"

12"

14"

16"

18"

20"

2013" 2014" 2015" 2016" 2017" 2018"

Improvem

ent*o

ver*2

013*

Year*

Instruments"

Processors"

Memory"BW"

Large labs and universities are leading the charge into big data supercomputing

·  By 2017, the ATLAS/CMS high-energy physics experiment will have generated 190 Pbyte data.

·  The Next Generation Light Source facility is expected to generate 1 Tbtye per second.

·  DNA datasets are exploding, due to the rapidly dropping cost of DNA sequencing technology.

7

Supercomputing, big data, machine learning and mathematical finance

Supercomputing, coupled with huge datasets, has great potential for mathematical finance: analyzing financial data, social network data, news data, government data, etc. But these huge data sets mean that the sun is setting on the day when an analyst (in any field) can rely on “eyeballing” individual data values. Solutions: ·  Parallel computers can analyze data as fast as it is generated. ·  Advanced visualization facilities make it possible to “see” phenomena that

otherwise are not evident. ·  “Machine learning” makes it possible for the computer to automatically

detect “interesting” phenomena in data that humans, in most cases, fail to recognize.

These technologies have huge potential for the world of mathematical finance. Will you be out-computed by the competition?

8

IBM’s “Watson” technology heads to New York City

·  On January 9, 2014, IBM announced the IBM Watson Group, a new unit that will bring its “Watson” technology to a wide range of businesses.

·  The unit is based in NYC, with a staff of 2000 and $1 billion startup capital.

·  Since defeating Jeopardy! champs Ken Jennings and Brad Rutter in February 2011, Watson is now 24 times faster and 90% smaller, according to IBM.

·  Among its target applications are: §  Medical diagnosis. §  Cancer research. §  Finance and investment. §  Wealth management – IBM has teamed

with DBS Holdings of Singapore.

Jeopardy! champ Ken Jennings concedes defeat to Watson

9

Machine learning on supercomputers

Machine learning has great potential for finance, but the requisite computations, when applied to massive datasets, can be time-consuming. Parallel computing and supercomputers to the rescue! Here are some available machine learning software packages suitable for parallel clusters and supercomputers: ·  IBM’s parallel machine learning toolbox: https://www.research.ibm.com/haifa/projects/verification/ml_toolbox/index.html ·  AMPLab’s Mlbase: http://www.mlbase.org ·  The Apache Mahout library: http://mahout.apache.org ·  Univ. of Waikato’s WEKA software: http://www.cs.waikato.ac.nz/ml/weka/ Some assembly required! This is still an area of active research and development. Don’t expect many turnkey solutions just yet.

10

DANGER AHEAD

Supercomputers, mathematical algorithms and machine learning can generate nonsense faster than ever before! In finance, the principal danger is statistical overfitting of backtest data: ·  When a computer can analyze thousands or millions of variations of a

given strategy, it is almost certain that the best such strategy, measured by backtests, will be overfitted (and thus of dubious value).

·  Many studies claim profitable investment strategies, but their results are based only on in-sample statistics, with no out-of-sample testing.

·  Overfitting is the most common reason that mathematical investment schemes look great in backtests, but then fall flat in the real world.

References: 1. C. Harvey, Y. Liu, H. Zhu, “… and the Cross-Section of Expected Returns,” SSRN, 2013, available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2249314. 2. D.H. Bailey, J.M. Borwein, M. Lopez de Prado and Q.J. Zhu, “Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance, available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2308659, or at http://www.financial-math.org.

11

Why overfitting is so common

This graph shows the trade-off between the number of trials N and the minimum backtest length needed to prevent spurious strategies to be generated with a Sharpe ratio in-sample of 1, when the underlying data has mean zero. Thus if only 5 years of financial data are available, no more than 45 independent model configurations should be tried, or else a spurious scheme with Sharpe ratio > 1 is almost certain to be found.

Reference: D.H. Bailey, J.M. Borwein, M. Lopez de Prado and Q.J. Zhu, “Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance, available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2308659, or at http://www.financial-math.org.

12

How to avoid overfitting errors

·  Observe statistical limits when generating models. §  See paper by DHB, Borwein, Lopez de Prado and Zhu, previous page,

available at http://www.financial-math.org. ·  Perform out-of-sample testing: Test the resulting investment algorithm on

data that was not used in the development and backtest process. ·  Perform model sequestration: Announce a proposed investment strategy

to others (either publicly, or within a firm), then subsequently publish the results of using this strategy for a pre-specified period of time. §  See D. Leinweber and K. Sisk, “Event Driven Trading and the ‘New News’,”

Journal of Portfolio Management, vol. 38(1), pg. 110-124. Many other scientific disciplines are facing similar issues of reproducibility, to overcome the bias of only publishing “good” results: ·  There is a growing movement in the pharmaceutical industry to require the

results of all prototype drug tests to be made public. Reference: V. Stodden, D. Bailey, J. Borwein, E. LeVeque, W. Rider, and W. Stein, “Setting the default to reproducible: Reproduciblity in computational and experimental mathematics,” February 2013, available at http://www.davidhbailey.com/dhbpapers/icerm-report.pdf.

13

Why the silence in the mathematical finance community?

·  Historically scientists have led the way in exposing those who utilize pseudoscience to extract a commercial benefit – i.e., in the 18th century, physicists exposed the nonsense of astrologers.

·  Yet financial mathematicians in the 21st century have remained disappointingly silent with the regards to those in the investment community who, knowingly or not: §  Fail to disclose the number of models that were used to develop a scheme. §  Make vague predictions that do not permit rigorous testing and falsification. §  Misuse probability theory, statistics and stochastic calculus. §  Use dubious technical jargon: “technical analysis,” “cycles,” “waves,”

“Fibonacci ratios,” “Golden ratios,” etc. ·  Our silence is consent, making us accomplices in these abuses. “One has to be aware now that mathematics can be misused and that we have to protect its good name.” – Andrew Wiles (prover of Fermat’s Last Theorem), New York Times, 4 Oct 2013.

14

Mathematicians Against Fraudulent Financial and Investment Advice (MAFFIA)

Websites: http://www.financial-math.org (main site) http://www.m-a-f-f-i-a.org (alias to main site) http://www.financial-math.org/blog/ (blog)

This site was created out of concern with the proliferation of pseudo-mathematical investment claims and schemes in the past few years. We regularly post news, technical articles, essays and exposes of abuses. Do yourself a favor: Don’t be guilty of abuses mentioned in our blog and technical articles. Better still, join the MAFFIA: Visit the sites above, or send email to:

[email protected] This talk is available at:

http://www.davidhbailey.com/dhbtalks/dhb-hpc-finance.pdf

Supercomputing, big data, machine learning and ... · 24.5 Pflop/s (GPU) 54.9 Pflop/s CPU cores...

Documents

Transcript of Supercomputing, big data, machine learning and ... · 24.5 Pflop/s (GPU) 54.9 Pflop/s CPU cores...