Light Up Your Dark Data

20
QuantCon “Light Up Your Dark Data” April 2016

Transcript of Light Up Your Dark Data

Page 1: Light Up Your Dark Data

QuantCon“Light Up Your Dark Data”

April 2016

Page 2: Light Up Your Dark Data

2

What is dark data?

SQL

CSV

REST

JSON

SQL

CSV

REST

JSON

SQL

CSV

SQL

CSV

Page 3: Light Up Your Dark Data

3

Example Datasets

Trade History

Signal History

Clearing Data

Log Files

Ref Data

Corp Actions

Market Data

Models

Firm Generated Vendor Generated

Page 4: Light Up Your Dark Data

4

Compounding ChallengesAccumulates

Quickly

Disparate StorageDifferent Vendors

Format Changes

Ad-hoc Usage

Urgent!

Page 5: Light Up Your Dark Data

5

Workflow

Find Data

Ad-Hoc ETL

Store / CopyAnalysis

Report

Page 6: Light Up Your Dark Data

6

Sample Environment

Oracle MySQL MSSQL KDB ZIPCSV

SQL

Python

DSL

R Matlab

C++ Java

Storage

ETL

Analysis

REST

Page 7: Light Up Your Dark Data

7

Independent First Class Citizens

Expression

ComputeData

Page 8: Light Up Your Dark Data

8

DatashapeStructured data description language

http://datashape.pydata.org

Page 9: Light Up Your Dark Data

9

Datashape Example daily_bars: var * { date: string, symbol: string, open: float64, high: float64, low: float64, close: float64, volume: int64, }

Language, compute, and storage independent

Page 10: Light Up Your Dark Data

10

Blaze

Write expressions independent of storage system

Push computations to the data

Lazy evaluation

Pandas-like API

Page 11: Light Up Your Dark Data

11

Blazehttp://blaze.pydata.org/

Page 12: Light Up Your Dark Data

12

Blaze Expressions

Page 13: Light Up Your Dark Data

13

Flat File Repositories

Many directories and files

Dictated structure

Naming convention part of dataset

Requires one off ad-hoc scripts

Page 14: Light Up Your Dark Data

14

Vendor - directory structure/daily/us/nasdaq stocks//daily/us/nasdaq stocks/1//daily/us/nasdaq stocks/2/

osn.us.txtostk.us.txt…

zyne.us.txt/daily/us/nyse etfs//daily/us/nyse stocks/1//daily/us/nyse stocks/2/

Contains ~8400 individual files

Page 15: Light Up Your Dark Data

15

Vendor – file contents

Date,Open,High,Low,Close,Volume,OpenInt20151111,18.5,25.9,18,24.5,1584600,020151112,24.25,27.12,22.5,25,83000,020151113,25.47,26.2,24.55,25.26,67300,020151116,25.01,26.19,24.13,25.02,16900,020151117,24.46,25.51,24.38,24.62,25900,020151118,24.62,26.31,24.06,25,111100,020151119,24.85,26,24.71,25.9,113100,0…

Symbol is not contained within the individual data files

/daily/us/nasdaq stocks/1/aaap.us.txt

Page 16: Light Up Your Dark Data

16

Luxsource: "lux://global-equities/data/daily/us/nasdaq stocks" extractor: "{}/{Symbol}.{Region}.txt"

Date,Open,High,Low,Close,Volume,OpenInt,Symbol,Region20151111,18.5,25.9,18,24.5,1584600,0,aaap,us20151112,24.25,27.12,22.5,25,83000,0,aaap,us20151113,25.47,26.2,24.55,25.26,67300,0,aaap,us…20160322,11.56,11.98,10.8894,11.09,517604,0,zyne,us20160323,11.3,11.72,9.5,9.75,489743,0,zyne,us20160324,9.5,10.24,9.22,9.64,188512,0,zyne,us

One dataset with ~5.5 million rows

Page 17: Light Up Your Dark Data

17

Lux Benefits

Combines individual files

No separate ETL or storage

Names become part of data

Optimized compute

Page 18: Light Up Your Dark Data

18

Anaconda Mosaic

Interactive exploration

Intuitive interface

Advanced visualizations

Catalog of datasets and expressions

Provenance and Governance

Page 19: Light Up Your Dark Data

19

Live Walkthrough

Page 20: Light Up Your Dark Data

20

Project References

• Anaconda Mosaic - http://know.continuum.io/Anaconda-Mosaic

• Blaze Ecosystem - http://blaze.pydata.org• Bokeh - http://bokeh.pydata.org