What's new in pandas and the SciPy stack for financial users

23
What’s new in pandas and the SciPy stack for financial users Wes McKinney

description

 

Transcript of What's new in pandas and the SciPy stack for financial users

Page 1: What's new in pandas and the SciPy stack for financial users

What’s new in pandas and the SciPy stack for financial

users

Wes McKinney

Page 2: What's new in pandas and the SciPy stack for financial users

Me

• AQR: August 2007 - July 2010

• Duke Statistics: 2010 - present (now on leave)

• My plans

• Improving Python libs for statistics and finance

• Building a financial software + consulting business based on said tools

Page 3: What's new in pandas and the SciPy stack for financial users

Core Python stack for finance

• NumPy, SciPy (heavy lifting)

• pandas (data handling / computation)

• IPython (dev and research env)

• Cython (perf optimization)

• matplotlib (visualization)

• statsmodels (statistics / econometrics)

Page 4: What's new in pandas and the SciPy stack for financial users

General sentiments

• Scientific Python growing solidly in finance and in many other fields

• Though good sci-pythonistas are still scarce

• Important work happening in many of the core projects

• Growing consensus: a new computational model is needed to better cope with “big data”

Page 5: What's new in pandas and the SciPy stack for financial users

NumPy• Significantly refactored C internals

• Great progress on native datetime64 type

• Will significantly improve date-handling performance and usability

• Extensible business day / holiday logic planned / in progress

• Addition of low-level missing data (NA) support in the works

Page 6: What's new in pandas and the SciPy stack for financial users

IPython

• One of Python’s killer apps gets even better

• Rich Qt GUI console with inline plotting

• New and improved architecture for high perf parallel / distributed computing

• See Fernando Pérez’s SciPy 2011 talk / video

Page 7: What's new in pandas and the SciPy stack for financial users

Cython

• Still the first tool you should reach for to get better performance

• New: OpenMP integration (for multi-core)

• Supports (almost) all of standard Python now (some things, like closures, used to not work)

with nogil: for i in prange(n): # do something in parallel

Page 8: What's new in pandas and the SciPy stack for financial users

statsmodels• Statistics and econometrics in Python

• Major work in time series models over last year+

• VAR, SVAR models, eventually (V)ECM models for cointegrated time series

• AR/ARMA, Kalman Filter, various macro filters (e.g. Hodrick-Prescott) implemented

• Soon: Bayesian state space models (DLMs), ARCH/GARCH models, etc.

Page 9: What's new in pandas and the SciPy stack for financial users

statsmodels

• Major criticism: weak user interface

• No R-style formula framework

• pandas not integrated (need to pass raw NumPy arrays)

• I have begun work on pandas integration, formulas have been implemented and will hopefully arrive within the next few months

Page 10: What's new in pandas and the SciPy stack for financial users

pandas

• Still the Python data hacker’s best friend?

• Most recent release: 0.3.0 on 2/20/2011

• However, last 4 months have been the most active development period in the library’s history

• ~375 commits since 0.3.0 release (more than the entire prior open source history)

Page 11: What's new in pandas and the SciPy stack for financial users

The state of data structures

Page 12: What's new in pandas and the SciPy stack for financial users

Ambitious big picture

• I want to make pandas the cornerstone of the “next generation” statistical computing environment

• Ease-of-use, performance, flexibility all equally important

Page 13: What's new in pandas and the SciPy stack for financial users

Ambitious big picture

• Taking the best features of other languages (R and friends) and making them better and easier to use

• See my recent blog article “A Roadmap for Rich Scientific Data Structures in Python”

Page 14: What's new in pandas and the SciPy stack for financial users

pandas: under the hood

• Complete redesign of DataFrame internals

• Now a single class for 2D data retaining optimal performance of old DataFrame and DataMatrix classes

• Significantly improved mixed-type and missing data handling

• Plan to use internal data structure to implement “NDFrame” for n-dimensional data

Page 15: What's new in pandas and the SciPy stack for financial users

Fancy indexing• Index a Series / DataFrame in a matrix-like

way via special .ix attribute, use:

• Slices with integers or labels

• Lists of integers, labels, or boolean vecs

• Integer or label locations

df.ix[0]df.ix[date1:date2]df.ix[:5, ‘A’:’F’]

df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan

Page 16: What's new in pandas and the SciPy stack for financial users

Misc new features

• “Sparse” (mostly NA) versions of Series, DataFrame, WidePanel

• Many new functions on Series/DataFrame

• describe, quantile, select, drop, dropna, corrwith, ...

• New moving window methods: rolling_quantile and rolling_apply

Page 17: What's new in pandas and the SciPy stack for financial users

Improved IO

• read_csv, read_table functions more flexible and robust, better type inferencing

• ExcelFile class for reading multiple sheets out of .xls files

df = read_table(‘foo.txt’, skiprows=[0,1], na_values=[‘#N/A’])

Page 18: What's new in pandas and the SciPy stack for financial users

Improved IO• HDFStore class provides a complete, tested

dict-like PyTables storage container

• Experimental: store as Table and query

store = HDFStore(‘mydata.h5’)store[‘x’] = xstore[‘y’] = yy = store[‘y’]

store.put('df', df, table=True)piece = store.select(‘df’, [{‘field’ : ‘index’, ‘op’ : ‘>=’, ‘value’ : date}])

Page 19: What's new in pandas and the SciPy stack for financial users

Group by enhancements

• Can group by multiple columns or key functions, SQL-like but more general

• Syntactic sugar to invoke aggregation functions on groups

• Automatic exclusion of “nuisance” columns of DataFrames

• Various other usability enhancements

Page 20: What's new in pandas and the SciPy stack for financial users

Very soon: hierarchical indexing

• Enable axis ticks to be identified by multiple labels instead of a single label

• Easily select subsets of data by “level”

• Create Excel-style pivot tables / cross-tabulations in a sensible way

• Will integrate naturally with groupby

Page 21: What's new in pandas and the SciPy stack for financial users

Other misc things

• Flexible binary operators

• a.add(b, fill_value=0.)

• Some timezone support in DateRange

• Numerous performance optimizations

• See the (long) release notes =)

Page 22: What's new in pandas and the SciPy stack for financial users

Planned work

• Fast time series up/downsampling

• Improved support and perf for HF/tick data

• Even more sophisticated group by tools

• Better documentation, online screencast tutorials / examples

Page 23: What's new in pandas and the SciPy stack for financial users

Thanks

• Email: [email protected]

• Twitter: @wesmckinn

• Blog: http://blog.wesmckinney.com

• pandas: http://github.com/wesm/pandas

• statsmodels: http://statsmodels.sourceforge.net