What's new in pandas and the SciPy stack for financial users

What’s new in pandas and the SciPy stack for financial

users

Wes McKinney

Me

• AQR: August 2007 - July 2010

• Duke Statistics: 2010 - present (now on leave)

• My plans

• Improving Python libs for statistics and finance

• Building a financial software + consulting business based on said tools

Core Python stack for finance

• NumPy, SciPy (heavy lifting)

• pandas (data handling / computation)

• IPython (dev and research env)

• Cython (perf optimization)

• matplotlib (visualization)

• statsmodels (statistics / econometrics)

General sentiments

• Scientific Python growing solidly in finance and in many other fields

• Though good sci-pythonistas are still scarce

• Important work happening in many of the core projects

• Growing consensus: a new computational model is needed to better cope with “big data”

NumPy• Significantly refactored C internals

• Great progress on native datetime64 type

• Will significantly improve date-handling performance and usability

• Extensible business day / holiday logic planned / in progress

• Addition of low-level missing data (NA) support in the works

IPython

• One of Python’s killer apps gets even better

• Rich Qt GUI console with inline plotting

• New and improved architecture for high perf parallel / distributed computing

• See Fernando Pérez’s SciPy 2011 talk / video

Cython

• Still the first tool you should reach for to get better performance

• New: OpenMP integration (for multi-core)

• Supports (almost) all of standard Python now (some things, like closures, used to not work)

with nogil: for i in prange(n): # do something in parallel

statsmodels• Statistics and econometrics in Python

• Major work in time series models over last year+

• VAR, SVAR models, eventually (V)ECM models for cointegrated time series

• AR/ARMA, Kalman Filter, various macro filters (e.g. Hodrick-Prescott) implemented

• Soon: Bayesian state space models (DLMs), ARCH/GARCH models, etc.

statsmodels

• Major criticism: weak user interface

• No R-style formula framework

• pandas not integrated (need to pass raw NumPy arrays)

• I have begun work on pandas integration, formulas have been implemented and will hopefully arrive within the next few months

pandas

• Still the Python data hacker’s best friend?

• Most recent release: 0.3.0 on 2/20/2011

• However, last 4 months have been the most active development period in the library’s history

• ~375 commits since 0.3.0 release (more than the entire prior open source history)

The state of data structures

Ambitious big picture

• I want to make pandas the cornerstone of the “next generation” statistical computing environment

• Ease-of-use, performance, flexibility all equally important

Ambitious big picture

• Taking the best features of other languages (R and friends) and making them better and easier to use

• See my recent blog article “A Roadmap for Rich Scientific Data Structures in Python”

pandas: under the hood

• Complete redesign of DataFrame internals

• Now a single class for 2D data retaining optimal performance of old DataFrame and DataMatrix classes

• Significantly improved mixed-type and missing data handling

• Plan to use internal data structure to implement “NDFrame” for n-dimensional data

Fancy indexing• Index a Series / DataFrame in a matrix-like

way via special .ix attribute, use:

• Slices with integers or labels

• Lists of integers, labels, or boolean vecs

• Integer or label locations

df.ix[0]df.ix[date1:date2]df.ix[:5, ‘A’:’F’]

df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan

Misc new features

• “Sparse” (mostly NA) versions of Series, DataFrame, WidePanel

• Many new functions on Series/DataFrame

• describe, quantile, select, drop, dropna, corrwith, ...

• New moving window methods: rolling_quantile and rolling_apply

Improved IO

• read_csv, read_table functions more flexible and robust, better type inferencing

• ExcelFile class for reading multiple sheets out of .xls files

df = read_table(‘foo.txt’, skiprows=[0,1], na_values=[‘#N/A’])

Improved IO• HDFStore class provides a complete, tested

dict-like PyTables storage container

• Experimental: store as Table and query

store = HDFStore(‘mydata.h5’)store[‘x’] = xstore[‘y’] = yy = store[‘y’]

store.put('df', df, table=True)piece = store.select(‘df’, [{‘field’ : ‘index’, ‘op’ : ‘>=’, ‘value’ : date}])

Group by enhancements

• Can group by multiple columns or key functions, SQL-like but more general

• Syntactic sugar to invoke aggregation functions on groups

• Automatic exclusion of “nuisance” columns of DataFrames

• Various other usability enhancements

Very soon: hierarchical indexing

• Enable axis ticks to be identified by multiple labels instead of a single label

• Easily select subsets of data by “level”

• Create Excel-style pivot tables / cross-tabulations in a sensible way

• Will integrate naturally with groupby

Other misc things

• Flexible binary operators

• a.add(b, fill_value=0.)

• Some timezone support in DateRange

• Numerous performance optimizations

• See the (long) release notes =)

Planned work

• Fast time series up/downsampling

• Improved support and perf for HF/tick data

• Even more sophisticated group by tools

• Better documentation, online screencast tutorials / examples

Thanks

• Email: [email protected]

• Twitter: @wesmckinn

• Blog: http://blog.wesmckinney.com

• pandas: http://github.com/wesm/pandas

• statsmodels: http://statsmodels.sourceforge.net

mailto:[email protected]

mailto:[email protected]

http://blog.wesmckinney.com

http://blog.wesmckinney.com

http://github.com/wesm/pandas

http://github.com/wesm/pandas

http://statsmodels.sourceforge.net

http://statsmodels.sourceforge.net

What's new in pandas and the SciPy stack for financial users

Technology

Transcript of What's new in pandas and the SciPy stack for financial users