What's new in pandas and the SciPy stack for financial users
-
Upload
wesm -
Category
Technology
-
view
17.816 -
download
2
description
Transcript of What's new in pandas and the SciPy stack for financial users
What’s new in pandas and the SciPy stack for financial
users
Wes McKinney
Me
• AQR: August 2007 - July 2010
• Duke Statistics: 2010 - present (now on leave)
• My plans
• Improving Python libs for statistics and finance
• Building a financial software + consulting business based on said tools
Core Python stack for finance
• NumPy, SciPy (heavy lifting)
• pandas (data handling / computation)
• IPython (dev and research env)
• Cython (perf optimization)
• matplotlib (visualization)
• statsmodels (statistics / econometrics)
General sentiments
• Scientific Python growing solidly in finance and in many other fields
• Though good sci-pythonistas are still scarce
• Important work happening in many of the core projects
• Growing consensus: a new computational model is needed to better cope with “big data”
NumPy• Significantly refactored C internals
• Great progress on native datetime64 type
• Will significantly improve date-handling performance and usability
• Extensible business day / holiday logic planned / in progress
• Addition of low-level missing data (NA) support in the works
IPython
• One of Python’s killer apps gets even better
• Rich Qt GUI console with inline plotting
• New and improved architecture for high perf parallel / distributed computing
• See Fernando Pérez’s SciPy 2011 talk / video
Cython
• Still the first tool you should reach for to get better performance
• New: OpenMP integration (for multi-core)
• Supports (almost) all of standard Python now (some things, like closures, used to not work)
with nogil: for i in prange(n): # do something in parallel
statsmodels• Statistics and econometrics in Python
• Major work in time series models over last year+
• VAR, SVAR models, eventually (V)ECM models for cointegrated time series
• AR/ARMA, Kalman Filter, various macro filters (e.g. Hodrick-Prescott) implemented
• Soon: Bayesian state space models (DLMs), ARCH/GARCH models, etc.
statsmodels
• Major criticism: weak user interface
• No R-style formula framework
• pandas not integrated (need to pass raw NumPy arrays)
• I have begun work on pandas integration, formulas have been implemented and will hopefully arrive within the next few months
pandas
• Still the Python data hacker’s best friend?
• Most recent release: 0.3.0 on 2/20/2011
• However, last 4 months have been the most active development period in the library’s history
• ~375 commits since 0.3.0 release (more than the entire prior open source history)
The state of data structures
Ambitious big picture
• I want to make pandas the cornerstone of the “next generation” statistical computing environment
• Ease-of-use, performance, flexibility all equally important
Ambitious big picture
• Taking the best features of other languages (R and friends) and making them better and easier to use
• See my recent blog article “A Roadmap for Rich Scientific Data Structures in Python”
pandas: under the hood
• Complete redesign of DataFrame internals
• Now a single class for 2D data retaining optimal performance of old DataFrame and DataMatrix classes
• Significantly improved mixed-type and missing data handling
• Plan to use internal data structure to implement “NDFrame” for n-dimensional data
Fancy indexing• Index a Series / DataFrame in a matrix-like
way via special .ix attribute, use:
• Slices with integers or labels
• Lists of integers, labels, or boolean vecs
• Integer or label locations
df.ix[0]df.ix[date1:date2]df.ix[:5, ‘A’:’F’]
df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan
Misc new features
• “Sparse” (mostly NA) versions of Series, DataFrame, WidePanel
• Many new functions on Series/DataFrame
• describe, quantile, select, drop, dropna, corrwith, ...
• New moving window methods: rolling_quantile and rolling_apply
Improved IO
• read_csv, read_table functions more flexible and robust, better type inferencing
• ExcelFile class for reading multiple sheets out of .xls files
df = read_table(‘foo.txt’, skiprows=[0,1], na_values=[‘#N/A’])
Improved IO• HDFStore class provides a complete, tested
dict-like PyTables storage container
• Experimental: store as Table and query
store = HDFStore(‘mydata.h5’)store[‘x’] = xstore[‘y’] = yy = store[‘y’]
store.put('df', df, table=True)piece = store.select(‘df’, [{‘field’ : ‘index’, ‘op’ : ‘>=’, ‘value’ : date}])
Group by enhancements
• Can group by multiple columns or key functions, SQL-like but more general
• Syntactic sugar to invoke aggregation functions on groups
• Automatic exclusion of “nuisance” columns of DataFrames
• Various other usability enhancements
Very soon: hierarchical indexing
• Enable axis ticks to be identified by multiple labels instead of a single label
• Easily select subsets of data by “level”
• Create Excel-style pivot tables / cross-tabulations in a sensible way
• Will integrate naturally with groupby
Other misc things
• Flexible binary operators
• a.add(b, fill_value=0.)
• Some timezone support in DateRange
• Numerous performance optimizations
• See the (long) release notes =)
Planned work
• Fast time series up/downsampling
• Improved support and perf for HF/tick data
• Even more sophisticated group by tools
• Better documentation, online screencast tutorials / examples
Thanks
• Email: [email protected]
• Twitter: @wesmckinn
• Blog: http://blog.wesmckinney.com
• pandas: http://github.com/wesm/pandas
• statsmodels: http://statsmodels.sourceforge.net