Rethinking Classical Theory: The Sociological Vision of Pierre
Rethinking classical approaches to analysis and predictive modeling
-
Upload
analyticsweek -
Category
Technology
-
view
845 -
download
1
description
Transcript of Rethinking classical approaches to analysis and predictive modeling
2
About 1010data
• Founded in 2000
• Based in NYC
• Big Data analyAcs plaCorm in the cloud
• Library of pre-‐built analyAcal applicaAons
• Speed, power and flexibility second to none
3
We Host/Analyze 14+ Trillion Rows of Data
All Quotes and Trades since 2003 on NYSE are done on 1010data
All mortgages ever issued are analyzed on 1010data
Nearly all real-estate transactions are completed on 1010data
Big Data - Granular Data - Time series Data
All data for ~35,000 Retail outlets across the US are analyzed on 1010data
4
A Typical BI Technology Stack
Administrators
Data Sources
ETL
Inter-‐En
terprise Users
EDW
Data Cubes/ Marts
ReporAng / VisualizaAon
Analysis / Modeling
5
The Stack Has Fallen!
6
The Analy(cs Con(nuum & A Single Version of the Truth
7
Intui(ve Access to Unlimited Amounts of Data
Partner Data
3rd Party Data
1010data Cloud
Corporate Data
425,369,127,325 Rows!
8
The code: Chart 1
<layout background_="white" border_="1" height_="525" name="candlesAck_layout" relpos_="0,50" width_="650"> <widget base_="nyse.trades.hist.all" class_="graphics" invmode_="hide" name="candlesAck" relpos_="25,25" update_="manual" width_="600"> <sel value="between(date;'{@startdate}';'{@enddate}')"/> <sel value="(symbol='{@symbol}')"/> <tabu label="Candle SAck" breaks="date"> <break col="date" sort="up"/> <tcol source="prc" fun="wavg" name="vwap" weight="vol" label="VWAP"/> <tcol source="prc" fun="hi" name="high" label="High"/> <tcol source="prc" fun="lo" name="low" label="Low"/> <tcol source="prc" fun="first" name="open" label="Open"/> <tcol source="prc" fun="last" name="close" label="Close"/> </tabu> <graphspec> <chart type="candlesAck" Atle="CandlesAck Chart for {@symbol}"> <axes xlabel="Date" ylabel="Trading Price"/> </chart> </graphspec> </widget> <widget class_="bulon" name="candlesAck_refresh" relpos_="475,475" submit_="candlesAck" text_="Refresh" type_="submit"/> <widget class_="field" label_="Choose Symbol:" name="symbol_input" relpos_="125,475" value_="@symbol"/> </layout>
Query Chart Spec
9
Predic(ve Analy(cs on a Big Data Scale!
Big Data mandated AnalyAcs and predicAve modeling -‐ an example: The larger data sets have mandated more rigorous sampling strategies as tradiAonal systems have not kept up with the computaAonal needs of predicAve analyAc soluAons on Big Data. • Can we use all but a small holdout set in predicAve modeling? • What are the challenges? • What is an approach that works? • Are the results any good? • Is this soluAon only applicable to one industry?
10
Common Predic(ve Modeling Approach
" CPU intensive & error prone steps:
» Data selecAon » IV to DV relaAonship » TransformaAons » Sampling and validaAon » Model esAmaAon » Model tesAng » Repeat
10 hlp://onlinepubs.trb.org/onlinepubs/nchrp/cd-‐22/v2chapter5.html
CPU Error Prone
IV to DV relaAonship TransformaAons Sampling and validaAon Model esAmaAon Model tesAng Repeat
11
“One Segment” => “A Segment of One”
“Any customer can have a car painted any color that he wants so long as it is black.” re: the Model-‐T in 1909 (from My Life and Work , Henry Ford, 1922, Chap. 4, p.71)
12
Harry Truman displays a copy of the Chicago Daily Tribune newspaper that erroneously reported the elecAon of Thomas Dewey in 1948. Truman’s narrow victory embarrassed pollsters, members of his own party, and the press who had predicted a Dewey landslide.
13
Build A 30 Day Shopping List For Each Loyal Shopper at a Retail Chain
Shopper SKU Probability of purchase in the next 30 days
A. Smith 12345 90%
A. Smith 23567 85%
A. Smith ….
A. Smith 87996 30%
POS
Loyalty
Econ House prices Mortgage Rates BLS -‐ Unemployment
Inventory
With Permission from A&P
14
If The Shopper Bought “It” Before Will They Buy “It” Again?
" Classical modeling: variables as either posiAvely or negaAvely correlated with target
" Shoppers don’t behave the same!
" The demographics alributes have distribuAons for each variable!
15
Subscribers are “A Segment Of One”!
16
All sources of Prepay as analyzed in 1989
D
R
M
Interest Rates
House prices
Unemployment
Loan Age
Cost of opAon
Regional economy I
hlp://w
ww.freeusandw
orldmaps.com
/html/U
S_CounAes/US_CounAes.htm
l hl
p://www.tradingeconom
ics.com/united-‐states/unem
ployment-‐rate
hlp://w
ww.w
fa.gov/ hl
p://www.richm
ondfed.org/banking/markets_trends_and_staAsAcs/trends/pdf/delinquency_and_foreclosure_rates.pdf
17
Quality Measures : Lia => AUC
18
Fine vs. Coarse: Cash flows
19
InQuery analy(cs – User Defined Group Func(ons
• User defined − KNN − Naïve Bayes − ARCH/AR − PCA − Kernel − Decision Tree − LogisAcs trees − FFT − Etc……..
20
Ques(ons?