Kenett On Information NYU-Poly 2013
-
Upload
the-hebrew-university-of-jerusalem -
Category
Documents
-
view
336 -
download
1
Transcript of Kenett On Information NYU-Poly 2013
![Page 1: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/1.jpg)
Financial and Risk
Applications of InfoQ
Prof. Ron S. Kenett
KPA Ltd., Raanana, Israel
Universita degli Studi di Torino, Turin, Italy
NYU Poly, New York, USA
![Page 2: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/2.jpg)
Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators
http://galitshmueli.com/content/predicting-changes-quarterly-corporate-earnings-using-economic-indicators
This study looks at corporate earnings in relation to an existing theory of business forecasting developed by Joseph H. Ellis (former research analyst at Goldman Sachs).
2
![Page 3: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/3.jpg)
Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
http://galitshmueli.com/content/predicting-zillowcom-s-zestimate-accuracy
Zillow.com is a free real estate service that calculates an estimated home valuation ("Zestimate") as a starting point for anyone to see for most homes in the U.S. The study looks at the accuracy of Zestimates.
3
![Page 4: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/4.jpg)
Three case studies (3/3)
3. Predicting First Day Returns for Japanese IPOs
http://galitshmueli.com/content/predicting-first-day-returns-japanese-ipos.
An Initial Public Offering (IPO) is the first sale of stock by a company to the public. The study looks at the first-day returns on IPOs of Japanese companies.
4
![Page 5: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/5.jpg)
InfoQ(f,X,g) = U( f(X|g) ) Depends on quality of g, X, f, U and relationship between them
The potential of a particular dataset to achieve a particular goal using a given empirical analysis method
5
g A specific analysis goal
X The available dataset
f An empirical analysis method
U A utility measure
Information Quality
Kenett, R.S. and Shmueli , G. (2013) On Information Quality, http://ssrn.com/abstract=1464444 Journal of the Royal Statistical Society, Series A (with discussion), 176(4).
![Page 6: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/6.jpg)
Analysis goal
g Explain, predict, describe enumerative, analytic, exploratory, confirmatory
Goal Specification • “error of the third kind” - giving the right answer to the wrong
question – Kimball • “Far better an approximate answer to the right question, which
is often vague, than an exact answer to the wrong question, which can always be made precise” - Tukey
6
![Page 7: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/7.jpg)
Analysis goal
g
7
Goal 1. Decide where to launch improvement initiatives
Goal 2. Highlight drivers of overall satisfaction
Goal 3. Detect positive or negative trends in customer satisfaction
Goal 4. Identify best practices by comparing products
Goal 5. Determine strengths and weaknesses
Goal 6. Set up improvement goals
Goal 7. Design a balanced scorecard with customer inputs
Goal 8. Communicate the results using graphics
Goal 9. Assess the reliability of the questionnaire
Goal 10. Improve the questionnaire for future use
Typical Goals of Customer Surveys
![Page 8: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/8.jpg)
X Available data
Data Source • Primary, secondary • Observational, experiment • Single, multiple sources • Collection instrument, protocol
Data Type • Continuous, categorical, semantic • Structured, un-, semi-structured • Cross-sectional, time series, panel,
network, geographical
Data Quality • “Zeroth Problem - How do the data relate to the problem, and
what other data might be relevant?” - Mallows • Quality of Statistical Data (IMF, OECD) - usefulness of summary
statistics for a particular goal (7 dimensions)
Data Size and Dimension • # observations • # variables
8
![Page 9: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/9.jpg)
f Data analysis
method
Analysis Quality • “poor models and poor analysis techniques, or even analyzing the
data in a totally incorrect way.” - Godfrey • Analyst expertise • Software availability • The focus of statistics education
Statistical models and methods • Parametric, semi-, non-parametric • Classic, Bayesian
Data mining algorithms Graphical methods Operations research methods
9
![Page 10: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/10.jpg)
Utility measure
U
Utility Measure • Adequate metric from analysis standpoint (R2, holdout data) • Adequate metric from domain standpoint
• Predictive accuracy, lift • Goodness-of-fit • Statistical power, statistical significance • Strength-of-fit • Expected costs, gains • Bias reduction, bias-variance tradeoff
10
![Page 11: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/11.jpg)
11
Goal of study:
1. Predict the final price of an Ebay auction at start of auction
2. Predict price during ongoing auction
3. Predict the auctions with the highest prices (ranking)
4. Identify factors that determine the final price of an eBay auction?
“Pennies from ebay: The determinants of price in
online auctions” Lucking-Reiley D., Bryan D.,
Prasad N. & Reeves D. Journal of Indust. Econ., 2007
An example….
X Available data
Analysis goal
g
![Page 12: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/12.jpg)
12
461 eBay coin auctions (Indian Head pennies) Auction characteristics
Duration Open and close prices Number of bids and bidders Secret reserve price Weekday/weekend ending
Seller characteristics Seller rating
Item characteristics Year and grade of coin
X Available data
“Pennies from ebay: The determinants of price in
online auctions” Lucking-Reiley D., Bryan D.,
Prasad N. & Reeves D. Journal of Indust. Econ., 2007
![Page 13: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/13.jpg)
13
Dimension Reduction
f Data analysis
method
An example….
![Page 14: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/14.jpg)
14
Prediction error:
• Holdout data
• Metrics such as MAPE and RMSE
f Data analysis
method
Utility measure
U
An example….
![Page 15: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/15.jpg)
Statistical Approaches for Increasing InfoQ
Study Design (Pre-Data)
• DOE
• Clinical trials
• Survey sampling
• Computer experiments
Post-Data-Collection
• Data cleaning and preprocessing
• Re-weighting, bias adjustment
• Meta analysis
Randomization, Stratification, Blinding, Placebo, Blocking, Replication, Sampling frame, Link data collection protocol with appropriate design
Recovering “real data” vs. “cleaning for the goal” Handling missing values, outlier detection, re-weighting, combining results
15
![Page 16: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/16.jpg)
Assessing InfoQ “Quality of Statistical Data” (Eurostat, OECD, NCSES,…) • Relevance • Accuracy • Timeliness and punctuality • Accessibility • Interpretability • Coherence • Credibility
InfoQ dimensions 1. Data resolution 2. Data structure 3. Data integration 4. Temporal relevance 5. Chronology of data and goal 6. Generalizability 7. Operationalization 8. Communication
3 V’s of Big Data • Volume • Variety • Velocity
Marketing Research • Recency • Accuracy • Availability • Relevance 16
4 V’s of Big Data • Volume • Variety • Velocity • Veracity
![Page 17: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/17.jpg)
#1 Data Resolution
17
![Page 18: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/18.jpg)
#2 Data Structure
Data Types • Time series, cross-sectional, panel • Structured, semi-, non-structured • Geographic, spatial, network • Text, audio, video, semantic • Discrete, continuous
Data Characteristics Corrupted and missing values due to study design or data collection mechanism
18
![Page 19: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/19.jpg)
19
www.riscoss.eu
Managing Risk and Costs in OSS Adoption
#2 Data Structure
![Page 20: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/20.jpg)
20
#2 Data Structure
![Page 21: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/21.jpg)
21
Who talks to whom?
IRC chat archives: http://dev.xwiki.org/xwiki/bin/view/IRC/WebHome
XWiki Community
#2 Data Structure
![Page 22: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/22.jpg)
XWiki Community Use association rules To characterize the content of the clusters (tm, arules)
#2 Data Structure
![Page 23: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/23.jpg)
XWiki Community
#2 Data Structure
![Page 24: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/24.jpg)
#3 Data Integration
Linkage, privacy-preserving methods: Increase or decrease InfoQ?
24
![Page 25: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/25.jpg)
#4 Temporal Relevance
Analysis Timeliness (solving the right problem too late)
Data Collection
Data Analysis
Study Deployment
t1 t2 t3 t4 t5 t6
Collection Timeliness (relevance to g)
g: Prospective vs. retrospective; longitudinal vs. snapshot Nature of X, complexity of f
forecast
25
![Page 26: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/26.jpg)
#5 Chronology of Data & Goal
Data: Daily AQI in a city g1: Reverse-engineer AQI g2: Forecast AQI Retrospective/prospective Ex-post availability Endogeneity
26 http://www.airnow.gov/?action=aqibasics.aqi
![Page 27: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/27.jpg)
#6 Generalizability
Statistical generalizability
Scientific generalizability
Definition of g Choice of X, f, U 27
![Page 28: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/28.jpg)
#7 (Construct) Operationalization
χ construct X = θ(χ) operationalization (measurable)
• Causal explanation vs. prediction, description
• Theory vs. data • Data: Questionnaire,
physio measurement
28
![Page 29: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/29.jpg)
#7 (Action) Operationalization
29 http://www.spcpress.com/pdf/DJW187.pdf
![Page 30: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/30.jpg)
#7 Operationalization
30
National Education Goals Panel (NEGP) recommended that states answer four questions on their student reports: 1. How did my child do? 2. What types of skills or knowledge does his or her performance reflect? 3. How did my child perform in comparison to other students in the school, district, state, and, if available, the nation? 4. What can I do to help my child improve?
![Page 31: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/31.jpg)
31
#7 Operationalization
http://sat.collegeboard.org/practice/sat-skills-insight/writing/band/200
![Page 32: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/32.jpg)
32
#7 Operationalization
http://sat.collegeboard.org/practice/sat-skills-insight/writing/band/200
![Page 33: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/33.jpg)
33
When asked what the 18% in line 1 meant, 53% of the policy makers responded incorrectly
1992 NAEP Executive
Summary Report
#8 Communication
43 16 2
![Page 34: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/34.jpg)
34
#8 Communication
http://nces.ed.gov/nationsreportcard/itemmaps/index.asp
![Page 35: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/35.jpg)
35
#8 Communication
http://nces.ed.gov/nationsreportcard/itemmaps/index.asp
![Page 36: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/36.jpg)
36
#8 Communication
http://nces.ed.gov/nationsreportcard/itemmaps/index.asp
![Page 37: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/37.jpg)
37
#8 Communication
http://nces.ed.gov/nationsreportcard/itemmaps/index.asp
![Page 38: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/38.jpg)
38
#8 Communication
http://nces.ed.gov/nationsreportcard/itemmaps/index.asp
![Page 39: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/39.jpg)
39
#8 Communication
http://nces.ed.gov/nationsreportcard/itemmaps/index.asp
![Page 40: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/40.jpg)
40 The Israeli version……
#8 Communication
http://rama.education.gov.il
'
N " N " N "
" 18,684 501 102 13,182 521 87 5,502 454 118
" 21,407 500 100 14,466 524 84 6,941 444 111
" 20,644 524 91 14,787 536 80 5,857 496 106
" 19,165 524 86 13,379 532 77 5,786 506 101
" 19,631 532 76 13,961 537 73 5,670 519 81
* " 20,222 528 77 13,957 541 70 6,265 498 82
![Page 41: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/41.jpg)
41
http://www.madlan.co.il/education/schools
#8 Communication
The Israeli version……
![Page 42: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/42.jpg)
Assessing InfoQ in Practice
Rating-based assessment
1-5 scale on each dimension:
InfoQ Score = [d1(Y1) d2(Y2) … d8(Y8)]1/8
Experience from two research methods courses
– Preparing a PhD research proposal (U Ljubljana, 50 students, goo.gl/f6bIA)
– Post-hoc evaluation of five completed studies (CMU, 16 students, goo.gl/erNPF)
42
# Dimension Note Value Index
1 Data resolution 5 1.0000
2 Data structure 4 0.7500
3 Data integration 5 1.0000
4 Temporal relevance 5 1.0000
5 Generalizability 3 0.5000
6 Chronology of data and goal 5 1.0000
7 Concept operationalization 2 0.2500
8 Communication 3 0.5000
InfoQ Score = 0.68
InfoQ=68%
![Page 43: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/43.jpg)
InfoQ: Strengths and Challenges InfoQ approach streamlines questioning of data value • “Why should we invest in data?” – management
• Compare value of potential datasets, analyses
• Prioritize/rank projects
• Strengthen functional – analytical relationship
Multiple goals: • Goals can change during study: Reevaluate InfoQ
• Multiple goals: Prioritize. – clinical trials: effect of new drug, adverse effects
To Do: • Improve InfoQ assessment • Alternative InfoQ assessment approaches (pilot study, EDA, other) • Further dimensions (data privacy, human subject compliance and risk) • Effect of technological advances on InfoQ 43
![Page 44: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/44.jpg)
Primary Data Secondary Data - Experimental - Experimental - Observational - Observational
Data
Quality
Information
Quality
Analysis
Quality
Knowledge
g A specific analysis goal
X The available dataset
f An empirical analysis method
U A utility measure
1.Data resolution
2.Data structure
3.Data integration
4.Temporal relevance
5.Chronology of data and goal
6.Generalizability
7.Operationalization
8.Communication
What
How
Goals InfoQ(f,X,g) = U(f(X|g))
Information Quality
![Page 45: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/45.jpg)
Russom, P., Big Data Analytics, TDWI Best Practices Report, Q4 2011
Massive data sets
1. Data resolution
2. Data structure
3. Data integration
4. Temporal relevance
5. Chronology of data and goal
6. Generalizability
7. Operationalization
8. Communication
Big data Analytics
![Page 46: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/46.jpg)
Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators
Stages in economic downturn: 1) the peak, 2) modest slowing, 3) intensifying worrying by investors (a lot of panic selling occurs in this stage), and 4) the advent of recession. Can we predict the economic slowdown in corporate earnings (S&P 500 EPS) well in advance?
Ellis claims (based on observations) there is a 0-9 month lag between wages and its effect on consumer spending. 0-6 months until changes in consumer spending affects changes in industrial production. Another 6-12 months between industrial production and capital spending. And finally, another 6-12 between capital spending and its effects on Corporate Profits.
46
![Page 47: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/47.jpg)
Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators Ellis model:
47
![Page 48: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/48.jpg)
Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators
The data: i) 180 quarters. 6 [Economic] x variables. Ii) Change in S&P EPS = y variable, iii) All variables transformed to year vs year % change, iv( All data used is publicly available via websites of US agencies: BEA, BLS, FED, and S&P.
The analysis: XLMiner on these different versions of datasets. Partitioned it. Ran predictor applications: ACF Plots, MLR, Regression Tree – full and pruned.
48
Auto Correlation Chart. Based on this, took Lag_1 as one of the predictors. Lag_1 = QEPS_YY(Q-1)
![Page 49: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/49.jpg)
Three case studies (1/3)
1. Predicting Changes in Quarterly Corporate Earnings Using Economic Indicators
49
QEPS_YY%(t) = 0.0486 + 0.747*QEPS_YY%(t-1) -0.517*QRCAP_YY%(t-2)
# Dimension Note Value Index
1 Data resolution quarterly data 2 0.2500
2 Data structure no externalities 3 0.5000
3 Data integration 4 0.7500
4 Temporal relevance 5 1.0000
5 Generalizability 5 1.0000
6 Chronology of data and goal quarterly data 3 0.5000
7 Concept operationalization 5 1.0000
8 Communication 4 0.7500
InfoQ Score = 0.66
InfoQ=66%
![Page 50: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/50.jpg)
Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
50
“Zillow.com” is a real estate service launched in 2006
It calculates a Zestimate-home valuation for most homes in the U.S
For MD and VA it gets only about 26% of predictions within the +/-5% range only.
1.Home Type (Single Family, Condo , etc) 2.No of Bed Rooms 3.No of Bath Rooms 4.Total Area –Sqft 5.Lot size –Sqft 6.No of Stories 7.Total Rooms 8.Distance from Metro 9.Primary School Rank 10.Middle School Rank 11.High School Rank 12.Age of house at Sale 13.Sale Season (Fall , Winter , etc) 14.Recession Period (Y/N) 15.Sales Volume
![Page 51: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/51.jpg)
Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
51
• Data collected, cleansed and merged from 4 sources –Zillow , Redfin, School Digger and Google Maps
• 17 counties (29 Zip codes) in Northern VA
House sales data • Before Data Clean up: 3500+ • After Data Clean up: 1416 • Y –Is Zestimate correct (Y/N)
37.6%/62.43% • X –15 variables (5+ variables
where discarded from initial set )
![Page 52: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/52.jpg)
Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
52
# Dimension Note Value Index
1 Data resolution by individual house 5 1.0000
2 Data structure no externalities 4 0.7500
3 Data integration 5 1.0000
4 Temporal relevance 5 1.0000
5 Generalizability only VA counties 3 0.5000
6 Chronology of data and goal 5 1.0000
7 Concept operationalization 4 0.7500
8 Communication 4 0.7500
InfoQ Score = 0.82InfoQ=82%
![Page 53: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/53.jpg)
Three case studies (2/3)
2. Predicting ZILLOW.com’s Zestimate accuracy
53 http://www.madlan.co.il/education/schools
The Israeli version……
![Page 54: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/54.jpg)
Three case studies (3/3)
3. Predicting First Day Returns for Japanese IPOs
Goal: To predict the First Day returns on Japanese IPOs (based on first day closing price), using public information available prior to the offer
The data: i) Japanese IPO data from 1997-2009*, ii) 1561 IPOs, iii) Industry(categorical) : 35 industries - 3 were spelling errors, corrected
Remove Air Trans (1), Fishery & Forestry (2) industries
–Removed first 128 entries (1997-1999) as they had no data for 2 columns : Underwriter’s fees & Allocation to BRLM
–New Columns
Minimum bid size
Secondary Offering %age
–Creation of Dummy Variables
BRLMs – 3, on the basis of Gross proceeds of IPO
Industry – 4, binned by average return
Market – whether the IPO was OTC or not 54
*Kaneko and Pettway’s Japanese IPO Database (KP-JIPO) http://www.fbc.keio.ac.jp/~kaneko/KP-JIPO/top.htm
![Page 55: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/55.jpg)
Three case studies (3/3)
3. Predicting First Day Returns for Japanese IPOs
55
1) Age of company at time of IPO 2) Gross Proceeds (size of IPO) 3) Minimum Bid Amount 4) IS_OTC listing 5) Secondary offering as %age of total 5) Percentage shares allocated to Lead Manager 1 7) Underwriter’s Gross Spread (fees as %age of size of IPO) 8) Industry_Type (binned categorical variable – 4 categories) 9) Lead_Manager (binned categorical variable – 3 categories)
# Dimension Note Value Index
1 Data resolution 5 1.0000
2 Data structure 4 0.7500
3 Data integration no externalities 2 0.2500
4 Temporal relevance 5 1.0000
5 Generalizability no theory 3 0.5000
6 Chronology of data and goal should be ex ante 3 0.5000
7 Concept operationalization 5 1.0000
8 Communication 4 0.7500
InfoQ Score = 0.66
Prediction algorithms do not give a reasonable prediction of IPO returns from public information. (High RMSE: 90%)
InfoQ=66%
![Page 56: Kenett On Information NYU-Poly 2013](https://reader034.fdocuments.net/reader034/viewer/2022042819/55cb63ecbb61ebf6198b45b0/html5/thumbnails/56.jpg)
Thank you for your attention
56