Visualization of the data set used – spanning over 1300 trips and 2 million records
description
Transcript of Visualization of the data set used – spanning over 1300 trips and 2 million records
Visualization of the data set used – spanning over 1300 trips and 2 million records
Predicting Map-Matching Values using GPS Data from Navigation Software
Objective:• Map-Matching Values, or SnapWeights, are a measure of data point error. • Our aim is to model and predict this based on variables such as position,
velocity, and heading
Approaches:• Linear Regression of logit-transformed response with I.I.D. data assumed• Beta Regression of Independent responses that are Beta-Distributed with
different parameters• AR model, treating data as a Time Series.
Illustration of Data:
Overview
Time Series Models: AR/MA
• These models assume a linear dependence on previous data points and require stationarity and ergodicity
• AR(p): • MA(q): where all are White Noise error terms
• Different distributions for different trips. Generally appear stationary with no clear trend or seasonality.• Auto-correlations plotted. Generally slow decay indicates low probability of AR/MA process• Similarly, not likely high value in trying combination model ARMA
Linear Regression of Transformed SnapWeight
• Response variable transformed using a Logit Function and then linearly regressed on covariates such as speed, bearing, and acceleration
where
• To uphold I.I.D assumption, trips with similar distributions are to be combined• Similarity of distributions determined empirically using either Kolmogorov-Smirnov test or
Mutual Information function (less rigorous but so far less implementation problems)
• Not many covariates. Just a matter of training and then validating an assortment of covariate combinations using OLS and regularized methods such as ridge regression
• Initial runs on single trip data sets give very low prediction error values (R-squared)
Beta RegressionMotivation: • The SnapWeight is an indication of error rate and is constrained between 0 and 1.
We would like to respect this boundary.
Assumption:
Beta distribution, parameterized by mean (μ) and precision (ϕ):
Regression based on these assumptions:
Link Function (e.g. Logit, Cauchy,..)
Beta Regression
• Current Problems:– Huge dataset impedes the use of Summary()– Not sure about the effect of IDs and whether it’s necessary to separate them
beta_logit betareg(formula = SnapWeight ~ Speedx10, data = data.train.small)
Coefficients in two cases of regression formula:(Intercept) Speedx10 Headingx10 1.170e+00 1.034e-04 -4.888e-05
(Intercept) Speedx10 1.0773586 0.0001835
What’s Next
• Fit AR/MA models to many IDs and find most common p and q. Then validate using validation data set.
• Regress on larger aggregated data sets and then validate.• Do beta regression with the assumption of non-identical precision parameters, also
different link functions• Eliminate incomplete/irrelevant data from corpus, based on both intuition and regression
results• Compare different test values• Pick most likely models and apply to untouched Test data.
References
• [1] René A. Carmona. Statistical Analysis of Financial Data in S-PLUS. Springer, 2004.
• [2] Cribari-Neto, Francisco and Zeileis, Achim (2009) Beta Regression in R. Research Report Series / Department of Statistics and Mathematics, 98. Department of Statistics and Mathematics x, WU Vienna University of Economics and Business, Vienna.
• [3] Ferrari, Silvia and Cribari-Neto, Francisco (2004) Beta Regression for Modeling Rates and Proportions. Journal of Applied Sciences, Volume 31, Issue 7.