Data Driven Modelling using MATLAB - School of Computer Science
Transcript of Data Driven Modelling using MATLAB - School of Computer Science
Data Driven Modelling
Data Driven Modelling using MATLAB
Shan He
School for Computational ScienceUniversity of Birmingham
Module 06-23836: Computational Modelling with MATLAB
Data Driven Modelling
Outline
Outline of Topics
What is data driven modelling?
Regression Analysis in MATLAB
Artificial Neural Networks
Conclusion
Data Driven Modelling
What is data driven modelling?
What is data driven modelling?
I For equation and agent-based models, we assume the model isknown.
I However, sometimes we have large amount of data but verylittle prior knowledge.
I Finding the model in the first place is the most difficult andimportant question.
I A new research field: data driven modelling (DDM).
I Based on the data, a model is built on the basis ofconnections between the system state variables, e.g., input,internal and output variables, with only a limited assumptionabout the system.
Data Driven Modelling
What is data driven modelling?
Goals/purposes of data driven modelling
I Extract and recognize patterns in data
I Interpret or explain observations
I Test validity of hypotheses
I Search the space of hypotheses
Data Driven Modelling
What is data driven modelling?
Tasks of data driven modelling
I Classification: where the task constitutes of assigning a classfor an input data point.
I Association: where association between variablescharacterising the system is to be identified, which is used insubsequent prediction.
I Regression: where the task constitutes of predicting a realvalue associated with an input data point.
I Clustering: where groups of data points with within groupsimilarity are to be determined.
Data Driven Modelling
What is data driven modelling?
It is new and old!
I Before it was called observational modelling.
I Based on methods in statistics, e.g., regression.
I These methods usually cannot handle nonlinear systems.
I Recent years, machine learning techniques have been applied.
I We will learn how to use regression and Artificial NeuralNetworks to build data-driven models in MATLAB.
Data Driven Modelling
What is data driven modelling?
Data driven modelling process
I Data preparation: obtain data / data checking/ datacleaning
I Feature selection: if you have high-dimensional data.
I Specify assumptions based on domain knowledge.
I Develop Model based on the assumptions.
I Specify loss function, e.g., the mean least square errorbetween the model output and the real data.
I Use algorithms to minimize loss based on the train data.
I Test the model using testing data
Data Driven Modelling
What is data driven modelling?
What tools can we use?
I Statistics:I Linear regressionI Nonlinear regressionI Logistic regressionI Probit regression
I Machine Learning techniques:I Decision treeI Artificial Neural NetworkI Nearest NeighboursI Support Vector MachineI Association rule learning
Data Driven Modelling
Regression Analysis in MATLAB
Linear regression analysis in MATLAB
I For linear regression, we can use polynomial curve fitting.
I MATLAB function: p = polyfit(x,y,n)
I It finds the coefficients of a polynomial p(x) of degree n thatfits the data, p(x(i)) to y(i), in a least squares sense.
I The output p is a row vector of length n+1 containing thepolynomial coefficients in descending powers:
p(x) = p1xn + p2x
n−1 + · · ·+ pnx + pn+1
I To evaluate the polynomial at the data points: y =
polyval(p,x)
Data Driven Modelling
Regression Analysis in MATLAB
A very simple example: fitting error function
I Regression: We aim to fit the data points from the errorfunction erf(X) is twice the integral of the Gaussiandistribution with 0 mean and variance of 1/2:
erf(x) =2√π
∫ ∞x
e−t2dt
Data Driven Modelling
Regression Analysis in MATLAB
A more complex example: fitting traffic data
I Hourly traffic counts at three intersections for a single day.
I Regression: We aim to fit the data with polyval
Data Driven Modelling
Regression Analysis in MATLAB
Logistic regression
I Sometimes called the logistic model or logit model.
I Can be used for predicting the outcome of a binary dependentvariable: Classification.
I MATLAB function: b = glmfit(X,y,distr)
I Output: a p-by-1 vector b of coefficient estimates for ageneralized linear regression of the responses in y on thepredictors in X, using the distribution distr
Data Driven Modelling
Regression Analysis in MATLAB
Australian Credit Card Assessment
I Task: to assess applications to an Australian bank for a creditcard based on a number of attributes.
I 2 classes: granted (44.5% of the instances) or denied (55.5%of the instances)
I 14 attributes: names and values have been changed tomeaningless symbols to protect confidentiality of the data.
I Mixing-value inputs: there are 5 continuous, 4 binary and 5nominal
I A lot of missing value.
Data Driven Modelling
Artificial Neural Networks
What is Artificial Neural Networks (ANNs)?
Hidden LayerInput Output
I ANN: Mathematical model or computational model inspiredby biological neural networks.
I Consists of an interconnected group of artificial neurons
Data Driven Modelling
Artificial Neural Networks
What are Artificial Neural Networks (ANNs)?
I Non-linear statistical data modeling tools:I Model complex relationships between inputs and outputs;I Discover patterns in data.
I Can be used for classification, association, regression andclustering.
I MATLAB Neural Network Toolbox (Click for more detailedtutorial)
Data Driven Modelling
Artificial Neural Networks
Example: Prediction of number of sun spots
I Sunspot series is a record of the activity of the surface of thesun.
I Important: Telecommunication will by disrupted by asufficiently large solar flare.
I Time series data for sunspot activity over the last 300 years.
I Sunspot activity is cyclical, reaching a maximum about every11 years.
I Challenging: sunspot series is nonlinear, non-stationary andnon-Gaussian
Data Driven Modelling
Artificial Neural Networks
Prediction of sunspot number by ANNs
I Task: We use recorded sunspot data to train our ANN topredict sunspot number based on the sunspot numbers ofprevious 3 years.
I Training data: sunspot numbers from 1705 – 1884
I Test data: sunspot numbers from 1884 – 1987
Data Driven Modelling
Artificial Neural Networks
New direction for ANNs: Deep Learning
I ANNs fell out of favour in 90s because they are slow andinefficient
I In 2006, Prof. Geoff Hinton made a breakthrough: deeplearning
I Excels at unsupervised learning, e.g., recognise handwrittenwords
I Key idea: learn categories incrementally, e.g., lower-levelcategories (letters) → higher-level categories (words)
I Google, Microsoft and along with other big names havejumped on the bandwagon
I Microsoft Project: Speech Recognition
Data Driven Modelling
Conclusion
Conclusion
I If you know the underlying mechanisms of the system (evenpartially), DO NOT use data-driven modelling methods.
I How to choose your tools: start from simple tools
I Regression → Decision Tree → ANNs (SVM, Random Forest)→ Hybrid methods, e.g., Evolutionary ANNs
I Also need to consider interpretability: simpler tools do better
Data Driven Modelling
Conclusion
Assignment
I Based on the sunspot number prediction example, use linearregression (polyfit) and ANNs to model Hudson BayCompany fur record data.
I Investigate how to use decision tree for Australian Credit CardAssessment problem. Compare the results with ANNs andLogistic regression.