Data Driven Modelling using MATLAB - School of Computer Science

21
Data Driven Modelling Data Driven Modelling using MATLAB Shan He School for Computational Science University of Birmingham Module 06-23836: Computational Modelling with MATLAB

Transcript of Data Driven Modelling using MATLAB - School of Computer Science

Data Driven Modelling

Data Driven Modelling using MATLAB

Shan He

School for Computational ScienceUniversity of Birmingham

Module 06-23836: Computational Modelling with MATLAB

Data Driven Modelling

Outline

Outline of Topics

What is data driven modelling?

Regression Analysis in MATLAB

Artificial Neural Networks

Conclusion

Data Driven Modelling

What is data driven modelling?

What is data driven modelling?

I For equation and agent-based models, we assume the model isknown.

I However, sometimes we have large amount of data but verylittle prior knowledge.

I Finding the model in the first place is the most difficult andimportant question.

I A new research field: data driven modelling (DDM).

I Based on the data, a model is built on the basis ofconnections between the system state variables, e.g., input,internal and output variables, with only a limited assumptionabout the system.

Data Driven Modelling

What is data driven modelling?

Goals/purposes of data driven modelling

I Extract and recognize patterns in data

I Interpret or explain observations

I Test validity of hypotheses

I Search the space of hypotheses

Data Driven Modelling

What is data driven modelling?

Tasks of data driven modelling

I Classification: where the task constitutes of assigning a classfor an input data point.

I Association: where association between variablescharacterising the system is to be identified, which is used insubsequent prediction.

I Regression: where the task constitutes of predicting a realvalue associated with an input data point.

I Clustering: where groups of data points with within groupsimilarity are to be determined.

Data Driven Modelling

What is data driven modelling?

It is new and old!

I Before it was called observational modelling.

I Based on methods in statistics, e.g., regression.

I These methods usually cannot handle nonlinear systems.

I Recent years, machine learning techniques have been applied.

I We will learn how to use regression and Artificial NeuralNetworks to build data-driven models in MATLAB.

Data Driven Modelling

What is data driven modelling?

Data driven modelling process

I Data preparation: obtain data / data checking/ datacleaning

I Feature selection: if you have high-dimensional data.

I Specify assumptions based on domain knowledge.

I Develop Model based on the assumptions.

I Specify loss function, e.g., the mean least square errorbetween the model output and the real data.

I Use algorithms to minimize loss based on the train data.

I Test the model using testing data

Data Driven Modelling

What is data driven modelling?

What tools can we use?

I Statistics:I Linear regressionI Nonlinear regressionI Logistic regressionI Probit regression

I Machine Learning techniques:I Decision treeI Artificial Neural NetworkI Nearest NeighboursI Support Vector MachineI Association rule learning

Data Driven Modelling

Regression Analysis in MATLAB

Linear regression analysis in MATLAB

I For linear regression, we can use polynomial curve fitting.

I MATLAB function: p = polyfit(x,y,n)

I It finds the coefficients of a polynomial p(x) of degree n thatfits the data, p(x(i)) to y(i), in a least squares sense.

I The output p is a row vector of length n+1 containing thepolynomial coefficients in descending powers:

p(x) = p1xn + p2x

n−1 + · · ·+ pnx + pn+1

I To evaluate the polynomial at the data points: y =

polyval(p,x)

Data Driven Modelling

Regression Analysis in MATLAB

A very simple example: fitting error function

I Regression: We aim to fit the data points from the errorfunction erf(X) is twice the integral of the Gaussiandistribution with 0 mean and variance of 1/2:

erf(x) =2√π

∫ ∞x

e−t2dt

Data Driven Modelling

Regression Analysis in MATLAB

A more complex example: fitting traffic data

I Hourly traffic counts at three intersections for a single day.

I Regression: We aim to fit the data with polyval

Data Driven Modelling

Regression Analysis in MATLAB

Logistic regression

I Sometimes called the logistic model or logit model.

I Can be used for predicting the outcome of a binary dependentvariable: Classification.

I MATLAB function: b = glmfit(X,y,distr)

I Output: a p-by-1 vector b of coefficient estimates for ageneralized linear regression of the responses in y on thepredictors in X, using the distribution distr

Data Driven Modelling

Regression Analysis in MATLAB

Australian Credit Card Assessment

I Task: to assess applications to an Australian bank for a creditcard based on a number of attributes.

I 2 classes: granted (44.5% of the instances) or denied (55.5%of the instances)

I 14 attributes: names and values have been changed tomeaningless symbols to protect confidentiality of the data.

I Mixing-value inputs: there are 5 continuous, 4 binary and 5nominal

I A lot of missing value.

Data Driven Modelling

Regression Analysis in MATLAB

Military Trauma survival prediction

Data Driven Modelling

Artificial Neural Networks

What is Artificial Neural Networks (ANNs)?

Hidden LayerInput Output

I ANN: Mathematical model or computational model inspiredby biological neural networks.

I Consists of an interconnected group of artificial neurons

Data Driven Modelling

Artificial Neural Networks

What are Artificial Neural Networks (ANNs)?

I Non-linear statistical data modeling tools:I Model complex relationships between inputs and outputs;I Discover patterns in data.

I Can be used for classification, association, regression andclustering.

I MATLAB Neural Network Toolbox (Click for more detailedtutorial)

Data Driven Modelling

Artificial Neural Networks

Example: Prediction of number of sun spots

I Sunspot series is a record of the activity of the surface of thesun.

I Important: Telecommunication will by disrupted by asufficiently large solar flare.

I Time series data for sunspot activity over the last 300 years.

I Sunspot activity is cyclical, reaching a maximum about every11 years.

I Challenging: sunspot series is nonlinear, non-stationary andnon-Gaussian

Data Driven Modelling

Artificial Neural Networks

Prediction of sunspot number by ANNs

I Task: We use recorded sunspot data to train our ANN topredict sunspot number based on the sunspot numbers ofprevious 3 years.

I Training data: sunspot numbers from 1705 – 1884

I Test data: sunspot numbers from 1884 – 1987

Data Driven Modelling

Artificial Neural Networks

New direction for ANNs: Deep Learning

I ANNs fell out of favour in 90s because they are slow andinefficient

I In 2006, Prof. Geoff Hinton made a breakthrough: deeplearning

I Excels at unsupervised learning, e.g., recognise handwrittenwords

I Key idea: learn categories incrementally, e.g., lower-levelcategories (letters) → higher-level categories (words)

I Google, Microsoft and along with other big names havejumped on the bandwagon

I Microsoft Project: Speech Recognition

Data Driven Modelling

Conclusion

Conclusion

I If you know the underlying mechanisms of the system (evenpartially), DO NOT use data-driven modelling methods.

I How to choose your tools: start from simple tools

I Regression → Decision Tree → ANNs (SVM, Random Forest)→ Hybrid methods, e.g., Evolutionary ANNs

I Also need to consider interpretability: simpler tools do better

Data Driven Modelling

Conclusion

Assignment

I Based on the sunspot number prediction example, use linearregression (polyfit) and ANNs to model Hudson BayCompany fur record data.

I Investigate how to use decision tree for Australian Credit CardAssessment problem. Compare the results with ANNs andLogistic regression.