Predictive Modeling in Insurance in the context of (possibly) big data

Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

Predictive Modeling in Insurance,

in the context of (possibly) Big Data

A. Charpentier (UQAM & Universit de Rennes 1)

Statistical & Actuarial Sciences Joint Seminar

& Center of studies in Asset Management (CESAM)

http://freakonometrics.hypotheses.org

@freakonometrics 1


Predictive Modeling in Insurance,

in the context of (possibly) Big Data

A. Charpentier (UQAM & Universit de Rennes 1)

Professor of Actuarial Sciences, Mathematics Department, UQM(previously Economics Department, Univ. Rennes 1 & ENSAE Paristechactuary in Hong Kong, IT & Stats FFSA)

ACTINFO-Cova Chair, Actuarial Value of InformationData Science for Actuaries program, Institute of ActuariesPhD in Statistics (KU Leuven), Fellow Institute of ActuariesMSc in Financial Mathematics (Paris Dauphine) & ENSAEEditor of the freakonometrics.hypotheses.orgs blogEditor of Computational Actuarial Science, CRC

@freakonometrics 2


Actuarial Science, an American Perspective

Source: Trowbridge (1989) Fundamental Concepts of Actuarial Science.

@freakonometrics 3


Actuarial Science, a European Perspective

Source: Dhaene et al. (2004) Modern Actuarial Risk Theory.

@freakonometrics 4


Exemples of Actuarial Problems: Ratemaking and Pricing

E[S|X] = E[

Ni=1

Zi

X]

= E[N |X] annual frequency

E[Zi|X] individual cost

censoring / incomplete datasets (exposure + delay to report claims)

We observe Y and E, but the variable of interest is N .

Yi P(Ei i) with i = exp[0 + xTi + Zi].

@freakonometrics 5


Exemples of Actuarial Problems: Pricing and Classification

Econometric models on classes

Yi B(pi) with pi =exp[0 + xTi ]

1 + exp[0 + xTi ]

or on counts

Yi P(i) with i = exp[0 + xTi ]

(too) large datasets X can be large (or com-plex)

factors with a lot of modalities, spatial data,text information, etc.

@freakonometrics 6


Exemples of Actuarial Problems: Pricing and ClassificationHow to avoid overfit? How to group modalities?How to choose between (very) correlated features?

model selection issues

Historically Bailey (1963) margin method n = rcT,with row (r) and column (c) effects, and constraints

i

ni,j =i

ri cj andj

ni,j =j

ri cj

Related to Poisson regression,

N P(exp[0 + rTR + cTC ])

@freakonometrics 7

https://www.casact.org/pubs/proceed/proceed63/63004.pdf


Exemples of Actuarial Problems: Claims Reserving and Predictive Models

predictive modeling issues

In all those cases, the goal is to get a predictivemodel, y = m(x) given some features x.Recall that the main interest in insurance is either

a probability m(x) = P[Y = 1|X = x]

an expected value m(x) = E[Y |X = x]

but sometimes, we need the (conditiondal) distribu-tion of Y .

@freakonometrics 8


History of Actuarial Models (in one slide)

Bailey (1963) or Taylor (1977) considered deterministic models, ni,j = ricj orni,j = ridi+j . Some additional constraints are given to get an identifiable model.

Then some stochastic version of those models were introduced, see Hachemeister(1975) or de Vylder (1985), e.g.

Ni,j P(exp[P(exp[0 +RTR +CTC ])

orlogNi,j N (0 +RTR +CTC , 2)

All those techniques are econometric-based techniques. Why not consider somestatistical learning techniques?

@freakonometrics 9

https://www.casact.org/pubs/proceed/proceed63/63004.pdfhttps://www.casact.org/library/astin/vol9no1and2/vol9no1and2.pdfhttps://www.casact.org/pubs/forum/92spforum/92sp307.pdfhttps://www.casact.org/pubs/forum/92spforum/92sp307.pdfhttp://www.sciencedirect.com/science/article/pii/0167668785900125


Statistical Learning and Philosophical Issues

From Machine Learning and Econometrics, by Hal Varian :

Machine learning use data to predict some variable as a function of othercovariables,

may, or may not, care about insight, importance, patterns

may, or may not, care about inference (how y changes as some x change)

Econometrics use statistical methodes for prediction, inference and causalmodeling of economic relationships

hope for some sort of insight (inference is a goal)

in particular, causal inference is goal for decision making.

machine learning, new tricks for econometrics

@freakonometrics 10

http://web.stanford.edu/class/ee380/Abstracts/140129-slides-Machine-Learning-and-Econometrics.pdf



Remark machine learning can also learn from econometrics, especially with noni.i.d. data (time series and panel data)

Remark machine learning can help to get better predictive models, given gooddatasets. No use on several data science issues (e.g. selection bias).

@freakonometrics 11


Machine Learning and Statistics

Machine learning and statistics seem to be very similar, they share the samegoalsthey both focus on data modelingbut their methods are affected bytheir cultural differences.

The goal for a statistician is to predict an interaction between variables withsome degree of certainty (we are never 100% certain about anything). Machinelearners, on the other hand, want to build algorithms that predict, classify, andcluster with the most accuracy, see Why a Mathematician, Statistician & MachineLearner Solve the Same Problem Differently

Machine learning methods are about algorithms, more than about asymptoticstatistical properties.

@freakonometrics 12

http://www.galvanize.com/blog/2015/08/26/why-a-mathematician-statistician-machine-learner-solve-the-same-problem-differently-2/http://www.galvanize.com/blog/2015/08/26/why-a-mathematician-statistician-machine-learner-solve-the-same-problem-differently-2/


Machine Learning and Statistics

See also nonparametric inference: Note that the non-parametric model is notnone-parametric: parameters are determined by the training data, not the model.[...] non-parametric covers techniques that do not assume that the structure of amodel is fixed. Typically, the model grows in size to accommodate thecomplexity of the data. see wikipedia

Validation is not based on mathematical properties, but on properties out ofsample: we must use a training sample to train (estimate) model, and a testingsample to compare algorithms.

@freakonometrics 13

https://en.wikipedia.org/wiki/Nonparametric_statistics


Goldilock Principle: the Mean-Variance Tradeoff

In statistics and in machine learning, there will be parameters andmeta-parameters (or tunning parameters. The first ones are estimated, thesecond ones should be chosen.

See Hill estimator in extreme value theory. X has a Pareto distribution abovesome threshold u if

P[X > x|X > u] =(ux

) 1 for x > u.

Given a sample x, consider the Pareto-QQ plot, i.e. the scatterplot{ log

(1 i

n+ 1

), log xi:n

}i=nk, ,n

for points exceeding Xnk:n.

@freakonometrics 14



The slope is , i.e.

logXni+1:n logXnk:n + ( log i

n+ 1 logn+ 1k + 1

)

Hence, consider estimator k =1k

k1i=0

log xni:n log xnk:n.

Standard mean-variance tradeoff,

k large: bias too large, variance too small

k small: variance too large, bias too small

@freakonometrics 15



Same holds in kernel regression, with bandwidth h (length of neighborhood)

mh(x) =

ni=1

Kh(x xi) yi

ni=1

Kh(x xi)

for some kernel K().

Standard mean-variance tradeoff,

h large: bias too large, variance too small

h small: variance too large, bias too small

@freakonometrics 16



More generally, we estimate h or mh()Use the mean squared error for h

E[( h

)2]

or mean integrated squared error mh(),

E[

(m(x) mh(x))2 dx]

In statistics, derive an asymptotic expression for these quantities, and find h?

that minimizes those.

@freakonometrics 17



In classical statistics, the MISE can be approximated by

h4

4

(x2K(x)dx

)2 (m(x) + 2m(x)f

(x)f(x)

)dx+ 1

nh2K2(x)dx

dx

f(x)

where f is the density of xs. Thus the optimal h is

h? = n 15

2K2(x)dx

dxf(x)(

x2K(x)dx)2 (

m(x) + 2m(x)f(x)f(x)

)2dx

15

(hard to get a simple rule of thumb... up to a constant, h? n 15 )

In statistics learning, use bootstrap, or cross-validation to get an optimal h...

@freakonometrics 18


Randomization is too important to be left to chance!

Consider some sample x = (x1, , xn) and some statistics . Set n = (x)

Jackknife used to reduce bias: set (i) = (x(i)), and =1n

ni=1

(i)

If E(n) = +O(n1) then E(n) = +O(n2).

See also leave-one-out cross validation, for m()

mse = 1n

ni=1

[yi m(i)(xi)]2

Boostrap estimate is based on bootstrap samples: set (b) = (x(b)), and

= 1n

ni=1

(b), where x(b) is a vector of size n, where values are drawn from

{x1, , xn}, with replacement. And then use the law of large numbers...

See Efron (1979).

@freakonometrics 19

http://www.stat.cmu.edu/~fienberg/Statistics36-756/Efron1979.pdf



From (yi,xi), there are different stories behind, see Freedman (2005)

the causal story : xj,i is usually considered as independent of the othercovariates xk,i. For all possible x, that value is mapped to m(x) and a noiseis atatched, . The goal is to recover m(), and the residuals are just thedifference between the response value and m(x).

the conditional distribution story : for a linear model, we usually say that Ygiven X = x is a N (m(x), 2) distribution. m(x) is then the conditionalmean. Here m() is assumed to really exist, but no causal assumption ismade, only a conditional one.

the explanatory data story : there is no model, just data. We simply want tosummarize information contained in xs to get an accurate summary, close tothe response (i.e. min{`(yi,m(xi))}) for some loss function `.

@freakonometrics 20

http://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/statistical-models-theory-and-practice-2nd-edition


Machine Learning vs. Statistical Modeling

In machine learning, given some dataset (xi, yi), solve

m() = argminm()F

{ni=1

`(yi,m(xi))}

for some loss functions `(, ).

In statistical modeling, given some probability space (,A,P), assume that yi arerealization of i.i.d. variables Yi (given Xi = xi) with distribution Fi. Then solve

m() = argmaxm()F

{logL(m(x);y)} = argmaxm()F

{ni=1

log f(yi;m(xi))}

where logL denotes the log-likelihood.

@freakonometrics 21


Loss Functions

Fitting criteria are based on loss functions (also called cost functions). For aquantitative response, a popular one is the quadratic loss,`(y,m(x)) = [y m(x)]2.

Recall that E(Y ) = argmin

mR{Y m`2} = argmin

mR{E([Y m]2

)}

Var(Y ) = minmR{E([Y m]2

)} = E

([Y E(Y )]2

)The empirical version is

y = argminmR

{ni=1

1n

[yi m]2}

s2 = minmR{ni=1

1n

[yi m]2} =ni=1

1n

[yi y]2

@freakonometrics 22


Model Evaluation

In linear models, the R2 is defined as the proportion of the variance of the theresponse y that can be obtained using the predictors.

But maximizing the R2 usually yields overfit (or unjustified optimism in Berk(2008)).

In linear models, consider the adjusted R2,

R2 = 1 [1R2] n 1

n p 1

where p is the number of parameters, or more generally trace(S) when somesmoothing matrix is considered

y = m(x) =ni=1

Sx,iyi = STxy

where Sx is some vector of weights (called smoother vector), related to a n nsmoother matrix, y = Sy where prediction is done at points xis.

@freakonometrics 23

http://www.springer.com/us/book/9780387775005http://www.springer.com/us/book/9780387775005


Model Evaluation

Alternatives are based on the Akaike Information Criterion (AIC) and theBayesian Information Criterion (BIC), based on a penalty imposed on somecriteria (the logarithm of the variance of the residuals),

AIC = log(

1n

ni=1

[yi yi]2)

+ 2pn

BIC = log(

1n

ni=1

[yi yi]2)

+ log(n)pn

In a more general context, replace p by trace(S).

@freakonometrics 24


Model Evaluation

One can also consider the expected prediction error (with a probabilistic model)

E[`(Y, m(X)]

We cannot claim (using the law of large number) that

1n

ni=1

`(yi, m(xi))a.s.9 E[`(Y,m(X)]

since m depends on (yi,xi)s.

Natural option : use two (random) samples, a training one and a validation one.

Alternative options, use cross-validation, leave-one-out or k-fold.

@freakonometrics 25


Underfit / Overfit and Variance - Mean Tradeoff

Goal in predictive modeling: reduce uncertainty in our predictions.

Need more data to get a better knowledge.

Unfortunately, reducing the error of the prediction on a dataset does notgenerally give a good generalization performance

need a training and a validation dataset

@freakonometrics 26


Overfit, Training vs. Validation and Complexity (Vapnik Dimension)

complexity polynomial degree

@freakonometrics 27


Overfit, Training vs. Validation and Complexity (Vapnik Dimension)

complexity number of neighbors (k)

@freakonometrics 28


Logistic RegressionAssume that P(Yi = 1) = i,

logit(i) = xTi , where logit(i) = log(

i1 i

),

ori = logit1(xTi ) =

exp[xTi ]1 + exp[xTi ]

.

The log-likelihood is

logL() =ni=1

yi log(i)+(1yi) log(1i) =ni=1

yi log(i())+(1yi) log(1i())

and the first order conditions are solved numerically

logL()k

=ni=1

Xk,i[yi i()] = 0.

@freakonometrics 29


Predictive Classifier

To go from a score

s(x) = exp[xT]

1 + exp[xT]to a class:

if s(x) > s, then Y (x) = 1 (or ) and s(x) s, then Y (x) = 0 (or ).

Plot TP (s) = P[Y = 1|Y = 1] against FP (s) = P[Y = 1|Y = 0]

@freakonometrics 30


Why a Logistic and not a Probit Regression?Bliss (1934) suggested a model such that

P(Y = 1|X = x) = H(xT) where H() = ()

the c.d.f. of the N (0, 1) distribution. This is the probit model.This yields a latent model, yi = 1(y?i > 0) where

y?i = xTi + i is a nonobservable score.

In the logistic regression, we model the odds ratio,

P(Y = 1|X = x)P(Y 6= 1|X = x) = exp[x

T]

P(Y = 1|X = x) = H(xT) where H() = exp[]1 + exp[]

which is the c.d.f. of the logistic variable, see Verhulst (1845)

@freakonometrics 31

http://www.sciencemag.org/content/79/2037/38http://gdz.sub.uni-goettingen.de/dms/load/img/?PPN=PPN129323640_0018&DMDID=dmdlog7


k-Nearest Neighbors (a.k.a. k-NN)

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is anon-parametric method used for classification and regression. (Source: wikipedia).

E[Y |X = x] 1k

d(xi,x) small

yi

For k-Nearest Neighbors, the class is usually the majority vote of the k closestneighbors of x.

Distance d(, ) should not be sensitive to units:normalize by standard deviation

0 5 10 15 20

500

1000

1500

2000

2500

3000

PVENT

RE

PU

L

@freakonometrics 32

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm


k-Nearest Neighbors and Curse of Dimensionality

The higher the dimension, the larger the distance to the closest neigbbor

mini{1, ,n}

{d(a,xi)},xi Rd.

dim1 dim2 dim3 dim4 dim5

0.0

0.2

0.4

0.6

0.8

1.0

dim1 dim2 dim3 dim4 dim5

0.0

0.2

0.4

0.6

0.8

1.0

n = 10 n = 100

@freakonometrics 33


Classification (and Regression) Trees, CART

one of the predictive modelling approaches used instatistics, data mining and machine learning [...]In tree structures, leaves represent class labels andbranches represent conjunctions of features that leadto those class labels. (Source: wikipedia).

5 10 15 20

500

1000

1500

2000

2500

3000

PVENT

RE

PU

L

@freakonometrics 34

https://en.wikipedia.org/wiki/Decision_tree_learning


Bagging

Bootstrapped Aggregation (Bagging) , is a machine learning ensemblemeta-algorithm designed to improve the stability and accuracy of machinelearning algorithms used in statistical classification (Source: wikipedia).

It is an ensemble method that creates multiple models of the same type fromdifferent sub-samples of the same dataset [boostrap]. The predictions from eachseparate model are combined together to provide a superior result [aggregation].

can be used on any kind of model, but interesting for trees, see Breiman (1996)

Boostrap can be used to define the concept of margin,

margini =1B

Bb=1

1(yi = yi)1B

Bb=1

1(yi 6= yi)

Remark Probability that ith raw is not selection (1 n1)n e1 36.8%, cftraining / validation samples (2/3-1/3)

@freakonometrics 35

https://en.wikipedia.org/wiki/Bootstrap_aggregatinghttp://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf


Random Forests

Strictly speaking, when boostrapping among observations,and aggregating, we use a bagging algorithm.In the random forest algorithm, we combine Breimans bag-ging idea and the random selection of features, introducedindependently by Ho (1995) and Amit & Geman (1997).

0 5 10 15 20

500

1000

1500

2000

2500

3000

PVENT

RE

PU

L

@freakonometrics 36

http://cm.bell-labs.com/cm/cs/who/tkh/papers/odt.pdfhttp://www.cis.jhu.edu/publications/papers_in_database/GEMAN/shape.pdf


Bagging, Forests, and Variable ImportanceGiven some random forest with M trees, set

V I(Xk) =1M

m

t

NtN

i(t)

where the first sum is over all trees, and the second one isover all nodes where the split is done based on variable Xk.But difficult to interprete with correlated features. Con-sidere model y = 0 + 1x1 + 3x3 + , and we considera model based on x = (x1, x2, x3) where x1 and x2 arecorrelated.Compare AIC vs. Variable Importance as a function of r

@freakonometrics 37


Support Vector Machine

SVMs were developed in the 90s based on previous work, from Vapnik & Lerner(1963), see Vailant (1984)Assume that points are linearly separable, i.e. there is and b such that

Y =

+1 if Tx+ b > 01 if Tx+ b < 0Problem: infinite number of solutions, need a good one,that separate the data, (somehow) far from the data.

The distance from x0 to the hyperplane Tx+ b is

d(x0, H,b) =Tx0 + b

@freakonometrics 38

http://www.cs.iastate.edu/~cs573x/vapnik-portraits1963.pdfhttp://www.cs.iastate.edu/~cs573x/vapnik-portraits1963.pdfhttps://people.mpi-inf.mpg.de/~mehlhorn/SeminarEvolvability/ValiantLearnable.pdf



Define support vectors as observations such that

|Txi + b| = 1

The margin is the distance between hyperplanes defined bysupport vectors.

The distance from support vectors to H,b is 1, and the margin is then21.

the algorithm is to minimize the inverse of the margins s.t. H,b separates1 points, i.e.

min{

12

T

}s.t. yi(Txi + b) 1, i.

@freakonometrics 39



Now, what about the non-separable case?

Here, we cannot have yi(Txi + b) 1 i.

introduce slack variables, Txi + b +1 i when yi = +1Txi + b 1 + i when yi = 1where i 0 i. There is a classification error when i > 1.

The idea is then to solve

min{

12

T + C1T1>1}, instead ofmin

{12

T

}

@freakonometrics 40


Support Vector Machines, with a Linear Kernel

So far, d(x0, H,b) = minxH,b

{x0 x`2}

where `2 is the Euclidean (`2) norm,

x0 x`2 =

(x0 x) (x0 x)=x0x0 2x0x+ xx

More generally, d(x0, H,b) = minxH,b

{x0 xk}

where k is some kernel-based norm,

x0 xk =k(x0,x0) 2k(x0,x) + k(xx)

0 5 10 15 20

500

1000

1500

2000

2500

3000

PVENT

RE

PU

L

0 5 10 15 20

500

1000

1500

2000

2500

3000

PVENT

RE

PU

L

@freakonometrics 41


Regression?

In statistics, regression analysis is a statistical process for estimating therelationships among variables [...] In a narrower sense, regression may referspecifically to the estimation of continuous response variables, as opposed to thediscrete response variables used in classification. (Source: wikipedia).

Here regression is opposed to classification (as in the CART algorithm). y iseither a continuous variable y R or a counting variable y N .

In many cases in econometric and actuarial literature we simply want a good fitfor the conditional expectation, E[Y |X = x].

@freakonometrics 42

https://en.wikipedia.org/wiki/Regression_analysis


Regression Smoothers, natura non facit saltus

In statistical learning procedures, a key role is played by basis functions. We willsee that it is common to assume that

m(x) =kj=0

jhj(x),

where h0 is usually a constant function and hj defined basis functions.

For instance, hm(x) = xj for a polynomial expansion witha single predictor, or hj(x) = (xsj)+ for some knots sj s(for linear splines, but one can consider quadratic or cubicones).

@freakonometrics 44


Regression Smoothers: Polynomial or Spline

Stone-Weiestrass theorem every continuous func-tion defined on a closed interval [a, b] can be uni-formly approximated as closely as desired by a poly-nomial function

Use also spline functions, e.g. piecewise linear

h(x) = 0 +kj=1

j(x sj)+

@freakonometrics 45


Linear Model

Consider some linear model yi = xTi + i for all i = 1, , n.

Assume that i are i.i.d. with E() = 0 (and finite variance). Writey1...yn

y,n1

=

1 x1,1 x1,k...

.... . .

...1 xn,1 xn,k

X,n(k+1)

0

1...k

,(k+1)1

+

1...n

,n1

.

Assuming N (0, 2I), the maximum likelihood estimator of is

= argmin{y XT`2} = (XTX)1XTy

... under the assumtption that XTX is a full-rank matrix.

What if XTiX cannot be inverted? Then = [XTX]1XTy does not exist, but = [XTX + I]1XTy always exist if > 0.

@freakonometrics 46


Ridge Regression

The estimator = [XTX + I]1XTy is the Ridge estimate obtained as solutionof

= argmin

ni=1

[yi 0 xTi ]2 + `2 1T2

for some tuning parameter . One can also write

= argmin;`2s

{Y XT`2}

Remark Note that we solve = argmin{objective()} where

objective() = L() training loss

+ R() regularization

@freakonometrics 47


Going further on sparcity issues

In severall applications, k can be (very) large, but a lot of features are just noise:j = 0 for many js. Let s denote the number of relevent features, with s



Define a`0 =

1(|ai| > 0). Ici dim() = s.

We wish we could solve

= argmin;`0s

{Y XT`2}

Problem: it is usually not possible to describe all possible constraints, since(s

k

)coefficients should be chosen here (with k (very) large).

Idea: solve the dual problem

= argmin;Y XT`2h

{`0}

where we might convexify the `0 norm, `0 .

@freakonometrics 49


Regularization `0, `1 et `2

@freakonometrics 50



On [1,+1]k, the convex hull of `0 is `1On [a,+a]k, the convex hull of `0 is a1`1Hence,

= argmin;`1s

{Y XT`2}

is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem

= argmin{Y XT`2+`1}

@freakonometrics 51


LASSO Least Absolute Shrinkage and Selection Operator

argmin{Y XT`2+`1}

is a convex problem (several algorithms?), but not strictly convex (no unicity ofthe minimum). Nevertheless, predictions y = xT are unique

? MM, minimize majorization, coordinate descent Hunter (2003).

@freakonometrics 52

http://sites.stat.psu.edu/~dhunter/papers/mmtutorial.pdf


Optimal LASSO Penalty

Use cross validation, e.g. K-fold,

(k)() = argmin

i6Ik

[yi xTi ]2 +

then compute the sum of the squared errors,

Qk() =iIk

[yi xTi (k)()]2

and finally solve

? = argmin{Q() = 1

K

k

Qk()}

Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) suggest thelargest such that

Q() Q(?) + se[?] with se[]2 = 1K2

Kk=1

[Qk()Q()]2

@freakonometrics 53

http://statweb.stanford.edu/~tibs/ElemStatLearn/


Boosting

Boosting is a machine learning ensemble meta-algorithm for reducing biasprimarily and also variance in supervised learning, and a family of machinelearning algorithms which convert weak learners to strong ones. (Source:wikipedia)

The heuristics is simple: we consider an iterative process where we keep modelingthe errors.

Fit model for y, m1() from y and X, and compute the error, 1 = y m1(X).

Fit model for 1, m2() from 1 and X, and compute the error,2 = 1 m2(X), etc. Then set

m() = m1() y

+m2() 1

+m3() 2

+ +mk() k1

@freakonometrics 54

https://en.wikipedia.org/wiki/Boosting_(machine_learning)


Boosting

With (very) general notations, we want to solve

m? = argmin{E[`(Y,m(X))]}

for some loss function `.

It is an iterative procedure: assume that at some step k we have an estimatormk(X). Why not constructing a new model that might improve our model,

mk+1(X) = mk(X) + h(X).

What h() could be?

@freakonometrics 55


Boosting

In a perfect world, h(X) = y mk(X), which can be interpreted as a residual.

Note that this residual is the gradient of 12 [y m(x)]2

A gradient descent is based on Taylor expansion

f(xk) f,xk

f(xk1) f,xk1

+ (xk xk1)

f(xk1) f,xk1

But here, it is different. We claim we can write

fk(x) fk,x

fk1(x) fk1,x

+ (fk fk1)

?fk1,x

where ? is interpreted as a gradient.

@freakonometrics 56


Boosting

Here, fk is a Rd R function, so the gradient should be in such a (big)functional space want to approximate that function.

mk(x) = mk1(x) + argminfF

{ni=1

`(Yi,mk1(x) + f(x))}

where f F means that we seek in a class of weak learner functions.

If learner are two strong, the first loop leads to some fixed point, and there is nolearning procedure, see linear regression y = xT + . Since x we cannotlearn from the residuals.

In order to make sure that we learn weakly, we can use some shrinkageparameter (or collection of parameters j).

@freakonometrics 57


Boosting with Piecewise Linear Spline & Stump Functions

@freakonometrics 58


Take Home Message

Similar goal: getting a predictive model, m(x)

Different/Similar tools: minimize loss/maximize likelihood

m() = argminm()F

{ni=1

`(yi,m(xi))}

vs. m() = argmaxm()F

{ni=1

log f(yi;m(xi))}

Try to remove the noise and avoid overfit using cross-validation,

`(yi, m(i)(xi))

Use computational tricks (bootstrap) to increase robustness

Nice tools to select interesting features (LASSO, variable importance)

@freakonometrics 59

0.0: 0.1: anm0: 1.0: 1.1: anm1: 2.0: 2.1: anm2: 3.0: 3.1: anm3: 4.0: 4.1: 4.2: anm4: 5.0: 5.1: 5.2: 5.3: 5.4: 5.5: 5.6: 5.7: 5.8: 5.9: 5.10: 5.11: 5.12: 5.13: 5.14: 5.15: 5.16: 5.17: 5.18: 5.19: anm5: 6.0: 6.1: 6.2: 6.3: 6.4: 6.5: 6.6: 6.7: 6.8: 6.9: 6.10: 6.11: 6.12: 6.13: 6.14: 6.15: 6.16: 6.17: 6.18: 6.19: anm6: 7.0: 7.1: 7.2: 7.3: 7.4: 7.5: 7.6: 7.7: 7.8: 7.9: 7.10: 7.11: 7.12: 7.13: 7.14: 7.15: 7.16: 7.17: 7.18: 7.19: 7.20: 7.21: 7.22: 7.23: 7.24: 7.25: 7.26: 7.27: 7.28: 7.29: 7.30: 7.31: 7.32: 7.33: 7.34: 7.35: 7.36: 7.37: 7.38: 7.39: anm7: 8.0: 8.1: 8.2: anm8: 9.0: 9.1: 9.2: 9.3: 9.4: 9.5: 9.6: 9.7: 9.8: 9.9: 9.10: 9.11: 9.12: 9.13: 9.14: 9.15: 9.16: 9.17: 9.18: 9.19: 9.20: 9.21: 9.22: 9.23: 9.24: 9.25: 9.26: 9.27: 9.28: 9.29: 9.30: 9.31: 9.32: 9.33: 9.34: 9.35: anm9: 10.0: 10.1: 10.2: 10.3: anm10: 11.0: 11.1: 11.2: anm11: 12.0: 12.1: 12.2: 12.3: 12.4: 12.5: anm12: 13.0: 13.1: 13.2: 13.3: 13.4: 13.5: 13.6: 13.7: 13.8: 13.9: 13.10: 13.11: 13.12: 13.13: 13.14: 13.15: 13.16: 13.17: 13.18: 13.19: 13.20: 13.21: 13.22: 13.23: 13.24: 13.25: 13.26: 13.27: 13.28: 13.29: 13.30: 13.31: 13.32: 13.33: 13.34: 13.35: anm13: 14.0: 14.1: 14.2: 14.3: 14.4: 14.5: 14.6: 14.7: 14.8: 14.9: 14.10: 14.11: 14.12: 14.13: 14.14: 14.15: 14.16: 14.17: 14.18: 14.19: 14.20: 14.21: 14.22: 14.23: 14.24: 14.25: 14.26: 14.27: 14.28: 14.29: 14.30: 14.31: 14.32: 14.33: 14.34: 14.35: anm14: 15.0: 15.1: 15.2: 15.3: 15.4: 15.5: 15.6: 15.7: 15.8: 15.9: 15.10: 15.11: 15.12: 15.13: 15.14: 15.15: 15.16: 15.17: 15.18: 15.19: 15.20: 15.21: 15.22: 15.23: 15.24: 15.25: 15.26: 15.27: 15.28: 15.29: 15.30: 15.31: 15.32: 15.33: 15.34: 15.35: anm15: 16.0: 16.1: anm16: 17.0: 17.1: 17.2: 17.3: 17.4: 17.5: 17.6: 17.7: 17.8: 17.9: 17.10: 17.11: 17.12: 17.13: 17.14: anm17: 18.0: 18.1: 18.2: 18.3: anm18: 19.0: 19.1: anm19: 20.0: 20.1: 20.2: 20.3: 20.4: 20.5: 20.6: 20.7: 20.8: 20.9: 20.10: 20.11: 20.12: 20.13: 20.14: 20.15: 20.16: 20.17: 20.18: 20.19: 20.20: 20.21: 20.22: 20.23: 20.24: 20.25: 20.26: 20.27: 20.28: 20.29: 20.30: 20.31: 20.32: 20.33: 20.34: 20.35: 20.36: 20.37: 20.38: 20.39: 20.40: 20.41: 20.42: 20.43: 20.44: 20.45: 20.46: 20.47: 20.48: 20.49: 20.50: 20.51: 20.52: 20.53: 20.54: 20.55: 20.56: 20.57: 20.58: 20.59: 20.60: 20.61: 20.62: 20.63: 20.64: 20.65: 20.66: 20.67: 20.68: 20.69: 20.70: 20.71: 20.72: 20.73: 20.74: 20.75: 20.76: 20.77: 20.78: 20.79: 20.80: 20.81: 20.82: 20.83: 20.84: 20.85: 20.86: 20.87: 20.88: 20.89: 20.90: 20.91: 20.92: 20.93: 20.94: 20.95: 20.96: 20.97: 20.98: 20.99: 20.100: 20.101: 20.102: 20.103: 20.104: 20.105: 20.106: 20.107: 20.108: 20.109: 20.110: 20.111: 20.112: 20.113: 20.114: 20.115: 20.116: 20.117: 20.118: 20.119: 20.120: 20.121: 20.122: 20.123: 20.124: 20.125: 20.126: 20.127: 20.128: 20.129: 20.130: 20.131: 20.132: 20.133: 20.134: 20.135: 20.136: 20.137: 20.138: 20.139: 20.140: 20.141: 20.142: 20.143: 20.144: 20.145: 20.146: 20.147: 20.148: 20.149: 20.150: 20.151: 20.152: 20.153: 20.154: 20.155: 20.156: 20.157: 20.158: 20.159: 20.160: 20.161: 20.162: 20.163: 20.164: 20.165: 20.166: 20.167: 20.168: 20.169: 20.170: 20.171: 20.172: 20.173: 20.174: 20.175: 20.176: 20.177: 20.178: 20.179: 20.180: 20.181: 20.182: 20.183: 20.184: 20.185: 20.186: 20.187: 20.188: 20.189: 20.190: 20.191: 20.192: 20.193: 20.194: 20.195: 20.196: 20.197: 20.198: 20.199: anm20: 21.0: 21.1: 21.2: 21.3: 21.4: 21.5: 21.6: 21.7: 21.8: 21.9: 21.10: 21.11: 21.12: 21.13: 21.14: 21.15: 21.16: 21.17: 21.18: 21.19: 21.20: 21.21: 21.22: 21.23: 21.24: 21.25: 21.26: 21.27: 21.28: 21.29: 21.30: 21.31: 21.32: 21.33: 21.34: 21.35: 21.36: 21.37: 21.38: 21.39: 21.40: 21.41: 21.42: 21.43: 21.44: 21.45: 21.46: 21.47: 21.48: 21.49: 21.50: 21.51: 21.52: 21.53: 21.54: 21.55: 21.56: 21.57: 21.58: 21.59: 21.60: 21.61: 21.62: 21.63: 21.64: 21.65: 21.66: 21.67: 21.68: 21.69: 21.70: 21.71: 21.72: 21.73: 21.74: 21.75: 21.76: 21.77: 21.78: 21.79: 21.80: 21.81: 21.82: 21.83: 21.84: 21.85: 21.86: 21.87: 21.88: 21.89: 21.90: 21.91: 21.92: 21.93: 21.94: 21.95: 21.96: 21.97: 21.98: 21.99: 21.100: 21.101: 21.102: 21.103: 21.104: 21.105: 21.106: 21.107: 21.108: 21.109: 21.110: 21.111: 21.112: 21.113: 21.114: 21.115: 21.116: 21.117: 21.118: 21.119: 21.120: 21.121: 21.122: 21.123: 21.124: 21.125: 21.126: 21.127: 21.128: 21.129: 21.130: 21.131: 21.132: 21.133: 21.134: 21.135: 21.136: 21.137: 21.138: 21.139: 21.140: 21.141: 21.142: 21.143: 21.144: 21.145: 21.146: 21.147: 21.148: 21.149: anm21:

Predictive Modeling in Insurance in the context of (possibly) big data

Education

Transcript of Predictive Modeling in Insurance in the context of (possibly) big data