Predictive Modeling in Insurance in the context of (possibly) big data

Click here to load reader

Transcript of Predictive Modeling in Insurance in the context of (possibly) big data

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Predictive Modeling in Insurance,

    in the context of (possibly) Big Data

    A. Charpentier (UQAM & Universit de Rennes 1)

    Statistical & Actuarial Sciences Joint Seminar

    & Center of studies in Asset Management (CESAM)

    http://freakonometrics.hypotheses.org

    @freakonometrics 1

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Predictive Modeling in Insurance,

    in the context of (possibly) Big Data

    A. Charpentier (UQAM & Universit de Rennes 1)

    Professor of Actuarial Sciences, Mathematics Department, UQM(previously Economics Department, Univ. Rennes 1 & ENSAE Paristechactuary in Hong Kong, IT & Stats FFSA)

    ACTINFO-Cova Chair, Actuarial Value of InformationData Science for Actuaries program, Institute of ActuariesPhD in Statistics (KU Leuven), Fellow Institute of ActuariesMSc in Financial Mathematics (Paris Dauphine) & ENSAEEditor of the freakonometrics.hypotheses.orgs blogEditor of Computational Actuarial Science, CRC

    @freakonometrics 2

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Actuarial Science, an American Perspective

    Source: Trowbridge (1989) Fundamental Concepts of Actuarial Science.

    @freakonometrics 3

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Actuarial Science, a European Perspective

    Source: Dhaene et al. (2004) Modern Actuarial Risk Theory.

    @freakonometrics 4

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Exemples of Actuarial Problems: Ratemaking and Pricing

    E[S|X] = E[

    Ni=1

    Zi

    X]

    = E[N |X] annual frequency

    E[Zi|X] individual cost

    censoring / incomplete datasets (exposure + delay to report claims)

    We observe Y and E, but the variable of interest is N .

    Yi P(Ei i) with i = exp[0 + xTi + Zi].

    @freakonometrics 5

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Exemples of Actuarial Problems: Pricing and Classification

    Econometric models on classes

    Yi B(pi) with pi =exp[0 + xTi ]

    1 + exp[0 + xTi ]

    or on counts

    Yi P(i) with i = exp[0 + xTi ]

    (too) large datasets X can be large (or com-plex)

    factors with a lot of modalities, spatial data,text information, etc.

    @freakonometrics 6

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Exemples of Actuarial Problems: Pricing and ClassificationHow to avoid overfit? How to group modalities?How to choose between (very) correlated features?

    model selection issues

    Historically Bailey (1963) margin method n = rcT,with row (r) and column (c) effects, and constraints

    i

    ni,j =i

    ri cj andj

    ni,j =j

    ri cj

    Related to Poisson regression,

    N P(exp[0 + rTR + cTC ])

    @freakonometrics 7

    https://www.casact.org/pubs/proceed/proceed63/63004.pdf

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Exemples of Actuarial Problems: Claims Reserving and Predictive Models

    predictive modeling issues

    In all those cases, the goal is to get a predictivemodel, y = m(x) given some features x.Recall that the main interest in insurance is either

    a probability m(x) = P[Y = 1|X = x]

    an expected value m(x) = E[Y |X = x]

    but sometimes, we need the (conditiondal) distribu-tion of Y .

    @freakonometrics 8

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    History of Actuarial Models (in one slide)

    Bailey (1963) or Taylor (1977) considered deterministic models, ni,j = ricj orni,j = ridi+j . Some additional constraints are given to get an identifiable model.

    Then some stochastic version of those models were introduced, see Hachemeister(1975) or de Vylder (1985), e.g.

    Ni,j P(exp[P(exp[0 +RTR +CTC ])

    orlogNi,j N (0 +RTR +CTC , 2)

    All those techniques are econometric-based techniques. Why not consider somestatistical learning techniques?

    @freakonometrics 9

    https://www.casact.org/pubs/proceed/proceed63/63004.pdfhttps://www.casact.org/library/astin/vol9no1and2/vol9no1and2.pdfhttps://www.casact.org/pubs/forum/92spforum/92sp307.pdfhttps://www.casact.org/pubs/forum/92spforum/92sp307.pdfhttp://www.sciencedirect.com/science/article/pii/0167668785900125

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Statistical Learning and Philosophical Issues

    From Machine Learning and Econometrics, by Hal Varian :

    Machine learning use data to predict some variable as a function of othercovariables,

    may, or may not, care about insight, importance, patterns

    may, or may not, care about inference (how y changes as some x change)

    Econometrics use statistical methodes for prediction, inference and causalmodeling of economic relationships

    hope for some sort of insight (inference is a goal)

    in particular, causal inference is goal for decision making.

    machine learning, new tricks for econometrics

    @freakonometrics 10

    http://web.stanford.edu/class/ee380/Abstracts/140129-slides-Machine-Learning-and-Econometrics.pdf

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Statistical Learning and Philosophical Issues

    Remark machine learning can also learn from econometrics, especially with noni.i.d. data (time series and panel data)

    Remark machine learning can help to get better predictive models, given gooddatasets. No use on several data science issues (e.g. selection bias).

    @freakonometrics 11

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Machine Learning and Statistics

    Machine learning and statistics seem to be very similar, they share the samegoalsthey both focus on data modelingbut their methods are affected bytheir cultural differences.

    The goal for a statistician is to predict an interaction between variables withsome degree of certainty (we are never 100% certain about anything). Machinelearners, on the other hand, want to build algorithms that predict, classify, andcluster with the most accuracy, see Why a Mathematician, Statistician & MachineLearner Solve the Same Problem Differently

    Machine learning methods are about algorithms, more than about asymptoticstatistical properties.

    @freakonometrics 12

    http://www.galvanize.com/blog/2015/08/26/why-a-mathematician-statistician-machine-learner-solve-the-same-problem-differently-2/http://www.galvanize.com/blog/2015/08/26/why-a-mathematician-statistician-machine-learner-solve-the-same-problem-differently-2/

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Machine Learning and Statistics

    See also nonparametric inference: Note that the non-parametric model is notnone-parametric: parameters are determined by the training data, not the model.[...] non-parametric covers techniques that do not assume that the structure of amodel is fixed. Typically, the model grows in size to accommodate thecomplexity of the data. see wikipedia

    Validation is not based on mathematical properties, but on properties out ofsample: we must use a training sample to train (estimate) model, and a testingsample to compare algorithms.

    @freakonometrics 13

    https://en.wikipedia.org/wiki/Nonparametric_statistics

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Goldilock Principle: the Mean-Variance Tradeoff

    In statistics and in machine learning, there will be parameters andmeta-parameters (or tunning parameters. The first ones are estimated, thesecond ones should be chosen.

    See Hill estimator in extreme value theory. X has a Pareto distribution abovesome threshold u if

    P[X > x|X > u] =(ux

    ) 1 for x > u.

    Given a sample x, consider the Pareto-QQ plot, i.e. the scatterplot{ log

    (1 i

    n+ 1

    ), log xi:n

    }i=nk, ,n

    for points exceeding Xnk:n.

    @freakonometrics 14

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Goldilock Principle: the Mean-Variance Tradeoff

    The slope is , i.e.

    logXni+1:n logXnk:n + ( log i

    n+ 1 logn+ 1k + 1

    )

    Hence, consider estimator k =1k

    k1i=0

    log xni:n log xnk:n.

    Standard mean-variance tradeoff,

    k large: bias too large, variance too small

    k small: variance too large, bias too small

    @freakonometrics 15

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Goldilock Principle: the Mean-Variance Tradeoff

    Same holds in kernel regression, with bandwidth h (length of neighborhood)

    mh(x) =

    ni=1

    Kh(x xi) yi

    ni=1

    Kh(x xi)

    for some kernel K().

    Standard mean-variance tradeoff,

    h large: bias too large, variance too small

    h small: variance too large, bias too small

    @freakonometrics 16

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Goldilock Principle: the Mean-Variance Tradeoff

    More generally, we estimate h or mh()Use the mean squared error for h

    E[( h

    )2]

    or mean integrated squared error mh(),

    E[

    (m(x) mh(x))2 dx]

    In statistics, derive an asymptotic expression for these quantities, and find h?

    that minimizes those.

    @freakonometrics 17

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Goldilock Principle: the Mean-Variance Tradeoff

    In classical statistics, the MISE can be approximated by

    h4

    4

    (x2K(x)dx

    )2 (m(x) + 2m(x)f

    (x)f(x)

    )dx+ 1

    nh2K2(x)dx

    dx

    f(x)

    where f is the density of xs. Thus the optimal h is

    h? = n 15

    2K2(x)dx

    dxf(x)(

    x2K(x)dx)2 (

    m(x) + 2m(x)f(x)f(x)

    )2dx

    15

    (hard to get a simple rule of thumb... up to a constant, h? n 15 )

    In statistics learning, use bootstrap, or cross-validation to get an optimal h...

    @freakonometrics 18

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Randomization is too important to be left to chance!

    Consider some sample x = (x1, , xn) and some statistics . Set n = (x)

    Jackknife used to reduce bias: set (i) = (x(i)), and =1n

    ni=1

    (i)

    If E(n) = +O(n1) then E(n) = +O(n2).

    See also leave-one-out cross validation, for m()

    mse = 1n

    ni=1

    [yi m(i)(xi)]2

    Boostrap estimate is based on bootstrap samples: set (b) = (x(b)), and

    = 1n

    ni=1

    (b), where x(b) is a vector of size n, where values are drawn from

    {x1, , xn}, with replacement. And then use the law of large numbers...

    See Efron (1979).

    @freakonometrics 19

    http://www.stat.cmu.edu/~fienberg/Statistics36-756/Efron1979.pdf

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Statistical Learning and Philosophical Issues

    From (yi,xi), there are different stories behind, see Freedman (2005)

    the causal story : xj,i is usually considered as independent of the othercovariates xk,i. For all possible x, that value is mapped to m(x) and a noiseis atatched, . The goal is to recover m(), and the residuals are just thedifference between the response value and m(x).

    the conditional distribution story : for a linear model, we usually say that Ygiven X = x is a N (m(x), 2) distribution. m(x) is then the conditionalmean. Here m() is assumed to really exist, but no causal assumption ismade, only a conditional one.

    the explanatory data story : there is no model, just data. We simply want tosummarize information contained in xs to get an accurate summary, close tothe response (i.e. min{`(yi,m(xi))}) for some loss function `.

    @freakonometrics 20

    http://www.cambridge.org/us/academic/subjects/statistics-probability/statistical-theory-and-methods/statistical-models-theory-and-practice-2nd-edition

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Machine Learning vs. Statistical Modeling

    In machine learning, given some dataset (xi, yi), solve

    m() = argminm()F

    {ni=1

    `(yi,m(xi))}

    for some loss functions `(, ).

    In statistical modeling, given some probability space (,A,P), assume that yi arerealization of i.i.d. variables Yi (given Xi = xi) with distribution Fi. Then solve

    m() = argmaxm()F

    {logL(m(x);y)} = argmaxm()F

    {ni=1

    log f(yi;m(xi))}

    where logL denotes the log-likelihood.

    @freakonometrics 21

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Loss Functions

    Fitting criteria are based on loss functions (also called cost functions). For aquantitative response, a popular one is the quadratic loss,`(y,m(x)) = [y m(x)]2.

    Recall that E(Y ) = argmin

    mR{Y m`2} = argmin

    mR{E([Y m]2

    )}

    Var(Y ) = minmR{E([Y m]2

    )} = E

    ([Y E(Y )]2

    )The empirical version is

    y = argminmR

    {ni=1

    1n

    [yi m]2}

    s2 = minmR{ni=1

    1n

    [yi m]2} =ni=1

    1n

    [yi y]2

    @freakonometrics 22

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Model Evaluation

    In linear models, the R2 is defined as the proportion of the variance of the theresponse y that can be obtained using the predictors.

    But maximizing the R2 usually yields overfit (or unjustified optimism in Berk(2008)).

    In linear models, consider the adjusted R2,

    R2 = 1 [1R2] n 1

    n p 1

    where p is the number of parameters, or more generally trace(S) when somesmoothing matrix is considered

    y = m(x) =ni=1

    Sx,iyi = STxy

    where Sx is some vector of weights (called smoother vector), related to a n nsmoother matrix, y = Sy where prediction is done at points xis.

    @freakonometrics 23

    http://www.springer.com/us/book/9780387775005http://www.springer.com/us/book/9780387775005

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Model Evaluation

    Alternatives are based on the Akaike Information Criterion (AIC) and theBayesian Information Criterion (BIC), based on a penalty imposed on somecriteria (the logarithm of the variance of the residuals),

    AIC = log(

    1n

    ni=1

    [yi yi]2)

    + 2pn

    BIC = log(

    1n

    ni=1

    [yi yi]2)

    + log(n)pn

    In a more general context, replace p by trace(S).

    @freakonometrics 24

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Model Evaluation

    One can also consider the expected prediction error (with a probabilistic model)

    E[`(Y, m(X)]

    We cannot claim (using the law of large number) that

    1n

    ni=1

    `(yi, m(xi))a.s.9 E[`(Y,m(X)]

    since m depends on (yi,xi)s.

    Natural option : use two (random) samples, a training one and a validation one.

    Alternative options, use cross-validation, leave-one-out or k-fold.

    @freakonometrics 25

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Underfit / Overfit and Variance - Mean Tradeoff

    Goal in predictive modeling: reduce uncertainty in our predictions.

    Need more data to get a better knowledge.

    Unfortunately, reducing the error of the prediction on a dataset does notgenerally give a good generalization performance

    need a training and a validation dataset

    @freakonometrics 26

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Overfit, Training vs. Validation and Complexity (Vapnik Dimension)

    complexity polynomial degree

    @freakonometrics 27

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Overfit, Training vs. Validation and Complexity (Vapnik Dimension)

    complexity number of neighbors (k)

    @freakonometrics 28

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Logistic RegressionAssume that P(Yi = 1) = i,

    logit(i) = xTi , where logit(i) = log(

    i1 i

    ),

    ori = logit1(xTi ) =

    exp[xTi ]1 + exp[xTi ]

    .

    The log-likelihood is

    logL() =ni=1

    yi log(i)+(1yi) log(1i) =ni=1

    yi log(i())+(1yi) log(1i())

    and the first order conditions are solved numerically

    logL()k

    =ni=1

    Xk,i[yi i()] = 0.

    @freakonometrics 29

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Predictive Classifier

    To go from a score

    s(x) = exp[xT]

    1 + exp[xT]to a class:

    if s(x) > s, then Y (x) = 1 (or ) and s(x) s, then Y (x) = 0 (or ).

    Plot TP (s) = P[Y = 1|Y = 1] against FP (s) = P[Y = 1|Y = 0]

    @freakonometrics 30

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Why a Logistic and not a Probit Regression?Bliss (1934) suggested a model such that

    P(Y = 1|X = x) = H(xT) where H() = ()

    the c.d.f. of the N (0, 1) distribution. This is the probit model.This yields a latent model, yi = 1(y?i > 0) where

    y?i = xTi + i is a nonobservable score.

    In the logistic regression, we model the odds ratio,

    P(Y = 1|X = x)P(Y 6= 1|X = x) = exp[x

    T]

    P(Y = 1|X = x) = H(xT) where H() = exp[]1 + exp[]

    which is the c.d.f. of the logistic variable, see Verhulst (1845)

    @freakonometrics 31

    http://www.sciencemag.org/content/79/2037/38http://gdz.sub.uni-goettingen.de/dms/load/img/?PPN=PPN129323640_0018&DMDID=dmdlog7

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    k-Nearest Neighbors (a.k.a. k-NN)

    In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is anon-parametric method used for classification and regression. (Source: wikipedia).

    E[Y |X = x] 1k

    d(xi,x) small

    yi

    For k-Nearest Neighbors, the class is usually the majority vote of the k closestneighbors of x.

    Distance d(, ) should not be sensitive to units:normalize by standard deviation

    0 5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    @freakonometrics 32

    https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    k-Nearest Neighbors and Curse of Dimensionality

    The higher the dimension, the larger the distance to the closest neigbbor

    mini{1, ,n}

    {d(a,xi)},xi Rd.

    dim1 dim2 dim3 dim4 dim5

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    dim1 dim2 dim3 dim4 dim5

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    n = 10 n = 100

    @freakonometrics 33

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Classification (and Regression) Trees, CART

    one of the predictive modelling approaches used instatistics, data mining and machine learning [...]In tree structures, leaves represent class labels andbranches represent conjunctions of features that leadto those class labels. (Source: wikipedia).

    5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    @freakonometrics 34

    https://en.wikipedia.org/wiki/Decision_tree_learning

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Bagging

    Bootstrapped Aggregation (Bagging) , is a machine learning ensemblemeta-algorithm designed to improve the stability and accuracy of machinelearning algorithms used in statistical classification (Source: wikipedia).

    It is an ensemble method that creates multiple models of the same type fromdifferent sub-samples of the same dataset [boostrap]. The predictions from eachseparate model are combined together to provide a superior result [aggregation].

    can be used on any kind of model, but interesting for trees, see Breiman (1996)

    Boostrap can be used to define the concept of margin,

    margini =1B

    Bb=1

    1(yi = yi)1B

    Bb=1

    1(yi 6= yi)

    Remark Probability that ith raw is not selection (1 n1)n e1 36.8%, cftraining / validation samples (2/3-1/3)

    @freakonometrics 35

    https://en.wikipedia.org/wiki/Bootstrap_aggregatinghttp://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Random Forests

    Strictly speaking, when boostrapping among observations,and aggregating, we use a bagging algorithm.In the random forest algorithm, we combine Breimans bag-ging idea and the random selection of features, introducedindependently by Ho (1995) and Amit & Geman (1997).

    0 5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    @freakonometrics 36

    http://cm.bell-labs.com/cm/cs/who/tkh/papers/odt.pdfhttp://www.cis.jhu.edu/publications/papers_in_database/GEMAN/shape.pdf

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Bagging, Forests, and Variable ImportanceGiven some random forest with M trees, set

    V I(Xk) =1M

    m

    t

    NtN

    i(t)

    where the first sum is over all trees, and the second one isover all nodes where the split is done based on variable Xk.But difficult to interprete with correlated features. Con-sidere model y = 0 + 1x1 + 3x3 + , and we considera model based on x = (x1, x2, x3) where x1 and x2 arecorrelated.Compare AIC vs. Variable Importance as a function of r

    @freakonometrics 37

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Support Vector Machine

    SVMs were developed in the 90s based on previous work, from Vapnik & Lerner(1963), see Vailant (1984)Assume that points are linearly separable, i.e. there is and b such that

    Y =

    +1 if Tx+ b > 01 if Tx+ b < 0Problem: infinite number of solutions, need a good one,that separate the data, (somehow) far from the data.

    The distance from x0 to the hyperplane Tx+ b is

    d(x0, H,b) =Tx0 + b

    @freakonometrics 38

    http://www.cs.iastate.edu/~cs573x/vapnik-portraits1963.pdfhttp://www.cs.iastate.edu/~cs573x/vapnik-portraits1963.pdfhttps://people.mpi-inf.mpg.de/~mehlhorn/SeminarEvolvability/ValiantLearnable.pdf

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Support Vector Machine

    Define support vectors as observations such that

    |Txi + b| = 1

    The margin is the distance between hyperplanes defined bysupport vectors.

    The distance from support vectors to H,b is 1, and the margin is then21.

    the algorithm is to minimize the inverse of the margins s.t. H,b separates1 points, i.e.

    min{

    12

    T

    }s.t. yi(Txi + b) 1, i.

    @freakonometrics 39

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Support Vector Machine

    Now, what about the non-separable case?

    Here, we cannot have yi(Txi + b) 1 i.

    introduce slack variables, Txi + b +1 i when yi = +1Txi + b 1 + i when yi = 1where i 0 i. There is a classification error when i > 1.

    The idea is then to solve

    min{

    12

    T + C1T1>1}, instead ofmin

    {12

    T

    }

    @freakonometrics 40

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Support Vector Machines, with a Linear Kernel

    So far, d(x0, H,b) = minxH,b

    {x0 x`2}

    where `2 is the Euclidean (`2) norm,

    x0 x`2 =

    (x0 x) (x0 x)=x0x0 2x0x+ xx

    More generally, d(x0, H,b) = minxH,b

    {x0 xk}

    where k is some kernel-based norm,

    x0 xk =k(x0,x0) 2k(x0,x) + k(xx)

    0 5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    0 5 10 15 20

    500

    1000

    1500

    2000

    2500

    3000

    PVENT

    RE

    PU

    L

    @freakonometrics 41

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Regression?

    In statistics, regression analysis is a statistical process for estimating therelationships among variables [...] In a narrower sense, regression may referspecifically to the estimation of continuous response variables, as opposed to thediscrete response variables used in classification. (Source: wikipedia).

    Here regression is opposed to classification (as in the CART algorithm). y iseither a continuous variable y R or a counting variable y N .

    In many cases in econometric and actuarial literature we simply want a good fitfor the conditional expectation, E[Y |X = x].

    @freakonometrics 42

    https://en.wikipedia.org/wiki/Regression_analysis

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Linear, Non-Linear and Generalized Linear

    (Y |X = x) N (x, 2) (Y |X = x) N (x, 2) (Y |X = x) L(x, )

    E[Y |X = x] = xT E[Y |X = x] = h(x) E[Y |X = x] = h(x)

    @freakonometrics 43

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Regression Smoothers, natura non facit saltus

    In statistical learning procedures, a key role is played by basis functions. We willsee that it is common to assume that

    m(x) =kj=0

    jhj(x),

    where h0 is usually a constant function and hj defined basis functions.

    For instance, hm(x) = xj for a polynomial expansion witha single predictor, or hj(x) = (xsj)+ for some knots sj s(for linear splines, but one can consider quadratic or cubicones).

    @freakonometrics 44

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Regression Smoothers: Polynomial or Spline

    Stone-Weiestrass theorem every continuous func-tion defined on a closed interval [a, b] can be uni-formly approximated as closely as desired by a poly-nomial function

    Use also spline functions, e.g. piecewise linear

    h(x) = 0 +kj=1

    j(x sj)+

    @freakonometrics 45

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Linear Model

    Consider some linear model yi = xTi + i for all i = 1, , n.

    Assume that i are i.i.d. with E() = 0 (and finite variance). Writey1...yn

    y,n1

    =

    1 x1,1 x1,k...

    .... . .

    ...1 xn,1 xn,k

    X,n(k+1)

    0

    1...k

    ,(k+1)1

    +

    1...n

    ,n1

    .

    Assuming N (0, 2I), the maximum likelihood estimator of is

    = argmin{y XT`2} = (XTX)1XTy

    ... under the assumtption that XTX is a full-rank matrix.

    What if XTiX cannot be inverted? Then = [XTX]1XTy does not exist, but = [XTX + I]1XTy always exist if > 0.

    @freakonometrics 46

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Ridge Regression

    The estimator = [XTX + I]1XTy is the Ridge estimate obtained as solutionof

    = argmin

    ni=1

    [yi 0 xTi ]2 + `2 1T2

    for some tuning parameter . One can also write

    = argmin;`2s

    {Y XT`2}

    Remark Note that we solve = argmin{objective()} where

    objective() = L() training loss

    + R() regularization

    @freakonometrics 47

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Going further on sparcity issues

    In severall applications, k can be (very) large, but a lot of features are just noise:j = 0 for many js. Let s denote the number of relevent features, with s

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Going further on sparcity issues

    Define a`0 =

    1(|ai| > 0). Ici dim() = s.

    We wish we could solve

    = argmin;`0s

    {Y XT`2}

    Problem: it is usually not possible to describe all possible constraints, since(s

    k

    )coefficients should be chosen here (with k (very) large).

    Idea: solve the dual problem

    = argmin;Y XT`2h

    {`0}

    where we might convexify the `0 norm, `0 .

    @freakonometrics 49

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Regularization `0, `1 et `2

    @freakonometrics 50

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Going further on sparcity issues

    On [1,+1]k, the convex hull of `0 is `1On [a,+a]k, the convex hull of `0 is a1`1Hence,

    = argmin;`1s

    {Y XT`2}

    is equivalent (Kuhn-Tucker theorem) to the Lagragian optimization problem

    = argmin{Y XT`2+`1}

    @freakonometrics 51

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    LASSO Least Absolute Shrinkage and Selection Operator

    argmin{Y XT`2+`1}

    is a convex problem (several algorithms?), but not strictly convex (no unicity ofthe minimum). Nevertheless, predictions y = xT are unique

    ? MM, minimize majorization, coordinate descent Hunter (2003).

    @freakonometrics 52

    http://sites.stat.psu.edu/~dhunter/papers/mmtutorial.pdf

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Optimal LASSO Penalty

    Use cross validation, e.g. K-fold,

    (k)() = argmin

    i6Ik

    [yi xTi ]2 +

    then compute the sum of the squared errors,

    Qk() =iIk

    [yi xTi (k)()]2

    and finally solve

    ? = argmin{Q() = 1

    K

    k

    Qk()}

    Note that this might overfit, so Hastie, Tibshiriani & Friedman (2009) suggest thelargest such that

    Q() Q(?) + se[?] with se[]2 = 1K2

    Kk=1

    [Qk()Q()]2

    @freakonometrics 53

    http://statweb.stanford.edu/~tibs/ElemStatLearn/

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Boosting

    Boosting is a machine learning ensemble meta-algorithm for reducing biasprimarily and also variance in supervised learning, and a family of machinelearning algorithms which convert weak learners to strong ones. (Source:wikipedia)

    The heuristics is simple: we consider an iterative process where we keep modelingthe errors.

    Fit model for y, m1() from y and X, and compute the error, 1 = y m1(X).

    Fit model for 1, m2() from 1 and X, and compute the error,2 = 1 m2(X), etc. Then set

    m() = m1() y

    +m2() 1

    +m3() 2

    + +mk() k1

    @freakonometrics 54

    https://en.wikipedia.org/wiki/Boosting_(machine_learning)

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Boosting

    With (very) general notations, we want to solve

    m? = argmin{E[`(Y,m(X))]}

    for some loss function `.

    It is an iterative procedure: assume that at some step k we have an estimatormk(X). Why not constructing a new model that might improve our model,

    mk+1(X) = mk(X) + h(X).

    What h() could be?

    @freakonometrics 55

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Boosting

    In a perfect world, h(X) = y mk(X), which can be interpreted as a residual.

    Note that this residual is the gradient of 12 [y m(x)]2

    A gradient descent is based on Taylor expansion

    f(xk) f,xk

    f(xk1) f,xk1

    + (xk xk1)

    f(xk1) f,xk1

    But here, it is different. We claim we can write

    fk(x) fk,x

    fk1(x) fk1,x

    + (fk fk1)

    ?fk1,x

    where ? is interpreted as a gradient.

    @freakonometrics 56

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Boosting

    Here, fk is a Rd R function, so the gradient should be in such a (big)functional space want to approximate that function.

    mk(x) = mk1(x) + argminfF

    {ni=1

    `(Yi,mk1(x) + f(x))}

    where f F means that we seek in a class of weak learner functions.

    If learner are two strong, the first loop leads to some fixed point, and there is nolearning procedure, see linear regression y = xT + . Since x we cannotlearn from the residuals.

    In order to make sure that we learn weakly, we can use some shrinkageparameter (or collection of parameters j).

    @freakonometrics 57

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Boosting with Piecewise Linear Spline & Stump Functions

    @freakonometrics 58

  • Arthur CHARPENTIER - Predictive Modeling in Insurance, in the context of (possibly) Big Data

    Take Home Message

    Similar goal: getting a predictive model, m(x)

    Different/Similar tools: minimize loss/maximize likelihood

    m() = argminm()F

    {ni=1

    `(yi,m(xi))}

    vs. m() = argmaxm()F

    {ni=1

    log f(yi;m(xi))}

    Try to remove the noise and avoid overfit using cross-validation,

    `(yi, m(i)(xi))

    Use computational tricks (bootstrap) to increase robustness

    Nice tools to select interesting features (LASSO, variable importance)

    @freakonometrics 59

    0.0: 0.1: anm0: 1.0: 1.1: anm1: 2.0: 2.1: anm2: 3.0: 3.1: anm3: 4.0: 4.1: 4.2: anm4: 5.0: 5.1: 5.2: 5.3: 5.4: 5.5: 5.6: 5.7: 5.8: 5.9: 5.10: 5.11: 5.12: 5.13: 5.14: 5.15: 5.16: 5.17: 5.18: 5.19: anm5: 6.0: 6.1: 6.2: 6.3: 6.4: 6.5: 6.6: 6.7: 6.8: 6.9: 6.10: 6.11: 6.12: 6.13: 6.14: 6.15: 6.16: 6.17: 6.18: 6.19: anm6: 7.0: 7.1: 7.2: 7.3: 7.4: 7.5: 7.6: 7.7: 7.8: 7.9: 7.10: 7.11: 7.12: 7.13: 7.14: 7.15: 7.16: 7.17: 7.18: 7.19: 7.20: 7.21: 7.22: 7.23: 7.24: 7.25: 7.26: 7.27: 7.28: 7.29: 7.30: 7.31: 7.32: 7.33: 7.34: 7.35: 7.36: 7.37: 7.38: 7.39: anm7: 8.0: 8.1: 8.2: anm8: 9.0: 9.1: 9.2: 9.3: 9.4: 9.5: 9.6: 9.7: 9.8: 9.9: 9.10: 9.11: 9.12: 9.13: 9.14: 9.15: 9.16: 9.17: 9.18: 9.19: 9.20: 9.21: 9.22: 9.23: 9.24: 9.25: 9.26: 9.27: 9.28: 9.29: 9.30: 9.31: 9.32: 9.33: 9.34: 9.35: anm9: 10.0: 10.1: 10.2: 10.3: anm10: 11.0: 11.1: 11.2: anm11: 12.0: 12.1: 12.2: 12.3: 12.4: 12.5: anm12: 13.0: 13.1: 13.2: 13.3: 13.4: 13.5: 13.6: 13.7: 13.8: 13.9: 13.10: 13.11: 13.12: 13.13: 13.14: 13.15: 13.16: 13.17: 13.18: 13.19: 13.20: 13.21: 13.22: 13.23: 13.24: 13.25: 13.26: 13.27: 13.28: 13.29: 13.30: 13.31: 13.32: 13.33: 13.34: 13.35: anm13: 14.0: 14.1: 14.2: 14.3: 14.4: 14.5: 14.6: 14.7: 14.8: 14.9: 14.10: 14.11: 14.12: 14.13: 14.14: 14.15: 14.16: 14.17: 14.18: 14.19: 14.20: 14.21: 14.22: 14.23: 14.24: 14.25: 14.26: 14.27: 14.28: 14.29: 14.30: 14.31: 14.32: 14.33: 14.34: 14.35: anm14: 15.0: 15.1: 15.2: 15.3: 15.4: 15.5: 15.6: 15.7: 15.8: 15.9: 15.10: 15.11: 15.12: 15.13: 15.14: 15.15: 15.16: 15.17: 15.18: 15.19: 15.20: 15.21: 15.22: 15.23: 15.24: 15.25: 15.26: 15.27: 15.28: 15.29: 15.30: 15.31: 15.32: 15.33: 15.34: 15.35: anm15: 16.0: 16.1: anm16: 17.0: 17.1: 17.2: 17.3: 17.4: 17.5: 17.6: 17.7: 17.8: 17.9: 17.10: 17.11: 17.12: 17.13: 17.14: anm17: 18.0: 18.1: 18.2: 18.3: anm18: 19.0: 19.1: anm19: 20.0: 20.1: 20.2: 20.3: 20.4: 20.5: 20.6: 20.7: 20.8: 20.9: 20.10: 20.11: 20.12: 20.13: 20.14: 20.15: 20.16: 20.17: 20.18: 20.19: 20.20: 20.21: 20.22: 20.23: 20.24: 20.25: 20.26: 20.27: 20.28: 20.29: 20.30: 20.31: 20.32: 20.33: 20.34: 20.35: 20.36: 20.37: 20.38: 20.39: 20.40: 20.41: 20.42: 20.43: 20.44: 20.45: 20.46: 20.47: 20.48: 20.49: 20.50: 20.51: 20.52: 20.53: 20.54: 20.55: 20.56: 20.57: 20.58: 20.59: 20.60: 20.61: 20.62: 20.63: 20.64: 20.65: 20.66: 20.67: 20.68: 20.69: 20.70: 20.71: 20.72: 20.73: 20.74: 20.75: 20.76: 20.77: 20.78: 20.79: 20.80: 20.81: 20.82: 20.83: 20.84: 20.85: 20.86: 20.87: 20.88: 20.89: 20.90: 20.91: 20.92: 20.93: 20.94: 20.95: 20.96: 20.97: 20.98: 20.99: 20.100: 20.101: 20.102: 20.103: 20.104: 20.105: 20.106: 20.107: 20.108: 20.109: 20.110: 20.111: 20.112: 20.113: 20.114: 20.115: 20.116: 20.117: 20.118: 20.119: 20.120: 20.121: 20.122: 20.123: 20.124: 20.125: 20.126: 20.127: 20.128: 20.129: 20.130: 20.131: 20.132: 20.133: 20.134: 20.135: 20.136: 20.137: 20.138: 20.139: 20.140: 20.141: 20.142: 20.143: 20.144: 20.145: 20.146: 20.147: 20.148: 20.149: 20.150: 20.151: 20.152: 20.153: 20.154: 20.155: 20.156: 20.157: 20.158: 20.159: 20.160: 20.161: 20.162: 20.163: 20.164: 20.165: 20.166: 20.167: 20.168: 20.169: 20.170: 20.171: 20.172: 20.173: 20.174: 20.175: 20.176: 20.177: 20.178: 20.179: 20.180: 20.181: 20.182: 20.183: 20.184: 20.185: 20.186: 20.187: 20.188: 20.189: 20.190: 20.191: 20.192: 20.193: 20.194: 20.195: 20.196: 20.197: 20.198: 20.199: anm20: 21.0: 21.1: 21.2: 21.3: 21.4: 21.5: 21.6: 21.7: 21.8: 21.9: 21.10: 21.11: 21.12: 21.13: 21.14: 21.15: 21.16: 21.17: 21.18: 21.19: 21.20: 21.21: 21.22: 21.23: 21.24: 21.25: 21.26: 21.27: 21.28: 21.29: 21.30: 21.31: 21.32: 21.33: 21.34: 21.35: 21.36: 21.37: 21.38: 21.39: 21.40: 21.41: 21.42: 21.43: 21.44: 21.45: 21.46: 21.47: 21.48: 21.49: 21.50: 21.51: 21.52: 21.53: 21.54: 21.55: 21.56: 21.57: 21.58: 21.59: 21.60: 21.61: 21.62: 21.63: 21.64: 21.65: 21.66: 21.67: 21.68: 21.69: 21.70: 21.71: 21.72: 21.73: 21.74: 21.75: 21.76: 21.77: 21.78: 21.79: 21.80: 21.81: 21.82: 21.83: 21.84: 21.85: 21.86: 21.87: 21.88: 21.89: 21.90: 21.91: 21.92: 21.93: 21.94: 21.95: 21.96: 21.97: 21.98: 21.99: 21.100: 21.101: 21.102: 21.103: 21.104: 21.105: 21.106: 21.107: 21.108: 21.109: 21.110: 21.111: 21.112: 21.113: 21.114: 21.115: 21.116: 21.117: 21.118: 21.119: 21.120: 21.121: 21.122: 21.123: 21.124: 21.125: 21.126: 21.127: 21.128: 21.129: 21.130: 21.131: 21.132: 21.133: 21.134: 21.135: 21.136: 21.137: 21.138: 21.139: 21.140: 21.141: 21.142: 21.143: 21.144: 21.145: 21.146: 21.147: 21.148: 21.149: anm21: