Modeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAM
-
Upload
suneel-babu-chatla -
Category
Data & Analytics
-
view
28 -
download
0
Transcript of Modeling Big Count Data: An IRLS Framework for COM-Poisson Regression and GAM
Modeling Big Count DataAn IRLS framework for COM-Poisson regression and GAM
Suneel ChatlaGalit ShmueliNovember 12, 2016
Institute of Service ScienceNational Tsing Hua University, Taiwan (R.O.C)
Table of contents
1. Speed Dating Experiment- Count data models
2. Motivation
3. An IRLS framework
4. Simulation Study-Comparison of IRLS with MLE
5. A CMP Generalized Additive Model
6. Results & Conclusions
1
Speed Dating Experiment- Countdata models
Speed dating experiment
Fisman et al. (2006) conducted a speed dating experiment toevaluate the gender differences in mate selection 1.
Total sessions 14Decision 1 or 0
Attractiveness 1-10Intelligence 1-10Ambition 1-10
......
Control variables
1https://www.kaggle.com/annavictoria/speed-dating-experiment
2
Outcome/Count variables
Matches : When both persons decide YesTot.Yes : Total number of Yes for each subject in a particular session
3
Summary Statistics
Statistic N Mean St. Dev. Min Maxmatches 531 2.524 2.304 0 14Tot.Yes 531 6.433 4.361 0 21
Tot.partner 531 15.311 4.967 5 22age 531 26.303 3.735 18 55perc.samerace 531 0.391 0.242 0.000 0.833avg.intcor 531 0.190 0.167 β0.298 0.569attr 531 6.195 1.122 1.818 10.000sinc 531 7.205 1.108 2.773 10.000intel 531 7.381 0.988 3.409 10.000func 531 6.438 1.103 2.682 10.000amb 531 6.812 1.133 3.091 10.000shar 531 5.511 1.333 1.409 10.000like 531 6.157 1.072 1.682 10.000prob 531 5.234 1.525 0.778 10.000mean.agep 531 26.314 1.674 20.444 31.667attr_o 531 6.200 1.186 2.333 8.688sinc_o 531 7.224 0.690 4.167 9.000intel_o 531 7.410 0.614 4.875 9.150fun_o 531 6.438 1.015 2.625 8.615amb_o 531 6.827 0.756 4.600 8.842shar_o 531 5.498 0.942 1.375 7.700like_o 531 6.161 0.873 2.333 8.300prob_o 531 5.256 0.736 3.200 7.200Tot.part.Yes 531 6.420 4.128 0 20 4
Tools:
β’ Poisson Regressionβ’ Negative Binomial Regressionβ’ Conway-Maxwell Poisson (CMP) Regression
5
The CMP distribution
From Shmueli et al. (2005),
Y βΌ CMP(Ξ», Ξ½)
implies
P(Y = y) = Ξ»y
(y!)Ξ½Z(Ξ», Ξ½) , y = 0, 1, 2, . . .
Z(Ξ», Ξ½) =ββs=0
Ξ»s
(s!)Ξ½
for Ξ» > 0, Ξ½ β₯ 0.
The CMP distribution includes three well-known distributions asspecial cases:
β’ Poisson (Ξ½ = 1),β’ Geometric (Ξ½ = 0, Ξ» < 1),β’ Bernoulli (Ξ½ β β with probability Ξ»
1+Ξ» ).6
CMP distribution for different (Ξ», Ξ½) combinations
Ξ»=2,Ξ½=0.5
Den
sity
0 5 10 15
0.00
0.05
0.10
0.15
Ξ»=2,Ξ½=0.75
0 2 4 6 8 10 12
0.00
0.10
0.20
Ξ»=2,Ξ½=1
0 2 4 6 8
0.0
0.2
0.4
Ξ»=2,Ξ½=3
0 1 2 3 4
0.0
1.0
2.0
Ξ»=8,Ξ½=0.5
Den
sity
40 60 80 100
0.00
00.
015
0.03
0
Ξ»=8,Ξ½=0.75
5 10 15 20 25 30 35
0.00
0.04
0.08
Ξ»=8,Ξ½=1
0 5 10 15 20
0.00
0.06
0.12
Ξ»=8,Ξ½=3
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
Ξ»=15,Ξ½=0.5
Den
sity
150 200 250 300
0.00
00.
010
Ξ»=15,Ξ½=0.75
20 30 40 50 60
0.00
0.02
0.04
Ξ»=15,Ξ½=1
5 10 15 20 25 30
0.00
0.04
0.08
Ξ»=15,Ξ½=3
0 1 2 3 4 5 6
0.0
0.4
0.8
7
CMP Regression
CMP regression models can be formulated as follows:
log(Ξ») = XΞ² (1)log(Ξ½) = ZΞ³ (2)
Maximizing the log-likelihood w.r.t the parameters Ξ² and Ξ³ will yieldthe following normal equations Sellers and Shmueli (2010):
U =βlogLβΞ²
= XT(yβ E(y)) (3)
V =βlogLβΞ³
= Ξ½ZT(βlog(y!) + E(log(y!))) (4)
8
Motivation
Exploration of Speed Dating data
β
β
β
β
β β
β
ββ
β
β
β
βββ
β
β
β
β
β
β
ββ
β
ββ
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β ββ
β
β
ββ
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β β
ββ
β
β
β
β
β
β
β
β
β
β
βββ
ββ
β
ββ
β
β
β
ββ
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β ββ
β
β β
β
β β
ββ
β
ββ
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β β
β
β
β β
ββ
β
ββ
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
ββ β
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β βββ
β
ββ
β
β β
β β
β
β
ββ β
β
β
ββ β
ββ β
β
β
β β β
β
β
β β
β
β
ββ
β
β
ββ
β
ββ
ββ
β
β
β
β
β βββ
β
β
ββ
β
β
β
ββ β
β
β
β
ββ
β
ββ β
β
β
β
β
β
β
β
β
ββ β β
β
β
β
β
β
β
β
β
β β
β
β
β
β
βββ
β
β
β
β
β
ββ
β
ββ
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
βββ
β
β
β
ββ
β
β
β
ββ
β
β
ββ
β
β
ββ
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
ββ β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
ββ
β
β
β
β
β
β β
ββ
β
4 5 6 7 8 9
β2
β1
01
23
Sincerity (Others)
Tot.Y
es (
log)
β
β
β
β
β β
β
ββ
β
β
β
βββ
β
β
β
β
β
β
ββ
β
β β
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β ββ
β
β
ββ
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β
β
β
β
β
β
β
β
β
β ββ
ββ
β
ββ
β
β
β
ββ
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
βββ
β
β β
β
β β
ββ
β
ββ
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β β
β
β
ββ
ββ
β
ββ
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
ββ β
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β βββ
β
ββ
β
ββ
ββ
β
β
ββ β
β
β
βββ
βββ
β
β
β ββ
β
β
β β
β
β
ββ
β
β
ββ
β
ββ
ββ
β
β
β
β
βββ β
β
β
β β
β
β
β
ββ β
β
β
β
ββ
β
ββ β
β
β
β
β
β
β
β
β
βββ β
β
β
β
β
β
β
β
β
β β
β
β
β
β
βββ
β
β
β
β
β
ββ
β
ββ
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ β
β
β
β
ββ
β
β
β
ββ
β
β
ββ
β
β
βββ
β
β
β
β
β
β
β β
β
β
β
β
β
β
β ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β β
β
β
β
ββ
β
β
β
β
β
β β
ββ
β
5 6 7 8 9
β2
β1
01
23
Intelligence (Others)
Tot.Y
es (
log)
β
β
β
β
β β
β
β β
β
β
β
βββ
β
β
β
β
β
β
ββ
β
ββ
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β ββ
β
β
ββ
β β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β
β
β
β
β
β
β
β
β
βββ
ββ
β
ββ
β
β
β
ββ
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β ββ
β
β β
β
β β
ββ
β
ββ
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β β
β
β
β β
ββ
β
ββ
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
ββ β
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
ββ ββ
β
ββ
β
ββ
β β
β
β
βββ
β
β
βββ
ββ β
β
β
ββ β
β
β
β β
β
β
β β
β
β
ββ
β
ββ
ββ
β
β
β
β
β ββ β
β
β
β β
β
β
β
ββ β
β
β
β
ββ
β
ββ β
β
β
β
β
β
β
β
β
ββββ
β
β
β
β
β
β
β
β
ββ
β
β
β
β
βββ
β
β
β
β
β
β β
β
ββ
β
β
β
β
β
β
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
βββ
β
β
β
ββ
β
β
β
ββ
β
β
β β
β
β
ββ
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
ββ β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β β
β
β
β
ββ
β
β
β
β
β
β β
ββ
β
4 6 8 10
β2
β1
01
23
Sincerity
Tot.Y
es (
log)
β
β
β
β
ββ
β
β β
β
β
β
βββ
β
β
β
β
β
β
ββ
β
ββ
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β ββ
β
β
ββ
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β β
ββ
β
β
β
β
β
β
β
β
β
β
β ββ
ββ
β
ββ
β
β
β
ββ
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β ββ
β
β β
β
β β
ββ
β
ββ
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
ββ
β
β
β
ββ
β
β
β
β
β
β
β
β
β
ββ
β
β
β β
ββ
β
ββ
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
ββ β
β
β
β β
β
β
β
β
β
β
β
β
β
β
β
ββ ββ
β
ββ
β
ββ
β β
β
β
βββ
β
β
βββ
ββ β
β
β
ββ β
β
β
ββ
β
β
β β
β
β
ββ
β
ββ
ββ
β
β
β
β
β ββ β
β
β
β β
β
β
β
ββ β
β
β
β
ββ
β
ββ β
β
β
β
β
β
β
β
β
ββ ββ
β
β
β
β
β
β
β
β
β β
β
β
β
β
βββ
β
β
β
β
β
β β
β
ββ
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β
β
β
βββ
β
β
β
ββ
β
β
β
ββ
β
β
β β
β
β
ββ
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
ββ β
β
β
β
β
β
β
β
β
β
β
β
β
β
β
ββ
β
β
β
β
β
β
β
β
β
β
β
β β
β
β
β
ββ
β
β
β
β
β
β β
ββ
β
4 6 8 10
β2
β1
01
23
Fun seeking
Tot.Y
es (
log)
9
More flexibility?
Generalized Additive Models
β’ Smoothing Splinesβ’ Penalized Splines
Both implementations are dependent upon the Iterative ReweightedLeast Squares (IRLS) estimation framework.
At present, there is no IRLS framework available for CMP !!
10
An IRLS framework
Update for each iteration
I[Ξ²
Ξ³
](m)
= I[Ξ²
Ξ³
](mβ1)
+
[UV
]
which implies the following equations
XTΞ£yXΞ²(m) β XTΞ£y,log(y!)Ξ½ZΞ³(m) = XTΞ£yXΞ²(mβ1) βXTΞ£y,log(y!)Ξ½ZΞ³(mβ1) + XT(yβ E(y))
and
β Ξ½ZTΞ£y,log(y!)XΞ²(m) + Ξ½2ZTΞ£log(y!)ZΞ³(m) = βΞ½ZTΞ£y,log(y!)XΞ²(mβ1) +
Ξ½2ZTΞ£log(y!)ZΞ³(mβ1) +
Ξ½ZT(βlog(y!) + E(log(y!)))
11
For the fixed values of both Ξ² and Ξ³ the equations
XTΞ£yXΞ²(m) = XTΞ£yXΞ²(mβ1) + XT(yβ E(y)) (5)
Ξ½2ZTΞ£log(y!)ZΞ³(m) = Ξ½2ZTΞ£log(y!)ZΞ³(mβ1) + Ξ½ZT(βlog(y!) + E(log(y!))).(6)
12
Algorithm
https://arxiv.org/abs/1610.08244
13
Practical issues
Initial Values
β’ For Ξ» = (y+ 0.1)Ξ½
β’ For Ξ½ = 0.2
Calculation of Cumulants
β’ Bounding error 10β8 or 10β10
β’ Asymptotic expressions
Stopping Criterion
β’ Based on β2βl(yi; Ξ»Μi, Ξ½Μi)
Step size
β’ Step halving
14
Simulation Study-Comparison ofIRLS with MLE
Study design
We compare our IRLS algorithm with the existing implementationwhich is based on maximizing the likelihood function (through optimin R).
(a) Set sample size n = 100(b) Generate x1 βΌ U(0, 1) and x2 βΌ N(0, 1)(c) Calculate x3 = 0.2x1 + U(0, 0.3) and x4 = 0.3x2 + N(0, 0.1) (to
create correlated variables)(d) Generate
y βΌ CMP(log(Ξ») = 0.05+ 0.5x1 β 0.5x2 + 0.25x3 β 0.25x4, Ξ½)where Ξ½ = {0.5, 2, 5}
15
Results
β
βββ
IR MLE IR MLE IR MLE
β0.
50.
00.
51.
01.
5
x1
β β
β
β
β
β
β
β
IR MLE IR MLE IR MLE
β2.
0β
1.5
β1.
0β
0.5
0.0
0.5
x2
β
β
β
IR MLE IR MLE IR MLE
β4
β2
02
46
x3
β
β
β
β
β
β
ββ
ββ
IR MLE IR MLE IR MLE
β4
β2
02
4
x4
β
β
β
IR MLE IR MLE IR MLE
β2
β1
01
23
4
log(Ξ½)
Ξ½=0.5Ξ½=2Ξ½=5
16
A CMP Generalized AdditiveModel
Additive Model
log(Ξ») = Ξ±+
pβj=1
fj(Xj)
log(Ξ½) = ZΞ³
where fj (j = 1, 2, . . . ,p) are the smooth functions for the p variables.
17
Backfitting
Based on Hastie and Tibshirani (1990); Wood (2006), the algorithm asfollows
1. Initialize: fj = f(0)j , j = 1, . . . ,p2. Cycle: j = 1, . . . ,p, 1, . . . ,p, . . .
fj = Sj(yβ
βkΜΈ=j
fk|xj)
3. Continue (2) until the individual functions donβt change.
One more nested loop inside theIRLS framework !
18
Results & Conclusions
Comparison of Regression models on Tot.Yes
Poisson Negative Binomial CMP(Intercept) 0.49 0.59 0.14
(0.43) (0.55) (0.33)GenderMale 0.05 0.05 0.03
(0.04) (0.06) (0.03)age β0.01 β0.01 β0.004
(0.01) (0.01) (0.004)Tot.partner 0.07βββ 0.07βββ 0.04βββ
(0.00) (0.01) (0.003)avg.intcor β0.04 β0.04 β0.02
(0.11) (0.15) (0.09)attr 0.19βββ 0.18βββ 0.11βββ
(0.03) (0.04) (0.02)sinc β0.06 β0.05 β0.04
(0.03) (0.04) (0.02)intel 0.05 0.06 0.03
(0.04) (0.05) (0.03)func 0.03 0.04 0.02
(0.04) (0.05) (0.03)amb β0.12βββ β0.13ββ β0.07ββ
(0.03) (0.04) (0.02)shar 0.10βββ 0.10βββ 0.06βββ
(0.02) (0.03) (0.02)mean.agep β0.01 β0.01 β0.007
(0.01) (0.02) (0.009)attr_o β0.10βββ β0.10βββ β0.06βββ
(0.02) (0.03) (0.02)sinc_o 0.02 0.02 0.01
(0.04) (0.05) (0.03)intel_o 0.08 0.08 0.05
(0.05) (0.07) (0.04)fun_o β0.01 β0.01 β0.003
(0.03) (0.04) (0.02)amb_o β0.00 β0.01 0.0005
(0.04) (0.05) (0.03)shar_o 0.02 0.03 0.01
(0.03) (0.04) (0.02)Ξ½ 0.53βββAIC 2844.92 2777.24 2751.7BIC 3011.64 2948.23 2922.66Log Likelihood -1383.46 -1348.62 -1335.33Deviance 970.04 637.25Num. obs. 531 531 531βββp < 0.001, ββp < 0.01, βp < 0.05
19
Comparison of Additive Models on Tot.Yes
Dependent variable:Tot.Yes
CMP(Chi.Sq) Poisson(Chi.Sq)s(sinc) 7.16 11.53ββs(func) 7.51 11.40ββs(sinc_o) 13.96ββ 29.30βββs(intel_o) 14.06ββ 13.26βββ
Ξ½ 0.56AIC 2737.03 2804.77
Note: βp<0.1; ββp<0.05; βββp<0.01
Itβs more about the behavior of opposite person that guide us toselect her/him.
20
Summary
β’ The IRLS framework is far more efficient than the existinglikelihood based method and provides more flexibility.
β’ Since CMP is computationally heavier than the other GLMs wecould parallelize some matrix computations inorder to increasethe speed.
β’ The IRLS framework allows CMP to have other modelingextensions such as LASSO etc.
Full paper available from https://arxiv.org/abs/1610.08244and the source code is available fromhttps://github.com/SuneelChatla/cmp
21
Suggestions and1. 1.1
Questions?
21
References
Fisman, R., Iyengar, S. S., Kamenica, E., and Simonson, I. (2006).Gender differences in mate selection: Evidence from a speeddating experiment. The Quarterly Journal of Economics, pages673β697.
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models,volume 43. CRC Press.
Sellers, K. F. and Shmueli, G. (2010). A flexible regression model forcount data. Annals of Applied Statistics, 4(2):943β961.
Shmueli, G., Minka, T. P., Kadane, J. B., Borle, S., and Boatwright, P.(2005). A useful distribution for fitting discrete data: revival of theconwayβmaxwellβpoisson distribution. Journal of the RoyalStatistical Society: Series C (Applied Statistics), 54(1):127β142.
Wood, S. (2006). Generalized additive models: an introduction with R.CRC press.