Feature selection can hurt model inference
-
Upload
wayne-lee -
Category
Data & Analytics
-
view
96 -
download
0
Transcript of Feature selection can hurt model inference
Feature Selection Can Hurt Model Inference
Wayne Tai Lee
I can see this happen...
- Deep diving into your A/B test data:- Your A/B test said the new feature did not improve the target
metric- It is tempting to deep-dive into the A/B test data to see if you
can detect an effect through fancy modeling- Part of your modeling involves some feature selection- You now detected an impact of your feature!
Prep work - what is a collider?
A B
C
- Let A be a coin toss- Let B be a separate coin toss- Let C be “Were the coin toss
outcomes from A and B the same?”
- C is a “collider”
Colliders can pass knowledge from one another
A B
C
- Let A be a coin toss- Let B be a separate coin toss- Let C be “Were the coin toss
outcomes from A and B the same?”
- C is a “collider”
- Knowing A doesn’t tell you anything about B.
- BUT if you know C in addition, you know B, even if B is not measured.
Possible Tech World Issue
A B
C
S
?
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
We want to know if the Ad affected sign-up chances
A B
C
S
?
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
Ads affect click rate (with calls to action)
A B
C
S
?
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
Your user’s background affects their clicks
A B
C
S
?
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
A user’s background, like age, industry, personal interests will also affect sign-up
A B
C
S
?
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
Your A/B testing should have eliminated the correlation between background and the exposure to Ads
A B
C
S
?
X
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
If you did not perform an A/B test, you ad could have been shown to people who would sign-up anyway
A B
C
S
?
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
Clicks themselves should not drive sign-ups…it is more common that something is driving both clicks and sign-ups
A B
C
S
?X
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
Most likely you cannot or do not know what to measure for the background
A B
C
S
?
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
Feature selection that predicts sign-ups will likely pick up C since B drives both C and S.
A B
C
S
?
Pred(S) = func(A, C)
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
Even if A and C do not affect S….
A B
C
S
Prob(S) = f(B)
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
Even if A and C do not affect S….Now if you predict S based on A and CYou will detect an effect from A!
A B
C
S
Prob(S) = f(B)ButPred(S) =func(A, C)
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
Implication: you will think your ads matter for sign-upsEven though you performed an A/B test!
A B
C
S
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
Intuition: your predictive model picked up knowledgeof B through C.
A B
C
S
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
Solution: stick to classic A/B testing and do notadd features without careful thought!
A B
C
S
- A is a new Ad campaign- C is the total Clicks
(engagement) of the user- B is the Background of
the user- S indicates whether the
user signed-up
Simulation: Generate the data
A
B
C
S
B = Uniform(0, 1)A = Bernoulli(0.2)C = 1[B > 0.5] * ceiling(Exponential(1/(B + A)))S = Bernoulli(B)------------------------------------n = 10000B = runif(n, 0, 1)A = rbinom(n, 1, 0.2)C = ifelse(B > 0.5, ceiling(rexp(n, 1/(B + A))), 0)S = rbinom(n, 1, B)
Simulation: Feature selection is often some form of correlation check
> cor(S, C)[1] 0.3083319
A B
C
S
Simulation: We detect a strong effect from A
> summary(lm(S ~ A + C))
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.381158 0.006409 59.470 < 2e-16 ***A 0.056145 0.011937 4.704 2.59e-06 ***C 0.119463 0.003646 32.766 < 2e-16 ***
A
B
C
S
Simulation: No problems if you include everything (often not feasible in real life)
> summary(lm(S ~ A + C + B))
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0002407 0.0085286 0.028 0.977 A 0.0108208 0.0103250 1.048 0.295 C -0.0048050 0.0037920 -1.267 0.205 B 1.0031360 0.0171023 58.655 <2e-16 ***
A
B
C
S
Simulation: Also no problems if you don’t add the extra variables from feature selection
> summary(lm(S ~ A))
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.498498 0.005593 89.125 <2e-16 ***A 0.012458 0.012482 0.998 0.318
A
B
C
S
Even with experimental data, adding features to yourwithout careful thought can lead to wrong inference!
A B
C
S
How to spot this?
A B
C
S
Recall that engagement metrics are ultimately proxies of more important indicators
A B
C
S
If your model detected an impact when you know it is not true, your model likely picked up something else...
A B
C
S
X
> summary(lm(S ~ A + C))
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.381158 0.006409 59.470 < 2e-16 ***A 0.056145 0.011937 4.704 2.59e-06 ***C 0.119463 0.003646 32.766 < 2e-16 ***
Yes, looking at the features that you don’t really care about can help!
A B
C
S
X
> summary(lm(S ~ A + C))
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.381158 0.006409 59.470 < 2e-16 ***A 0.056145 0.011937 4.704 2.59e-06 ***C 0.119463 0.003646 32.766 < 2e-16 ***
Question?
A B
C
S
Send me a LinkedIn message!https://www.linkedin.com/in/waynetailee/