A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for...

30
A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson, Martin J. Wainwright UC Berkeley Spotlight, NIPS Conference, December 2017

Transcript of A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for...

Page 1: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

A framework for Multi-A(rmed)/B(andit) testing with online FDR control

Fanny Yang, Aaditya Ramdas, Kevin Jamieson, Martin J. WainwrightUC Berkeley

Spotlight,NIPS Conference, December 2017

Page 2: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Traditional A/B Testing

A B

vs.

control alternative

Page 3: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Traditional A/B Testing

A B

50% 75%

vs.

control alternative

Page 4: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Traditional A/B Testing

A B

H0: A at least as good as B

50% 75%

vs.

control alternative

Hypothesis test

Page 5: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Traditional A/B Testing

A B

H0: A at least as good as B

50% 75%

vs.

control alternativeaccept

AKeep using

Hypothesis test

Page 6: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Traditional A/B Testing

A B

H0: A at least as good as B

50% 75%

vs.

control alternative

B

reject

Switch to

accept

AKeep using

Hypothesis test

Page 7: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

In reality: many alternatives, many tests

vs.

Page 8: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Control (default) Alternatives

In reality: many alternatives, many tests

vs.

Page 9: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

…January

Phone App Layout

Control (default) Alternatives

In reality: many alternatives, many tests

vs.

Page 10: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Sequen

ce o

f te

sts …

…AprilWebsiteLayout

JanuaryPhone App

Layout

Control (default) Alternatives

In reality: many alternatives, many tests

vs.

vs.

Page 11: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Sequen

ce o

f te

sts …

…AprilWebsiteLayout

JanuaryPhone App

Layout

AugustTeaser picture

Control (default) Alternatives

In reality: many alternatives, many tests

vs.

vs.

vs.

Page 12: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Sequen

ce o

f te

sts …

…AprilWebsiteLayout

JanuaryPhone App

Layout

AugustTeaser picture

Control (default) Alternatives

In reality: many alternatives, many tests

vs.

vs.

vs.

Page 13: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Goal I (A/B testing)

AprilWebsiteLayout

AugustTeaser picture

MayTV ad

JanuaryPhone App

Layout

Dec.NIPS booth

JuneEmail ads

Page 14: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Goal I (A/B testing)Null hypothesis truecontrol is indeed better

Null hypothesis wrongat least one alternative better

AprilWebsiteLayout

AugustTeaser picture

MayTV ad

JanuaryPhone App

Layout

Dec.NIPS booth

JuneEmail ads

Page 15: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Goal I (A/B testing)Null hypothesis truecontrol is indeed better

Null hypothesis wrongat least one alternative better

AprilWebsiteLayout

AugustTeaser picture

MayTV ad

JanuaryPhone App

Layout

Dec.NIPS booth

JuneEmail ads

Accepted

Rejected

Page 16: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

discoveries

Goal I (A/B testing)Null hypothesis truecontrol is indeed better

Null hypothesis wrongat least one alternative better

AprilWebsiteLayout

AugustTeaser picture

MayTV ad

JanuaryPhone App

Layout

Dec.NIPS booth

JuneEmail ads

Accepted

Rejected

Page 17: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

discoveriesfalse discoveries

Goal I (A/B testing)Null hypothesis truecontrol is indeed better

Null hypothesis wrongat least one alternative better

AprilWebsiteLayout

AugustTeaser picture

MayTV ad

JanuaryPhone App

Layout

Dec.NIPS booth

JuneEmail ads

Accepted

Rejected

Page 18: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

discoveriesfalse discoveries

Goal I (A/B testing)

Control the expected ratio #false discoveries

#discoveries(FDR)

Null hypothesis truecontrol is indeed better

Null hypothesis wrongat least one alternative better

AprilWebsiteLayout

AugustTeaser picture

MayTV ad

JanuaryPhone App

Layout

Dec.NIPS booth

JuneEmail ads

Accepted

Rejected

Page 19: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

discoveries

Goal II (power and best alternative)

Null hypothesis truecontrol is indeed better

Null hypothesis wrongat least one alternative better

AprilWebsiteLayout

AugustTeaser picture

MayTV ad

JanuaryPhone App

Layout

Dec.NIPS booth

JuneEmail ads

Accepted:

Rejected:

Page 20: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

discoveries true discoveries

Goal II (power and best alternative)

Null hypothesis truecontrol is indeed better

Null hypothesis wrongat least one alternative better

AprilWebsiteLayout

AugustTeaser picture

MayTV ad

JanuaryPhone App

Layout

Dec.NIPS booth

JuneEmail ads

Accepted:

Rejected:

Page 21: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

discoveries true discoveries

Goal II (power and best alternative)

Null hypothesis truecontrol is indeed better

Null hypothesis wrongat least one alternative better

AprilWebsiteLayout

AugustTeaser picture

MayTV ad

JanuaryPhone App

Layout

Dec.NIPS booth

JuneEmail ads

Accepted:

Rejected:

Best alternative: Alternative 3 Alternative 4 Alternative 2

Page 22: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

discoveries true discoveries

Goal II (power and best alternative)

Null hypothesis truecontrol is indeed better

Null hypothesis wrongat least one alternative better

AprilWebsiteLayout

AugustTeaser picture

MayTV ad

JanuaryPhone App

Layout

Dec.NIPS booth

JuneEmail ads

Accepted:

Rejected:

Best alternative: Alternative 3 Alternative 4 Alternative 2

Maximize # true discoveries,

find best alternative for each discovery

Page 23: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Our framework: MAB-FDR

MAB-FDR meta algorithm

Online FDR procedure

desired FDR level 𝛼

Page 24: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Our framework: MAB-FDR

MAB-FDR meta algorithm

Test j

Online FDR procedure

desired FDR level 𝛼

Page 25: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Our framework: MAB-FDR

MAB-FDR meta algorithm

𝛼𝑗

Test j

Online FDR procedure

desired FDR level 𝛼

Page 26: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Our framework: MAB-FDR

MAB-FDR meta algorithm

𝛼𝑗

Test j

Test𝑝𝑗 < 𝛼𝑗

𝑝𝑗

Online FDR procedure

desired FDR level 𝛼

Best-armMAB

Page 27: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Our framework: MAB-FDR

MAB-FDR meta algorithm

𝛼𝑗 Reject/accept

Test j

Test𝑝𝑗 < 𝛼𝑗

𝑝𝑗

Online FDR procedure

desired FDR level 𝛼

Best alternative

Best-armMAB

Page 28: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Our framework: MAB-FDR

MAB-FDR meta algorithm

𝛼𝑗 Reject/accept

Test j

Test𝑝𝑗 < 𝛼𝑗

𝑝𝑗

𝛼j+1 Reject/accept

Test j+1

Best-armMAB

Test 𝑝j+1 < 𝛼j+1

𝑝j+1

Online FDR procedure

……

desired FDR level 𝛼

Best alternativeBest alternative

Best-armMAB

Page 29: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

Our framework…

1. Uses online FDR procedures to control FDR at any test

2. Uses best-arm MAB algorithm for testing each hypothesis,

and finding the best alternative

while sampling only as much as needed

Page 30: A framework for Multi-A(rmed)/B(andit) testing with online ... · A framework for Multi-A(rmed)/B(andit) testing with online FDR control Fanny Yang, Aaditya Ramdas, Kevin Jamieson,

AadityaRamdas

KevinJamieson

MartinWainwright

”A framework for Multi-A(rmed)/B(andit) testing with online FDR control”

FannyYang

Come and learn more at

Poster #2