Reproducibility and Replicability in Deep Reinforcement...

37
Reproducibility and Replicability in Deep Reinforcement Learning (and Other Deep Learning Methods) Peter Henderson Statistical Society of Canada Annual Meeting 2018

Transcript of Reproducibility and Replicability in Deep Reinforcement...

Page 1: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Reproducibility and Replicability in Deep Reinforcement Learning

(and Other Deep Learning Methods)Peter Henderson

Statistical Society of Canada Annual Meeting 2018

Page 2: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Contributors

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. "Deep reinforcement learning that matters.”

Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). 2018.

Riashat Islam Phil Bachman

Doina Precup David Meger Joelle Pineau

Page 3: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

What are we talking about?“Repeatability (Same team, same experimental setup): A researcher can reliably repeat her own computation.

Replicability (Different team, same experimental setup): An independent group can obtain the same result using the author's own artifacts.

Reproducibility (Different team, different experimental setup): An independent group can obtain the same result using artifacts which they develop completely independently.”

Plesser, Hans E. "Reproducibility vs. replicability: a brief history of a confused terminology." Frontiers in Neuroinformatics 11 (2018): 76.

Page 4: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

“Reproducibility is a minimum necessary condition for a finding to be believable and informative.”

Cacioppo, John T., et al. "Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science." (2015).

Page 5: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

“An article about computational science in a scientific publication is not the scholarship itself, it is merely the advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.”

Buckheit, Jonathan B., and David L. Donoho. "Wavelab and reproducible research." Wavelets and statistics. Springer New York, 1995. 55-81.

Page 6: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Why should we care about this?

Fermat’s last theorem (1637)

“No three positive integers a, b, and c satisfy the equation an + bn = cn for any integer value of n greater than 2.”

Image Source: https://upload.wikimedia.org/wikipedia/commons/4/47/Diophantus-II-8-Fermat.jpg

Page 7: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Why should we care about this?

Claimed proof too long for margins, so not included.

Proof not able to be reproduced until 1995 (358 years!!)

Image Source: https://upload.wikimedia.org/wikipedia/commons/4/47/Diophantus-II-8-Fermat.jpg

Page 8: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Why should we care about this?

https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970

Page 9: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

So what’s the deal with Deep Learning?Generative Adversarial Networks (GANs): “We find that most models can reach similar scores with enough hyperparameter optimization and random restarts. (…) We did not find evidence that any of the tested algorithms consistently outperforms the original one.”

Lucic, Mario, et al. "Are GANs Created Equal? A Large-Scale Study." arXiv preprint arXiv:1711.10337 (2017).

Neural Language Models (NLMs): “We reevaluate several popular architectures and regularisation methods with large-scale automatic black-box hyperparameter tuning and arrive at the somewhat surprising conclusion that standard LSTM architectures, when properly regularised, outperform more recent models”

Melis, Gábor, Chris Dyer, and Phil Blunsom. "On the state of the art of evaluation in neural language models." arXiv preprint arXiv:1707.05589 (2017).

Reinforcement Learning (RL): “Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful.”

Henderson, Peter, et al. "Deep reinforcement learning that matters." arXiv preprint arXiv:1709.06560 (2017).

Page 10: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

REUSABLE material (software, datasets, experimental platforms)

help us REPRODUCE, REPLICATE, REPEAT scientific methods

to establish the ROBUSTNESS of the findings

using fair comparisons and informative evaluation methods.

Page 11: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

What is RL?

Page 12: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

25 years of RL papers# of RL papers per year scraped from Google Scholar searches.

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, D. Meger. Deep Reinforcement Learning that Matters. AAAI 2017.

Page 13: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

https://eng.uber.com/deep-neuroevolution/

https://www.quora.com/Why-did-AlphaGo-resign-in-the-4th-match-against-Lee-Sedol

https://neuralsculpt.com/2018/05/26/gym-retro/

Page 14: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

http://robohub.org/deep-learning-in-robotics/

http://www.cim.mcgill.ca/~dmeger/ICRA2015_GaitLearning/icra2015_submission3011_megerHigueraXuDudek.pdf

https://amp.businessinsider.com/images/574995b4dd089506238b45e6-750-419.jpg

Page 15: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Investigated Reproducibility in Policy Gradient Methods

Codebases

Hyperparameters

Random Seeds

Variable Performance Across Different Settings

Page 16: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Codebases

Page 17: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Codebases

Page 18: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

HyperparametersAn intricate interplay of hyper-parameters.

For many/most algorithms, hyper parameters can have a profound effect on performance.

Page 19: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

HyperparametersAn intricate interplay of hyper-parameters.

For many/most algorithms, hyper parameters can have a profound effect on performance.

When testing a baseline, how motivated are we to find the best hyperparameters?

Page 20: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Environments

Alg.1 Alg.2 Alg.3 Alg.4

Alg.1 Alg.2 Alg.3 Alg.4

Alg.1 Alg.2 Alg.3 Alg.4

Page 21: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Behaviour

Video taken from: https://gym.openai.com/envs/HalfCheetah-v1

Page 22: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

How do we evaluate?

• Average return over test set of trials and random seeds?

• Confidence interval? +

How do we pick n?

Alg.1 Alg.2 Alg.3 Alg.4

Page 23: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

How many trials?

Page 24: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

How many trials?

0

10

20

30

40

50

60

70

Baseline to beat

n=10

Page 25: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

How many trials?

0

10

20

30

40

50

60

70

Baseline to beat

0

10

20

30

40

50

60

70

Top-3 resultsn=10

Baseline to beat

• We beat the baseline! • And with low variance!

Page 26: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

How many trials?

n=5

n=5

Page 27: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

How many trials?

Both are same TRPO code with best hyper-parameter configuration but different random seeds!

Page 28: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Other Considerations

Overfitting and generalization to new conditions

(reproducible algorithm performance in varying environments)

Page 29: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Other Considerations

What constitutes a fair comparison?

Using different hyperparameters? What about different codebases?

Should hyperparameter optimization computation time be included when comparing algorithm sample efficiency?

Page 30: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

What can we do?

Commit to releasing all materials necessary for replicability, repeatability, and reproducibility

(e.g., code, hyperparameters, tricks, etc.)

Page 31: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

What can we do?

Develop new methods for and ensure the use of rigorous evaluation and experimental methodology

(at least use many trials and random seeds as indications of robustness, and better statistical indicators of significant findings)

Page 32: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

What can we do?

Develop new methods for and ensure the use of rigorous evaluation and experimental methodology

2nd Workshop on Reproducibility in Machine Learning (ICML 2018) (deadline to submit June 10th)

https://sites.google.com/view/icml-reproducibility-workshop/home

Page 33: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

What can we do?

Aim to make algorithms more robust to variation in different settings and at least across random seeds.

Better generalization and robustness

Page 34: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

What can we do?

Reduce the amount of hyperparameters for ease-of-use by outside communities, robustness, and easier fair comparisons.

(e.g., fewer hyperparameters, AutoML, etc.)

Page 35: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

What can we do?

Align rewards with desired behaviors

1st Workshop on Goal Specification (ICML 2018)

https://sites.google.com/view/goalsrl/home

Page 36: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

What can we do as a community?Commit to sharing reusable material.

Develop a culture of good experimental practice.Be thorough, be fair, be as critical of the “good” results as the “bad results”.

Contribute to the reproducibility effort!Organize an event, sign-up for a challenge, include it in your work, in your

course.

Page 37: Reproducibility and Replicability in Deep Reinforcement Learningweb.stanford.edu/~phend/slides/ssc2018.pdf · 2018. 6. 19. · “Repeatability (Same team, same experimental setup):

Submit your awesome work on reproducibility to the 2nd Workshop on Reproducibility in Machine Learning (ICML 2018) (deadline to submit June 10th)

Thank you!

Feel free to email with questions and comments.