Lecture 7: DQN - GitHub Pages · PDF fileLecture 7: DQN Reinforcement Learning with...

27
Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim <[email protected]>

Transcript of Lecture 7: DQN - GitHub Pages · PDF fileLecture 7: DQN Reinforcement Learning with...

Lecture 7: DQN

Reinforcement Learning with TensorFlow&OpenAI GymSung Kim <[email protected]>

Q-function Approximation: Q-Nets

(1) state, s

(2) quality (reward)for all actions(eg, [0.5, 0.1, 0.0, 0.8] LEFT: 0.5, RIGHT 0.1 UP: 0.0, DOWN: 0.8)

111 11111

1

1

Q-Nets are unstable

Convergence

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|✓))]2

Reinforcement + Neural Net

http://stackoverflow.com/questions/10722064/training-a-neural-network-with-reinforcement-learning

NATURE.COM/NATURE26 February 2015 £10

Vol. 518, No. 7540

EPIDEMIOLOGY

SHARE DATA IN OUTBREAKS

Forge open access to sequences and more

PAGE 477

COSMOLOGY

A GIANT IN THE EARLY UNIVERSE

A supermassive black hole at a redshift of 6.3

PAGES 490 & 512

QUANTUM PHYSICS

TELEPORTATION FOR TWO

Transferring two properties of a single photon

PAGES 491 & 516

INNOVATIONS INThe microbiome

Self-taught AI software attains human-level

performance in video games PAGES 486 & 529

T H E I N T E R N AT I O N A L W E E K LY J O U R N A L O F S C I E N C E

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Two big issues

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

1. Correlations between samples

Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.

1. Correlations between samples

Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.

1. Correlations between samples

Prerequisite: http://hunkim.github.io/ml/ orhttps://www.inflearn.com/course/기본적인-머신러닝-딥러닝-강좌/

1. Correlations between samples

2. Non-stationary targets

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|✓))]2

2. Non-stationary targets

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|✓))]2

Y = Q(st, at|✓) Y = rt + �max

a0ˆQ✓(st+1, a

0|✓)

DQN’s three solutions

1. Go deep

2. Capture and replay • Correlations between samples

3. Separate networks: create a target network • Non-stationary targets

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Human-level control through deep reinforcement learning, Nature http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

Solution 1: go deep

Solution 1: go deep

ICML 2016 Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Problem 2: correlations between samples

Solution 2: experience replayDeep Q-Networks (DQN): Experience Replay

To remove correlations, build data-set from agent’s own experience

s1, a1, r2, s2s2, a2, r3, s3 ! s, a, r , s 0

s3, a3, r4, s4...

s

t

, at

, rt+1, st+1 ! s

t

, at

, rt+1, st+1

Sample experiences from data-set and apply update

l =

✓r + � max

a

0Q(s 0, a0,w�) � Q(s, a,w)

◆2

To deal with non-stationarity, target parameters w� are held fixed

Capturerandom sample

& Replaymin

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|✓))]2

ICML 2016 Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Solution 2: experience replay

Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.

Problem 2: correlations between samplesDeep Q-Networks (DQN): Experience Replay

To remove correlations, build data-set from agent’s own experience

s1, a1, r2, s2s2, a2, r3, s3 ! s, a, r , s 0

s3, a3, r4, s4...

s

t

, at

, rt+1, st+1 ! s

t

, at

, rt+1, st+1

Sample experiences from data-set and apply update

l =

✓r + � max

a

0Q(s 0, a0,w�) � Q(s, a,w)

◆2

To deal with non-stationarity, target parameters w� are held fixed

Problem 3: non-stationary targets

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|✓))]2

Y = Q(st, at|✓) Y = rt + �max

a0ˆQ✓(st+1, a

0|✓)

Human-level control through deep reinforcement learning, Nature http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

Solution 3: separate target network

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|✓))]2

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|¯✓))]2

Solution 3: separate target network

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|¯✓))]2

(2)Ws(1)s

(1)s (2) Y (target)

Solution 3: copy network

(2)Ws(1)s

(1)s (2) Y (target)

Understanding Nature

Paper (2015)

Human-level control through deep reinforcement learning, Nature http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

Next

Lab: DQN