딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016

172
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 김태훈

Transcript of 딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016

  • AI

  • EMNLP,DataCom,IJBDI

    http://carpedm20.github.io

  • +

  • PlayingAtariwithDeepReinforcementLearning(2013)

    https://deepmind.com/research/dqn/

  • MasteringthegameofGowithdeepneuralnetworksandtreesearch(2016)

    https://deepmind.com/research/alphago/

  • (2016)

    http://vizdoom.cs.put.edu.pl/

  • +

  • +

  • + ?ReinforcementLearning

  • Machine Learning

  • ?

    ?

  • ?

    ?

    Classification

  • ?

    ?

    ?

    ?

  • Clustering

  • ?

    ?

    ?

  • ?

  • ( )

  • ( )

  • (ReinforcementLearning)

  • Agent

  • Environment

    Agent

  • Environment

    Agent

    StateE

  • Environment

    Agent

    StateE

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

  • Environment

    Agent

    StateE ActionE = 2

  • Environment

    Agent

    ActionE = 2

    StateE

  • Environment

    Agent

    ActionE = 2StateE RewardE = 1

  • Environment

    Agent

    ActionE = 2StateE RewardE = 1

  • Environment

    Agent

    StateE

  • Environment

    Agent

    ActionE = 0StateE

  • Environment

    Agent

    ActionE = 0StateE RewardE = 1

  • ,

    Agent action

    action state

    reward

    : reward action

  • Environment

    Agent

    ActionaJStatesJ RewardrJ

    ReinforcementLearning

  • ,

  • DL+RL ?

  • AlphaRun

    AlphaRun

  • +

  • ?

  • ?

  • 30 ( 8)

    30

    9 (2 )

    7

    4

    30830MO74=

    5,040

  • 4

    1 6

    30830MO746 =

    14

  • AlphaRun

    AlphaRun

  • A.I.

  • AI 8

    1. DeepQ-Network(2013)

    2. DoubleQ-Learning(2015)

    3. DuelingNetwork(2015)

    4. PrioritizedExperienceReplay(2015)

    5. Model-freeEpisodicControl(2016)

    6. AsynchronousAdvantageousActor-Criticmethod(2016)

    7. HumanCheckpointReplay(2016)

    8. GradientDescentwithRestart(2016)

  • AI 8

    1. DeepQ-Network(2013)

    2. DoubleQ-Learning(2015)

    3. DuelingNetwork(2015)

    4. PrioritizedExperienceReplay(2015)

    5. Model-freeEpisodicControl(2016)

    6. AsynchronousAdvantageousActor-Criticmethod(2016)

    7. HumanCheckpointReplay(2016)

    8. GradientDescentwithRestart(2016)

  • 1.DeepQ-Network

  • StateE0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 -1 -1

    0 1 1 3 3 -1 -1 -1

    0 1 1 0 0 -1 -1 -1

    -1 -1 -1 -1 -1 -1 -1 -1

  • ActionE =?0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 -1 -1

    0 1 1 3 3 -1 -1 -1

    0 1 1 0 0 -1 -1 -1

    -1 -1 -1 -1 -1 -1 -1 -1

    ActionE

    None 0

    ? ??

  • ActionE =?0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 -1 -1

    0 1 1 3 3 -1 -1 -1

    0 1 1 0 0 -1 -1 -1

    -1 -1 -1 -1 -1 -1 -1 -1

    E

  • ActionE =?0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 -1 -1

    0 1 1 3 3 -1 -1 -1

    0 1 1 0 0 -1 -1 -1

    -1 -1 -1 -1 -1 -1 -1 -1

    E

    Q

  • QState s Action a Q

    s, a

  • s Q s,Q s,

    Q s,

  • s Q s, = 5Q s, = 0

    Q s, = 10

  • s Q s, = 5Q s, = 0

    Q s, = 10

  • Q:

  • s =

  • s =

  • Q

  • +1+1+5+

    +1+1+

    +0+1+

    Q

  • -1-1-1+

    +1+1+

    -1+1-1+

    Q

  • = r + max[`

    , ` , O

  • ?

  • #1...

    ,

  • 2.DoubleQ-Learning

  • ,

  • or

  • = , r + max[`

    , `O

  • = , r + max[`

    , `O

    = O

  • = , r + max[`

    , `O

  • 0

    = , r + max[`

    , `O

  • 0

    = , r + max[`

    , `O

  • 0

    = , r + max[`

    , `O

  • 0

    = , r + max[`

    , `O

  • = r + max[`

    , ` , O

    = r + , argmax[`

    , ` , ODQN

    DoubleDQN

    = r + , argmax[`

    , ` , O

  • DoubleQ Q

    : DeepQ-learning :DoubleQ-learning

  • : DeepQ-learning :DoubleQ-learning

    DoubleQ Q !

    60 !

  • ?

  • , .

  • #2...

    land1 land3 ...

  • #2...

    ...

  • 3.DuelingNetwork

  • , ?0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

  • 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

    ,+1+1+1

    0 0 0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0 0

    3 3 3 3 0 0

    3 3 3 3 0 0

    0 0 0 0 0 0

    -1 -1 -1 -1 -1 -1

  • 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

    , +1+1+1 +0-1-1+

    0 0 0 0 0 0

    0 0 0 -1 -1 0

    0 0 0 -1 -1 0

    -1 -1 0 -1 -1 0

    -1 -1 0 -1 -1 0

    -1 -1 0 -1 -1 0

    -1 -1 -1 -1 -1 -1

  • : 10? 8?

    : -2? 1?

    : 5? 12?

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

    ?

  • !

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

  • ?

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

  • : () : +1?+3? : +1?+2?

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 3 3 0 0 0

    0 1 1 0 0 0 0 0

    -1 -1 -1 -1 -1 -1 -1 -1

  • 0()

    +1?+3?

    -1?-2?

    10? 20?

    0? 1?

    14? 32?Q

    ?

  • 0()

    +1?+3?

    -1?-2?Q

  • Q(s,a)=V(s)+A(s,a)

  • Q(s,a)=V(s)+A(s,a)Value

  • Q(s,a)=V(s)+A(s,a) Q

    AdvantageValue

  • (, b)

    (, O)

    (, c)

    DeepQ-Network

  • ()

    (, b)

    (, O)

    (, c)

    Dueling Network

  • Q

    Dueling Network

  • Q

  • Sum : , ; , , = ; , + (, ; , )

    Max : , ; , , = ; , + , ; , [`

    (, `; , )

    Average: , ; , , = ; , + , ; , b (, `; , )[`

  • 3.Duelingnetwork

    : DQN :Sum : Max

    60

  • 3.Duelingnetwork

    : DQN :Sum : Max

    60 100!!!

  • AI 8

    1. DeepQ-Network(2013)

    2. DoubleQ-Learning(2015)

    3. DuelingNetwork(2015)

    4. PrioritizedExperienceReplay(2015)

    5. Model-freeEpisodicControl(2016)

    6. AsynchronousAdvantageousActor-Criticmethod(2016)

    7. HumanCheckpointReplay(2016)

    8. GradientDescentwithRestart(2016)

  • AI 8

    1. DeepQ-Network(2013)

    2. DoubleQ-Learning(2015)

    3. DuelingNetwork(2015)

    4. PrioritizedExperienceReplay(2015)

    5. Model-freeEpisodicControl(2016)

    6. AsynchronousAdvantageousActor-Criticmethod(2016)

    7. HumanCheckpointReplay(2016)

    8. GradientDescentwithRestart(2016)

  • ? 8 ..?

  • 1. Hyperparameter tuning

    2. Debugging

    3. Pretrainedmodel

    4. Ensemblemethod

  • 1. Hyperparameter tuning

  • Optimizer reward

  • 70+

  • Network

    TrainingmethodExperiencememory

    Environment

  • 1. activation_fn :activationfunction(relu,tanh,elu,leaky)2. initializer :weightinitializer(xavier)3. regularizer :weightregularizer(None,l1,l2)4. apply_reg :layerswhereregularizerisapplied5. regularizer_scale :scaleofregularizer6. hidden_dims :dimensionofnetwork7. kernel_size :kernelsizeofCNNnetwork8. stride_size :stridesizeofCNNnetwork9. dueling :whethertouseduelingnetwork10.double_q :whethertousedoubleQ-learning11.use_batch_norm :whethertousebatchnormalization

    1. optimizer :typeofoptimizer(adam,adagrad,rmsprop)2. batch_norm :whethertousebatchnormalization3. max_step :maximumsteptotrain4. target_q_update_step :#ofsteptoupdatetargetnetwork5. learn_start_step :#ofsteptobeginatraining6. learning_rate_start :themaximumvalueoflearningrate7. learning_rate_end :theminimumvalueoflearningrate8. clip_grad :valueforamaxgradient9. clip_delta :valueforamaxdelta10. clip_reward :valueforamaxreward11. learning_rate_restart :whethertouselearningraterestart

    1. history_length :thelengthofhistory2. memory_size :sizeofexperiencememory3. priority :whethertouseprioritizedexperiencememory4. preload_memory :whethertopreloadasavedmemory5. preload_batch_size :batchsizeofpreloadedmemory6. preload_prob :probabilityofsamplingfrompre-mem7. alpha :alphaofprioritizedexperiencememory8. beta_start:betaofprioritizedexperiencememory9. beta_end_t:betaofprioritizedexperiencememory

    1. screen_height :#ofrowsforagridstate2. screen_height_to_use :actual#ofrowsforastate3. screen_width :#ofcolumnsforagridstate4. screen_width_to_us :actual#ofcolumnsforastate5. screen_expr :actual#ofrowforastate

    c:cookie p:platform s:scoreofcells p:plainjelly

    ex)(c+2*p)-s,i+h+m,[rj,sp],((c+2*p)-s)/2.,i+h+m,[rj,sp,hi]6. action_repeat :#ofreapt ofanaction(frameskip)

    i :plainitem h:heal m:magic hi:height

    sp :speed rj :remained

    jump

  • ,

  • !

  • for land in range(3, 7):

    for cookie in cookies:

    options = []

    option = {

    'hidden_dims': ["[800, 400, 200]", "[1000, 800, 200]"],

    'double_q': [True, False],

    'start_land': land,

    'cookie': cookie,

    }

    options.append(option.copy())

    array_run(options, double_q_test')

  • ,,,,,, ,,, ,,, ,,,,,,,,, ,,, ,,, ,,,

  • 2.Debugging

  • ...

  • History

  • Q-value

  • DuelingNetwork

    V(s)

    A(s,a)

  • action Q 0

    State 0

    Tensorflow fully_connected activationfunction

    default nn.relu

    Activationfunction None !

  • tensorflow/contrib/layers/python/layers/layers.py

  • State Q

    ex)

    Exploration

    reward ?

  • 3.Pretrainedmodel

  • .

  • weight ,

    weight

  • Max:3,586,503Max:3,199,324

  • 4.Ensemblemethods

  • Experiment#1

    Experiment#2

    Experiment#3

    Experiment#4

  • weight

    ?

  • Ex) weight

    ,

    weight

  • Session

  • Session

    Graph

  • Session

    Graph

  • Session

    Graph

  • Session

    Graph

  • node

  • Session

    Graph Graph

  • Session

    Graph Graph

    Session

  • ,

  • with tf.session() as sess:network = Network()sess.run(network.output)

  • sessions = []g = tf.Graph()with g.as_default():

    network = Network()sess = tf.Session(graph=g)sessions.append(sess)

    sessions[0].run(network.output)

    Session

  • !

  • A.I. 360

  • A.I.

  • ,

  • .