(Advances in) Quantum Reinforcement Learningqtml2017.di.univr.it/resources/Slides/Progress-in... ·...

48
(Advances in) Quantum Reinforcement Learning Vedran Dunjko [email protected] 2017

Transcript of (Advances in) Quantum Reinforcement Learningqtml2017.di.univr.it/resources/Slides/Progress-in... ·...

  • (Advances in) Quantum Reinforcement Learning

    Vedran Dunjko [email protected]

    



 2017

    mailto:[email protected]?subject=

  • Briegel

    Makmal

    Melnikov

    Friis

    Tiersch

    Taylor

    Acknowledgements:

    Poulsen NautrupOrsucci

    Liu Wu

    Paparo Martin-Delgado

    Piater Hangltheoretical

    physicsintelligent & interactive systems

    Boyajian

    Krenn Zeilinger

  • Part 1: Intro Machine learning, Reinforcement learning (and AI) Quantum machine learning

    Part 2: (a perspective on) Quantum Reinforcement Learning

    Quantum-enhanced agents and quantum-accesible environments Quantum enhancements of learning agents:
oraculization, and three flavours of improvements

  • Part 1: Quantum information and machine learning

  • Quantum machine learning (plus)

    QIP→ML (quantum-enhanced ML) [’94; ‘12]


    ML→QIP (ML applied to QIP problems) [70’s? ’10]


    QIP↭ML (generalizations of learning concepts) [’00? ’06]


    ML-insipred QIP QIP/QM inspired ML beyond (Q. AI)?

  • Learning P(labels|data) given samples from P(data,labels)

    Learning structure in P(data), give samples from P(data)

    mostly about inferring information from data (about the source)medicine stock markets climate/weather

    But… what is Machine Learning?

  • Unsupervised learning Supervised learning

    Learning P(labels|data) given samples from P(data,labels)

    Learning structure in P(data), give samples from P(data)

    mostly about inferring information from data (about the source) -medicine -stock-markets -climate/weather

    Who cares?

    Citation from https://www.entrepreneur.com/article/287324

    ML/Data mining in the spotlight

    industry 4.0

    “This Is the Year of the Machine Learning RevolutionTo predict the future, we look at the data we have about the

    past.”

    https://www.entrepreneur.com/article/287324

  • Figures taken from Wikipedia

  • Figures taken from Wikipedia

  • Reinforcement learning in the spotlight

    Figures taken from Wikipedia

    MIT technology review: 
one of “10 breakthrough technologies in 2017”

    The New Stack: 
“AI’s Next Big Step: Reinforcement learning”

    E.g. AlphaGo Zero: superhuman Go performance with no human supervision: pure RL self-play

  • Supervised

    Knowledge

    representation

    OCR&Vision

    reaso

    ning &

    infere

    nce

    reinforce

    ment

    Towards AI?

  • What do agents learn?

    (PO) Markov Decision Process

    Corner for the serious

  • RL glossary

    finite-horizon

    infinite-horizonhistory of interaction

    learning model/agent no-free lunch: metaparameters computational complexity “model complexity”

    Figures of merit:

  • What do agents learn?

    (PO) Markov Decision Process

    Corner for the serious

    RL vs. (data-driven) ML

    Learning behavior vs. learning about data

    In RL, the agent changes the environment

    In ML the agent does not change P(x|y)

  • 16

    What […] Quantum Reinforcement Learning

    RL in QIP/QM (natural, related to feedback and control)

    Learning for and in quantum experiments ML building a QC?

    Quantum-enhanced MLQ. enhancements for specific algorithms

    Q. enhancements via q. interaction
(quantum computational learning theory)


    Q. generalizations

    QRL in Q. systems enabling QRL?

  • 17

    What […] Quantum Reinforcement Learning

    RL in QIP/QM (natural, related to feedback and control)

    Learning for and in quantum experiments ML building a QC?

    Quantum-enhanced MLQ. enhancements for specific algorithms

    Q. enhancements via q. interaction
(quantum computational learning theory)


    Q. generalizations

    QRL in Q. systems enabling QRL?

  • Part 2: Quantum information and reinforcement learning

  • 2

    XV. SYMBOLS FOR TALK

    S = {s1, s2, . . .}

    A = {a1, a2, . . .}

    ak

    , ak+1, sk, s

    XVI. INTRO

    The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting

    time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/

    p�✏) - this yields a type of generic improvement.

    Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011

    synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.

    XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH

    A. The need for a ’fair comparison’ of agents

    As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal

    Environment

    2

    XV. SYMBOLS FOR TALK

    S = {s1, s2, . . .}

    A = {a1, a2, . . .}

    ak

    , ak+1, sk, s

    XVI. INTRO

    The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting

    time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/

    p�✏) - this yields a type of generic improvement.

    Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011

    synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.

    XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH

    A. The need for a ’fair comparison’ of agents

    As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal

    2

    XV. SYMBOLS FOR TALK

    S = {s1, s2, . . .}

    A = {a1, a2, . . .}

    ak

    , ak+1, sk, s

    XVI. INTRO

    The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting

    time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/

    p�✏) - this yields a type of generic improvement.

    Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011

    synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.

    XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH

    A. The need for a ’fair comparison’ of agents

    As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal

    2

    XV. SYMBOLS FOR TALK

    S = {s1, s2, . . .}

    A = {a1, a2, . . .}

    ak

    , ak+1, sk, s

    XVI. INTRO

    The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting

    time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/

    p�✏) - this yields a type of generic improvement.

    Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011

    synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.

    XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH

    A. The need for a ’fair comparison’ of agents

    As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal

    Agent

    V. Dunjko, J. M. Taylor, H. J. Briegel Quantum-enhanced learning In preparation (2016)

    C or Q C or Q

    V. Dunjko, J. M. Taylor, H. J. Briegel Framework for learning agents in quantum environments arXiv:1507.08482 (2015)

    Want something like:

    Quantum Agent - Environment paradigm?

  • Flavors or RL…what are agents, environments?

    CS: MDPs, POMDPs, Turing machines

    Robotics: Real environments

    QIP/QML: ?…systems

    Image: Sebastian Thrun & Chris Urmson/Google

  • is equivalent to

    Agents (environments) are sequences of CPTP maps, acting on a private and a common register - the memory and the interface, respectively. Memory channels = combs = quantum strategies

    Agent

    Envir.

    Quantum Agent - Environment paradigm

  • 22

    But…. why go there?

    Fundamental meaning and role of learning in the quantum world

    Speed-ups! “faster”, “better” learning


    What can we make better?

a) computational complexity b) learning efficiency

  • 23

    But…. why go there?

    Fundamental meaning and role of learning in the quantum world

    Speed-ups! “faster”, “better” learning


    What can we make better?

a) computational complexity b) learning efficiency (“genuine learning-related figures of merit”)

    succ

    ess

    prob

    abili

    tytime-steps

    related to query complexity

  • Classical (in a formal,

    well-defined sense)

    Not classical

    • All improvements are 
consequences of 
computational improvements


    intuition: “cannot run 
Grover’s search on 
a classical phonebook”

    • More radical computational improvements 


    • More interesting quantum effects
may me exploitable


    Oracular quantum computation: 
exponential separations in principle possible

    V. Dunjko, J. M. Taylor, H. J. Briegel Framework for learning agents in quantum environments arXiv:1507.08482 (2015)

    V. Dunjko, J. M. Taylor, H. J. Briegel Quantum-enhanced machine learning Phys. Rev. Lett. 117, 130501 (2016)

    Quantum Agent - Environment paradigm

  • V. Dunjko, J. M. Taylor, H. J. Briegel Quantum-enhanced machine learning Phys. Rev. Lett. 117, 130501 (2016)

    Environment

    2

    XV. SYMBOLS FOR TALK

    S = {s1, s2, . . .}

    A = {a1, a2, . . .}

    ak

    , ak+1, sk, s

    XVI. INTRO

    The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting

    time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/

    p�✏) - this yields a type of generic improvement.

    Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011

    synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.

    XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH

    A. The need for a ’fair comparison’ of agents

    As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal

    2

    XV. SYMBOLS FOR TALK

    S = {s1, s2, . . .}

    A = {a1, a2, . . .}

    ak

    , ak+1, sk, s

    XVI. INTRO

    The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting

    time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/

    p�✏) - this yields a type of generic improvement.

    Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011

    synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.

    XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH

    A. The need for a ’fair comparison’ of agents

    As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal

    2

    XV. SYMBOLS FOR TALK

    S = {s1, s2, . . .}

    A = {a1, a2, . . .}

    ak

    , ak+1, sk, s

    XVI. INTRO

    The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting

    time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/

    p�✏) - this yields a type of generic improvement.

    Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011

    synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.

    XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH

    A. The need for a ’fair comparison’ of agents

    As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal

    2

    XV. SYMBOLS FOR TALK

    S = {s1, s2, . . .}

    A = {a1, a2, . . .}

    ak

    , ak+1, sk, s

    XVI. INTRO

    The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting

    time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/

    p�✏) - this yields a type of generic improvement.

    Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011

    synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.

    XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH

    A. The need for a ’fair comparison’ of agents

    As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal

    Agent EnvironmentQAgentQ Q…

    Q Q

  • Inspiration from oracular quantum computation…

  • Inspiration from oracular quantum computation…

    Agent-like Environment-like

  • Inspiration from oracular quantum computation…

    Agent-like Environment-like

    think of Environment as oracle

  • ∈ “Large” set of environments

    If E is chosen at random, on average all agents perform equally (NFL)

    How could one use that… E.g. oracle identification!

  • ∈AAn

    A3

    A2

    A1

  • ∈An

    A3

    A2

    A1(meta) A

    A1

    A2

    A3

    An think of Environment as oracleidentify faster using QC

  • But… environments are not like standard oracles…

    “Oraculization” (taming the open environment)

    (blocking, accessing purification and recycling)

  • 1)

    2)

    3)

    Oraculization (blocking) (taming the open environment)

    quantum comb

    causal network

    “blocking”

  • 34

    Oraculization (recovery and recycling) (taming the open environment)

    Classically specified oracle

    f

    “quantiza

    tion”

  • 4)

    Oraculization (continued) (taming the open environment)

  • 4)

    Oraculization (continued) (taming the open environment)

    Relevant information about (aspect of) environment accessible encoded in effective quantum oracle

  • 37

    RL:

    Oraculization in data-driven ML?

  • Result schemata: for every A we construct Aq s.t. Aq > A…

    find useful info using quantum access

    optimize a (given) learning mechanism

    compare to given (best) classical agent

  • First result: quantum upgrade of a learning model

    V. Dunjko, J. M. Taylor, H. J. Briegel Quantum-enhanced machine learning Phys. Rev. Lett. 117, 130501 (2016)

    polynomial improvements for a broad class of task environments by using Grover-like amplification

    1. Grover finds “rewarding sequences”

    2. Algorithm is “pre-trained” to focus more 
on rewarding space of policies

    3. Works whenever algorithm makes sense, 
and environment s.t. it prefers winners

  • V. Dunjko, J. M. Taylor, H. J. Briegel Advances in Quantum Reinforcement Learning IEEE SMC 2017

    polynomial improvements for meta-learning settings by using quantum optimization methods

    given a fixed model and environment, the joint AE 
interaction can be oracularized

    model parameters can be “quantum-optimized”
in (any) given quantum-accessible environment

    Second result: quantum speedup of meta-learning

  • quantum speedup of meta-learningquantum upgrade of a learning model

    underlying Quant. alg.: Grover-based optimization

    underlying Quant. alg.: Grover-based optimization

    oraculization of Environment oraculization of Agent-Env. interaction

    oracle encoding properties of environment

    ~~

    oracle encoding performance of Agent-Environment interaction

    0 r

    Ranc

  • Quantum folklore:

    quadratic improvements a-plenty, super-polynomial hard to come by

    What about reinforcement learning?

  • Third result: absolute separation of classical & quantum learning

    RL is VERY BIG. Trivial answers aplenty

    Given oracular problem, construct MDPs which


    1) have “generic RL properties”: 
delayed rewards, genuine actions, 
controlled amount degeneracy in optimal policies (stochasticity)


    2) Are hard to learn for any classical agent


    3) Oraculization possible, and reduces to oracles which allow 
superpolynomial speed-ups

  • super-polynomial improvements possible for a class of genuine RL task environments via reduction to (generalized) Recursive Fourier Sampling

    Complicated, and genuine RL problem

    oraculization process

    RFS-type oracle hiding a necessary “key”

    super-polynomial separations known

    V. Dunjko, Y. Liu, X. Wu, J. M. Taylor Super-polynomial separations for quantum reinforcement learning (in preparation)

  • super-polynomial improvements possible for a class of genuine RL task environments via reduction to (generalized) Recursive Fourier Sampling

    Complicated, and genuine RL problem

    oraculization process

    RFS-type oracle hiding a necessary “key”

    super-polynomial separations known

    V. Dunjko, Y. Liu, X. Wu, J. M. Taylor Super-polynomial separations for quantum reinforcement learning (in preparation)

    can be adapted to e.g. Simon’s problem: strict exponential separation possible!

  • absolute separation c & q learningquantum upgrade of a learning model

    “natural” environments: mazes Start with solution; oracle problems with separation

    oraculization to most natural info: rewarding sequences reverse-enigneered matching environments

    Grover is (almost always) applicable: large “natural” class of settings where improvements possible

    identify generalizations/deformations which a) maintain classical hardness b) allow for meaningful oraculization


    and application of known q. algorithms

    (shhh: call those above “the generic ones”!)

  • 47

    Perspectives of (this type of) Quantum Reinforcement learning 


    QAE: fundamental questions of learnability 
Quantum AI and quantum Turing test (behavioral intelligence)

    QRL-enhancements (theory): connections to oracular models. 
Elitzur-Vaidman & counterfactual computation. Q. Metrology, q. illumination.

    Making it concrete: connection to QML!

    QIP overlaps: generalization of standard oracular models

    QRL-enhancements: 
Minimal oraculizations. 
Quantum agents in quantum systems (quantum experiments, QQ ML).

Model-based agents. Quantum behavioral databases.

  • 48

    The end