(Advances in) Quantum Reinforcement Learningqtml2017.di.univr.it/resources/Slides/Progress-in... ·...
Transcript of (Advances in) Quantum Reinforcement Learningqtml2017.di.univr.it/resources/Slides/Progress-in... ·...
-
(Advances in) Quantum Reinforcement Learning
Vedran Dunjko [email protected]
2017
mailto:[email protected]?subject=
-
Briegel
Makmal
Melnikov
Friis
Tiersch
Taylor
Acknowledgements:
Poulsen NautrupOrsucci
Liu Wu
Paparo Martin-Delgado
Piater Hangltheoretical
physicsintelligent & interactive systems
Boyajian
Krenn Zeilinger
-
Part 1: Intro Machine learning, Reinforcement learning (and AI) Quantum machine learning
Part 2: (a perspective on) Quantum Reinforcement Learning
Quantum-enhanced agents and quantum-accesible environments Quantum enhancements of learning agents: oraculization, and three flavours of improvements
-
Part 1: Quantum information and machine learning
-
Quantum machine learning (plus)
QIP→ML (quantum-enhanced ML) [’94; ‘12]
ML→QIP (ML applied to QIP problems) [70’s? ’10]
QIP↭ML (generalizations of learning concepts) [’00? ’06]
ML-insipred QIP QIP/QM inspired ML beyond (Q. AI)?
-
Learning P(labels|data) given samples from P(data,labels)
Learning structure in P(data), give samples from P(data)
mostly about inferring information from data (about the source)medicine stock markets climate/weather
But… what is Machine Learning?
-
Unsupervised learning Supervised learning
Learning P(labels|data) given samples from P(data,labels)
Learning structure in P(data), give samples from P(data)
mostly about inferring information from data (about the source) -medicine -stock-markets -climate/weather
Who cares?
Citation from https://www.entrepreneur.com/article/287324
ML/Data mining in the spotlight
industry 4.0
“This Is the Year of the Machine Learning RevolutionTo predict the future, we look at the data we have about the
past.”
https://www.entrepreneur.com/article/287324
-
Figures taken from Wikipedia
-
Figures taken from Wikipedia
-
Reinforcement learning in the spotlight
Figures taken from Wikipedia
MIT technology review: one of “10 breakthrough technologies in 2017”
The New Stack: “AI’s Next Big Step: Reinforcement learning”
E.g. AlphaGo Zero: superhuman Go performance with no human supervision: pure RL self-play
-
Supervised
Knowledge
representation
OCR&Vision
reaso
ning &
infere
nce
reinforce
ment
Towards AI?
-
What do agents learn?
(PO) Markov Decision Process
Corner for the serious
-
RL glossary
finite-horizon
infinite-horizonhistory of interaction
learning model/agent no-free lunch: metaparameters computational complexity “model complexity”
Figures of merit:
-
What do agents learn?
(PO) Markov Decision Process
Corner for the serious
RL vs. (data-driven) ML
Learning behavior vs. learning about data
In RL, the agent changes the environment
In ML the agent does not change P(x|y)
-
16
What […] Quantum Reinforcement Learning
RL in QIP/QM (natural, related to feedback and control)
Learning for and in quantum experiments ML building a QC?
Quantum-enhanced MLQ. enhancements for specific algorithms
Q. enhancements via q. interaction (quantum computational learning theory)
Q. generalizations
QRL in Q. systems enabling QRL?
-
17
What […] Quantum Reinforcement Learning
RL in QIP/QM (natural, related to feedback and control)
Learning for and in quantum experiments ML building a QC?
Quantum-enhanced MLQ. enhancements for specific algorithms
Q. enhancements via q. interaction (quantum computational learning theory)
Q. generalizations
QRL in Q. systems enabling QRL?
-
Part 2: Quantum information and reinforcement learning
-
2
XV. SYMBOLS FOR TALK
S = {s1, s2, . . .}
A = {a1, a2, . . .}
ak
, ak+1, sk, s
XVI. INTRO
The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting
time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/
p�✏) - this yields a type of generic improvement.
Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011
synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.
XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH
A. The need for a ’fair comparison’ of agents
As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal
Environment
2
XV. SYMBOLS FOR TALK
S = {s1, s2, . . .}
A = {a1, a2, . . .}
ak
, ak+1, sk, s
XVI. INTRO
The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting
time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/
p�✏) - this yields a type of generic improvement.
Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011
synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.
XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH
A. The need for a ’fair comparison’ of agents
As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal
2
XV. SYMBOLS FOR TALK
S = {s1, s2, . . .}
A = {a1, a2, . . .}
ak
, ak+1, sk, s
XVI. INTRO
The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting
time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/
p�✏) - this yields a type of generic improvement.
Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011
synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.
XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH
A. The need for a ’fair comparison’ of agents
As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal
2
XV. SYMBOLS FOR TALK
S = {s1, s2, . . .}
A = {a1, a2, . . .}
ak
, ak+1, sk, s
XVI. INTRO
The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting
time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/
p�✏) - this yields a type of generic improvement.
Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011
synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.
XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH
A. The need for a ’fair comparison’ of agents
As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal
Agent
V. Dunjko, J. M. Taylor, H. J. Briegel Quantum-enhanced learning In preparation (2016)
C or Q C or Q
V. Dunjko, J. M. Taylor, H. J. Briegel Framework for learning agents in quantum environments arXiv:1507.08482 (2015)
…
Want something like:
Quantum Agent - Environment paradigm?
-
Flavors or RL…what are agents, environments?
CS: MDPs, POMDPs, Turing machines
Robotics: Real environments
QIP/QML: ?…systems
Image: Sebastian Thrun & Chris Urmson/Google
-
is equivalent to
…
Agents (environments) are sequences of CPTP maps, acting on a private and a common register - the memory and the interface, respectively. Memory channels = combs = quantum strategies
Agent
Envir.
Quantum Agent - Environment paradigm
-
22
But…. why go there?
Fundamental meaning and role of learning in the quantum world
Speed-ups! “faster”, “better” learning
What can we make better? a) computational complexity b) learning efficiency
-
23
But…. why go there?
Fundamental meaning and role of learning in the quantum world
Speed-ups! “faster”, “better” learning
What can we make better? a) computational complexity b) learning efficiency (“genuine learning-related figures of merit”)
succ
ess
prob
abili
tytime-steps
related to query complexity
-
Classical (in a formal,
well-defined sense)
Not classical
• All improvements are consequences of computational improvements
intuition: “cannot run Grover’s search on a classical phonebook”
• More radical computational improvements
• More interesting quantum effects may me exploitable
Oracular quantum computation: exponential separations in principle possible
V. Dunjko, J. M. Taylor, H. J. Briegel Framework for learning agents in quantum environments arXiv:1507.08482 (2015)
V. Dunjko, J. M. Taylor, H. J. Briegel Quantum-enhanced machine learning Phys. Rev. Lett. 117, 130501 (2016)
Quantum Agent - Environment paradigm
-
V. Dunjko, J. M. Taylor, H. J. Briegel Quantum-enhanced machine learning Phys. Rev. Lett. 117, 130501 (2016)
Environment
2
XV. SYMBOLS FOR TALK
S = {s1, s2, . . .}
A = {a1, a2, . . .}
ak
, ak+1, sk, s
XVI. INTRO
The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting
time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/
p�✏) - this yields a type of generic improvement.
Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011
synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.
XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH
A. The need for a ’fair comparison’ of agents
As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal
2
XV. SYMBOLS FOR TALK
S = {s1, s2, . . .}
A = {a1, a2, . . .}
ak
, ak+1, sk, s
XVI. INTRO
The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting
time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/
p�✏) - this yields a type of generic improvement.
Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011
synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.
XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH
A. The need for a ’fair comparison’ of agents
As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal
2
XV. SYMBOLS FOR TALK
S = {s1, s2, . . .}
A = {a1, a2, . . .}
ak
, ak+1, sk, s
XVI. INTRO
The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting
time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/
p�✏) - this yields a type of generic improvement.
Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011
synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.
XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH
A. The need for a ’fair comparison’ of agents
As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal
2
XV. SYMBOLS FOR TALK
S = {s1, s2, . . .}
A = {a1, a2, . . .}
ak
, ak+1, sk, s
XVI. INTRO
The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting
time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/
p�✏) - this yields a type of generic improvement.
Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011
synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.
XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH
A. The need for a ’fair comparison’ of agents
As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal
Agent EnvironmentQAgentQ Q…
Q Q
-
Inspiration from oracular quantum computation…
-
Inspiration from oracular quantum computation…
Agent-like Environment-like
-
Inspiration from oracular quantum computation…
Agent-like Environment-like
think of Environment as oracle
-
∈ “Large” set of environments
If E is chosen at random, on average all agents perform equally (NFL)
How could one use that… E.g. oracle identification!
-
∈AAn
A3
A2
A1
-
∈An
A3
A2
A1(meta) A
A1
A2
A3
An think of Environment as oracleidentify faster using QC
-
But… environments are not like standard oracles…
“Oraculization” (taming the open environment)
(blocking, accessing purification and recycling)
-
1)
2)
3)
Oraculization (blocking) (taming the open environment)
quantum comb
causal network
“blocking”
-
34
Oraculization (recovery and recycling) (taming the open environment)
Classically specified oracle
f
“quantiza
tion”
-
4)
Oraculization (continued) (taming the open environment)
-
4)
Oraculization (continued) (taming the open environment)
Relevant information about (aspect of) environment accessible encoded in effective quantum oracle
-
37
RL:
Oraculization in data-driven ML?
-
Result schemata: for every A we construct Aq s.t. Aq > A…
find useful info using quantum access
optimize a (given) learning mechanism
compare to given (best) classical agent
-
First result: quantum upgrade of a learning model
V. Dunjko, J. M. Taylor, H. J. Briegel Quantum-enhanced machine learning Phys. Rev. Lett. 117, 130501 (2016)
polynomial improvements for a broad class of task environments by using Grover-like amplification
1. Grover finds “rewarding sequences”
2. Algorithm is “pre-trained” to focus more on rewarding space of policies
3. Works whenever algorithm makes sense, and environment s.t. it prefers winners
-
V. Dunjko, J. M. Taylor, H. J. Briegel Advances in Quantum Reinforcement Learning IEEE SMC 2017
polynomial improvements for meta-learning settings by using quantum optimization methods
given a fixed model and environment, the joint AE interaction can be oracularized
model parameters can be “quantum-optimized” in (any) given quantum-accessible environment
Second result: quantum speedup of meta-learning
-
quantum speedup of meta-learningquantum upgrade of a learning model
underlying Quant. alg.: Grover-based optimization
underlying Quant. alg.: Grover-based optimization
oraculization of Environment oraculization of Agent-Env. interaction
oracle encoding properties of environment
~~
oracle encoding performance of Agent-Environment interaction
0 r
Ranc
-
Quantum folklore:
quadratic improvements a-plenty, super-polynomial hard to come by
What about reinforcement learning?
-
Third result: absolute separation of classical & quantum learning
RL is VERY BIG. Trivial answers aplenty
Given oracular problem, construct MDPs which
1) have “generic RL properties”: delayed rewards, genuine actions, controlled amount degeneracy in optimal policies (stochasticity)
2) Are hard to learn for any classical agent
3) Oraculization possible, and reduces to oracles which allow superpolynomial speed-ups
-
super-polynomial improvements possible for a class of genuine RL task environments via reduction to (generalized) Recursive Fourier Sampling
Complicated, and genuine RL problem
oraculization process
RFS-type oracle hiding a necessary “key”
super-polynomial separations known
V. Dunjko, Y. Liu, X. Wu, J. M. Taylor Super-polynomial separations for quantum reinforcement learning (in preparation)
-
super-polynomial improvements possible for a class of genuine RL task environments via reduction to (generalized) Recursive Fourier Sampling
Complicated, and genuine RL problem
oraculization process
RFS-type oracle hiding a necessary “key”
super-polynomial separations known
V. Dunjko, Y. Liu, X. Wu, J. M. Taylor Super-polynomial separations for quantum reinforcement learning (in preparation)
can be adapted to e.g. Simon’s problem: strict exponential separation possible!
-
absolute separation c & q learningquantum upgrade of a learning model
“natural” environments: mazes Start with solution; oracle problems with separation
oraculization to most natural info: rewarding sequences reverse-enigneered matching environments
Grover is (almost always) applicable: large “natural” class of settings where improvements possible
identify generalizations/deformations which a) maintain classical hardness b) allow for meaningful oraculization
and application of known q. algorithms
(shhh: call those above “the generic ones”!)
-
47
Perspectives of (this type of) Quantum Reinforcement learning
QAE: fundamental questions of learnability Quantum AI and quantum Turing test (behavioral intelligence)
QRL-enhancements (theory): connections to oracular models. Elitzur-Vaidman & counterfactual computation. Q. Metrology, q. illumination.
Making it concrete: connection to QML!
QIP overlaps: generalization of standard oracular models
QRL-enhancements: Minimal oraculizations. Quantum agents in quantum systems (quantum experiments, QQ ML). Model-based agents. Quantum behavioral databases.
-
48
The end