(Advances in) Quantum Reinforcement Learningqtml2017.di.univr.it/resources/Slides/Progress-in... ·...

(Advances in) Quantum Reinforcement Learning

Vedran Dunjko [email protected]

     2017

mailto:[email protected]?subject=

Briegel

Makmal

Melnikov

Friis

Tiersch

Taylor

Acknowledgements:

Poulsen NautrupOrsucci

Liu Wu

Paparo Martin-Delgado

Piater Hangltheoretical

physicsintelligent & interactive systems

Boyajian

Krenn Zeilinger

Part 1: Intro Machine learning, Reinforcement learning (and AI) Quantum machine learning

Part 2: (a perspective on) Quantum Reinforcement Learning

Quantum-enhanced agents and quantum-accesible environments Quantum enhancements of learning agents: oraculization, and three flavours of improvements

Part 1: Quantum information and machine learning

Quantum machine learning (plus)

QIP→ML (quantum-enhanced ML) [’94; ‘12] 

ML→QIP (ML applied to QIP problems) [70’s? ’10] 

QIP↭ML (generalizations of learning concepts) [’00? ’06] 

ML-insipred QIP QIP/QM inspired ML beyond (Q. AI)?

Learning P(labels|data) given samples from P(data,labels)

Learning structure in P(data), give samples from P(data)

mostly about inferring information from data (about the source)medicine stock markets climate/weather

But… what is Machine Learning?

Unsupervised learning Supervised learning

Learning P(labels|data) given samples from P(data,labels)

Learning structure in P(data), give samples from P(data)

mostly about inferring information from data (about the source) -medicine -stock-markets -climate/weather

Who cares?

Citation from https://www.entrepreneur.com/article/287324

ML/Data mining in the spotlight

industry 4.0

“This Is the Year of the Machine Learning RevolutionTo predict the future, we look at the data we have about the

past.”

https://www.entrepreneur.com/article/287324

Figures taken from Wikipedia

Reinforcement learning in the spotlight

Figures taken from Wikipedia

MIT technology review:  one of “10 breakthrough technologies in 2017”

The New Stack:  “AI’s Next Big Step: Reinforcement learning”

E.g. AlphaGo Zero: superhuman Go performance with no human supervision: pure RL self-play

Supervised

Knowledge

representation

OCR&Vision

reaso

ning &

infere

nce

reinforce

ment

Towards AI?

What do agents learn?

(PO) Markov Decision Process

Corner for the serious

RL glossary

finite-horizon

infinite-horizonhistory of interaction

learning model/agent no-free lunch: metaparameters computational complexity “model complexity”

Figures of merit:

What do agents learn?

(PO) Markov Decision Process

Corner for the serious

RL vs. (data-driven) ML

Learning behavior vs. learning about data

In RL, the agent changes the environment

In ML the agent does not change P(x|y)

16

What […] Quantum Reinforcement Learning

RL in QIP/QM (natural, related to feedback and control)

Learning for and in quantum experiments ML building a QC?

Quantum-enhanced MLQ. enhancements for specific algorithms

Q. enhancements via q. interaction (quantum computational learning theory) 

Q. generalizations

QRL in Q. systems enabling QRL?

17

What […] Quantum Reinforcement Learning

RL in QIP/QM (natural, related to feedback and control)

Learning for and in quantum experiments ML building a QC?

Quantum-enhanced MLQ. enhancements for specific algorithms

Q. enhancements via q. interaction (quantum computational learning theory) 

Q. generalizations

QRL in Q. systems enabling QRL?

Part 2: Quantum information and reinforcement learning

2

XV. SYMBOLS FOR TALK

S = {s1, s2, . . .}

A = {a1, a2, . . .}

ak

, ak+1, sk, s

XVI. INTRO

The clear motivation for considering quantum internal processes in a an agent is to see whether such agent canbe more successful, in any sense, than a fully classical agent. By construction, a projective-simulation based agentrealizes its input-output relations by performing a random walk, that is a Markovian process, over the clips. It isvery tempting to see whether obvious generic improvements can be obtained if the particular walk is performedin a quantum way. This is also statable in terms of a search problem, if the walk indeed performs a search. Here,the question is ’what do we look for’? The closest result to a ’generic improvement theorem’ is given via the(generalized) framework of Szegedy, which relates the eigenvalues of a transition matrix of an (egodic, symmetric- this is dropped later) Markov chain, and the eigenvalues of a corresponding ’quantum walk unitary’ which isdefined in his formalism.The found relationship between the two can be readily applied on the known results for the classical hitting

time of some subset of the nodes (the marked nodes) of the Markov chain (which has expected hitting time inO(1/�✏), with � being the spectral gap of the transition matrix, and ✏ the fraction of marked nodes vs. all nodes),and the analogous quantum hitting time which scales as O(1/

p�✏) - this yields a type of generic improvement.

Motivation story for the talk: Sure, getting an exponential speedup is much cooler than getting a quadratic one.In the problems we address we talk about agents, which are abstractions of entities which exist, and interact withan environment. And they do so in real time. So let us consider an agent, and a setting where reaction time isimportant, and see what happens if we, ’only’ quadratically, slow down the agent. We will do this by solving acartoon Fermi problem. Consider a particular agent: a mouse. A mouse can find itself in tricky situations, likebeing stalked by a cat. And the moment it notices a cat it must respond as fast as it can. A non-super-athleticmouse can escape a cats attack if he spots the charging cat from a distance of say 1m. If he spots the cat later,he’s a goner. Given that your average charging cat runs at 10m/s, this implies the mouses deliberation time mustbe 0.1s. Suppose the mouse was quadratically slower. Let us give meaning to this assumption. A mouse has 1011

synapses, and say 100km of total axon length in its brain. This gives the average e↵ective axon length of 10�6m. Aneural pulse travels at, on average, at 50m/s, say a lower bound of 10m/s, which yields that in a second a mouse’spulses can traverse 107 average axonic lengths. Since his deliberation time is 0.1s, in deliberation, the mouse’sneurons traverse 106 axonic lengths. A quadratically slower mouse will have to traverse 1012 axonic lengths to’jump into action’. The time it takes him to do this is 1012⇥10�8 = 104s. Well, the average cat can travel 100 kmin this time, meaning, if the mouse is in the center of Innsbruck, and he spots a charging cat which is in Bolzano,he is a goner! If the cat has a train pass, the mouse can spot the cat in Salzburg while the cat is boarding thetrain to Innsbruck, and still the cat will munch him.

XVII. DIRECT QUANTIZATION OF THE ’SIMPLE PS’ APPROACH

A. The need for a ’fair comparison’ of agents

As mentioned, one of the central ideas of this project is to use quantum walk-base approaches to produce ’better’agents than classically possible. Since quantum walks have di↵erent, and sometimes clearly better properties thanclassical walks, this should be possible. In particular, there seems to be a generic speedup of hitting times inquantized versions of Markov chains. And this is attractive. A noted criticism is that even if one produces anagent which ’walks faster’ this does not incur ’faster learning’ times in terms of ’external time’ - that is, if theagent has all the time in the world to produce his response to the environment, then such a generic speedup is notuseful. However, this can be remedied by considering ’active learning’ - a process in which external and internal

Environment

2


S = {s1, s2, . . .}

A = {a1, a2, . . .}

ak

, ak+1, sk, s

XVI. INTRO









2


S = {s1, s2, . . .}

A = {a1, a2, . . .}

ak

, ak+1, sk, s

XVI. INTRO









2


S = {s1, s2, . . .}

A = {a1, a2, . . .}

ak

, ak+1, sk, s

XVI. INTRO









Agent

V. Dunjko, J. M. Taylor, H. J. Briegel Quantum-enhanced learning In preparation (2016)

C or Q C or Q

V. Dunjko, J. M. Taylor, H. J. Briegel Framework for learning agents in quantum environments arXiv:1507.08482 (2015)

…

Want something like:

Quantum Agent - Environment paradigm?

Flavors or RL…what are agents, environments?

CS: MDPs, POMDPs, Turing machines

Robotics: Real environments

QIP/QML: ?…systems

Image: Sebastian Thrun & Chris Urmson/Google

is equivalent to

…

Agents (environments) are sequences of CPTP maps, acting on a private and a common register - the memory and the interface, respectively. Memory channels = combs = quantum strategies

Agent

Envir.

Quantum Agent - Environment paradigm

22

But…. why go there?

Fundamental meaning and role of learning in the quantum world

Speed-ups! “faster”, “better” learning 

What can we make better?  a) computational complexity b) learning efficiency

23

But…. why go there?

Fundamental meaning and role of learning in the quantum world

Speed-ups! “faster”, “better” learning 

What can we make better?  a) computational complexity b) learning efficiency (“genuine learning-related figures of merit”)

succ

ess

prob

abili

tytime-steps

related to query complexity

Classical (in a formal,

well-defined sense)

Not classical

• All improvements are  consequences of  computational improvements 

intuition: “cannot run  Grover’s search on  a classical phonebook”

• More radical computational improvements  

• More interesting quantum effects may me exploitable 

Oracular quantum computation:  exponential separations in principle possible

V. Dunjko, J. M. Taylor, H. J. Briegel Framework for learning agents in quantum environments arXiv:1507.08482 (2015)

V. Dunjko, J. M. Taylor, H. J. Briegel Quantum-enhanced machine learning Phys. Rev. Lett. 117, 130501 (2016)

Quantum Agent - Environment paradigm


Environment

2


S = {s1, s2, . . .}

A = {a1, a2, . . .}

ak

, ak+1, sk, s

XVI. INTRO









2


S = {s1, s2, . . .}

A = {a1, a2, . . .}

ak

, ak+1, sk, s

XVI. INTRO









2


S = {s1, s2, . . .}

A = {a1, a2, . . .}

ak

, ak+1, sk, s

XVI. INTRO









2


S = {s1, s2, . . .}

A = {a1, a2, . . .}

ak

, ak+1, sk, s

XVI. INTRO









Agent EnvironmentQAgentQ Q…

Q Q

Inspiration from oracular quantum computation…


Agent-like Environment-like


Agent-like Environment-like

think of Environment as oracle

∈ “Large” set of environments

If E is chosen at random, on average all agents perform equally (NFL)

How could one use that… E.g. oracle identification!

∈AAn

A3

A2

A1

∈An

A3

A2

A1(meta) A

A1

A2

A3

An think of Environment as oracleidentify faster using QC

But… environments are not like standard oracles…

“Oraculization” (taming the open environment)

(blocking, accessing purification and recycling)

1)

2)

3)

Oraculization (blocking) (taming the open environment)

quantum comb

causal network

“blocking”

34

Oraculization (recovery and recycling) (taming the open environment)

Classically specified oracle

f

“quantiza

tion”

4)

Oraculization (continued) (taming the open environment)

4)

Oraculization (continued) (taming the open environment)

Relevant information about (aspect of) environment accessible encoded in effective quantum oracle

37

RL:

Oraculization in data-driven ML?

Result schemata: for every A we construct Aq s.t. Aq > A…

find useful info using quantum access

optimize a (given) learning mechanism

compare to given (best) classical agent

First result: quantum upgrade of a learning model


polynomial improvements for a broad class of task environments by using Grover-like amplification

1. Grover finds “rewarding sequences”

2. Algorithm is “pre-trained” to focus more  on rewarding space of policies

3. Works whenever algorithm makes sense,  and environment s.t. it prefers winners

V. Dunjko, J. M. Taylor, H. J. Briegel Advances in Quantum Reinforcement Learning IEEE SMC 2017

polynomial improvements for meta-learning settings by using quantum optimization methods

given a fixed model and environment, the joint AE  interaction can be oracularized

model parameters can be “quantum-optimized” in (any) given quantum-accessible environment

Second result: quantum speedup of meta-learning

quantum speedup of meta-learningquantum upgrade of a learning model

underlying Quant. alg.: Grover-based optimization

underlying Quant. alg.: Grover-based optimization

oraculization of Environment oraculization of Agent-Env. interaction

oracle encoding properties of environment

~~

oracle encoding performance of Agent-Environment interaction

0 r

Ranc

Quantum folklore:

quadratic improvements a-plenty, super-polynomial hard to come by

What about reinforcement learning?

Third result: absolute separation of classical & quantum learning

RL is VERY BIG. Trivial answers aplenty

Given oracular problem, construct MDPs which 

1) have “generic RL properties”:  delayed rewards, genuine actions,  controlled amount degeneracy in optimal policies (stochasticity) 

2) Are hard to learn for any classical agent 

3) Oraculization possible, and reduces to oracles which allow  superpolynomial speed-ups

super-polynomial improvements possible for a class of genuine RL task environments via reduction to (generalized) Recursive Fourier Sampling

Complicated, and genuine RL problem

oraculization process

RFS-type oracle hiding a necessary “key”

super-polynomial separations known

V. Dunjko, Y. Liu, X. Wu, J. M. Taylor Super-polynomial separations for quantum reinforcement learning (in preparation)

super-polynomial improvements possible for a class of genuine RL task environments via reduction to (generalized) Recursive Fourier Sampling

Complicated, and genuine RL problem

oraculization process

RFS-type oracle hiding a necessary “key”

super-polynomial separations known

V. Dunjko, Y. Liu, X. Wu, J. M. Taylor Super-polynomial separations for quantum reinforcement learning (in preparation)

can be adapted to e.g. Simon’s problem: strict exponential separation possible!

absolute separation c & q learningquantum upgrade of a learning model

“natural” environments: mazes Start with solution; oracle problems with separation

oraculization to most natural info: rewarding sequences reverse-enigneered matching environments

Grover is (almost always) applicable: large “natural” class of settings where improvements possible

identify generalizations/deformations which a) maintain classical hardness b) allow for meaningful oraculization 

and application of known q. algorithms

(shhh: call those above “the generic ones”!)

47

Perspectives of (this type of) Quantum Reinforcement learning  

QAE: fundamental questions of learnability  Quantum AI and quantum Turing test (behavioral intelligence)

QRL-enhancements (theory): connections to oracular models.  Elitzur-Vaidman & counterfactual computation. Q. Metrology, q. illumination.

Making it concrete: connection to QML!

QIP overlaps: generalization of standard oracular models

QRL-enhancements:  Minimal oraculizations.  Quantum agents in quantum systems (quantum experiments, QQ ML).  Model-based agents. Quantum behavioral databases.

48

The end

(Advances in) Quantum Reinforcement Learningqtml2017.di.univr.it/resources/Slides/Progress-in... ·...

Documents

Transcript of (Advances in) Quantum Reinforcement Learningqtml2017.di.univr.it/resources/Slides/Progress-in... ·...