Policy Search in Reproducing Kernel Hilbert Spaceby functionals in reproducing kernel Hilbert...

Policy Search in Reproducing Kernel Hilbert Space

Ngo Anh Vien and Peter Englert and Marc ToussaintMachine Learning and Robotics Lab

University of Stuttgart, Germany{vien.ngo, peter.englert, marc.toussaint}@ipvs.uni-stuttgart.de

AbstractModeling policies in reproducing kernel Hilbertspace (RKHS) renders policy gradient reinforce-ment learning algorithms non-parametric. As a re-sult, the policies become very flexible and havea rich representational potential without a pre-defined set of features. However, their perfor-mances might be either non-covariant under re-parameterization of the chosen kernel, or very sen-sitive to step-size selection. In this paper, we pro-pose to use a general framework to derive a newRKHS policy search technique. The new derivationleads to both a natural RKHS actor-critic algorithmand a RKHS expectation maximization (EM) pol-icy search algorithm. Further, we show that kernel-ization enables us to learn in partially observable(POMDP) tasks which is considered daunting forparametric approaches. Via sparsification, a smallset of “support vectors” representing the history isshown to be effectively discovered. For evalua-tions, we use three simulated (PO)MDP reinforce-ment learning tasks, and a simulated PR2’s roboticmanipulation task. The results demonstrate the ef-fectiveness of the new RKHS policy search frame-work in comparison to plain RKHS actor-critic,episodic natural actor-critic, plain actor-critic, andPoWER approaches.

1 IntroductionPolicy search methods in reinforcement learning (RL) haverecently been shown to be very effective in very high di-mensional robotics problems [Williams, 1992; Kober et al.,2013]. In such applications, the policy is compactly repre-sented in a parametric space and effectively optimized usinggradient-descent [Sutton et al., 1999]. Theoretically, policygradient methods have strong convergence guarantees. Inlarger problems, their performance and computation can befurther enhanced when combined with function approxima-tion. The function approximation techniques might be lin-ear [Kober et al., 2013] or non-linear [Silver et al., 2014;Lillicrap et al., 2015].

Nonetheless, parametric approaches require a careful de-sign of the features, which is inflexible in practical use and

inefficient in large problems. One can overcome this inflex-ibility and limited adaptability by using nonparametric tech-niques, esp. by modeling policy in RKHS (reproducing kernelHilbert space) function space [Bagnell and Schneider, 2003a;Lever and Stafford, 2015; van Hoof et al., 2015]. More-over, an adaptively compact policy representation can beachieved via sparsification in RKHS. For the case of amulti-dimensional output policy (multi-dimensional actions),a vector-valued kernel can be used [Micchelli and Pontil,2005].

However, standard policy gradient methods might face dif-ficult convergence problems. The issue stems from eitherusing a non-covariant gradient or from selecting an adhocstep-size. The latter is a standard problem in any gradient-based method and is classically addressed using line-search,which may be sample-efficient. The step-size selection prob-lem can be avoided by using EM-inspired techniques instead[Vlassis et al., 2009; Kober and Peters, 2011]. Using a non-covariant gradient means that the gradient direction mightchange significantly even with a simple re-parameterizationof the policy representation. Fortunately, the inherent non-covariant behaviour due to coordinate transformation [Amari,1998] can be fixed using the natural gradient with respect toone particular Riemannian metric, namely the natural metricor the Fisher information metric [Kakade, 2001; Bagnell andSchneider, 2003b; Peters and Schaal, 2008].

The EM-inspired and policy gradient algorithms have fur-ther been shown to be special cases of a common policyframework, called free-energy minimization [Kober and Pe-ters, 2011; Neumann, 2011]. Being inspired by this generalderivation, in this paper we propose a policy search frame-work in RKHS which results in two RKHS policy search al-gorithms: RKHS natural actor-critic and RKHS EM-inspiredpolicy search. We derive the following

• The natural gradient in RKHS is computed with respectto the Fisher information metric, which is a by-productof considering the path-distribution manifold in policyupdate. We show that plain functional policy gradi-ents are non-covariant under the change of the kernel’shyperparameters, e.g. variance in RBF kernels or freeparameters in polynomial kernels. We prove that ourfunctional natural policy gradient is a covariant updaterule. As a result, we design a new Natural Actor-Criticin the RKHS framework, called RKHS-NAC. Using a

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)

2089

simple example we show that using RKHS-AC requiresvery careful choice of the kernel’s hyperparameters, as itdrastically affects the performance and converging pol-icy quality. By contrast, RKHS-NAC free us from thatpainful problem.

• The RKHS EM-inspired algorithm is considered as akernelized PoWER (Policy Learning by Weighting Ex-ploration with the Returns). As a result, both mean andvariance functionals of the policy are updated analyti-cally based on a notion of bounds on policy improve-ments.

• The general RKHS policy search framework turns outto be an efficient technique to learning in POMDPswhere the policy is modeled based on a sparse set ofobserved histories. This nonparametric representationrenders our standard RKHS policy search algorithms apromising approach in POMDP learning which is daunt-ing for parametric ones.

We compare our new framework to the recently intro-duced RKHS actor-critic framework (RKHS-AC) [Lever andStafford, 2015], other parametric approaches such as theepisodic natural actor-critic [Peters and Schaal, 2008], theactor-critic frameworks [Konda and Tsitsiklis, 1999], andPoWER [Kober and Peters, 2011]. The comparisons are doneon benchmark RL problems in both MDP and POMDP envi-ronments, and a robotic manipulation task. The robotic taskis implemented on a simulated PR2 platform.

2 BackgroundIn this section, we briefly present related background that willbe used in the rest of the current paper.

2.1 Markov Decision ProcessA Markov decision process (MDP) is defined as 5-tuple{S,A, T ,R, �}, where S 2 <m is a state space, A 2 <n isan action space, T (s, a, s0) = p(s0|s, a) is a transition func-tion, R(s, a, s0) is a reward function, and a discount factor�. An RL agent goes to optimize an optimal policy ⇡, with-out knowing T,R, that maximizes a cumulative discountedreward.

U(⇡) =

Zp(⇠;⇡)R(⇠)d⇠ =

Zp(⇠;⇡)

1X

t=0

�trtd⇠ (1)

where ⇠ = {s0, a0, s1, a1, . . .} is a trajectory, st 2 S, at 2A. We denote R(⇠) =

P1t=0 �

trt the cumulative discountedreward of a trajectory ⇠, and p(⇠;⇡) the trajectory distributiongiven ⇡. In this paper, we use Gaussian policies

⇡(a|s) ⇠ e�12

�h(s)�a

�>⌃�1

�h(s)�a

�(2)

where h(s) : S 7! <n is a vector-valued function of thecurrent state s, and ⌃ is an n ⇥ n covariance matrix. Forinstance, the linear parametrization approach of h(s) assumesthat the policy ⇡ is parametrized by a parameter space ✓ 2Rd, and depends linearly on predefined features �i(s), as

h(s) =dX

i=1

✓i�i(s) (3)

Similarly, nonparametric approaches with a reproducing ker-nel K represent h(s) = hK(s), hi (the reproducing prop-erty).

2.2 A Policy Search FrameworkWe briefly describe a general policy framework whose mainderivation is to maximize a lower bound on the expected re-turn at the new parameter value ✓0.

logU(✓) = log

Zp(⇠; ✓)R(⇠)d⇠

=

Zp(⇠; ✓0) log

p(⇠; ✓)R(⇠)

p(⇠; ✓0)d⇠ +KL

�p(⇠; ✓0)||p(⇠|R; ✓)

�

where KL

�p(⇠; ✓0)||p(⇠|R; ✓)

�is the Kullback-Leibler diver-

gence, and assuming that the rewards are positive. In order tomaximize U(✓), one can choose p(⇠; ✓0) / p(⇠|R; ✓) to makethe KL zero (E-step). Substituting this result into the aboveinequality to obtain new performance U(✓0) (M-step) as

logU(✓0) �Z

p(⇠; ✓)R(⇠) logp(⇠; ✓0)p(⇠; ✓)

d⇠

/ �KL�p(⇠; ✓)R(⇠)||p(⇠; ✓0)� = L(✓0; ✓)

(4)

This derivation can result in two different types of algo-rithms: 1) policy gradient, e.g. REINFORCE [Williams,1992] and (Natural) Actor-Critic Algorithms [Konda andTsitsiklis, 1999; Peters and Schaal, 2008]; 2) EM-inspiredpolicy search, e.g. PoWER [Kober and Peters, 2011], MCEM[Vlassis et al., 2009], VIP [Neumann, 2011]. All these ap-proaches optimize U(✓0) by maximizing the lower boundL(✓0, ✓), hence take the derivativer✓0L(✓0; ✓).(Natural) Actor-Critic AlgorithmsThese methods use iterative update rules to increment ✓along the direction of the gradient ascent r✓U(✓) =

lim✓0!✓r✓0L(✓0; ✓). Using the policy gradient theorem [Sut-ton et al., 1999; Baxter and Bartlett, 2001], we can derive that

r✓U(✓) = Eh HX

t=0

r✓ log ⇡(at|st)A⇡(st, at)

i(5)

where A⇡(st, at) =

�Q⇡

(st, at)�V (st)�

is called an advan-tage function, and Q⇡

(s, a) is the state-action Q-value func-tion. Using function approximation, the advantage functionA⇡

(s, a) is approximated as ˜A(s, a) =

Pdi=1 wi i(s), in

which the basis functions are i(s) = r✓i log ⇡(a|s) in or-der to make the function approximation compatible with thepolicy parameterization. The parameter ✓ is updated alongthe direction of the gradient ascent r✓U(⇡). Interleavingupdates of the weights ✓ (the actor’s parameter) and w (thecritic’s parameter) lead to the actor-critic framework.

With compatible function approximation, the actor updateusing Amari’s natural gradient [Amari, 1998] is proven to be-come ✓ ✓ + ↵w. This leads to the natural actor-criticframework, e.g. the episodic natural actor-critic framework[Peters and Schaal, 2008]

2090

EM-Inspired Policy SearchThe methods eRWR [Peters and Schaal, 2007], Cost-regularized Kernel Regression (CrKR) [Kober et al., 2010],and PoWER [Kober and Peters, 2011] introduce a notion ofweighting exploration by changing the policy defined in Eq. 2to a =

�✓ + ✏

�>�(s), where ✏ ⇠ N (✏; 0,⌃). For brevity, we

assume that the variance ⌃ is a diagonal matrix of �i. Thepolicy is improved by computing the next parameter value ✓0that makes r✓0L(✓0; ✓) equal to zero. As a result, the up-date in the form of the reward weighted exploration with thereturns is (for dimension i-th)

✓0i = ✓i +EhPH

t=0 ✏i,t��1i A⇡

(st, at)i

EhPH

t=0 ��1i A⇡

(st, at)i (6)

If taking the derivative of the lower bound w.r.t. ⌃0 (or �i), wewould also receive a similar update expression for the vari-ance ⌃

0.

2.3 Actor Critic in RKHSAs recently introduced by [Bagnell and Schneider, 2003a;Lever and Stafford, 2015], the policies are parameterizedby functionals in reproducing kernel Hilbert spaces. Morespecifically, the functional h(s) in Eq. 2 is an element of avector-valued RKHS HK

h(·) =X

K(si, ·)↵i, where ↵i 2 A(2 Rn) (7)

where the kernel K is defined as a mapping S ⇥ S 7! L(A)

[Micchelli and Pontil, 2005], L(A) is the space of linear op-erators on A. The simplest choice of K might be K(s, s0) =(s, s0)In, where In is an n ⇥ n identity matrix, and is ascalar-valued kernel [Scholkopf and Smola, 2002].

The functional gradient, i.e. the Frechet derivative, of theterm log ⇡(at|st) is computed as

r logh ⇡h(at, st) = K(st, ·)⌃�1(at � h(st)) 2 HK (8)

Therefore, the gradient of the performance measure w.r.t. husing likelihood ratio methods is approximated by a trajec-tory set ⇠i of H steps as

rhU(⇡h) ⇡ 1

N

NX

i=1

rh logP (⇠i;h)R(⇠i)

=

1

N

NX

i=1

HX

t=0

A(st, at)K(st, ·)⌃�1(at � h(st))

(9)

As a result, rhU(⇡h) 2 HK and the functional gradient up-date at iteration k is

hk+1 hk + ↵rhU(⇡hk) (10)

where ↵ is a step-size. After each gradient update, the num-ber of centres (st) of the next hk+1 might increase rapidly.One can attain sparse representation via sparsification. In ad-dition, Lever and Stafford proved that the compatible kernelfunction for A⇡

(s, a), where Kh is a compatible kernel thatis constructed based on K as

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18 20 22 24

disc

ount

ed c

umul

ativ

e re

war

d

iterations

RKHS-AC c = 1.0RKHS-AC c = 0.2

RKHS-AC c = 0.01

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18 20 22 24

disc

ount

ed c

umul

ativ

e re

war

d

iterations

RKHS-AC var = 0.01RKHS-AC var = 0.1RKHS-AC var = 2.0

Figure 1: Discounted cumulative reward in Toy domain withpolynomial and RBF kernels. The horizon H is set to 20.We report the mean performances averaged over 50 runs, andtheir 95% confidence interval in shading areas

Kh

�(s, a), (s0, a0

)

�=

⇣K(s, s0)⌃�1

(a�h(s))⌘>

⌃

�1(a0�h(s0))

(11)As a result of compatible function approximation, there ex-ists a feature space F� ⌘ HK that contains Ah(s, a) and theweight w and satisfies

Ah(s, a) = hw,rh log ⇡(a|s)iF�

= hw,K(s, ·)⌃�1(a� h(s))iHK

(12)

There are a number of ways to regress Ah(s, a) from datagenerated from ⇡h. Lever et. al. [Lever and Stafford, 2015]used kernel pursuit matching [Vincent and Bengio, 2002]for both function regression and sparsification. Accordingly,they propose the compatible RKHS Actor-Critic framework,RKHS-AC.

However, the RKHS-AC’s gradient update is not covariantw.r.t. the change of the kernel’s hyperparameter. We show anexample use of RKHS-AC in the toy problem in Fig. 1 (seethe experiment section for the problem description). We usetwo types of kernel to represent our functional policy in Eq. 2with different values of hyperparameters: polynomial kernel(s, s0) = (s>s0 + c)d, where d = 5., and c = 0.01; 0.2; 1.0;and RBF kernel (s, s0) = exp(�ks� s0k/(2 ⇤ var)), wherevar = 0.01; 0.1; 2.0. Note that the changes of the hyperpa-rameters c, var lead to transformation of the coordinate sys-tem in the same corresponding feature space. The resultsshow that RKHS-AC is non-covariant. RKHS-AC does notguarantee good improvement in all setting. For instance, inthe cases of c = 0.01 and var = 2.0 some features are verypeaky, this would make the policy overwhelmingly exploitnon-optimal actions at early stage of learning.

3 Policy Search in RKHSInspired by the derivation of the parametric policy searchframework, we derive a new policy search framework whichenables policy modeling in RKHS. Assume that the controlleris a ⇠ N (a;h(s),⌃(s)), where a 2 <n, h(·) 2 HK asdefined in Eq. 7, and ⌃(·) is a matrix-valued function. Byreshaping ⌃(·) to a vector-valued function, one could em-bed it into a another vector-valued RKHS. For brevity, weassume that ⌃(·) is a diagonal matrix-valued function whichallows independent exploration in each dimension of actions.

2091

The following derivation still applies for the cases of gen-eral matrices. The variance on each dimension is a scalar-valued functional gi(·) 2 HKi

with a reproducing kernel Ki,8i = 1, . . . , n.

The lower bound of the expected return is similarly derivedas

logU(h0) �

Zp(⇠;h)R(⇠) log

p(⇠;h0)

p(⇠;h)d⇠ / L(h0

;h)

(13)Its derivative is

rh0L(h0;h) = E

"HX

t=0

rh0log ⇡(at|st)R(⇠)

#(14)

We next derive two RKHS policy search algorithms based onthe above notion of bounds on policy improvement.

3.1 (Natural) Policy Gradients in RKHSThe RKHS policy gradient of Lever and Stafford [Leverand Stafford, 2015] can exactly be re-derived by comput-ing the derivative of U(h) in Eq. 10 if we set rhU(h) =

limh0!hrL(h0;h) (see Eq. 14).

In this paper, we are interested in applying the natural gra-dient method [Amari, 1998] to update the functional policy.The key idea is to constraint the change of p(⇠;h) after eachgradient update. For vanilla gradient, the functional gradientdirection is g = rhU(⇡h). With the constraint of a smallchange in the trajectory distribution, we compute the steepestgradient direction by solving an optimization problem,

max

h0. L(h0

;h), s.t. KL�p(⇠;h)||p(⇠;h0

)

� � (15)

Using the second-order Taylor expansion for the KL-divergence constraint, we can prove that the Fisher informa-tion operator [Nordebo et al., 2010] is

G = E⇠⇠P (⇠;h)

hr2

h logP (⇠;h)i

= E⇠⇠P (⇠;h)

hrh logP (⇠;h)⌦ logrhP (⇠;h)

i (16)

in which we use the fact:Rp(⇠)d⇠ = 1 twice.

We now prove an important property of G in order to de-rive further results: the bounded property. Consequently,there exists the inverse operator G�1 of G according to thebounded inverse theorem [Reed and Simon, 2003].Proposition 1 The linear operator G : H 7! H is bounded.

Proof: The linear operator G is a mapping from a Hilbertspace H to H, hence according to the Riesz representationtheorem [Reed and Simon, 2003] there exists the adjoint ofG. The adjoint is a continuous linear operator G⇤

: H 7! Hsuch that hGh, gi = hh,G⇤gi, where h, g 2 H. Moreover,we can easily see that G⇤G defines a Gram operator, hence itis positive definite and self-adjoint. Thus, G⇤G is a boundedlinear operator [Reed and Simon, 2003]. In other words, thereexists a number M > 0 such that kG⇤Ghk Mkhk. Fromthose results of G and G⇤G, we can derive

hGh,Ghi = hh,G⇤Ghi khkkG⇤Ghk Mkhk2

that means the linear operator G is also bounded. ⇤From a set of N trajectories {⇠i}, G can be estimated as

ˆG =

1

N

NX

i=1

rh logP (⇠i, h)⌦rh logP (⇠i, h)

=

1

N�(⇠)�(⇠)>

(17)

where �(⇠) = [rh logP (⇠1;h), . . . ,rh logP (⇠N ;h)]. Us-ing Lagrange multipliers for the optimization problem in (15),the natural functional gradient is

rNGh U(⇡h) = G�1rhU(⇡h)

⇡ 1

N

⇣1

N�(⇠)�(⇠)> + �I

⌘�1�(⇠)R

(18)

where R = [R(⇠1), . . . , R(⇠N )], and � is a regularizer. Usingthe Woodbury identity for Eq. 18, we obtain

rNGh U(⇡h) ⇡ �(⇠)

⇣�(⇠)>�(⇠) + �NI

⌘�1R

= �(⇠)⇣⇤+ �NI

⌘�1R

=

NX

i=1

HX

t=0

K(st, ·)⌃�1(at � h(st)) ˆRi

where ⇤ is a Gram matrix of trajectories which is built basedon the trajectory kernel function K⇠(⇠i, ⇠j); and ˆR =

⇣⇤ +

�NI⌘�1

R. We now define the scalar-valued trajectory ker-nel K⇠(⇠i, ⇠j). Note that the gradient of log trajectory distri-bution is

rh logP (⇠i;h) =HX

t=0

K(st, ·)⌃�1(at � h(st)) (19)

Therefore,

K⇠(⇠i, ⇠j) =X

i,j

⇣K(si, sj)⌃

�1(ai � h(si))⌘>

⌃�1(aj � h(sj))

=

TiX

i=1

TjX

j=1

Kh

⇣(si, ai), (sj , aj)

⌘

The resulting kernel of trajectories is a sum of kernel func-tions of all visited state-action pairs Kh

�(s, a), (s0, a0)

�de-

fined in Eq. 11. As a result of summing of many properkernels, the trajectory kernel is proper and specifies a RKHSHK⇠

of trajectory feature maps.Finally, the plain natural policy gradient in RKHS, reported

in Algorithm 1, has the following update

ht+1 = ht + ↵rNGh U(⇡h) (20)

3.2 Episodic Natural Actor-Critic in RKHSWe now use the above results to propose a new episodic nat-ural actor-critic framework that enables policy modeling inRKHS.

2092

Algorithm 1 RKHS Policy Search FrameworkGiven the kernel K, initialize h0 = 0

for k = 0, 1, 2, . . . doSample N trajectories {⇠i} from ⇡hk

Estimate A⇡(s, a) using {⇠i}Ni=1 to obtain w 2 HK

RKHS-NPG : hk+1 = hk + ↵rNGh U(⇡hk )

RKHS-NAC : Update the policy hk+1 = hk + ↵kwRKHS-PoWER: Update hk+1, gk+1 as in Eq. 24Sparsify hk+1 2 HK

end for

The Actor UpdateFirst, we rewrite the Fisher information operator in Eq. 16 bysubstituting the definition ofr logP (⇠) in Eq. 5 to obtain

G = EP (⇠,h)

h HX

t=0

r2h log ⇡(at|st)

i

= EP (⇠,h)

h HX

t=0

rh log ⇡(at|st)rh log ⇡(at|st)>i (21)

where we use the fact:R⇡(a|s)da = 1 twice. On the other

hand, we replace A(s, a) in Eq. 12 into Eq. 9 to obtain

rhU(⇡) = EP (⇠,h)

h HX

t=0

hw,rh log ⇡(at|st)ir log ⇡(at|st)i

= EP (⇠,h)

h HX

t=0

rh log ⇡(at|st)ir log ⇡(at|st)>wi

= Gw

As a result, the natural gradient in RKHS used in the actorupdate is

rNGh U(⇡h) = G�1Gw = w 2 HK (22)

The Critic EvaluationSimilar to eNAC [Peters and Schaal, 2008], we write the en-tire rollout of Bellman equations for an episode, and put thedefinition of A⇡

(st, at) in Eq. 12 to get

HX

t=0

�tr(st, at) =HX

t=0

�tA⇡(st, at) + V ⇡

(s0)

= hw,HX

t=0

�tK(st, ·)⌃�1(at � h(st))i+ V ⇡

(s0)

where V ⇡(s0) is the value function of the starting state. One

could use another kernel to estimate V (s0). We propose touse kernel matching pursuit algorithm [Vincent and Bengio,2002] to regress A⇡

(s, a) by sampling a set of trajectoriesfrom the policy ⇡h.

Putting all ingredients together, we derive the RKHS natu-ral actor-critic framework (RKHS-NAC), as described in Al-gorithm 1. The algorithm uses the kernel matching pursuitmethod to sparsify the new functional hk+1.

3.3 EM-Inspired Policy Search in RKHSThe major problem of the policy gradient approaches, likeRKHS-AC and RKHS-NAC, is the step-size ↵k. As in-spired by PoWER, we propose an alternative solution formaximizing the lower bound L(h0

;h) by setting its deriva-tive rh0L(h0

;h) to zero and solving for h0 to obtain

E⇡h

(HX

t=0

A(st, at)K(s, ·)⌃�1(s)

�at � h0

(st)�)

= 0

Hence,

h0= h+E⇡h

(HX

t=0

A(st, at)K(s, ·)⌃�1(s)K>

(s, ·))�1

⇥

E⇡h

(HX

t=0

A(st, at)K(s, ·)⌃�1(s)

�at � h(st)

�)

where we exploited the reproducing property h0(s) =

hh0,K(s, ·)i. The update rule for the variance functionals⌃(·) or gi(·) are similarly derived as

g0i =E⇡h

hPHt=0 ✏

i2t A(st, at)Ki(st, ·)Ki(st, st)

�1i

E⇡h

hPHt=0 ✏

i2t A(st, at)

i 2 HKi

where ✏it =�at � h(st)

�i

(the i-th dimension). For an exam-ple of how to estimate h0 and ⌃

0 given a set of N trajectories(sjt, ajt)

N,Hj=1,t=0 sampled from ⇡h, we assume K(s, s0) =

(s, s0)In for simplicity purposes, where (si, sj) is a scalar-valued kernel. The estimates for the i-th dimension are 1

h0i = hi +

NX

j=1

HX

t=0

(sjt, ·)↵ijt (23)

g0i =

NX

j=1

HX

t=0

Ki(sjt, ·)�ijt (24)

where we denote � =

�(s1,0, ·), . . . ,(sN,H , ·)�, A1 =

diag

�A(s1,0, a1,0)/gi(s1,0), . . . , A(sN,H , aN,H)/gi(sN,H)

�,

A2 =

�✏i1,0, . . . , ✏

iN,H

�>, � = �

>� is a Gram matrix where

�ij = (si, sj), to compute ↵i= (�A�1

1 + �)

�1A2 (theconstant � is a regularizer) 2, and

�ijt =

✏i2jtA(sjt, ajt)

Ki(sjt, sjt)PN

k=1

PHl=0 ✏

i2klA(skl, akl)

The overall algorithm, named RKHS-PoWER, is reported inAlgorithm 1.

3.4 Policy Search in POMDP EnvironmentOne of the very interesting characteristics of the RKHS pol-icy search framework is learning in POMDP environments.POMDP learning requires action selection at time t: µ(at|ht)

to depend on the history ht = {o0, a0, . . . , ot}, where ot isan observation. Feature design in POMDP learning is elu-sive for standard parametric policy search approaches, as the

1The results are based on the use of the Woodbury matrix identity2↵i is a N.H ⇥ 1 vector, is reshaped to N ⇥H matrix

2093

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18 20 22 24

disc

ount

ed c

umul

ativ

e re

war

d

iterations

RKHS-AC c = 1.0RKHS-AC c = 0.2

RKHS-AC c = 0.01RKHS-NAC c = 1.0RKHS-NAC c = 0.2

RKHS-NAC c = 0.01

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16 18 20 22 24

disc

ount

ed c

umul

ativ

e re

war

d

iterations

RKHS-AC var = 0.01RKHS-AC var = 0.1RKHS-AC var = 2.0

RKHS-NAC var = 0.01RKHS-NAC var = 0.1RKHS-NAC var = 2.0

Figure 2: Toy domain with the use of (left) polynomial kernel,(right) RBF kernel.

history space is very large and high-dimensional. It’s com-mon for the parametric methods to put regular features oneach state dimension. Hence parametric methods suffer fromnot only the curse of dimensionality (the number of featuresis exponential on the number of state dimensions) but alsothe curse of history (the number of features is exponentialon the length of a history state). In contrast, our RKHSpolicy search approaches discover history features adaptivelyand sparsely. One may find this learning similar to the non-controlled POMDP learning using conditional embeddings ofpredictive state representation [Boots et al., 2013]. In our ex-periment, we fix the length of ht (the number of history steps)and use an RBF kernel to measure the distance between twohistories.

4 ExperimentsWe evaluate and compare on four domains (three learning inMDP and one learning in POMDP): a Toy domain, the bench-mark Inverted Pendulum domain, a simulated PR2 roboticmanipulation task, and the POMDP inverted Pendulum. Thepresented methods are: RKHS-NAC, RKHS-AC [Lever andStafford, 2015], PoWER [Kober and Peters, 2011], eNAC[Peters and Schaal, 2008], and actor-critic. All methods usea discount factor � = 0.99 for learning and report the cu-mulative discounted reward. The number of centres used inRKHS methods is comparable to the number of bases usedin parametric methods. All tasks use the RBF kernel. Thebandwidth of RBF kernels is chosen using the median trick.

4.1 Toy DomainThe Toy domain was introduced by Lever and Stafford [Leverand Stafford, 2015]. It has a state space S 2 [�4.0; 4.0], anaction space A 2 [�1.0; 1.0], the starting state s0 = 0, anda reward function r(s, a) = exp(�|s� 3|). The dynamics isst+1 = st + at + ✏, where ✏ is a small Gaussian noise. Weset N = 10, H = 20. All controllers use 20 centres. Theoptimal value is around 14.5 [Lever and Stafford, 2015].

We first study the covariant behaviour of the natural RKHSpolicy gradient method, as not seen in the standard RKHS-AC described in Section 2.3. Both algorithms use line-searchover a grid of 50 choices. The results are computed over 50runs as depicted in Fig. 2. The performance of RKHS-NACin all setting for both polynomial and RBF kernels ensuresthat RKHS-NAC is covariant.

Figure 3 reports the comparisons of other methods withoutline-search. We chose the best step-size for each algorithm to

0

2

4

6

8

10

12

0 6 12 18 24 30 36 42 48

disc

ount

ed c

umul

ativ

e re

war

d

iterations

ACeNAC

PoWERRKHS-AC

RKHS-NACRKHS-PoWER

0

5

10

15

20

25

30

35

40

45

0 20 40 60 80 100 120 140 160 180 200

disc

ount

ed c

umul

ativ

e re

war

d

iterations

ACeNAC

PoWERRKHS-AC

RKHS-NACRKHS-PoWER

Figure 3: (left) Toy Domain. (right) Inverted Pendulum

ensure their best performance (AC and RKHS-AC use 0.005;eNAC and RKHS-NAC use 0.1). Arbitrary choice of the step-size may yield instability and local optimality which is not aproblem with the EM-inspired methods. All RKHS meth-ods perform comparably well and slightly better than AC andeNAC. PoWER improves slowly but very stably.

4.2 Inverted PendulumThis task [Lever and Stafford, 2015] has an action domain[�3, 3], a state space s = (✓,!) which are angular posi-tion ✓ 2 [�⇡,⇡] and velocity ! 2 [�4⇡, 4⇡] values. Theinitial state is always at s0 = (�⇡, 0). The reward func-tion is r(s, a) = exp(�0.5✓2) and The dynamics is: ✓0 =

✓+0.05!+ ✏; !0= !+0.05a+2✏, where ✏ is a small Gaus-

sian noise with a standard deviation of 0.02. Each transitionapplies the above dynamics two times (for the same actionvalue). We use N = 10, H = 100 whose optimal return isroughly 46. We use 40 RBF centres in all controllers. Theaveraged performance is obtained over 15 runs. If the pendu-lum is balanced, the return should be at least 25. All methodsuse carefully selected step-sizes to achieve their best perfor-mance.

The mean performance and the mean’s first deviation of allalgorithms are reported in Fig. 3. While RKHS-AC, AC, andPoWER are still struggling in finding the optimal policy after200 iterations, RKHS-NAC, RKHS-PoWER and eNAC canachieve a return over 25 after 20 iterations. The performanceof RKHS-NAC and eNAC are very promising in this domainwhich are very close to the optimal performance.

4.3 POMDP Inverted Pendulum

Figure 4: The PR2 robot isopening a door.

The third task is thePOMDP Inverted Pen-dulum. We keep usingthe original dynamics andreward functions and addan observation function tocreate a learning task inPOMDP. We assume thatthe agent can only observea noisy angular positionot = ✓t + 2✏. We use 150centres for all algorithmsand line-search over a grid of 50 step-sizes, and the length 20for a history state. We use only eNAC as the best parametriccontender whose state space is 3-step history states and put4 centres on each state dimension (46 features). Figure 5reports the average performances over 10 runs and their

2094

0

5

10

15

20

25

30

0 20 40 60 80 100 120 140 160 180 200

disc

ount

ed c

umul

ativ

e re

war

d

iterations

eNACRKHS-AC

RKHS-NACRKHS-PoWER

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

2 4 6 8 10 12 14 16 18 20 22 24

disc

ount

ed c

umul

ativ

e re

war

d

iterations

RKHS-ACRKHS-NAC

RKHS-PoWER

Figure 5: Results for (left) the POMDP inverted pendu-lum;(right) the simulated robotic manipulation task.

first deviations. RKHS-NAC performs best as this task hasa very large history state space where bandwidth setting isvery important. Because RKHS-NAC is covariant undersuch transformation which is yielded from changes of thebandwidth.

4.4 Robotic Manipulation Task

This task is a door-opening task on a simulated PR2 robot (us-ing the physics simulation ODE internally) and is only a com-parison of the three RKHS methods. We give a demonstrationvia kinesthetic teaching on a physical Willow-Garage PR2platform whose setting is as seen in Fig. 4. As action spacewe define a two dimensional space that defines the gripperposition on the door handle. The first action is the horizontalposition of the finger on the door and the second action is thefinger opening width. Both actions transform the demonstra-tion trajectory. The resulting trajectory is then executed onthe simulator to return a reward value. The reward functionconsists of two parts: e�(R1+R2). The first part measures howclose the current trajectory is to the demonstration, R1. Thesecond part is a binary penalty cost that tells whether the dooris opened successful, R2. We set N = 10, H = 1.

Figure 5 reports the average performance over 5 runs witha fixed step-size ↵ = 0.005 for RKHS-AC and RKHS-NAC.Each run starts from different initial poses of the robot. Allthree algorithms are making improvement, RKHS-PoWER’sperformance is improving slowly but looks more stable.

5 Conclusion

We have proposed a new policy search framework that en-ables policy modeling in reproducing kernel Hilbert space.The main insight is in its derivation that makes the previousRKHS policy gradient framework a special case, and addi-tionally results in two new algorithms: natural actor-critic inRKHS and the EM-inspired policy search in RKHS. WhileRKHS-NAC is a fix for the non-covariant RKHS-AC algo-rithm, the RKHS-PoWER is able to mitigate the step-sizeproblem. Their experimental results are very promising dueto the covariant behaviour and the new application track inPOMDPs. The covariant behaviour seems to play a very im-portant role in RKHS methods, because the bandwidth settingissue is not as easy as in parametric methods, especially invery large problems.

AcknowledgmentThis work was supported by the DFG (German ScienceFoundation) under grant number 409/9-1 within the priorityprogram Autonomous Learning SPP 1527 and the EU-ICTProject 3rdHand 610878. We would like to thank Guy Lever(UCL) for fruitful discussions.

References[Amari, 1998] Shun-Ichi Amari. Natural gradient works ef-

ficiently in learning. Neural Computation, 10(2):251–276,February 1998.

[Bagnell and Schneider, 2003a] J. Andrew Bagnell and JeffSchneider. Policy search in reproducing kernel hilbertspace. Technical Report CMU-RI-TR-03-45, Robotics In-stitute, Pittsburgh, PA, November 2003.

[Bagnell and Schneider, 2003b] J. Andrew Bagnell andJeff G. Schneider. Covariant policy search. In IJCAI,pages 1019–1024, 2003.

[Baxter and Bartlett, 2001] Jonathan Baxter and Peter L.Bartlett. Infinite-horizon policy-gradient estimation. J. Ar-tif. Intell. Res. (JAIR), 15:319–350, 2001.

[Boots et al., 2013] Byron Boots, Geoffrey J. Gordon, andArthur Gretton. Hilbert space embeddings of predictivestate representations. In UAI, 2013.

[Kakade, 2001] Sham Kakade. A natural policy gradient. InAdvances in Neural Information Processing Systems 14(NIPS), pages 1531–1538, 2001.

[Kober and Peters, 2011] Jens Kober and Jan Peters. Policysearch for motor primitives in robotics. Machine Learning,84(1-2):171–203, 2011.

[Kober et al., 2010] Jens Kober, Erhan Oztop, and Jan Pe-ters. Reinforcement learning to adjust robot movementsto new situations. In Robotics: Science and Systems VI,2010.

[Kober et al., 2013] Jens Kober, J. Andrew Bagnell, and JanPeters. Reinforcement learning in robotics: A survey. I. J.Robotic Res., 32(11):1238–1274, 2013.

[Konda and Tsitsiklis, 1999] Vijay R. Konda and John N.Tsitsiklis. Actor-critic algorithms. In Advances in NeuralInformation Processing Systems 12 (NIPS), pages 1008–1014, 1999.

[Lever and Stafford, 2015] Guy Lever and Ronnie Stafford.Modelling policies in mdps in reproducing kernel hilbertspace. In AISTATS, 2015.

[Lillicrap et al., 2015] Timothy P. Lillicrap, Jonathan J.Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yu-val Tassa, David Silver, and Daan Wierstra. Continu-ous control with deep reinforcement learning. CoRR,abs/1509.02971, 2015.

[Micchelli and Pontil, 2005] Charles A. Micchelli and Mas-similiano Pontil. On learning vector-valued functions.Neural Computation, 17(1):177–204, 2005.

2095

[Neumann, 2011] Gerhard Neumann. Variational inferencefor policy search in changing situations. In ICML, pages817–824, 2011.

[Nordebo et al., 2010] Sven Nordebo, Andreas Fhager, MatsGustafsson, and Borje Nilsson. Fisher information integraloperator and spectral decomposition for inverse scatteringproblems. Inverse Problems in Science and Engineering,Taylor & Francis, 18(8), 2010.

[Peters and Schaal, 2007] Jan Peters and Stefan Schaal. Re-inforcement learning by reward-weighted regression foroperational space control. In ICML, pages 745–750, 2007.

[Peters and Schaal, 2008] Jan Peters and Stefan Schaal. Nat-ural actor-critic. Neurocomputing, 71(7-9):1180–1190,2008.

[Reed and Simon, 2003] Michael Reed and Barry Simon.Functional Analysis. Elsevier, 2003.

[Scholkopf and Smola, 2002] Bernhard Scholkopf andAlexander J. Smola. Learning with Kernels SupportVector Machines, Regularization, Optimization, andBeyond. Adaptive Computation and Machine Learningseries, MIT Press, 2002.

[Silver et al., 2014] David Silver, Guy Lever, Nicolas Heess,Thomas Degris, Daan Wierstra, and Martin A. Riedmiller.Deterministic policy gradient algorithms. In ICML, pages387–395, 2014.

[Sutton et al., 1999] Richard S. Sutton, David A.McAllester, Satinder P. Singh, and Yishay Mansour.Policy gradient methods for reinforcement learning withfunction approximation. In Advances in Neural Informa-tion Processing Systems 12 (NIPS), pages 1057–1063,1999.

[van Hoof et al., 2015] Herke van Hoof, Jan Peters, and Ger-hard Neumann. Learning of non-parametric control poli-cies with high-dimensional state features. In AISTATS,2015.

[Vincent and Bengio, 2002] Pascal Vincent and Yoshua Ben-gio. Kernel matching pursuit. Machine Learning, 48(1-3):165–187, 2002.

[Vlassis et al., 2009] Nikos Vlassis, Marc Toussaint, Geor-gios Kontes, and Savas Piperidis. Learning model-freerobot control by a monte carlo EM algorithm. Auton.Robots, 27(2):123–130, 2009.

[Williams, 1992] Ronald J. Williams. Simple statisticalgradient-following algorithms for connectionist reinforce-ment learning. Machine Learning, 8:229–256, 1992.

2096

Policy Search in Reproducing Kernel Hilbert Spaceby functionals in reproducing kernel Hilbert...

Documents

Transcript of Policy Search in Reproducing Kernel Hilbert Spaceby functionals in reproducing kernel Hilbert...