IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, …

8
IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, APRIL2020 2317 Learning to Walk a Tripod Mobile Robot Using Nonlinear Soft Vibration Actuators With Entropy Adaptive Reinforcement Learning Jae In Kim , Mineui Hong, Kyungjae Lee , DongWook Kim , Yong-Lae Park , and Songhwai Oh Abstract—Soft mobile robots have shown great potential in unstructured and confined environments by taking advantage of their excellent adaptability and high dexterity. However, there are several issues to be addressed, such as actuating speeds and controllability, in soft robots. In this letter, a new vibration actu- ator is proposed using the nonlinear stiffness characteristic of a hyperelastic material, which creates continuous vibration of the actuator. By integrating three proposed actuators, we also present an advanced soft mobile robot with high degrees of freedom of movement. However, since the dynamic model of the soft mobile robot is generally hard to obtain(intractable), it is difficult to design a controller for the robot. In this regard, we present a method to train a controller, using a novel reinforcement learning (RL) algorithm called adaptive soft actor-critic (ASAC). ASAC grad- ually reduces a parameter called an entropy temperature, which regulates the entropy of the control policy. In this way, the proposed method can narrow down the search space during training, and reduce the duration of demanding data collection processes in real- world experiments. For the verification of the robustness and the controllability of our robot and the RL algorithm, experiments for zig-zagging path tracking and obstacle avoidance were conducted, and the robot successfully finished the missions with only an hour of training time. Index Terms—Modeling, control, and learning for soft robots, hydraulic/pneumatic actuators, motion and path planning. Manuscript received September 10, 2019; accepted January 12, 2020. Date of publication February 3, 2020; date of current version February 19, 2020. This letter was recommended for publication by Associate Editor K. Hagelskjaer Petersen and Editor K.-J. Cho upon evaluation of the reviewers’ comments. This work was supported in part by the National Research Foundation under Grant NRF-2016R1A5A1938472, in part by the Institute of Information & Commu- nications Technology Planning & Evaluation under Grant 2019-0-01190, [SW Star Lab] Robot Learning: Efficient, Safe and Socially-Acceptable Machine Learning, both funded by the Korea Government (MSIT), and in part by the Technology Innovation Program under Grant 2017-10069072 funded by the Ministry of Trade, Industry & Energy, Korea. (Jae In Kim and Mineui Hong contributed equally to this work.) (Corresponding authors: Yong-Lae Park; Songhwai Oh.) J. I. Kim, D. Kim, and Y.-L. Park are with the Department of Mechanical Engineering, the Institute of Advanced Machine Design (IAMD), the Institute of Engineering Research, Seoul National University, Seoul 08826, Republic of Ko- rea (e-mail: [email protected]; [email protected]; [email protected]). M. Hong, K. Lee, and S. Oh are with the Department of Electrical and Computer Engineering and ASRI, Seoul National University, Seoul 08826, Republic of Korea (e-mail: [email protected]; [email protected]; [email protected]). This letter has supplementary downloadable material available at https: //ieeexplore.ieee.org, provided by the authors. Digital Object Identifier 10.1109/LRA.2020.2970945 I. INTRODUCTION S OFT mobile robots have a great potential in the area of field robotics, since they can perform difficult tasks for conventional mobile robots, such as locomotion and navigation in unstructured and confined environments, by utilizing their high adaptability to their surroundings and the dexterity of manipulating their own bodies [1]. Among those, soft mobile robots with pneumatic actuators that provide relatively high force-to-weight ratios and durability have been widely employed due to the simplicity in design and lightweightness [2]. Never- theless, there is a limitation in deploying those robots to real world missions due to the relatively slow actuating speed of pneumatic actuators and the difficulty in control with traditional methods, such as a feedback control [3], since it is hard to obtain accurate dynamic models of the robots. To address these issues, a soft membrane vibration actua- tor was proposed in our prior work [4], composed of a soft membrane, a vibration shaft, and a rigid housing for contin- uous vibration with a constant input pressure. We were able to not only increase the actuating speed in this actuator but also demonstrate the controllability of the tripod mobile robot composed of these vibration actuators by learning the dynamic model of the robot using Gaussian process regression (GPR), which is one of non-parametric methods convenient in modeling soft robots [5]. However, there still exist several challenges in design and control in the actuator. First, the gap between the shaft and the hub inside the chamber for the vibration caused air leakage during actuation, since the gap could not be removed or further reduced due to friction. Another issue is the reduced performance in actuation when the actuator is in contact with an object due to the exposed part of the shaft. In addition, GPR used to model the dynamics of the robot is a supervised method and therefore the training data had to be manually collected, which made the process expensive and time-consuming. In this letter, we propose a new design of the vibration actuator by replacing the material of the rigid shaft with an elastomer with nonlinear stiffness. This solves the shaft friction and the air leakage problems since the actuator no longer requires a hub for the shaft. In addition, the soft shaft moves only inside the chamber, allowing for continuous vibration regardless of contact with external objects. A new tripod mobile robot was built using the new actuators, as shown in Fig. 1. For control, we utilize model-free reinforcement learning (RL) by neural networks as a function approximator. Since RL autonomously prioritizes con- trol actions based on their potential to get higher rewards while training, RL efficiently learns to control a robot with complex dynamics as demonstrated in [6], [7]. Specifically, we propose 2377-3766 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on February 26,2020 at 02:36:16 UTC from IEEE Xplore. Restrictions apply.

Transcript of IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, …

Page 1: IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, …

IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, APRIL 2020 2317

Learning to Walk a Tripod Mobile Robot UsingNonlinear Soft Vibration Actuators With Entropy

Adaptive Reinforcement LearningJae In Kim , Mineui Hong, Kyungjae Lee , DongWook Kim , Yong-Lae Park , and Songhwai Oh

Abstract—Soft mobile robots have shown great potential inunstructured and confined environments by taking advantage oftheir excellent adaptability and high dexterity. However, thereare several issues to be addressed, such as actuating speeds andcontrollability, in soft robots. In this letter, a new vibration actu-ator is proposed using the nonlinear stiffness characteristic of ahyperelastic material, which creates continuous vibration of theactuator. By integrating three proposed actuators, we also presentan advanced soft mobile robot with high degrees of freedom ofmovement. However, since the dynamic model of the soft mobilerobot is generally hard to obtain(intractable), it is difficult to designa controller for the robot. In this regard, we present a methodto train a controller, using a novel reinforcement learning (RL)algorithm called adaptive soft actor-critic (ASAC). ASAC grad-ually reduces a parameter called an entropy temperature, whichregulates the entropy of the control policy. In this way, the proposedmethod can narrow down the search space during training, andreduce the duration of demanding data collection processes in real-world experiments. For the verification of the robustness and thecontrollability of our robot and the RL algorithm, experiments forzig-zagging path tracking and obstacle avoidance were conducted,and the robot successfully finished the missions with only an hourof training time.

Index Terms—Modeling, control, and learning for soft robots,hydraulic/pneumatic actuators, motion and path planning.

Manuscript received September 10, 2019; accepted January 12, 2020. Date ofpublication February 3, 2020; date of current version February 19, 2020. Thisletter was recommended for publication by Associate Editor K. HagelskjaerPetersen and Editor K.-J. Cho upon evaluation of the reviewers’ comments. Thiswork was supported in part by the National Research Foundation under GrantNRF-2016R1A5A1938472, in part by the Institute of Information & Commu-nications Technology Planning & Evaluation under Grant 2019-0-01190, [SWStar Lab] Robot Learning: Efficient, Safe and Socially-Acceptable MachineLearning, both funded by the Korea Government (MSIT), and in part by theTechnology Innovation Program under Grant 2017-10069072 funded by theMinistry of Trade, Industry & Energy, Korea. (Jae In Kim and Mineui Hongcontributed equally to this work.) (Corresponding authors: Yong-Lae Park;Songhwai Oh.)

J. I. Kim, D. Kim, and Y.-L. Park are with the Department of MechanicalEngineering, the Institute of Advanced Machine Design (IAMD), the Institute ofEngineering Research, Seoul National University, Seoul 08826, Republic of Ko-rea (e-mail: [email protected]; [email protected]; [email protected]).

M. Hong, K. Lee, and S. Oh are with the Department of Electrical andComputer Engineering and ASRI, Seoul National University, Seoul 08826,Republic of Korea (e-mail: [email protected]; [email protected];[email protected]).

This letter has supplementary downloadable material available at https://ieeexplore.ieee.org, provided by the authors.

Digital Object Identifier 10.1109/LRA.2020.2970945

I. INTRODUCTION

SOFT mobile robots have a great potential in the area offield robotics, since they can perform difficult tasks for

conventional mobile robots, such as locomotion and navigationin unstructured and confined environments, by utilizing theirhigh adaptability to their surroundings and the dexterity ofmanipulating their own bodies [1]. Among those, soft mobilerobots with pneumatic actuators that provide relatively highforce-to-weight ratios and durability have been widely employeddue to the simplicity in design and lightweightness [2]. Never-theless, there is a limitation in deploying those robots to realworld missions due to the relatively slow actuating speed ofpneumatic actuators and the difficulty in control with traditionalmethods, such as a feedback control [3], since it is hard to obtainaccurate dynamic models of the robots.

To address these issues, a soft membrane vibration actua-tor was proposed in our prior work [4], composed of a softmembrane, a vibration shaft, and a rigid housing for contin-uous vibration with a constant input pressure. We were ableto not only increase the actuating speed in this actuator butalso demonstrate the controllability of the tripod mobile robotcomposed of these vibration actuators by learning the dynamicmodel of the robot using Gaussian process regression (GPR),which is one of non-parametric methods convenient in modelingsoft robots [5]. However, there still exist several challenges indesign and control in the actuator. First, the gap between theshaft and the hub inside the chamber for the vibration caused airleakage during actuation, since the gap could not be removedor further reduced due to friction. Another issue is the reducedperformance in actuation when the actuator is in contact with anobject due to the exposed part of the shaft. In addition, GPR usedto model the dynamics of the robot is a supervised method andtherefore the training data had to be manually collected, whichmade the process expensive and time-consuming.

In this letter, we propose a new design of the vibration actuatorby replacing the material of the rigid shaft with an elastomerwith nonlinear stiffness. This solves the shaft friction and theair leakage problems since the actuator no longer requires a hubfor the shaft. In addition, the soft shaft moves only inside thechamber, allowing for continuous vibration regardless of contactwith external objects. A new tripod mobile robot was built usingthe new actuators, as shown in Fig. 1. For control, we utilizemodel-free reinforcement learning (RL) by neural networks as afunction approximator. Since RL autonomously prioritizes con-trol actions based on their potential to get higher rewards whiletraining, RL efficiently learns to control a robot with complexdynamics as demonstrated in [6], [7]. Specifically, we propose

2377-3766 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on February 26,2020 at 02:36:16 UTC from IEEE Xplore. Restrictions apply.

Page 2: IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, …

2318 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, APRIL 2020

Fig. 1. A tripod mobile robot consists of vibration actuators, DC motor, androtation plate.

a maximum entropy RL method with the entropy temperatureadaptation. In general, maximum entropy RL algorithms [8],[9], Shannon entropy maximization has been employed to en-courage exploration by enhancing a random behaviors. Sincethe data collected from maximum entropy RL cover a widerange of state and control spaces, the learned feedback con-troller can be robust against to unexpected situations. However,the maximum entropy framework may hamper the exploitationof the policy, as shown in [8], since it prevents convergenceof the policy. To alleviate this issue while taking advantagesof entropy maximization, we control the level of exploration byscheduling the entropy temperature,α. Thus, at the beginning oflearning, our method collects a wide range of data, and convergesas learning progresses by gradually decreasing α. As a result,our algorithm can control the developed soft robot to follow thedesired path, by training only with 2,500 data instances, while46,000 manually-collected data instances were required to learnthe whole dynamics model using GPR in the previous work.

II. DESIGN

A. Tripod Mobile Robot

We designed a new mobile robot to enhance the robustness andthe dexterity from the previous version [4], as shown Fig. 2-(a).For robustness against external contacts, a new vibration actuatorwas designed, and the same three actuators were used to form anequilateral triangle (top view) of the robot in order to increasethe stability during ground contact. For dexterity, the vibrationamplitude of the actuator was increased by adding a 100 gweight to the top of each actuator. Also, a direct current (DC)motor was installed at the center of the robot, combined withthe rotating plate to control the direction of rotation. As a result,the mobile robot is capable of making various motions, suchas bi-directional rotations and translation, with a combinationof the three vibration modes of the actuator. When one of theactuators is driven, the robot follows the direction of the actuator.Also, the orientation of the robot can be controlled by rotatingthe motor. In addition, by driving the motor clockwise andvibrating three actuators at a constant frequency, the robot rotatescounterclockwise without translation, and vice versa.

B. Soft Vibration Actuator

As shown in Fig. 2-(b), the actuator consists of a chamberhousing, a soft shaft, and a soft membrane. The soft shaft is

coupled with the soft membrane, and the membrane is combinedwith the chamber housing. The chamber housing has an air inletand an outlet which are marked by the blue and red circles. Theair flow through the chamber housing is shown in Fig. 2-(c)using the cross-sectional view of the chamber housing. In thisnew design, we solved the friction and the air leakage problemsof the previous actuator. The soft shaft no longer vibrates alongthe passageway of the chamber housing, and the head of theshaft is able to completely block the exhaust of the chamberhousing. In addition, since the soft shaft is located inside thechamber housing, the actuator can vibrate continuously androbustly regardless of contact with external objects.

C. Vibration Mechanism of Actuator

The proposed actuator makes use of the nonlinear stiffnesscharacteristic of hyperelastic material (Eco-flex 30) for vibrationwhich shows a nonlinear stress-strain behavior over large strains.Fig. 3-(a) shows the initial state of the actuator. In this state,the exhaust of the chamber housing is closed due to the lengthconstraint of the soft shaft. However, the soft membrane expandsuntil the vertical distance from the soft membrane to the exhaustand the length of the soft shaft become the same, as shown inFig. 3-(b). If the internal pressure is the same as the atmosphericpressure here, the exhaust valve will open. However, the flangeof the soft shaft is pushed upward to close the exhaust bythe relatively high internal pressure that already contributedexpansion of the soft membrane. With the exhaust closed, thesoft membrane continues to expand, as shown in Fig. 3-(c),causing the soft shaft to begin to stretch. Due to the stress formedinside the soft shaft, the flange begins to be pushed downwardnow. As the membrane further inflates, the soft shaft is stretchedfurther. Therefore, this downward force becomes stronger. Dueto the nonlinear stiffness of the membrane and the shaft, there isa moment when the downward force is larger than the upwardforce. As a result, the exhaust opens and the upward force appliedto the flange is canceled. Then, the stretched soft shaft quicklyreturns to its original state, as shown in Fig. 3-(d). Also, as theair inside the chamber escapes, the soft membrane returns to itsinitial state and the soft shaft closes the exhaust again. Throughthis process, the actuator continues to vibrate.

D. Opening Principle of Exhaust Using Nonlinear StiffnessCharacteristic of Hyper-Elastic Structures

In order to explain the principle of opening of the exhaust,the relationship between the upward and the downward forcesof the soft shaft was analyzed. The nonlinear stiffness is variedaccording to the shape of the hyper-elastic material. As shownin Fig. 4, the soft membrane is assumed to be inflated to makea partial sphere with respect to the internal pressure p, and thesoft shaft elongates with the inflation of the soft membrane.

The elastic strain energy density W in Ogden model [10](μ1 = 0, μ2 = E

6 ) is given as

W (λ1, λ2, λ3) =E

24(λ4

1 + λ42 + λ4

3 − 3). (1)

The Cauchy stress is given as

σi = λi∂W

∂λi− p∗, (2)

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on February 26,2020 at 02:36:16 UTC from IEEE Xplore. Restrictions apply.

Page 3: IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, …

KIM et al.: LEARNING TO WALK A TRIPOD MOBILE ROBOT USING NONLINEAR SOFT VIBRATION ACTUATORS 2319

Fig. 2. The design of (a) a tripod robot and (b) a soft vibration actuator. (c) Internal design of the chamber housing represented using cross sections.

Fig. 3. Vibration sequence of the actuator. (a) An initial state. The chamberhousing is closed. (b) The soft membrane is expanded according to the length ofthe soft shaft. The chamber is still closed due to internal pressure. (c) A maximuminflation state which upward force due to internal pressure and downward forcedue to internal stress of the soft shaft are equal. (d) A deflating state. As the airescapes, the soft membrane and soft shaft return to their original positions. Theactuator is vibrating repeatedly through (a) to (d).

Fig. 4. (a) An initial state of hyper-elastic structure. (b) An inflated state byinternal pressure p.

where p∗ is the hydrostatic pressure which is determined fromthe boundary conditions, and σi and λi are the principle stressand strain, respectively.

The hyper-elastic material is assumed to be incompressiblegiven as

λ1 · λ2 · λ3 = 1. (3)

It is also assumed that the soft membrane is subjected toequibiaxial loading (σ1 = σ2 = σ, σ3 = 0). From Equations (2)

and (3), the principle strains are

λ1 = λ2 = λ, λ3 =1

λ2. (4)

Using the geometrical relationship of the volume of a spher-ical cap Vc =

πL2

3 (3R− L) and the radius of the curvature

Rc =L2+r20

2L , the principle strain λ is given as

λ =Rcθ

r0=L2 + r202Lr0

· arcsin(

2Lr0L2 + r20

), (5)

where L is the inflation length of the actuator [11], r0 is theinitial radius of the membrane, and θ = arcsin(r0/Rc).

The total potential energy Πtot is the sum of the strain energyW · V0 and the work of pressure Vc · p, where W is the elasticstrain energy density, V0 is the initial volume of the soft circularmembrane, Vc is the volume of the spherical cap, and p is theinternal pressure, and therefore

Πtot =W · V0 − Vc · p. (6)

At the static equilibrium,

∂Πtot(L, p)

∂L= 0. (7)

Solving Equation (7), we can finally find the inflation lengthL as a function of the internal pressure p.

L = f(p) (8)

The soft shaft is assumed to be subjected to uniaxial loading(σ1 = σ, σ2 = σ3 = 0). From Equations (2) and (3), the strainsand stress can be expressed as

λ1 = λ(p) =L(p) + h0

h0,

λ2 = λ3 =1√λ,

σ(p) =E

6

(λ(p)4 − 1

λ(p)2

). (9)

Finally, we can find the upward force Fupward and the down-ward force Fdownward based on the internal pressure p and thestress σ(p) of the shaft acting on the flange, respectively, shownin Fig. 4.

Fupward = p×Ared (10)

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on February 26,2020 at 02:36:16 UTC from IEEE Xplore. Restrictions apply.

Page 4: IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, …

2320 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, APRIL 2020

Fig. 5. Maximum inflation state (a) without a weight, and (b) with a weight.A weight increases the inflation length of the actuator.

Fdownward = σ(p)×Ablue (11)

From Equations (10) and (11), it can be noted that Fdownwardincreases more rapidly than Fupward since the increase of thestress σ(p) is larger than that of p. Therefore, due to this nonlin-ear characteristic of the soft structure with the internal pressurep, the exhaust opens when Fdownward is greater than Fupward.From this principle, the actuator can generate vibration.

E. Effect of Weight

If an additional weight is applied to the actuator, the ver-tical displacement of the chamber from inflation length willbe smaller than that without the weight. In other words, ahigher pressure is required to inflate the actuator to the samevolume with a weight, which means an increase in the upwardforce applied to the flange of the soft shaft, and an additionaldownward force is required to open the exhaust. As a result,if a weight is added to the top of the actuator, the inflationdisplacement of the actuator increases, as shown in Fig. 5.

III. LEARNING FOR CONTROL

To control the proposed robot, it is essential to design afeedback controller. However, the complex dynamics model ofthe robot makes it difficult to design a controller explicitly. Inparticular, a motion generated by the soft membrane vibrationactuators is affected by multiple and complex physical propertyof the environment such as pneumatic lines, friction and flatnessof the ground, or surface condition of membrane. Analyzinga dynamic system considering all these factors is a demand-ing task if not impossible. To handle this issue, we employa model-free RL method which can learn the parameter of afeedback controller, even if its dynamics model is unknown.Furthermore, since it is difficult to collect a large amount of datafor learning directly from the robot in real-world, the learningalgorithm needs to be sample efficient, while exploring enoughsearch space to find the optimal control of the robot. In thisregard, we utilize a novel RL method, which is sample efficientand robust to hyperparameters, to learn a feedback controllerwithout the knowledge of dynamic models.

A. Problem Formulation

Our controller aims at making the robot move to the desiredposition, while maintaining its orientation toward the target.For that, we first define a state space, which represents currentstatus of the robot, and an action space, the possible controlinputs which can be chosen by the controller. The proposedrobot has three soft membrane vibration actuators and one motorfor controlling the angular momentum of the robot. Hence,our action space (or control input) is defined as 4D vectorsas at = [ p1, p2, p3, δM ], where pi indicates an input pressureof each vibration actuator and M indicates an angular veloc-ity of the motor and δM is the change in the motor speed.Note that, since the motor has a control delay, drastic changeof motor signal may induce unstable motion and inconsistentmovements. Hence, we smoothly change the motor speed bycontrolling δM rather than directly controlling M . Then, thestate of the feedback controller is defined as st := [δθt, dt,Mt],where δθt := θg − θt is an angular difference between headingand goal direction, dt :=

√x2t + y2t , is the Euclidean distance

to the desired position, and Mt is the current motor speed.We define the feedback controller as a Gaussian policy func-

tion, (μt, σt) = fφ(st), where μt and σt are the mean andstandard deviation of a Gaussian distribution, and φ is a pa-rameter of the controller. In particular, we model fφ as a neuralnetwork which has shown high performance to model a complexnonlinear function. Since the robot requires to explore the stateand action spaces, during the training phase, we sample a controlfrom the Gaussian policy as,

at ∼ πφ(at|st) := N (at; fφ(st)) (12)

For testing phase, the mean μt is used as a feedback control.We also design a reward function r(st), which assigns a higher

score as a control reduces the gap between robot’s current stateand desired state:

r(st) := −√δx2t + δy2t − δθt + c, (13)

where c is an alive reward. (In experiments, c = 2 is used).During the learning phase, a robot starts from an initial

state s0 ∼ d(s0) and samples an action at from the Gaussiancontroller πφ(·|st) and executes the sampled control. As aresult of the control, the robot transitions to the next statest+1 according to unknown dynamics P (st+1|at, st) and re-ceives a reward rt+1 = r(st+1). As we sequentially controlthe robot, the trajectory of states, actions, and rewards τ =(s0, a0, s1, r1, a1, s2, r2, . . .) is generated. Finally, the pur-pose of an RL algorithm is to maximize the following objectivefunction (also known as expected return):

maximizeπ∈Π

Eτ∼P,π

[ ∞∑t=1

γtrt

], (14)

by updating the parameter φ based on sampled trajectories,where γ ∈ (0, 1) is a discount factor. If the robot achieves themaximum expected return, it indicates that we find an optimalfeedback controller.

B. Maximum Entropy Reinforcement Learning Preliminaries

The maximum entropy reinforcement learning (RL)[12]–[14] maximizes both the sum of expected rewards and theShannon entropy of the policy distribution, i.e., H(πφ(·|s)) =

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on February 26,2020 at 02:36:16 UTC from IEEE Xplore. Restrictions apply.

Page 5: IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, …

KIM et al.: LEARNING TO WALK A TRIPOD MOBILE ROBOT USING NONLINEAR SOFT VIBRATION ACTUATORS 2321

Ea∼πφ[− log(πφ(a|s))].1 An optimal policy, π∗α, of the maxi-

mum entropy RL is defined by,

π∗α := argmaxπ

Eτ∼P,π

[ ∞∑t=0

γt (r(st, at) + αH (π (·|st)))],

(15)

where r(s, a) = Es′∼P (·|s,a)[r(s′)] and an entropy temperatureα ∈ [0,∞) is a parameter to determine the relative importanceof the entropy with respect to the reward. Note that, the ob-jective of π�α depends on the temperature α and if α = 0, theobjective of (15) becomes the original objective (14). On thecontrary, ifα is large, the robot tries random actions to maximizethe Shannon entropy.

When the dynamic model P is fully known, in [8], Haarnojaet al. proposed a soft policy iteration which can achieve theoptimal policy of (15). In the soft policy iteration, soft statevalue V πα and soft state-action value (or soft action value) Qπαare defined as,

Qπα(s, a) := Eτ∼P,π [r(s0, a0) + γV πα (s1)|s0 = s, a0 = a] ,

V πα (s) := Eτ∼P,π

[ ∞∑t=0

γt (r(st, at)+αH (π (·|st))) |s0=s],

(16)

where these values indicate the expected sum of rewards andthe entropy. The soft policy iteration iteratively evaluateQπα andupdate π based on Qπα. Furthermore, in [12], it has shown thatthe soft policy iteration converges to the optimality condition of(15), which is given by

π∗α(a|s) =exp

(1αQ

∗α(s, a)

)∫A exp(

1αQ

∗α(s, a

′))da′, (17)

which is called a soft Bellman optimality (SBO) equation [15].In [8], Haarnoja et al. extended the soft policy iteration to a

soft actor critic (SAC) method which can be applied to RL prob-lems. SAC has benefits in terms of exploration and empiricallyshows the efficiency. However, there exist disadvantages in thatthe algorithm is especially sensitive to the entropy temperature,α, as mentioned in [8]. Since the temperature must be tuned foreach learning task manually, it is tricky to apply the algorithmto learn in real-world experiments.

C. Entropy Temperature Adaptation

While SAC empirically showed that the a large entropy helpsexploration, there exists a drawback of entropy maximization.From the optimality condition of (17), we can observe that ifα goes to zero, the optimal policy of the maximum entropy RLproblem (15) converges to the optimal policy of the original RLproblem (14), since the effect of the entropy decreases whenα goes to zero. From this observation, it can be known that,by gradually reducing the entropy temperature to zero, we canrecover the original optimal policy. Hence, we schedule theentropy temperature to decrease from an initial value to zero.

When it comes to reduction of α, we should consider theestimation error ofQπα. As mentioned in Section III-B, in the RLproblem, due to absence of the dynamic model, the expectationin (16) is intractable. Thus, we estimates V πα andQπα by training

1This is also known as an entropy bonus.

a neural network similarly to other existing methods [6], [8],[12], [16]. Since we update π based on Qπα, if we drasticallyreduce α to zero, the policy can be too greedy on mis-estimatedvalue, and cannot explore the action space thoroughly to find abetter policy. Therefore, we present a method to schedule thetemperature using the trust region method [16]. The trust regionmethod ensures a monotonic improvement of performance, bylimiting the Kullback-Leibler (KL) divergence between old andnew policies using a threshold δ. Applying the concept of thetrust region to our approach, we can expect the algorithm toautomatically adjust the temperature by the proper amount thatguarantees not to hamper enough exploration.

The proposed method consists of two parts: first, for givenαm,we obtain a near-optimal policy π∗αm

by running SAC. Second,αm is reduced using the trust region method. Hence, the policylearned by SAC converges to π∗αm

and, as αm decreases, π∗αm

converges to π∗ which is the optimal policy of the original RLproblem. Note that SAC converges toπ∗αm

within a small numberof iterations in pratice where more detail setting can be found inSection IV.

Now, let Qπα denotes an estimated soft action value functionof a policy π, Qπ = Qπα|α=0 denotes an estimated state actionvalue function without considering the entropy, and ρπ(s) =(1− γ)∑∞

t=0 γtP (st = s) denotes discounted visitation fre-

quency of a state s. Then, we updateαm by solving the followingoptimization problem:

maximizeαm+1

Es∼ρπαm,a∼παm

[παm+1

(a|s)παm

(a|s) Qπαm (s, a)

]

subject to Es∼ρπαm

[DKL(παm

||παm+1)] ≤ δ, (18)

whereDKL(παm||παm+1

) indicates the KL-divergence defined

as∫A παm

(a|s) log παm (a|s)παm+1

(a|s)da, which measures the differ-

ence between two policy παmand παm+1

, for a state s. Notethat παm

indicates the optimal policy of (15) when α = αmwhich is obtained by the SAC. Now, note that,

d

dαm+1E

[παm+1

(a|s)παm

(a|s) Qπαm (s, a)

]

= − 1

α2m

E

[(Qπαm (s, a)− V παm (s)

)2]≤ 0, (19)

where, V παm (s) =∫A Q

παm (s, a)παm(a|s)da, and E[. . .] de-

notes Es∼ρπαm,a∼παm

[. . .] for simplicity. Equation (19) meansthat the objective always increases as αm+1 decreases. Thus,αm+1 appears at the equality of KL constraints, then we cansolve Equation (18) with a quadratic approximation of theKL-divergence using Taylor expansion as,

E[DKL(παm

||παm+1)]

≈ (αm+1 − αm)2

2αm4E

[(Qπαm (s, a)− V παm (s)

)2]= δ.

(20)

Then, we can get αm+1 as,

αm+1 = αm − αm2

√√√√ 2δ

E[Aπαm (s, a)

2] , (21)

where, Aπαm (s, a) = Qπαm (s, a)− V παm (s).

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on February 26,2020 at 02:36:16 UTC from IEEE Xplore. Restrictions apply.

Page 6: IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, …

2322 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, APRIL 2020

Algorithm 1: Adaptive Soft Actor-Critic.

Initialize parameter vectors ψ, ψ, θi, φ, λ, ω, entropycoefficient α, and replay buffer D.

for each iteration dofor each environment steps do

Sample a transition {st, at, r(st, at), st+1}, and storeit in the replay buffer D.endfor each gradient steps do

Minimize JV α(ψ), JQα(θ1,2), JQ(μ), JV (ω), andJπ(φ) using stochastic gradient descent.ψ ← (1− τ)ψ + τψ

endif πφ converges then

Update α with trust region methodend

end

Furthermore, we theoretically prove that the policy convergesto the optimal policy by applying the SAC with the sequence oftemperature {αm}.

Theorem 1: Consider a sequence of coefficients {αm} gen-erated from Equation (21). Then, repeated application of softpolicy iteration with {αm}, from any initial policyπ0, convergesto an optimal policy π∗.

Theorem 1 indicates that scheduling the temperature withthe proposed method ensures not only the improvement ofthe performance, but also the optimality of the algorithm. Thedetailed derivations of all the equations and theorems in thissection are included in the supplementary material [17].

D. Algorithm

In this section, we present an actor-critic algorithm, calledadaptive soft actor-critic (ASAC), which schedules the entropytemperature using the proposed trust region method. To handlecontinuous state and action domains, ASAC maintains sevennetworks to model value and policy functions (a soft action valueQαθ1,2 , a soft state value V αψ , a target state value V α

ψ, an action

value Qλ, a state value Vω , and a policy πφ).2 The soft actionvalue and the soft state value function are needed to updatethe policy, and the action value and the state value functionare needed to update the temperature α. Also, we utilize areplay bufferD, which stores every transition (st, at, rt+1, st+1)obtained by interactions with an environment.

The objective functions of the soft action value and the softstate value are defined the same as SAC [8].

JV α(ψ) =

ED

[1

2

(V αψ (s)− Ea∼πφ

[miniQαθi(s, a)−α log πφ(a|s)

])2],

JQα(θi) = ED

[1

2(Qαθi(s, a)− (r + γV αψ (s′)))2

].

Then the target value network parameter ψ is updated by anexponentially moving average of the value network.

2Subscripts denote the network parameters of each function.

Also, we modeled the policy function as a tangent hy-perbolic of a Gaussian random variable, i.e., a := fφ(s, ε) =tanh(μφ(s) + εσφ(s)), where μφ(s) and σφ(s) are the outputsof πφ, and ε ∼ N (0, I). Now to approximate the soft policyiteration, the policy function is trained to minimize the expectedKL-divergence given by,

Es∼D

[DKL

(πφ(·|s)

∥∥∥∥∥exp

Qαθ (s,a)

α∫A exp

Qαθ (s,a′)α da′

)]. (22)

Then using a reparameterization trick as in [8], minimizing (22)can be changed to minimizing the following objective:

Jπ(φ)

= Es∼D,ε∼N [α log πφ(fφ(ε; s)|s)−Qαθ1(s, fφ(ε; s))]. (23)

Furthermore, we estimate V πφ and Qπφ to compute (21), byadding two networks Vω and Qλ, which are trained to minimizethe squared residual errors:

JV (ω) = Es∼D

[1

2(Vω(s)− Ea∼πφ

[Qλ(s, a)])2

]

JQ(λ) = E(s,a,r,s′)∼D

[1

2(Qλ(s, a)− (r + γVω(s

′)))2].

(24)

Finally, we determine whether the policy has converged, bycomparing the changes of Jπ(φ). If the change in Jπ(φ) is lessthan a threshold�, i.e., Jπ(φold)−Jπ(φnew)

Jπ(φold)< �, we assume that

πφ has converged and decrease the temperature using (21). Inpractice, the proposed criterion can be satisfied not only whenthe policy converges to π∗α, but also when it struggles to finda better policy. At that time, reducing the coefficient leads tomore efficient exploration, and helps the policy escape fromsub-optimal policy.

IV. EXPERIMENT

A. Platform Setup

Our experiment setup consists of a single workstation, acamera and the proposed tripod robot. For the workstation, weuse an Intel Core Quad Processor i5-6600 CPU and a TitanX GPU for learning the network parameter. Furthermore, anIntel RealSense D435 camera, attached at a height one meterabove the robot, is used for sensing the robot. As the cameracaptures an image of the robot and passing it to the workstation,the position and heading direction of the robot are extracted bydetecting colored markers on the robot. In order to control therobot, the pneumatic is controlled by a pressure regulator (SMC,ITV2050). Fig. 6 shows how the components of our experimentsetup interact to each other.

B. Reinforcement Learning Setup

In real robot experiments, we train a feedback controller ofthe proposed robot using a reward function defined in (13).We compare the proposed method to SAC with automaticallyentropy adjustment (AEA) [6], and SAC with fixed temperatures(α = 0.01 and α = 0.2). The SAC-AEA also controls the tem-peratureα of the entropy, where the temperature is controlled bymaintaining the entropy to be greater than a predefined threshold.

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on February 26,2020 at 02:36:16 UTC from IEEE Xplore. Restrictions apply.

Page 7: IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, …

KIM et al.: LEARNING TO WALK A TRIPOD MOBILE ROBOT USING NONLINEAR SOFT VIBRATION ACTUATORS 2323

Fig. 6. Given a state of the robot extracted from the image taken by the camera,the next action is sampled from the policy of the controller. Also, a replay bufferstores transitions, which are used for training the controller.

The SAC with α = 0.01 and α = 0.2 are designed to verify theeffect of the entropy.

All of the value and policy functions of all algorithms areparameterized with a single-layer of 300 hidden units, and weused ADAM optimizer [18] to learn the parameters. The initialα of ASAC is set at 0.2 based on the simulation experiment, ofwhich the result is included in the supplementary [17], and thetarget entropy of SAC-AEA is set to be the negative dimensionof actions as suggested in [6]. For all algorithms, we train eachcontroller with 50 episodes. For each episode, a goal point issampled about 20 centimeters away from the robot in a uni-formly random direction within±π/4 rad and each controller isrequired to move the robot to the goal in 50 steps. At each step,the controller samples an action and executes it for a second,i.e., 1 Hz of control frequency. Therefore, total 2500 steps ofcontrol actions (≈50 minutes) are used for training.

C. Robustness, Accuracy, and Success Rate Index

We compare the performance of the feedback controllerslearned by four algorithms: SAC with α = 0.01, SAC withα = 0.2, SAC-AEA, and ASAC. For the comparative evalua-tion, we demonstrate each controller to track the goal pointsplaced in 20 different directions (from −π/2 to π/2 rad). 10 ofthe target points in the demonstration set are on the directionover π/4 (or under −π/4) away from the initial heading ofthe robot, which are not included in training episodes. Throughthis experiment, the robustness of the learned controllers can beverified by demonstrating it on the unexperienced tasks. Also,the accuracy of the controllers can be evaluated by measuringroot mean square (RMS) of the distance between the goal and therobot (d =

√δx2 + δy2), and the angular difference between

heading and goal direction (δθ = θg − θt) during the wholeepisodes. In addition, we measure the success rate by assumingthe robot succeeds in tracking when the robot reaches withinthree centimeters from the target point.

D. Robot Controllability

In order to show the controllability and dexterity of theproposed tripod robot for advanced missions with the controller

TABLE ICOMPARISON PERFORMANCE OF THE LEARNED CONTROLLERS

learned by our algorithm, we demonstrate a zig-zagged pathfollowing experiment and obstacle avoidance experiment. Apath is given by a human or a planning algorithm and the robottracks the given waypoints in order.

V. RESULT

A. Learning Feedback Controller

We evaluate the return, a cumulative reward sum, i.e., R =∑Tt=1 rt, to check the performance of the controller during

the training time. Fig. 7(a) shows how the sum of rewardschanges as the number of sampled transition increases for eachlearning algorithm, and ASAC shows the fastest convergenceand the highest sum of rewards compared to other algorithms.In particular, the controller learned by ASAC was able to reachany target point with only about 1500 steps (≈30 minutes) oftraining. Furthermore, training the controller for more episodes,the movement of the robot becomes more stable and faster. SinceASAC gradually reduces the influence of the entropy terms asα decreases, ASAC shows stable convergence. However, sinceSAC-AEA uses the contraints of the entropy, SAC-AEA ham-pers increasing the accuracy of the feedback controller. In thisregards, we can conclude that ASAC makes the controller moreprecise with less training steps, while the constrained entropy inSAC-AEA hampers accurate control.

B. Robustness, Accuracy, and Success Rate of LearnedFeedback Controllers

The results of experiments for evaluating the performance ofthe learned controllers are shown in Table I. The accuracy ofthe controllers are measured as a root mean squared differencefrom desired position (RMS(d)) and heading angle (RMS(δθ)).Also, we present the success rate both from the experienced(|δθ0| ≤ π

4 ), and unexperienced scenarios (|δθ0| > π4 ). As a

result, ASAC shows the best accuracy and success rate, overother algorithms. In particular, ASAC shows 100% successrates even for the unexperienced scenarios, while SAC with lessentropy maximization (α = 0.01) shows only 30% of successrates. This result indicates that ASAC takes an advantage ofmaximizing entropy in that it explores diverse policy and is ro-bust under unexpected situations, and also makes the controllermore accurate by automatically reducing the entropy at the end.

C. Controllability

In the zig-zagging path traking and the obstacle avoidanceexperiments, we used the rapidly-exploring random tree (RRT)[19] algorithm for planning the path and the controller trainedusing ASAC is used to control the robot. The zig-zagging path

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on February 26,2020 at 02:36:16 UTC from IEEE Xplore. Restrictions apply.

Page 8: IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, …

2324 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, APRIL 2020

Fig. 7. (a) Comparison the performance of the controllers trained using SAC with fixed temperature (α = 0.2, 0.01), SAC-AEA, and ASAC. The cumulativesum of rewards is evaluated for every 10 episodes, and five different scenarios are used for each evaluation. We repeated the whole training process five timesfor each algorithm, and the mean value is plotted as the solid line and the one standard deviation is plotted as the shade region. (b) The tripod robot follows thezig-zagging path. (c) The tripod robot avoids the obstacle by following the planned path. (b), (c) Blue circles are planned waypoints, a green line represents thetrajectory of the robot, and red arrows show heading directions of the robot at each point.

traking is a challenging task, since the robot is desired to consis-tently change its heading direction over 90 degree. Fig. 7(b)shows that the robot can follow the given path representedby waypoints, which indicates that the robot controlled by thelearned controller has high controllability on its direction. Also,as shown in Fig. 7(c), the robot was able to dexterously avoidthe obstacle block as tracking the planned path while keeping itsheading direction to each waypoint. The video of the zig-zaggingpath traking experiment and obstacle avoidance experiment ispresented in our video submission.

VI. CONCLUSION

We propose a new pneumatic vibration actuator that utilizesthe nonlinear stiffness characteristics of a hyperelastic materialin order to ensure vibration stability and robustness against theexternal environment. Based on this actuator, we also proposean advanced soft mobile robot capable of orientation control,which was not possible in our previous work. In order to controlthe robot, we present a reinforcement learning algorithm calledadaptive soft actor-critic (ASAC), which provides efficient ex-ploration strategy and is adaptive to various control tasks. As aresult, the feedback controller trained by ASAC not only accu-rately controls the robot but it is also robust against unexpectedsituations as demonstrated in experiments.

REFERENCES

[1] S. Kim, C. Laschi, and B. Trimmer, “Soft robotics: a bioinspired evolutionin robotics,” Trends Biotechnol., vol. 31, no. 5, pp. 287–294, 2013.

[2] D. J. Preston et al., “A soft ring oscillator,” Sci. Robot., vol. 4, no. 31,2019, Art. no. eaaw5496.

[3] T. G. Thuruthel, E. Falotico, F. Renda, and C. Laschi, “Model-basedreinforcement learning for closed-loop dynamic control of soft roboticmanipulators,” IEEE Trans. Robot., vol. 35, no. 1, pp. 124–134, 2019.

[4] D. Kim, J. I. Kim, and Y.-L. Park, “A simple tripod mobile robot using softmembrane vibration actuators,” IEEE Robot. Autom. Lett., vol. 4, no. 3,pp. 2289–2295, 2019.

[5] C. E. Rasmussen, “Gaussian processes in machine learning,” in SummerSchool on Machine Learning. Berlin, Germany: Springer, 2003, pp. 63–71.

[6] T. Haarnoja, A. Zhou, S. Ha, J. Tan, G. Tucker, and S. Levine, “Learningto walk via deep reinforcement learning,” in Proc. Robot.: Sci. Syst.,2019, doi: 10.15607/RSS.2019.XV.011.

[7] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Bench-marking deep reinforcement learning for continuous control,” in Proc.Int. Conf. Mach. Learn. New York City, NY, USA: JMLR.org, 2016,pp. 1329–1338.

[8] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochasticactor,” in Proc. Int. Conf. Mach. Learn., Stockholm, Sweden, 2018,pp. 1856–1865.

[9] T. Haarnoja et al., “Soft actor-critic algorithms and applications,” 2018,ArXiv, abs/1812.05905v2.

[10] G. Marckmann and E. Verron, “Comparison of hyperelastic models forrubber-like materials,” Rubber Chem. Technol., vol. 79, no. 5, pp. 835–858,2006.

[11] E. W. Weisstein, “Spherical cap,” 2008. [Online]. Available: http://thznetwork.net/index.php/thz-images

[12] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learningwith deep energy-based policies,” in Proc. Int. Conf. Mach. Learn., Sydney,NSW, Australia, 2017, pp. 1352–1361.

[13] M. Bloem and N. Bambos, “Infinite time horizon maximum causal entropyinverse reinforcement learning,” in Proc. IEEE Conf. Decis. Control,Dec. 2014, pp. 4911–4916.

[14] J. Schulman, P. Abbeel, and X. Chen, “Equivalence between policy gra-dients and soft q-learning,” 2017, ArXiv, abs/1704.06440v4.

[15] M. L. Puterman, Markov Decision Processes: Discrete Stochastic DynamicProgramming. Hoboken, NJ, USA: Wiley, 2014.

[16] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, “Trustregion policy optimization,” in Proc. 32nd Int. Conf. Mach. Learn., Lille,France, 2015, pp. 1889–1897.

[17] J. I. Kim, M. Hong, K. Lee, D. Kim, Y.-L. Park, and S. Oh. Learningto walk a tripod mobile robot using nonlinear soft vibration actuatorswith entropy adaptive reinforcement learning: Supplementary material.[Online]. Available: http://rllab.snu.ac.kr/publications/letters/2020_ral_adasac_supp.pdf/

[18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”2014, ArXiv, abs/1412.6980v9.

[19] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimalmotion planning,” Int. J. Robot. Res., vol. 30, no. 7, pp. 846–894, 2011.

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on February 26,2020 at 02:36:16 UTC from IEEE Xplore. Restrictions apply.