Towards the development of a soft manipulator as an …...2018/08/06 · Research Article Towards...
Transcript of Towards the development of a soft manipulator as an …...2018/08/06 · Research Article Towards...
Research Article
Towards the development of a softmanipulator as an assistive robot forpersonal care of elderly people
Yasmin Ansari1, Mariangela Manti1, Egidio Falotico1,Yoan Mollard2, Matteo Cianchetti1 and Cecilia Laschi1
AbstractManipulators based on soft robotic technologies exhibit compliance and dexterity which ensures safe human–robotinteraction. This article is a novel attempt at exploiting these desirable properties to develop a manipulator for an assistiveapplication, in particular, a shower arm to assist the elderly in the bathing task. The overall vision for the soft manipulatoris to concatenate three modules in a serial manner such that (i) the proximal segment is made up of cable-based actuationto compensate for gravitational effects and (ii) the central and distal segments are made up of hybrid actuation toautonomously reach delicate body parts to perform the main tasks related to bathing. The role of the latter modules iscrucial to the application of the system in the bathing task; however, it is a nontrivial challenge to develop a robust andcontrollable hybrid actuated system with advanced manipulation capabilities and hence, the focus of this article. We firstintroduce our design and experimentally characterize its functionalities, which include elongation, shortening, omnidir-ectional bending. Next, we propose a control concept capable of solving the inverse kinetics problem using multiagentreinforcement learning to exploit these functionalities despite high dimensionality and redundancy. We demonstrate theeffectiveness of the design and control of this module by demonstrating an open-loop task space control where it suc-cessfully moves through an asymmetric 3-D trajectory sampled at 12 points with an average reaching accuracy of 0.79 cm+ 0.18 cm. Our quantitative experimental results present a promising step toward the development of the softmanipulator eventually contributing to the advancement of soft robotics.
KeywordsAssistive robotics, soft robotics, machine learning, robot control
Date received: 20 May 2016; accepted: 10 December 2016
Topic: Special Issue - Soft Robotics Interacting with the Real WorldTopic Editor: Li Wen
Introduction
Activities of daily living (ADLs) refer to the daily tasks
people should be able to independently accomplish1 and
are grouped into two general categories: (i) basic ADLs
(BADLs) and (ii) instrumental ADLs. However, achieving
these is a significant challenge for the elderly generation
with chronic ailments and loss of abilities.2,3 Initially, chil-
dren took care of their elders; however, nowadays the
majority of the population is dependent upon professional
healthcare services. This has brought about an interest to
the development of advanced assistive technical devices
(also referred to as assistive robotics) in order to promote
independent living and consequently provide the possibil-
ity to the elderly generation to improve their quality of
life.2,4,5 This work focuses on the development of assistive
devices for BADLs, in particular, the task of bathing which
1The BioRobotics Institute, Scuola Superiore Sant’Anna, Pisa, Italy2Flowers Lab, Inria, France
Corresponding author:
Yasmin Ansari, The BioRobotics Institute, Scuola Superiore Sant’Anna, via
Rinaldo Piaggio, 34 Pontedera 56025, Italy.
Email: [email protected]
International Journal of AdvancedRobotic Systems
March-April 2017: 1–17ª The Author(s) 2017
DOI: 10.1177/1729881416687132journals.sagepub.com/home/arx
Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 3.0 License
(http://www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without
further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/
open-access-at-sage).
is considered challenging because the entire body is
exposed in this scenario. Literature reveals, that this area
has been scarcely investigated, with only two commercially
available technical solutions.
� Oasis Seated Shower system:6 provides an independent,
safe, and hygienic shower experience. The system is
equipped with a push-button soap application and rinse
system. On the downside, it completely lacks active
interaction between the user and the tool which also
includes lack of functionalities such as rubbing/wiping
the senior. Furthermore, it is bulky and expensive.
� Seat Lift Device:7 is much simpler, partially auto-
mated, and helps positioning the user in the bathing
task. However, it does not provide any assistance in
showering task.
There still exists a multitude of unaddressed issues in
this domain where the system should: (i) be client-oriented
i.e. translate customer requests into technical requirements;
(ii) be safe, reliable, and adaptable for human-robot inter-
action;8,9 (iii) be highly dexterous with a large reachable
workspace; and finally (iv) autonomously reach difficult
and sensitive regions of the body (refer to Figure 1).
Soft robotic manipulators
In order to address the above limitations and increase the
user autonomy at home, a suitable approach is to exploit
principles and technologies from the rapidly growing
research field of Soft Robotics10 for developing a manipu-
lator as a shower arm in the bathing task. Soft robotics refers
to a new generation of flexible robots that take inspiration
from muscular hydrostats such as octopus tentacles, elephant
trunks, etc., in the use of soft materials to achieve advanced
manipulation capabilities while being safe to interact
with.11–13 These natural manipulators have inspired the
development of novel soft robotic manipulators for obtain-
ing a desired gamut of motion. These manipulators comprise
of actuators elements that can be arranged into the single
body or repeated into single identical modules. The advan-
tages of the modular approach to soft robots, introduced by
Onal and Rus14 for the first time, demonstrates how actua-
tion units can be composed in various, possibly redundant
arrangements to achieve exceedingly complex motions. Tak-
ing advantage of this achievement, a mechanical modular
manipulator can be designed by concatenating modular
blocks, which, in turn, are an assembly of soft actuators that
can be intrinsic (actuators are located within the body thus
forming part of the animated mechanism), extrinsic (the
actuator is remote and motion is transferred into the mechan-
ism via a mechanical linkage), or hybrid (combination of
intrinsic and extrinsic actuation).15
We have considered that the technical requirements for the
design of a soft manipulator are defined by the workspace that
it should cover, which includes a volumetric space around and
close to the users’ body as shown in Figure 1. This will be
translated into a mechanical modular manipulator by conca-
tenating three modular blocks such that (i) the proximal mod-
ule is a structural module based on radially arranged cables
that are used to compensate for gravitational effects and (ii)
the centre and the distal modules on hybrid actuation are
identical functional modules. The success of the manipulator
is highly dependent upon the functional capabilities and
working of the hybrid actuated module which is a nontrivial
challenge. Consequently, the focus of this work lies in the
development of a reliable and controllable distal module.
Challenges in design
The challenge of developing manipulators that rely on the
combination of fluidic actuators and cables i.e. hybrid actua-
tion has been largely faced in the surgical robotics field open-
ing the door to other applications.16 Good examples of soft
manipulators based on tendons pulled in relation to pressur-
ized chambers are: the KSI tentacle manipulator powered by a
hybrid system of pneumatic bellows and electric motors,17 the
continuum robot proposed by Pritts19 uses 6-8 contracting/
extending McKibben actuators for providing two-axis bend-
ing, the Air-Octor manipulator that uses three cables and a
single chamber,20 and the OCTARM continuum robot21 that
is able to perform adaptive manipulation in challenging envir-
onments by using air muscle actuators. These examples have
been integrated into a more comprehensive literature analy-
sis, as reported in Table 1, that provides a chronological order
of some examples for bioinspired manipulators that cover all
three categories focusing on the actuation strategy that
enables specific functionalities (i.e. elongation/contraction,
bending and variable stiffness). This table also indicates the
major achievements in Soft Robotics concerning the design of
Figure 1. General overview of the bathing task. The softmanipulator approaches the back region of the user, one of themost critical and difficult regions to reach. The lower limbs havealso been identified as difficult task. An additional soft manipulatorcould be mounted on the wall for the lower limbs.
2 International Journal of Advanced Robotic Systems
Tab
le1.
Asu
mm
ary
ofth
edes
ign
and
contr
olofso
me
exam
ple
sofso
ftro
botic
man
ipula
tors
.
Man
ipula
tor
Public
atio
n
Tim
elin
eA
pplic
atio
nD
esig
n
Act
uat
ion
Funct
ional
itie
sC
ontr
ol
Typ
eA
ctuat
ors
KSI
tenta
cle1
71995
Endosc
opy
Agr
iculture
Nucl
ear
Modula
rH
ybri
dBel
low
-shap
edpneu
mat
icbla
dder
Rad
ially
arra
nge
dca
ble
s
Var
iable
stiff
nes
s
Short
enin
g
Elo
nga
tion
Om
nid
irec
tional
ben
din
g
Model
bas
ed
Act
ive
Hose
18
2001
Sear
chan
dR
escu
eM
odula
rIn
trin
sic
Wounded
tube
actu
ator
Om
nid
irec
tional
ben
din
g—
Art
ifici
alm
usc
leco
ntinuum
robot1
9
2004
NSS
*M
odula
rIn
trin
sic
(Exte
nso
rþ
contr
acting)
pneu
mat
ic
Cham
ber
s
Om
nid
irec
tional
ben
din
g—
Air
-Oct
or2
02005
Sear
chan
dre
scue
Modula
rH
ybri
dSi
ngl
epneu
mat
icch
amber
Rad
ially
arra
nge
dca
ble
s
Var
iable
stiff
nes
s
Short
enin
g
Elo
nga
tion
Om
nid
irec
tional
ben
din
g
—
Oct
Arm
21
2005–2008
Indust
rial
Spac
e
Mili
tary
Modula
rIn
trin
sic
Rad
ially
arra
nged
(Ext
enso
rþ
Cont
ract
ing)
pneu
mat
ics
cham
bers
Short
enin
g
Elo
nga
tion
Om
nid
irec
tional
ben
din
g
Model
bas
ed22
Hyb
rid
23
Oct
opus2
42010–2016
Mar
ine
Singl
ebody
Extr
insi
cC
able
Elo
nga
tion
Ben
din
g
Model
bas
ed
Lear
nin
gbas
ed
BH
A25
2011–2016
Indust
rial
Modula
rIn
trin
sic
Rad
ially
arra
nge
dbel
low
-shap
edpneu
mat
ics
cham
ber
s
Elo
nga
tion
Om
nid
irec
tional
ben
din
g
Lear
nin
gbas
ed26
Hig
hly
articu
late
man
ipula
tor
enab
led
by
jam
min
gm
edia
27
2012
NSS
*M
odula
rH
ybri
dG
ranula
rja
mm
ing
Rad
ially
arra
nge
dca
ble
s
Spri
ngs
Var
iable
stiff
nes
s
Short
enin
g
Om
nid
irec
tional
ben
din
g
—
Shri
nka
ble
,st
iffnes
s-co
ntr
olla
ble
soft
man
ipula
tor2
8
2014
NSS
*M
odula
rH
ybri
dLa
tex
cham
ber
Rad
ially
arra
nge
dca
ble
s
Bra
ided
fabri
c
Var
iable
stiff
nes
s
Short
enin
g
Elo
nga
tion
Om
nid
irec
tional
ben
din
g
—
STIIFF
–FL
OP
29,3
02014–2016
Surg
ery
Modula
rIn
trin
sic
Rad
ially
arra
nge
dpneu
mat
ics
cham
ber
s
Gra
nula
rja
mm
ing
Var
iable
stiff
nes
s
Elo
nga
tion
Om
nid
irec
tional
ben
din
g
—
Hyb
rid
Rad
ially
arra
nge
dpneu
mat
ics
Rad
ially
arra
nge
dca
ble
s
Var
iable
stiff
nes
s
Short
enin
g
Elo
nga
tion
Om
nid
irec
tional
ben
din
g
—
Multi-se
gmen
tso
ftflu
idic
actu
ator3
12015
NSS
*M
odula
rIn
trin
sic
Anis
otr
opic
pneu
mat
icch
amber
sO
mnid
irec
tional
ben
din
gM
odel
-bas
ed
*NSS
:not
spec
ifica
llyst
ated
.**
Sim
ula
tion.
manipulators based on variable stiffness mechanisms. More-
over the Application field of Table 1 highlights how none of
these is intended for assistive robotics. It is clear how our
proposal of using Soft Robotics methodologies and mechan-
isms for building the first assistive soft robot is challenging
and completely new. In order to develop a manipulator as a
shower arm in the bathing task, we investigate an actuation
strategy that enables all the mentioned specific functionalities
(i.e. elongation/contraction, bending and variable stiffness).
Challenges in control
Soft robots often lack an analytical model due to their nonlinear
material properties and high-dimensional articulated structure.
This makes predictable and controlled movements with posi-
tional accuracy a significantly challenging task. Consequently,
one of the open research questions is to achieve task space
control that allows these robotic systems to accurately follow
a trajectory, that is, point-to-point motion control. This is
essential for a bathing task where close contact is required
between the manipulator and the human to perform scrubbing
and cleaning. The objective of point-to-point motion control is
to find the correct combination of input actuation at each dis-
crete sample, that is, inverse kinematics/kinetics. Immega and
Antonelli17 proposed a cascaded spline arc approach that con-
trols the cables of the KSI tentacle that roughly positions it in 3-
D space, which is further improved through a closed-loop
vision feedback. Giorelli et al.24 used a Jacobian based
approach to reach an average tip accuracy of 6% of the total
manipulator length. Marchese et al.31 applied a closed-loop
controller on a 3D soft-arm to position the end effector to reach
a ball with a diameter of 0.04m. These traditional model-based
methods are limited in accuracy and computational costs by
modelling assumptions that need to be improved for practical
application of soft manipulators. Learning based control23,24,26
is a promising approach to automate low-level sensorimotor
skills. The underlying principle is to allow the manipulator to
autonomously explore its environment and correlate sensory
and motor spaces. This work investigates the application of
these methods for the previously unaddressed challenge of the
control of hybrid actuated systems.
The paper is organized as follows: Section ‘Design’
presents the design, fabrication, and characterization of
the single distal module. Section ‘Control Framework’
presents the inverse kinematics control approach that is
incorporated as a low-level module in the hierarchical
controller presented in Section ‘Hierarchical Controller’
Experimental methodology and results in point-to-point
motion control is reported in Section ‘Experiments’ fol-
lowed by a comprehensive discussion of the main out-
comes of the proposed work is stated in the last section
of the paper.
Design
The overall design of the distal segment relies on the total length
which should be able to reach the target workspace which
includes back and/or lower limbs (refer to Figure 1). On the one
hand, a McKibben-based actuation maximizes the reachable
length due to the elongation performances of the fluidic actua-
tors, thus, reducing the number of modules required to achieve
the desired length. However, it is unstable due to gravitational
Figure 2. Flowchart showing the fabrication process for the single McKibben actuator.
4 International Journal of Advanced Robotic Systems
effects. On the other hand, the use of cables enables multiple
functionalitiesof shortening,omnidirectionalbending,andcom-
pensation for gravity effects. However, it requires a higher num-
ber of modules for covering the same length. We propose to
embed both technologies within the distal module in order to
complement the limitations of each other while increasing
the overall functionality of the complete module. Thus, the
distal module is hybrid actuated. The innovative element in
its mechanical design is represented by the flexible fluidic
actuators that are McKibben based but with bellow-shaped
surface, discussed subsequently.
Design and manufacturing of the distal module
Three McKibben actuators and three cables are alternately dis-
placed at an angle of 60� along a circle with a radius of 3 cm. A
central hollow chamber of 1.4 cm diameter has been inserted to
facilitate the flow of water and soap. The total length of the
module is 20.5 cm, while the total weight is 180 g (refer
to Figure 2). The cables and chambers are decoupled, meaning
that each one has a dedicated activation line for tension and
pressure regulation, respectively.
Design of the McKibben actuator. The manufacturing process
for the McKibben actuator32–34 comprises of the following
steps: (i) forming the external chamber: A braided sheath (Pro
Power PETBK3B10, Farnell Components) with a length of
60 cm and internal nominal diameter of 0.3 cm is expanded
and inserted on a metallic cylinder with a diameter of 0.8 cm.
A mechanical deformation is applied in the inferior/superior
direction to introduce a fiber deformation until a cylindrical
bellow structure is created. The final shape is ‘‘memorized’’
by applying a cyclic uniform heat along the structure after
which the actuator has an outer diameter of 1 cm and a total
length of 19.5 cm (refer to Figure 2). A recommended tem-
perature is around 350�C for 3 min. (ii) Combining the inter-
nal and external chambers: A balloon (latex) is inserted into
Figure 3. CAD design of the single module; on the top right asection view of the module with the three McKibben-basedactuators (identified by A1–A2–A3), the three cables (identifiedby B1–B2–B3), and the internal channel for the provision of water.
Figure 4. Experimental setup for the single modulecharacterization.
Figure 5. Elongation–contraction module characterization. Theblue marker represents the starting configuration without activeelements.
Ansari et al. 5
the braided sheath. It is delimited at both ends by endcaps
which provide an air-source from one end and an anchoring
point to the lateral surface from the other. The bellow-shaped
external chamber facilitates a radial containment effect of the
internal elastic chamber, thus, improving the elongation per-
formances of the combined structure. (iii) Structural support:
The entire module is completed by adding a custom-made
helicoidally shaped structure along its length (Figure 3). It
houses the three cables, the three chambers, and the internal
channel, thus, providing a constraint for the structural ele-
ments and avoiding lateral expansions of the actuators. The
last part of the module consists of the lateral interfaces (end
plates) for confining the module.
Kinetic characterization of the distal module
The kinetic characterization refers to the evaluation of
the complete behavior of the system that includes
contraction/elongation, bending/stiffness, and the com-
plete reachable workspace, as discussed in the subse-
quent subsections.
Experimental setup. The experimental set-up for the character-
ization of the distal module consists of: (1) a rigid rectangular
frame made of 0.8 cm acrylic sheet that orients the module
vertically downward; it is custom designed to encompass the
actuation systems; (2) Pneumatic Set-up: three proportional
pressure-controlled electronic valves (K8P Series by EVP Sys-
tems, Italy, Input: 0 - 10 V, Output: 0 -3 bar), one filter (EVP
Systems, Italy: MC-104FB0); one manometer (EVP Systems,
Italy: M043-p12 0-12bar); one stand-alone air compressor; (3)
Cable Set-up: three Hs-645 Hitech Servomotors; (4) A six-
DOF electromagnetic sensing probe (Aurora® Tracking by
Northern Digital Inc., Canada) has been fixed at the tip of the
module base which is capable of sub-millimetre position track-
ing. This is a visualization tool that provides a log of tracking
movements at the end of an experiment (Figure 4).
Elongation/contraction. The linear motion of the module was
captured by moving it from a completely contracted state to a
completely extended state. The contracting state corresponds
to the cables pulled by the servomotors without the chambers
being inflated. All the servomotors are then simultaneously
rotated through 180� in steps of 20�. The position obtained
Figure 6. Activation pattern with a single cable (B3 in Figure 3) up to the maximum position and activation of the opposite chamber (A1in Figure 3). (a) Global bending movement of the element; (b) xz view; (c) yz view. Arrows with different colors show the orientation ofthe module: x, y, z directions in green, blue, and red, respectively. Arrows with different colors show the orientation of the module: x, y,z directions in green, blue, and red, respectively.
6 International Journal of Advanced Robotic Systems
after the completion of this rotation corresponds to the neutral
point of the manipulator. Next, the pneumatic chambers are
inflated from 0.5 to 1 bar. As explained in,35 we start from 0.5
bar as it is the inferior pressure limit of a single actuator i.e.,
below this limit the actuator does not detect any global elon-
gation movement. Figure 5 illustrates the displacement of the
module along the z-axis from the contracted state to the
extended state which is found to be 5 cm. With respect to the
neutral position, this implies that there is a contraction of 3 cm
and elongation of 2 cm.
Bending and stiffening. In order to perform the bending tests,
we evaluated two different activation patterns: (i) bending
due to a single cable and single chamber: a single randomly
chosen cable of the system is contracted from an untensioned
state into a tensioned state by rotating the servomotor from
the neutral position through 180� in incremental steps of 20�.When the maximum tensioned state is achieved, the diame-
trically opposite pneumatic chamber is inflated by applying
a pressure from 0.5 to 1 bar, with an incremental step of 0.1
bar. Referring to Figure 3, this activation pattern is achieved
when A1 collaborates with B3, A2 with B1, or A3 with B2.
Figure 6 illustrates the resulting behavior from this activa-
tion pattern. The global bending of the system contributed by
the cable undergoes a displacement of 60 mm along the z-
axis (viewed as the xz-view and yz-view in Figure 6). A
stiffness variation can be observed when pneumatic actua-
tion begins which is due to an antagonistic effect. (ii) Bend-
ing due to two cables and two chambers: two adjacent cables
of the system are contracted from an untensioned state into a
tensioned state by rotating the corresponding servomotors
simultaneously through 180� in incremental steps of 20�.Referring to Figure 3, this activation pattern can be achieved
when first B1–B2 or B2–B3 or B3–B1 are activated. When
the maximum tensioned state is achieved, the diametrically
opposite pneumatic chambers corresponding to each servo-
motor, that is, A2–A3, A3–A1, or A1–A2, respectively, are
inflated similar to the procedure explained previously.
Figure 7. Activation pattern with two cables (B1–B2 in Figure 3) up to the maximum position and activation of the two oppositechambers (A2–A3 in Figure 3). (a) Global bending movement of the element. (b) xz view. (c) yz view. Arrows with different colors showthe orientation of the module: x, y, z directions in green, blue, and red, respectively. Arrows with different colors show the orientationof the module: x, y, z directions in green, blue, and red, respectively.
Ansari et al. 7
Fig
ure
8.(
a)W
ork
spac
eev
aluat
ion
for
the
singl
em
odule
.The
blu
epoin
tre
pre
sents
the
initia
lposi
tion
oft
he
module
inan
unac
tiva
ted
stat
e.(b
)W
ork
spac
evi
ewfr
om
the
X–Y
pla
ne.
(c)
Work
spac
evi
ewfr
om
the
X–Z
pla
ne.
(d)
Work
spac
evi
ewfr
om
the
Y–Z
pla
ne.
8
Figure 7 depicts resulting behavior from this activation pat-
tern. The global bending of the system contributed by the
cables undergoes a displacement of 5 cm along the z-axis.
The reduced value in comparison to the first activation pat-
tern is due to the double-cable activation that pulls the cables
with an increased force. It can also be concluded from these
figures that the chambers contribute to adapt the tip orienta-
tion of the module. The hybrid actuation is promising
because of the stiffness variation for applying restrained
forces in the interaction with the user.
Reachable workspace. The workspace measurement (see
Figure 8) aims to measure all the reachable positions of the
module tip. A total of 8000 points were recorded which
resulted from a permutation of all six inputs applied using
the following values: (i) servomotors: 0�, 45�, 90�, 135�,180�; (ii) pneumatic pressure: 0 bar, 0.5 bar, 0.8 bar, and 1 bar.
Figure 8(a) depicts the isometric view of the resulting
point cloud, highlighting the workspace in the range of 21.5
cm � 17.2 cm � 10 cm in the x–y–z axes, respectively. The
blue marker corresponds to the starting configuration of the
end effector. The overall shape of the workspace can be cate-
gorized as a volumetric convex with the lower limits defined by
the maximum bending capabilities (approximately 90�) of the
module, as expected. The asymmetry of the shape can be cred-
ited to the manufacturing procedure i.e. the irregularities that
are introduced during the assembly phase which are not easily
identifiable when the module is at rest. For a detailed analysis,
the workspace is also presented from a 2D viewpoint from all
three planes. Figure 8b shows the top view of the workspace
from the X-Y plane. The shape is circular, as theoretically
expected. Figure 8c indicates an asymmetric response of the
system from the X-Z planar viewpoint, whereas a symmetric
response of the system from the Y-Z planar viewpoint is
depicted in Figure 8d. Data points are more pronounced toward
the left portion in the former graph as a result of a tightened
cable attachment to the respective servo arm.
Control framework
Learning based controllers36,37 offer a promising solution to
address the control challenge. In particular, we take inspiration
from a previous work38 where each actuator is considered an
autonomous agent that resides within and shares an environ-
ment forming a distributed multi-agent system (MAS).39 The
underlying objective is to enable the agents to coordinate their
actions to learn a joint optimum behaviour. Reinforcement
Learning (RL)40 is a very attractive online optimization tech-
nique in milieu of automating a solution for MAS as it assists in
(i) model-free learning and (ii) online learning. The former is
beneficial to learn optimal behaviour in unpredictable environ-
ments where the dynamics are not known a-priori, whereas, the
latter is particularly useful to learn behaviour without specify-
ing the underlying actuation technology of the system. This,
however, requires that the agents learn in an independent man-
ner41 similar to the Iterated Prisoner’s Dilemma.42
Reinforcement learning preliminaries
RL is a sequential decision-making tool where an agent
iteratively interacts with an environment modeled as a
Markov decision process (MDP). An MDP is a 4-tuple
<S, A, T, R>, where S is the state space, A is the action
space, T is the environment dynamics given by the condi-
tional probability Pr fs0 | s, ag that taking action a in state s
will lead to new state s0, and R is the reward received when
an action a in state s leads to a new state s0. An agent acts on
the environment by selecting actions according to a policy �,
that is, a mapping from states to actions. Starting from an
initial state, the agent iteratively interacts with the MDP
according to the policy to generate a trajectory which
refers to a sequence of state-action-reward tuples
< s0 ! a0 ! r1 ! s1 ! a1 ! r2 ! s2 ! a2; . . . >.
The return over an infinite horizon trajectory refers to the
cumulative discounted reward as Rt ¼P1
k¼0 gtrkþtþ1,
where g 2 ½0; 1Þ, and r is the reward received at each
time-step. This can also be extended to a finite horizon set-
ting for episodic tasks where g ¼ 1. In the context of con-
trol, an action-value function (the utility is described using
the action-value function as opposed to the more commonly
used value-function V�ðsÞ ¼ E ½Rt jst ¼ s� as a matter of
convenience to the underlying work) Q is widely employed
to evaluate the utility of executing an action a in a state s and
following a given policy � thereafter, given as43
Qðs; aÞ ¼ E
(X�k¼0
gtrkþtþ1jst ¼ s; at ¼ a; �
)(1)
where � is the terminating condition, s is the state of the
system at time t, and a is the action selected in state s at
time t in given policy �. This work considers trajectories that
finish within a finite horizon; hence, � represents when either
the goal sg is reached or a maximum number of action trials
are attempted in an episode. In optimal control problems, the
goal is to find an optimal control policy�� that maximizes the
expected cumulative discounted return regardless of the ini-
tial state. The focus of this work is to derive �� via policy
iteration (PI),44 one of the two main classes of dynamic pro-
graming to solve MDPs.
Formally, PI is a subroutine that runs trajectories iteratively
for a given duration, wherein it alternates the processes of
policy evaluation and policy improvement as follows: Starting
from any initial position at time-step t, an action in the current
state is chosen according to a policy �t and is evaluated by Qt.
In the next time-step t þ 1, an improved policy �tþ1 is gener-
ated by acting greedily (exploitation) with respect to Q�t. In
this framework, the policy is also referred to as the actor and the
Q-function the critic. The goal of PI is to learn an optimal Q-
function Q* such that it satisfies the condition
Q�ðs; aÞ > Q�ðs; aÞ 8 s 2 S ; a 2 A; 8 � (2)
Consequently, an optimal policy �* can be automatically
derived from Q* as
Ansari et al. 9
��ðsÞ ¼ argmaxa Q�ðs; aÞ (3)
For a finite-dimensional MDP, the Bellman equation is
employed to analytically optimize the Q-function. However,
for most real-world applications, including the problem-at-
hand, prior knowledge of environment dynamics is rendered
unknown due to stochasticity. The most powerful approach
in this scenario is temporal difference (TD)-based PI which
is model free and learns incrementally from the data.40 There
are two classical TD learning algorithms applied in the con-
text of control, that is, Q-learning (off-policy TD learning)
and SARSA (on-policy TD learning), where the focus of this
work lies on the latter approach.
In SARSA, an arbitrarily initialized Q-function is incre-
mentally updated with a tuple < st; at; rtþ1; stþ1; atþ1 >as follows
Qtþ1ðst; atÞ ¼ Qtðst; atÞþ � � ½rtþ1 þ gQtðstþ1; atþ1Þ � Qtðst; atÞ�
(4)
where � 2 ½0; 1Þ is the learning-rate/step-size parameter;
the term between square brackets is the TD error between
the current estimate Qtðst ; atÞ and the updated estimate
rtþ1 þ gQtðstþ1 ; atþ1Þ. Equation (4) highlights that every
update aims to estimate the Q-function of the policy being
followed, that is, on-policy TD learning.
This learning process can be further sped-up by
incorporating eligibility traces e 2 R which update Q
not only by the current tuple but rather with all visited
state-action pairs that lead to this tuple as it is the causal
result of an entire trajectory. This variant of SARSA,
more commonly known as SARSA(l) with accumulat-
ing eligibility traces,45 has memory parameters associ-
ated with each state-action pair that are initialized to 0
at the beginning of each trajectory and updated as
et ðs; aÞ ¼�
glet�1ðs; aÞ þ 1 if s ¼ st ; a ¼ at
glet�1ðs; aÞ otherwise(5)
where l 2 ½0; 1Þ is the accumulating trace-decay error that
controls the degree of how fast a trace is decreased to zero.
Then, equation (3) is modified to update Q as follows
Qtþ1ðst; atÞ ¼Qtðst; atÞ þ � � ½rtþ1 þ gQtðstþ1; atþ1Þ� Qtðst; atÞ� � etðs; aÞ (6)
Convergence to Q* is dependent upon two factors: (i)
Robbins-Monro criteria:46 the learning rate � should be opti-
mized in the range � 2 ½0; 1� to satisfyP1
i¼0 ai ¼ 1 andP1i¼0 ai
2 <1; (ii) exploration:47 to ensure that every state-
action pair is visited an infinite number of times. The former
criteria can be achieved through a manual selection process
whereas the latter requires that the updates occur by following
a stochastic policy (e.g. ] –greedy, softmax, etc.) that will
explore the environment by selecting random actions with
certain probability measure. Additionally for SARSA, this
exploration should be reduced to 0 as t !1. Under these
conditions, an optimal memoryless stationary policy, that is,
�*¼ � 8 t can be extracted from the Q-function to reach the
target according to equation (3). PI is motivating due to its
ability to reach optimality with asymptotic performance.
However, fulfilling criteria (ii) is nontrivial for a con-
tinuous domain environment, necessitating techniques to
approximate the Q-function, as discussed subsequently.
Parametric linear function approximation
For continuous domains, the Q-function can be estimated
either with linear/non-linear function approximation. The-
oretical stability and convergence properties favor linear
approximators over non-linear approximators.48
The choice opted for the parametric linear function
approximator is tile coding, which partitions the state space
into multiple layers called tilings in the spherical coordinate
system such that R¼ [0 max(R;)]; (ii) ϕ¼[0� 360�]; and (iii)
� ¼ [max(�) 90�]; where max(R) refers to the length of the
manipulator in a fully extended state; ϕ refers to the azimuth
which for omnidirectional bending will always be 360�; and
� refers the contraction capability of the manipulator mea-
sured from the z-axis. The complete feature space is then
represented by rectangular shaped tilings (refer to Figure 9)
and is automatically restricted to the reachable workspace of
the manipulator. Each tiling is divided into elemental blocks
called a tile that allows for local generalization of distances
close in Euclidean space. The resolution is different in each
dimension of the state-space, given mathematically as
’i¼li
N(7)
where i ¼ 1, . . . , n and n is the total number of dimensions
of the state-space; ’ represents the resolution of a tile in a
Figure 9. 3-D Cartesian plane is abstracted into 2-layeredrectangular tilings in R, ϕ, and � dimensions. There are 7 tiles,8 tiles, and 1 tile with a width of wR, wϕ, and w� in each dimension,respectively. The origin of the tilings are offset with respect toeach other. The position of the soft manipulator end effectoractivates one tile per tiling.
10 International Journal of Advanced Robotic Systems
given dimension; l represents the width of the tile in given
dimension; N represents the total number of tilings. The role
of resolution is crucial on the learning performance of the
algorithm: A larger resolution implies slower learning but more
optimal solutions, whereas smaller tiles imply faster learning
but less optimal solutions. The selection of the resolution is
mostly done through a manual trial and error process. Tiles act
as receptive fields, that is, only one tile per tiling is activated if
and only if the given state falls in the region delineated by that
tile. The origin of each tiling is offset to avoid singularities at
the boundaries of the tiles. Q-function is then simply approxi-
mated by a sum of the indexes of these activated tiles as
Q ðs; a Þ ¼Xk
j ¼ 1�
jðs Þwj (8)
where j ¼ 1 . . . k, where k is the total number of tilings;
�: S x A ! Rk is a vector of binary basis functions
½�1ðs; aÞ; . . . ; �kðs; aÞ�T ; w 2 Rk is the parameter vector.
Equation (8) shows that the computational complexity of
tile coding is linearly dependent on the number of til-
ings. Parameter tuning for tile coding is done offline and
kept constant throughout the experimental work.
SARSA(l) can be combined with linear approximation
by using gradient-based updates which will change the
parameter vector according to,
wtþ1 ¼wt þ �t � ½rtþ1 þ g�ðstþ1; atþ1ÞT wt � �ðst; atÞT wt�
� @
@wt
Qðst; atÞ � et
(9)
Reward structure and policy learning
In order to facilitate learning, we guide the policies through
the technique of reward shaping such that the MAS is
motivated to select actions that progressively lead towards
the goal. (Note: In this work, the optimal policy is learnt
from a single starting state towards the goal). This is
implemented in an information theoretic manner by
defining the reward using a distance measure from the
goal as follows: A bounded monotonically increasing
reward function is implemented representing unequally
spaced concentric spheres centered on the target, where
the number of spheres in this work are selected randomly
but once created, is applied in a similar manner for any
target. The motivation behind this reward structure is to
allot the distribution of information throughout the
abstracted state-space such that motor commands that
result in positions with distances further from the goal
will be updated with a higher negative value in compar-
ison to targets closer to the goal. Even though this pro-
vides efficient distribution of information, it would still
suffer from the curse of dimensionality49 as the policy
space increases exponentially with the number of agents.
Thus, we modularize the learning process that is the pol-
icy learns to reach the goal by learning by moving
through the state-space in the direction of greater reward.
This is achieved by taking advantage of our reward struc-
ture such that once it encounters an action that is
rewarded with a scalar value higher than previously
encountered ones, it will be the first action taken by the
system in the next episode onwards.
Hierarchical controller
A schematic overview of the motion controller is depicted
in Figure 10. It has a hierarchical architecture such that at
the higher level, a trajectory is generated within the reach-
able workspace of the manipulator using the spherical
Figure 10. Schematic view of the hierarchical control architecture for a point-to-point motion controller for soft robotic modules.
Ansari et al. 11
coordinates. It is sampled at n points ½s0; s1; . . . ; sn� and
follows them in a sequential order. In structured environ-
ments, this could be used to generate circular, curved, and
so on, trajectories offline. However, in this work, we are
motivated by the unpredictability of the showering envi-
ronment and, hence, aim to test the movement of the mod-
ule through an asymmetrical trajectory which essentially
implies that the task space is sampled at random points in
the state-space in a sequential order with the aim of the
manipulator to reach these points in the same order. These
samples are then fed sequentially (si) to the low-level mod-
ule where it generates a solution to reach the given Carte-
sian spatial target position using the control framework
discussed in section ‘‘Control framework’’. Once gener-
ated, the next sample is requested.
Experiments
The implementation of the tile coding software was
achieved through typically available software. Before
using it, all the necessary parameters required by the
algorithm are set as follows (refer to Table 2): (i) reach-
able workspace: It is defined in spherical coordinates in
consistence with the characterization of the module
shown in Figure 8, that is, R ¼ [10 cm 30 cm], ϕ ¼[0� 360�], � ¼ [0� 90�]. The upper range of the radius
was overextended to compensate for the unseen scenar-
ios, for example, the tear of a cable resulting in a larger
range for the workspace. (ii) Resolution: 10, 12, and 10
tiles were initialized in the R, ϕ, and � dimensions with
a total of 4 tilings offset with respect to one another
such that the state space has a total of 4800 tiles. (iii)
Learning parameters: The step-size (�) and exploration
(]) are set to 0.16 and 0.05, standard values found in
text. The eligibility traces (7) are set to 0.9. (iv) Accu-
racy: The radius of the target circle was kept at a ran-
domly chosen value of .99 cm. This indicates the
maximum accuracy that is acceptable as a solution, the
minimum being 0. (v) Action-set: The module is allowed
to choose actions from a heuristically initialized discrete
Table 2. A summary of applied parameters for the proximal module.
Module length 20.5 cm
Reachable workspace R 10–30 cm nR 10 wR 0.02ϕ 0�–360� nϕ 12 wϕ 0.52� 0�–45.8� n� 10 w� 0.08
State-space parameters T 4� 0.16] 0.1l 0.9! 1
Action set Pneumatic chambers [�0.087 �0.0349 �0.0175 0.0175 0.0349 0.087] (bar)Servomotors [�25 �20 �10 �5 0 5 10 20 25]�
Figure 11. Experimental setup.
12 International Journal of Advanced Robotic Systems
action-set that can simultaneously decrease/increase/keep
unchanged the length of the cables and pneumatics. The
length of the cable is controlled by a servomotor rotation
range of [0�, 180�], whereas the length of the chamber is
controlled by pressure-regulating valves in the range of
[0,1] bar. The cardinality of the action-set implies that the
module has to able to learn the required motor command
from a total of 96 ¼ 531,441 input action combinations.
(vi) Algorithmic settings: The algorithm runs online for a
total of 100 episodes with 50 trials per episode and is
evaluated on two criteria: (a) mean reaching error and
(b) trials needed for convergence.
Experimental setup
The experimental setup is similar to the one mentioned in
section ‘‘Kinetic characterization of the distal module’’;
however, an OptiTrack Motion Capture V120: Trio vision
feedback system was employed to provide real-time 3-D
positioning data. There are seven retroreflective markers
attached to the end effector in a random manner such that
four markers need to be detected for the 3-D reconstruction
of the tip position. The complete setup with the initial
position of the robotic module is shown in Figure 11. The
system represents a mapping from an R6 motor space to an
R3 Cartesian space.
Table 3. A summary of the obtained results for point-to-point motion control of the module through a sampled asymmetrictrajectory.a
No. of goals reached 12Desired accuracy 0.99 cmAverage reaching accuracy 0.79 cm + 0.18 cm
Sample number 1 2 3 4 5 6 7 8 9 10 11 12
Reaching accuracy (cm) .93 .59 .63 .93 .81 .79 .94 .66 .88 .98 .4 .93Actions taken per point before convergence 5394 4396 16 1020 787 410 30 423 27 8 230 661
aNote: The data from this table are color coordinated with the data provided in Figure 13.
Figure 12. The distal module moving point-to-point through an asymmetric 3-D trajectory generated offline and sampled at 12 points.Each frame corresponds to the learnt kinematic solution of the manipulator for a given sample. The green line depicts the trajectoryfollowed by the manipulator from sample to sample.
Ansari et al. 13
Results
We generated 12 points in an uneven shaped curve starting
from the top right end with respect to the module in the
resting position moving downward toward the resting posi-
tion. The higher-level controller then fed each sample
sequentially to the low-level controller.
Reaching error. Figure 12 depicts the kinematic solution
learnt by the manipulator for each sample. The algorithm
was found to converge for all 12 points with an average
reaching accuracy of 0.79 cm (+ 0.18) as summarized in
Table 2. It is worthy to mention that this result for accuracy
opens the possibility to apply this module to reach sensitive
areas for providing future services of scrubbing and rinsing.
Furthermore, the possibility to control both pneumatic and
cable actuators simultaneously implies that the control can
inherently produce stiffness, which otherwise require
tedious kinematic and force mappings. Thus, it facilitates
a model-free form of handling internal/external loads in an
efficient manner. The authors would also like to mention
that this accuracy can be improved by reducing the target
region size, provided that the system can also physically
reach that desired accuracy. The green line depicts the
trajectory the module moves through to go from one point
to another. The trajectory from the resting position to the
first target point is not shown as it obstructs the view from
the other targets.
Convergence time. For a given sample, the algorithm allows
the module to test a total of 50,000 actions (i.e. 100 epi-
sodes � 50 trials per episode); however, Table 3 demon-
strates that the algorithm is able to generate solutions for all
the given samples within a couple of hundred of actions
apart from targets 1 and 2, which will be discussed in the
next paragraph. This is a direct consequence of the reward
structure mentioned in the previous section. In a state
where the manipulator has experienced actions with both
lower and higher scalar reward, it will prefer actions with
higher reward. This implies high data efficiency but also
the ability to quickly learn optimal solutions high in accu-
racy. However, as a future research goal, the authors intend
to use the hierarchical control in conjunction with a learnt
forward model. A forward mapping is a causal unique rela-
tion between the motor and task space, and thus, are accu-
rate to learn. The inverse kinematic solutions can then be
learnt without executing real-time trajectories, saving both
time and hardware resources.
Targets 1 and 2 from Table 3 show that a significant
amount of actions were required to achieve convergence. This
is because the actions taken by the module to reach these two
targets positioned the retroreflective markers in such a way
that it was challenging to reconstruct the 3-D position for the
Opti-track system. As mentioned previously, an occlusion for
more than three markers at various places in the state space
returns no information to the algorithm which implies that
(0,0,0) is returned to the algorithm. This state has been given a
reward of�1000. The red plots in Figure 13(b) illustrate the
reward accumulation for these two targets, which is signifi-
cantly much higher as compared to the other samples. Despite
this challenge, the algorithm was able to learn a solution as
long as the target point was not occluded.
The quality of the learnt policies. As mentioned in section
‘‘Parametric linear function approximation’’, the worst per-
formance of this algorithm will be generating a suboptimal
policy. In this practical setup, this was found to be true. The
algorithm is tested from various starting positions; how-
ever, in majority of the cases, the solution was generated
from the positions that are closer to the goal rather than
further from the goal. One reason for this could be that the
resolution of the state-space is not optimized, that is, it
should be increased. However, this will result in an expo-
nential increase in computational time, whereas the policy
currently learnt very well meets our requirements. Conse-
quently, the authors believe that such an approach should
not be precluded from practical applications.
Figure 13 depicts the reward accumulation for the all
samples. The same trend is exhibited for all experiments
which is that the reward accumulation will increase until
convergence where it will receive a reward of 0, hence
becoming constant.
Discussion and conclusions
Soft robotics has already demonstrated promising results to
revolutionize key sectors such as industry and surgery pri-
marily because this technology guarantees safe human–
robot interaction. This article aims to apply the powerful
properties of soft robotic systems to design the first system
to assist the elderly community in the task of bathing. We
proposed an overall modular system; however, in order to
Figure 13. The reward accumulation for all samples of theasymmetric trajectory.
14 International Journal of Advanced Robotic Systems
reach this ambitious goal, we first needed to design, fabri-
cate, and test the key component module that will perform
the main bathing activities, that is the distal module which
will come in direct contact with the human.
For the design, a hybrid actuated system was proposed.
It comprises of radially arranged bellow-shaped pneumatic
chambers and cables. The kinetic characterization of the
module pointed out the following functional capabilities
of the robotic platform: The novel bellow shape of the
fluidic actuators contributes toward elongation and con-
traction, whereas the fluidic chambers contribute to adapt
the tip orientation. The overall advantages provided by the
combination of these actuators are (i) reachability to the
user workspace: It guarantees to cover larger distances with
a reduced number of modules and orient the module with
respect to the target region of the user’s body. This was
observed through the measured reachable workspace
which was 21.5 cm � 17.2 cm � 10 cm in the x–y–z axes,
respectively. Consequently, a single module has the capa-
bility to match the dimensions of the back region of the
human. However, further experimentations are required to
test how well this module can be oriented toward this
region in a modular form. We also acknowledge that these
results are specific to the experimental setup described in
section ‘‘Experimental setup’’. However, this is not con-
sidered a source of error as the overall behavior (apart
from the asymmetries) is in congruence with theoretical
expectations. For future work, a more optimized setup will
be prepared by introducing motors instead of the servo-
motors to facilitate variable adjustment of the module
motion, due to the extrinsic cable-driven actuation. (ii)
Stiffness variation: Hybrid actuation provides an antago-
nistic motion between cables and pneumatic actuators that
results in a stiffening which is particularly important to
compensate for internal/external loading in the bathing
environment.
For the control, a point-to-point motion hierarchical
controller based on a multiagent RL was used to control
the inverse kinetics of a high-dimensional, nonlinear,
hybrid actuated manipulator. We tested the controller to
move the manipulator through an asymmetric trajectory
generated offline sample at 12 discrete points. We found
the algorithm to provide solutions to all points with posi-
tional accuracy of 0.79 cm + 0.18 cm. It is worthy to
mention that this accuracy opens the possibility to apply
this module to reach sensitive areas for providing future
services of scrubbing and rinsing. The accuracy can also
be improved by setting the target region to a lower value, of
course provided that the system can physically reach that
limit. Furthermore, as the positions are being reached with
all six input activations, this implies that the control is able
to exploit the stiffness properties in a model-free manner
while reaching which otherwise is achieved through com-
plex kinematic and force models. The authors also demon-
strated the effectiveness of applying this approach despite
its ability to generate only suboptimal policies. A limitation
of the approach lies in the possible high number of steps
needed to reach a single point. This will be addressed
through the computation of the forward model on which
the proposed method can be applied and then the inverse
solution is computed, this can be executed by the robot.
Future works will focus on this improvement as well as on
the control of the tip orientation which is not considered in
this implementation. Also, it is also demonstrated that
vision-based systems may introduce additional constraints
and hence, it is recommended to find alternative sources
of feedback.
Acknowledgements
The authors would like to acknowledge the support by the
People Programme (Marie Curie Actions) of the European
Union’s Seventh Framework Programme FP7/2007-2013/
under REA grant agreement number 608022; the European
Commission through the I-SUPPORT project (#643666); and
by the Italian Ministry of Foreign Affairs, General Directorate
for the Promotion of the Country System’’, Bilateral and Mul-
tilateral Scientific and Technological Cooperation Unit, for the
support through the Joint Laboratory on Biorobotics Engineer-
ing project.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect
to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, author-
ship, and/or publication of this article.
References
1. Bock T, Georgoulas C and Linner T. Towards robotic assisted
hygienic services: concept for assisting and automating daily
activities in the bathroom. Gerontechnology 2012; 11(2): 362.
2. Brose SW, Weber DJ, Salatin BA, et al. The role of assistive
robotics in the lives of persons with disability. Am J Phys Med
Rehabil 2010; 89(6): 509–521.
3. Katz S, Ford AB, Moskowitz RW, et al. Studies of illness in the
aged: the index of ADL: a standardized measure of biological
and psychosocial function. JAMA 1963; 185(12): 914–919.
4. Miskelly FG. Assistive technology in elderly care. Age Age-
ing 2001; 30(6): 455–458.
5. Zollo L, Wada K and Van der Loos HF. Special issue on
assistive robotics [from the guest editors]. Robot Autom Mag
IEEE 2013; 20(1): 16–19.
6. http://www.seatedshower.com/
7. http://www.drivemedical.co.uk/sections/bathroom-toilet-aids
8. Kirchner EA, Albiez JC, Seeland A, et al. Towards Assistive
Robotics for Home Rehabilitation. Biodevices 2013; 168–177.
9. Meng Q and Lee MH. Design issues for assistive robotics for
the elderly. Adv Eng Inform 2006; 20(2): 171–186.
10. Laschi C, Mazzolai B and Cianchetti M. Soft robotics: tech-
nologies and systems pushing the boundaries of robot abil-
ities. Sci Robot 2016; 1: eaah3690.
Ansari et al. 15
11. Kier WM and Smith KK. Tongues, tentacles and trunks: the
biomechanics of movement in muscular-hydrostats. Zool J
Linn Soc 1985; 83(4): 307–324.
12. Kim S, Laschi C and Trimmer B. Soft robotics: a bioinspired
evolution in robotics. Trends Biotechnol 2013; 31(5):
287–294.
13. Trivedi D, Rahn CD, Kier WM, et al. Soft robotics: biological
inspiration, state of the art, and future research. Appl Bionics
Biomech 2008; 5(3): 99–117.
14. Onal CD and Rus D. A modular approach to soft robots. In:
Biomedical robotics and biomechatronics (BioRob), 2012 4th
IEEE RAS & EMBS International Conference, 24 June 2012,
pp. 1038–1045. IEEE.
15. Robinson G and Davies JB. Continuum robots-a state of the
art. In: Robotics and Automation, 1999. Proceedings. 1999
IEEE International Conference on 1999, Vol. 4, pp.
2849–2854. IEEE.
16. Loeve A, Breedveld P and Dankelman J. Scopes too flexible
and too stiff. IEEE Pulse 2010; 1(3): 26–41.
17. Immega G and Antonelli K. The KSI tentacle manipulator.
In: robotics and automation, 1995. proceedings., 1995 IEEE
International Conference on, 21 May 1995, Vol. 3, pp.
3149–3154. IEEE.
18. Tsukagoshi H, Kitagawa A, Segawa M, et al. Active hose: an
artificial elephant’s nose with maneuverability for rescue oper-
ation. In: Robotics and automation, 2001. proceedings 2001
ICRA. IEEE international conference on, 2001, Vol. 3. IEEE.
19. Pritts MB and Rahn CD. Design of an artificial muscle con-
tinuum robot. In: Robotics and automation, 2004, Proceed-
ings. ICRA’04. 2004 IEEE International Conference on, 26
April 2004, vol. 5, pp. 4742–4746. IEEE.
20. McMahan W, Jones BA and Walker ID. Design and implemen-
tation of a multi-section continuum robot: Air-Octor. In: Intelligent
Robots and Systems, 2005. (IROS 2005). 2005 IEEE/RSJ Interna-
tional Conference on, 2 August 2005, pp. 2578–2585. IEEE.
21. McMahan W, Chitrakaran V, Csencsits M, et al. Field trials
and testing of the OctArm continuum manipulator. In: Pro-
ceedings 2006 IEEE International Conference on Robotics
and Automation, 2006. ICRA, 2006. IEEE.
22. Kapadia A and Walker I. Task-space control of extensible
continuum manipulators. In: 2011 IEEE/RSJ International
Conference on Intelligent Robots and Systems.
23. Braganza D, Dawson M, Walker ID, et al. A neural network
controller for continuum robots. IEEE Trans Robot 2007;
23(6): 1270–1277.
24. Giorelli M, Renda F, Ferri G, et al. A feed-forward neural
network learning the inverse kinetics of a soft cable-driven
manipulator moving in three-dimensional space. In: Intelli-
gent Robots and Systems (IROS), 2013 IEEE/RSJ Interna-
tional Conference on, 3 Nov 2013, pp. 5033–5039. IEEE.
25. Grzesiak A, Becker R and Verl A. The bionic handling assis-
tant: a success story of additive manufacturing. Assem Autom
2011; 31(4): 329–333.
26. Rolf M and Steil JJ. Efficient exploratory learning of inverse
kinematics on a bionic elephant trunk. Neural Networks and
Learning Systems. IEEE Trans 2014; 25(6): 1147–1160.
27. Cheng NG, Lobovsky MB, Keating SJ, et al. Design and
analysis of a robust, low-cost, highly articulated manipulator
enabled by jamming of granular media. In: Robotics and
Automation (ICRA), 2012 IEEE International Conference
on, 2012. IEEE.
28. Stilli A, Helge AW, Althoefer K, et al. Shrinkable, stiff-
ness-controllable soft manipulator based on a bio-inspired
antagonistic actuation principle. In: 2014 IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems,
2014. IEEE.
29. Ranzani T, Cianchetti M, Gerboni G, et al. A Soft modular
manipulator for minimally invasive surgery: design and char-
acterization of a single module.
30. Shiva A, Stilli A, Noh Y, et al. Tendon-based stiffening for a
pneumatically actuated soft manipulator. IEEE Robot Autom
Lett 2016; 1(2): 632–637.
31. Marchese AD and Rus D. Design, kinematics, and control of
a soft spatial fluidic elastomer manipulator. Int J Robot Res
2016; 35(7).
32. Yang Q, Zhang L, Bao G, et al. Research on novel flexible
pneumatic actuator FPA. In: Robotics, Automation and
Mechatronics, 2004 IEEE Conference on, 1 December
2004, Vol. 1, pp. 385–389. IEEE.
33. Kang R, Branson DT, Zheng T, et al. Design, modeling and
control of a pneumatically actuated manipulator inspired by
biological continuum structures. Bioinspir Biomim 2013;
8(3): 036008.
34. De Volder M, Moers AJ and Reynaerts D. Fabrication and
control of miniature mckibben actuators. Sensor Actuat A
2011; 166(1): 111–116.
35. Manti M, Pratesi A, Falotico E, et al. Soft Assistive Robot for
Personal Care of Elderly People. In: 6th IEEE RAS / EMBS
International Conference on Biomedical Robotics and Bio-
mechatronics, 2016
36. Vannucci L, Cauli N, Falotico E, et al. Adaptive visual pur-
suit involving eye-head coordination and prediction of the
target motion. In: IEEE-RAS International Conference on
Humanoid Robots, 2014, pp. 541–546
37. Vannucci L, Falotico E, Di Lecce N, et al. Integrating feed-
back and predictive control in a bio-inspired model of visual
pursuit implemented on a humanoid robot. Lecture Notes in
Computer Science (including subseries Lecture Notes in Arti-
ficial Intelligence and Lecture Notes in Bioinformatics) 2015;
9222: 256–267
38. Ansari Y, Falotico E, Mollard Y, et al. A Multi-agent rein-
forcement learning approach for inverse kinematics of high
dimensional manipulators with precision positioning. In: 6th
IEEE RAS / EMBS International Conference on Biomedical
Robotics and Biomechatronics, 2016.
39. Wooldridge MJ. An introduction to multiagent systems. John
Wiley & Sons, 2009.
40. Sutton RS and Barto AG. Reinforcement learning: An intro-
duction. Cambridge: MIT Press, 1998.
41. Claus C and Boutilier C. The dynamics of reinforcement
learning in cooperative multiagent systems. In: AAAI/IAAI,
26 July 1998, pp. 746–752.
16 International Journal of Advanced Robotic Systems
42. Sandholm TW and Robert HC. Multiagent reinforcement
learning in the iterated prisoner’s dilemma. Biosystems
1996; 37(1): 147–166.
43. Watkins C. Learning from delayed rewards. University of
Cambridge, 1989.
44. Howard RA. Dynamic programming and Markov processes,
1960.
45. Rummery GA and Mahesan N. On-line Q-learning using
connectionist systems, University of Cambridge, Department
of Engineering, 1994.
46. Robbins H and Sutton M. A stochastic approximation
method. Ann Math Stat 1951: 400–407.
47. Singh S, Jaakkola T, Littman ML, et al. Convergence results
for single-step on-policy reinforcement-learning algorithms.
Mach Learn 2000; 38(3): 287–308.
48. Singh S, Jaakkola T, Littman ML, et al. Convergence results
for single-step on-policy reinforcement-learning algorithms.
Mach Learn 2000; 38(3): 287–308.
49. Bellman RE. Dynamic programming. Princeton University-
Press, BellmanDynamic programming, 1957: 151.
Ansari et al. 17