Learnable Strategies for Bilateral Agent Negotiation over ...

Learnable Strategies for Bilateral Agent Negotiation over Multiple Issues

Pallavi Bagga1 , Nicola Paoletti2 , Kostas Stathis3Royal Holloway, University of London, UK

pallavi.bagga.20171, nicola.paoletti2, [email protected]

AbstractWe present a novel bilateral agent negotiation model sup-porting multiple issues under user preference uncertainty.This model relies upon interpretable strategy templates rep-resenting agent negotiation tactics with learnable parame-ters. It also uses deep reinforcement learning to evaluatethreshold utilities for those tactics that require them. To han-dle user preference uncertainty, our approach uses stochas-tic search to derive the user model that best approximatesa given partial preference profile. Multi-objective optimiza-tion and multi-criteria decision-making methods are appliedat negotiation time to generate near Pareto-optimal bids. Weempirically show that our agent model outperforms win-ning agents of previous automated negotiation competitionsin terms of the individual as well as social-welfare utilities.Also, the adaptive nature of our learning-based agent modelprovides a significant advantage when our agent confrontsunknown opponents in unseen negotiation scenarios.

IntroductionAn important problem in multi-issue bilateral negotiation ishow to model a self-interested agent learn to adapt its strat-egy while negotiating with other agents. A model of thiskind mostly considers the application preferences of the userthe agent represents. Think, for instance, of bilateral negoti-ation in e-commerce, where a buyer agent settles the productprice according to user preferences such as product colour,payment method and delivery time (Fatima, Wooldridge,and Jennings 2006). In practice, users would express theirpreferences by ranking only a few representative examplesinstead of specifying a utility function (Tsimpoukis et al.2018). Hence, agents are uncertain about the complete userpreferences and lack knowledge about the preferences andcharacteristics of their opponents (Baarslag et al. 2016). Insuch uncertain settings, predefined one-size-fits-all heuris-tics are unsuitable for representing a strategy, which mustlearn from opponent actions in different domains.

To this aim, we introduce ANESIA (Adaptive NEgotiationmodel for a Self-Interested Autonomous agent), an agentmodel that provides a learnable, adaptive and interpretablestrategy under user and opponent preferences’ uncertainty.ANESIA builds on so-called strategy templates, i.e., para-metric strategies that incorporate multiple negotiation tac-

tics for the agent to choose from. A strategy template is de-scribed by a set of condition-action rules to be applied atdifferent stages during the negotiation. Templates require noassumptions from the agent developer as to which tactic tochoose during negotiation: from a template, we automati-cally learn the best combination of tactics to use at any timeduring the negotiation. Being logical combinations of indi-vidual tactics, the resulting strategies are interpretable and,thus, can be explained to the user. Although the templateparameters are learned before a negotiation begins, ANESIAallows for online learning and adaptation as well.

ANESIA is an actor-critic architecture combining DeepReinforcement Learning (DRL) (Lillicrap et al. 2016) withmeta-heuristic optimisation. DRL allows ANESIA to learna near-optimal threshold utility dynamically, adapting thestrategy in different domains and against unknown oppo-nents. ANESIA neither accepts nor proposes any bid belowthis utility. To deal with user preference uncertainty, we usea single-objective meta-heuristic (Yang and Karamanoglu2020; Yang 2009) that handles large search spaces optimally,and converges towards a good solution in a limited numberof iterations, using a limited population size (Talbi 2009).This estimated user model ensures that our agent receives asatisfactory utility from opponent bids. Also, ANESIA gener-ates near-Pareto-optimal bids leading to win-win situationsby combining multi-objective optimization (MOO) (Debet al. 2002) and multi-criteria decision-making (Hwang andYoon 1981) on top of estimated user and opponent models.

To evaluate the effectiveness of our approach against thestate-of-the-art, we conduct simulation experiments basedon the ANAC tournaments (Jonker et al. 2017). We havechosen domains available in the GENIUS tool (Lin et al.2014), with different domain sizes and competitiveness lev-els (Williams et al. 2014). We play against winning agentswith learning capabilities (from ANAC’17 and ’18) andagents that deal with user preference uncertainty (fromANAC’19). These agents span a wide range of strategiesand techniques1. Empirically, ANESIA outperforms existingstrategies in terms of individual and social welfare utilities.

Related work Existing approaches for a bilateral multi-

1E.g., Genetic algorithm-SAGA, Bayesian approach-FSEGA,Gaussian process-AgentGP, Tabu Search-KakeSoba, LogisticRegression-AgentHerb, and Statistical frequency model-AgentGG.

arX

iv:2

009.

0830

2v2

[cs

.MA

] 7

Jan

202

2

issue strategy with reinforcement learning have focused onTabular Q-learning to learn what to bid (Bakker et al. 2019)or DQN to learn when to accept a bid (Razeghi, Yavaz,and Aydogan 2020). Unlike our work, these approachesare neither optimal for continuous action spaces nor canhandle user preference uncertainty. The negotiation modelof (Bagga et al. 2020) uses DRL to learn a complete strat-egy, but it is for single issue only. All the above approachesuse DRL to learn acceptance or bidding strategies. Insteadwe learn a threshold utility as part of (possibly more sophis-ticated) tactics for acceptance and bidding.

Past work uses meta-heuristics to explore the outcomespace and find desired bids or for bid generation (Silva et al.2018; De Jonge and Sierra 2016; El-Ashmawi et al. 2020;Sato and Ito 2016; Kadono 2016; Klein et al. 2003; Sato andIto 2016). We, too, use population-based meta-heuristic, butfor estimating the user model that best agrees with the partialuser preferences. In particular, we employ the Firefly Algo-rithm (FA) (Yang 2009) that has shown its effectiveness incontinuous optimization, but has not been tested until nowin the context of automated negotiation. Moreover, while theGenetic Algorithm NSGA-II (Deb et al. 2002) for MOO hasbeen used previously to find multiple Pareto-optimal solu-tions during negotiation (Hashmi et al. 2013), we are thefirst to combine NSGA-II with TOPSIS (Hwang and Yoon1981) to choose one best among a set of ranked near Pareto-optimal outcomes.

Negotiation EnvironmentWe assume bilateral negotiations where two agents inter-act with each other in a domain D, over n different inde-pendent issues, D = (I1, I2, . . . In), with each issue tak-ing a finite set of k possible discrete or continuous valuesIi = (vi1, . . . v

ik). In our experiments, we consider issues

with discrete values. An agent’s bid ω is a mapping fromeach issue to a chosen value (denoted by ci for the i-th is-sue), i.e., ω = (v1

c1 , . . . vncn). The set of all possible bids or

outcomes is called outcome space Ω s.t. ω ∈ Ω. The out-come space is common knowledge to the negotiating partiesand stays fixed during a single negotiation session.

Negotiation protocol Before the agents can begin the ne-gotiation and exchange bids, they must agree on a ne-gotiation protocol P , which determines the valid movesagents can take at any state of the negotiation (Fatima,Wooldridge, and Jennings 2005). Here, we consider the al-ternating offers protocol (Rubinstein 1982) due to its sim-plicity and wide use. The set of protocol’s possible Actionsare: offer(ω), accept , reject. One of the agents (say Au)starts a negotiation by making an offer X to the other agent(say Ao). Agent Ao can either accept or reject the offer. If itaccepts, the negotiation ends with an agreement, otherwiseAo makes a counter-offer to Au. This process of making of-fers continues until one of the agents either accepts an offer(i.e., success) or the deadline is reached (i.e., failure). More-over, we assume that the negotiations are sensitive to time,i.e. time impacts the utilities of the negotiating parties. Inother words, the value of an agreement decreases over time.

Utility Each agent has a preference profile, reflecting the is-

sues important to the agent, which values per issue are pre-ferred over other values, and on the whole provides a (par-tial) ranking over all possible deals (Marsa-Maestre et al.2014). In contrast to Ω, the agent’s preference profile is pri-vate information and is given in terms of a utility functionU . U is defined as a weighted sum of evaluation functions,ei(v

ici) as shown in (1). Each issue is evaluated separately

and contributes linearly to U . Such a U is a very commonutility model and is also called a Linear Additive Utilityspace. Here, wi are the normalized weights indicating eachissue’s importance to the user and ei(vici) is an evaluationfunction that maps the vici value of the ith issue to a utility.U(ω) = U(v1

c1 , . . . vncn) =

∑ni=1 wi · ei(vici), where

∑ni=1 wi = 1

(1)Note that U does not take dependencies between issuesinto account. Whenever the negotiation terminates withoutany agreement, each negotiating party gets its correspond-ing utility based on the private reservation2 value (ures).In case the negotiation terminates with an agreement, eachagent receives the discounted utility of the agreed bid, i.e.,Ud(ω) = U(ω)dtD. Here, dD is a discount factor in the in-terval [0, 1] and t ∈ [0, 1] is current normalized time.

User and opponent utility models In our settings, the ne-gotiation environment is one with incomplete information,because the user utility model Uu is unknown. Only par-tial preferences are given, i.e., a partial order over B bidsw.r.t. Uu s.t. ω1 ω2 → Uu(ω1) ≤ Uu(ω2). Hence, duringthe negotiation, one of the objectives of our agent is to de-rive an estimate Uu of the real utility function Uu from thegiven partial preferences3. This leads to a single-objectiveconstrained optimization problem,

maxw1,...,wn,

e1(v1c1

),...,en(vncn

)

ρ

(n∑

i=1

wi · ei(vici), B

)

s. t.n∑

i=1

wi = 1

wi > 0 and 0 ≤ ei(vici),∀i ∈ n

(2)

where B is the incomplete sequence of known bid prefer-ences (ordered by ), and ρ is a measure of ranking similar-ity (e.g., Spearman correlation) between the estimated rank-ing of Uu and the true, but partial, bid ranking B. We alsoassume that our agent is unaware of the utility structure ofits opponent agent Uo. Hence, to increase the agreement rateover multiple issues, another objective of our agent is to gen-erate the (near) Pareto-optimal solutions during the negotia-tion which can be defined as a MOO problem as follows:

maxω∈Ω

(Uu(ω), Uo(ω)) (3)

2The reservation value is the minimum acceptable utility for anagent. It may vary for different parties and different domains. Inour settings, it is the same for both parties.

3Humans do not necessarily use an explicit utility function.Also, preference elicitation can be tedious for users since they haveto interact with the system repeatedly (Baarslag and Kaisers 2017).As a result, agents should accurately represent users under minimalpreference information (Tsimpoukis et al. 2018).

In (3), we have two objectives: Uu, the user’s estimated util-ity, and Uo, the opponent’s estimated utility. A bid ω∗ ∈ Ωis Pareto optimal if no other bid exists ω ∈ Ω that Pareto-dominates ω∗. In our case, a bid ω1 Pareto-dominates ω2 iff:(

Uu(ω1) ≥ Uu(ω2) ∧ Uo(ω1) ≥ Uo(ω2))∧(

Uu(ω1) > Uu(ω2) ∨ Uo(ω1) > Uo(ω2)) (4)

The ANESIA ModelOur agent Au is situated in an environment E (containingthe opponent agent Ao) where at any time t, Au senses thecurrent state St of E and represents it as a set of internalattributes. These include information derived from the se-quence of previous bids offered by Ao (e.g., utility of thebest opponent bid so far Obest, average utility of all the op-ponent bids Oavg and their variability Osd) and informationstored in our agent’s knowledge base (e.g., number of bidsB in the given partial order, dD, ures, Ω, and n), and thecurrent negotiation time t. This internal state representation,denoted with st, is used by the agent (in acceptance and bid-ding strategies) to decide what action at to execute. Actionexecution then changes the state of the environment to St+1.

Learning in ANESIA4 mainly consists of three compo-nents: Decide, Negotiation Experience, and Evaluate. De-cide refers to the negotiation strategy for choosing a near-optimal action at among a set of Actions at a particularstate st based on a protocol P . Action at is derived via twofunctions, fa and fb, for the acceptance and bidding strate-gies, respectively. Function fa takes as inputs st, a dynamicthreshold utility ut (defined later in the Methods section), thesequence of past opponent bids Ωo

t , and outputs a discreteaction at among accept or reject. When fa returns reject, fbcomputes what to bid next, with input st and ut, see (5–6).This separation of acceptance and bidding strategies is notrare, see for instance (Baarslag et al. 2014).

fa(st, ut,Ωot ) = at, at ∈ accept , reject (5)

fb(st, ut,Ωot ) = at, at ∈ offer(ω), ω ∈ Ω (6)

Since we assume incomplete user and opponent preferenceinformation, Decide uses the estimated models Uu and Uo.In particular, Uu is estimated once before the negotiationstarts by solving (2) and using the given partial prefer-ence profile . This encourages agent autonomy and avoidscontinuous user preference elicitation. Similarly, Uo is es-timated at time t using information from Ωo

t , see Methodssection for more details.

Negotiation Experience stores historical informationabout N previous interactions of an agent with other agents.Experience elements are of the form 〈st, at, rt, st+1〉, wherest is the internal state representation of the negotiation envi-ronment E, at is the performed action, rt is a scalar rewardreceived from the environment and st+1 is the new agentstate after executing at.

4See Appendix for the interaction between components of ANE-SIA architecture.

Evaluate refers to a critic helping ANESIA learn the dy-namic threshold utility ut, which evolves as new experienceis collected. More specifically, it is a function of random K(K < N ) experiences fetched from the agent’s memory.Learning ut is retrospective since it depends on the rewardrt obtained fromE by performing at at st. The reward valuedepends on the (estimated) discounted utility of the last bidreceived from the opponent, ωo

t , or of the bid accepted byeither parties ωacc and defined as follows:

rt =

Uu(ωacc, t), on agreementUu(ωo

t , t), on received offer−1, otherwise.

(7)

Uu(ω, t) is the discounted reward of ω defined as:

Uu(ω, t) = U(ω) · dt, d ∈ [0, 1] (8)

where d is a temporal discount factor to encourage the agentto negotiate without delay. We should not confuse d, whichis typically unknown to the agent, with the discount factorused to compute the utility of an agreed bid (dD).

Strategy templates: One common way to define the accep-tance (fa) and bidding (fb) strategies is via a combinationof hand-crafted tactics that, by empirical evidence or do-main knowledge, are known to work effectively. However, afixed set of tactics might not well adapt to multiple differentnegotiation domains. ANESIA does not assume pre-definedstrategies for fa and fb, and learns these strategies offline.We run multiple negotiations between our agent and a poolof opponents. We select the combination of tactics that max-imizes the true user utility over these negotiations. So, in thisstage only, we assume that the true user model is known.

To enable strategy learning, we introduce strategy tem-plates, i.e., parametric strategies incorporating a series oftactics, where each tactic is executed for a specific negotia-tion phase. The parameters describing the start and durationof each phase, as well as the particular tactic choice for thatphase are all learnable (blue-colored in (9), (10)). Moreover,tactics can expose, in turn, learnable parameters themselves.

We assume a collection of acceptance and bidding tactics,Ta and Tb. Each ta ∈ Ta maps the agent state, thresholdutility, opponent bid history, and a (possibly empty) vectorof learnable parameters p into a utility value: if the agent isusing tactic ta and ta(st, ut,Ω

ot ,p) = u, then it will not

accept any offer with utility below u, see (9) below. Eachtb ∈ Tb is of the form tb(st, ut,Ω

ot ,p) = ω where ω ∈

Ω is the bid returned by the tactic. An acceptance strategytemplate is a parametric function given by∧na

i=1 t ∈ [ti, ti+1)→(∧ni

j=1 ci,j → U(ωot ) ≥ ti,j(st, ut,Ω

ot ,pi,j)

)(9)

where na is the number of phases; t1 = 0, tna+1 = 1, andti+1 = ti + δi, where the δi parameter determines the dura-tion of the i-th phase; for each phase i, the strategy templateincludes ni tactics to choose from: ci,j is a Boolean choiceparameter determining whether tactic ti,j ∈ Ta should beused during the i-th phase. We note that (9) is a predicate

returning whether or not the opponent bid ωot is accepted.

Similarly, a bidding strategy template is defined by

⋃nb

i=1

ti,1(st, ut,Ω

ot ,pi,1) if t ∈ [ti, ti+1) and ci,1

· · · · · ·ti,ni(st, ut,Ω

ot ,pi,ni) if t ∈ [ti, ti+1) and ci,n

(10)where nb is the number of phases, ni is the number of op-tions for the i-th phase, and ti,j ∈ Tb. ti and ci,j are definedas in the acceptance template. The particular libraries of tac-tics used in this work are discussed in the next Section. Westress that both (9) and (10) describe time-dependent strate-gies where a given choice of tactics is applied at differentphases (denoted by the condition t ∈ [ti, ti+1)).

MethodsUser modelling: Before the negotiation begins, we estimatethe user model Uu by finding the weights wi and utilityvalues ei(vici) for each issue i, see (1), so that the result-ing bid ordering best fits the given partial order of bids.To solve this optimization problem (2), we use FA (Yang2009), a meta-heuristic inspired by the swarming and flash-ing behaviour of fireflies, because, in our preliminary anal-yses, it outperformed other traditional nature-inspired meta-heuristics such as GA and PSO (Rawat 2021). We computethe fitness of a candidate solution (i.e., the user model U ′u)as the Spearman’s rank correlation coefficient ρ between theestimated ranking of U ′u and the true, but partial, bid ranking. The coefficient ρ ∈ [−1, 1] is indeed a similarity measurebetween two rankings, assigning a value of 1 for identicaland −1 for opposed rankings.

Opponent modelling: To derive an estimate of the oppo-nent model Uo during negotiation, we use the distribution-based frequency model proposed in (Tunalı, Aydogan, andSanchez-Anguix 2017). In this model, the empirical fre-quency of the issue values in Ωo

t provides an educated guesson the opponent’s most preferred issue values. The issueweights are estimated by analysing the disjoint windows ofΩo

t , giving an idea of the shift of opponent’s preferencesfrom its previous negotiation strategy over time.

Utility threshold learning: We use an actor-critic archi-tecture with model-free deep reinforcement learning (i.e.,Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al.2016)) to predict the target threshold utility ut. We considera model-free RL approach because our problem is how tomake an agent decide what target threshold utility to set nextin a negotiation dialogue rather than predicting the new stateof the environment, which implies model-based RL. Thus,ut is expressed as a deep neural network function, whichtakes the agent state st as an input (see previous section forthe list of attributes). Prior to RL, our agent’s strategy is pre-trained with supervision from synthetic negotiation data. Tocollect supervision data, we use the GENIUS simulation en-vironment (Lin et al. 2014), which supports multi-issue bi-lateral negotiation for different domains and user profiles.In particular, data was generated by running the winner of

the ANAC’19 (AgentGG) against other strategies5 in threedifferent domains6 and assuming no user preference uncer-tainties (Aydogan et al. 2020). This initial supervised learn-ing (SL) stage helps our agent decrease the exploration timerequired for DRL during the negotiation, an idea primarilyinfluenced by the work of (Bagga et al. 2020).

Strategy learning: The parameters of the acceptance andbidding strategy templates (9–10) are learned by runningthe FA meta-heuristic. We define the fitness of a particu-lar choice of template parameters as the average true userutility over multiple negotiations rounds under the concretestrategy implied by those parameters. Negotiation data isobtained by running our agent on the GENIUS platformagainst three (readily available) opponents (AgentGG, Kake-Soba and SAGA) in three different negotiation domains6.

We now describe the libraries of tactics used in our tem-plates. As for the acceptance tactics, we consider:

• Uu(ωt), the estimated utility of the bid that our agentwould propose at the time t (ωt = fb(st, ut,Ω

ot )).

• QUu(Ωot )(a · t + b), where Uu(Ωo

t ) is the distribution of(estimated) utility values of the bids in Ωo

t ,QUu(Bo(t))(p)

is the quantile function of such distribution, and a andb are learnable parameters. In other words, we considerthe p-th best utility received from the agent, where p isa learnable (linear) function of the negotiation time t. Inthis way, this tactic automatically and dynamically de-cides how much the agent should concede at time t.

• ut, the dynamic DRL-based utility threshold.• u, a fixed, but learnable, utility threshold.

The bidding tactics in our library are:

• bBoulware , a bid generated by a time-dependent Boulwarestrategy (Fatima, Wooldridge, and Jennings 2001).

• PS(a · t + b) extracts a bid from the set of Pareto-optimal bids PS (see (4)), derived using the NSGA-IIalgorithm7 (Deb et al. 2002) under Uu and Uo. In par-ticular, it selects the bid that assigns a weight of a · t+ bto our agent utility (and 1− (a · t+ b) to the opponent’s),where a and b are learnable parameters telling how thisweight scales with the negotiation time t. The TOPSISalgorithm (Hwang and Yoon 1981) is used to derive sucha bid, given the weighting a · t+ b as input.

• bopp(ωot ), a tactic to generate a bid by manipulating the

last bid received from the opponent ωot . This is modified

in a greedy fashion by randomly changing the value ofthe least relevant issue (w.r.t. U ) of ωo

t .• ω ∼ U(Ω≥ut), a random bid above our DRL-based util-

ity threshold ut8.

5Gravity, HardDealer, Kagent, Kakesoba, SAGA, and SACRA.6Laptop, Holiday and Party.7Meta-heuristics (instead of brute-force) for Pareto-optimal so-

lutions have the potential to deal efficiently with continuous issues.8U(S) is the uniform distribution over S, and Ω≥ut is the sub-

set of Ω whose bids have estimated utility above ut w.r.t. U .

Below, we give an example of a concrete acceptance strat-egy learned in our experiments: it employs time-dependentquantile tactic during the middle of the negotiation, and theDRL threshold utility during the initial and final stages.

t ∈ [0.0, 0.4)→ U(ωot ) ≥ ut ∧ u

t ∈ [0.4, 0.7)→ U(ωot ) ≥ U(ωt) ∧QU(Ωo

t )(−0.67 · t+ 1.27)

t ∈ [0.7, 0.95)→ U(ωot ) ≥ U(ωt) ∧QU(Ωo

t )(−0.21 · t+ 0.9)

t ∈ [0.95, 1.0]→ U(ωot ) ≥ ut

Below is an example of a learned concrete bidding strategy:it behaves in a Boulware-like manner in the initial stage, af-ter which it proposes near Pareto-optimal bids (between time0.4 and 0.9) and opponent-oriented bid in the final stage.

t ∈ [0.0, 0.4)→ ω = bBoulware

t ∈ [0.4, 0.9)→ ω = PS(−0.75 · t+ 0.6)

t ∈ [0.9, 1.0]→ ω = bopp(ωot )

We stress that our approach allows to automatically devisesuch combinations of tactics so as to achieve optimal userutility, which would be infeasible manually.

Domain (n) Ordinal Accuracy (↑) Cardinal Inaccuracy (↓)|B| 5% of Ω 10% of Ω 5% of Ω 10% of ΩAirportSite (3) (0.75,0.75) (0.85,0.87) (0.47,0.54) (0.78,0.76)Camera (6) (0.77,0.63) (0.83,0.75) (0.32,0.32) (0.69,0.41)Energy (6) (0.74,0.78) (0.83,0.84) (0.56,0.61) (0.57,0.69)Fitness (5) (0.67,0.67) (0.70,0.75) (0.55,0.47) (0.46,0.59)Flight (3) (0.75,0.85) (0.82,0.90) (0.65,0.75) (0.58,0.79)Grocery (5) (0.67,0.67) (0.75,0.72) (0.35,0.42) (0.39,0.56)Itex-Cypress (4) (0.70,0.74) (0.78,0.80) (0.56,0.48) (0.74,0.56)Outfit (4) (0.70,0.75) (0.80,0.84) (0.71,0.88) (0.89,0.79)

Table 1: Evaluation of User Modelling using FA for two pro-files (separated by comma) in each domain

Domain IGD (↓) Domain IGD (↓)Airport Site (|Ω| = 420) 0.000 Flight (|Ω| = 48) 0.006Camera (|Ω| = 3600) 0.000 Grocery (|Ω| = 1600) 0.000Energy (|Ω| = 15625) 0.011 Itex-Cypress (|Ω| = 180) 0.000Fitness (|Ω| = 3520) 0.012 Outfit (|Ω| = 128) 0.000

Table 2: Evaluation of Pareto Frontier using Inverted Gener-ational Distance estimated using NSGA-II

Experimental ResultsAll the experiments have been performed using the GENIUSnegotiation platform (Lin et al. 2014). Our experiments aredesigned to prove the following hypotheses:Hypothesis A: Our approach can well approximate usermodels under user preference uncertainty.Hypothesis B: The set of NSGA-II estimated Pareto-optimal bids are close to the true Pareto-optimal front.Hypothesis C: ANESIA outperforms the “teacher” strategies(AgentGG, KakeSoba and SAGA) in known negotiation set-tings in terms of individual and social efficiency.Hypothesis D: ANESIA outperforms not-seen-before nego-tiation strategies and adapts to different negotiation settingsin terms of individual and social efficiency.

Performance metrics: We measure the performance of eachagent in terms of six widely-adopted metrics inspired by theANAC competition:

• U totalind : The utility gained by an agent averaged over all

the negotiations (↑);• Us

ind : The utility gained by an agent averaged over all thesuccessful negotiations (↑);

• Usoc : The utility gained by both negotiating agents aver-aged over all successful negotiations (↑);

• Pavg : Average minimal distance of agreements from thePareto Frontier (↓).

• Ravg : Average number of rounds before reaching theagreement (↓);

• S%: Proportion of successful negotiations (↑).The first and second measures represent individual efficiencyof an outcome, whereas the third and fourth correspond tothe social efficiency of agreements.

Experimental settings: ANESIA is evaluated against state-of-the-art strategies that participated in ANAC’17, ’18, and’19, and designed by different research groups indepen-dently. Each agent has no information about another agent’sstrategies beforehand. Details of all these strategies areavailable in (Aydogan et al. 2018; Jonker and Ito 2020;Aydogan et al. 2020). We assume incomplete informationabout user preferences, given in the form of B randomly-chosen partially-ordered bids. We evaluate ANESIA on 8negotiation domains which are different from each otherin terms of size and opposition (Baarslag et al. 2013) toensure good negotiation characteristics and to reduce anybiases. The domain size refers to the number of issues,whereas opposition9 refers to the minimum distance fromall possible outcomes to the point representing completesatisfaction of both negotiation parties (1,1). For our ex-periments, we choose readily-available 3 small-sized, 2medium-sized, and 3 large-sized domains. Out of these do-mains, 2 are with high, 3 with medium and 3 with low op-position (see (Williams et al. 2014) for more details).

For each configuration, each agent plays both roles inthe negotiation to compensate for any utility differences inthe preference profiles. We call user profile the agent’s rolealong with the user’s preferences. We set two user prefer-ences uncertainties for each role: |B| = 5%|Ω| and |B| =10%|Ω|. Also, we set the ures and dD to their respective de-fault values, whereas the deadline is set to 60s, normalizedin [0, 1] (known to both negotiating parties in advance).

Regarding the optimization algorithms, for FA (hypothe-ses A and C), we choose a population size of 20 and 200generations for user model estimation and learning of strat-egy template parameters. We also set the maximum attrac-tiveness value to 1.0 and absorption coefficient to 0.01. ForNSGA-II (hypothesis B), we choose the population size of2% × |Ω|, 2 generations and mutation count of 0.1. With

9The value of opposition reflects the competitiveness betweenparties in the domain. Strong opposition means a gain of one partyis at the loss of the other, whereas, weak opposition means that bothparties either lose or gain simultaneously (Baarslag et al. 2013).

Agent Ravg(↓) Pavg(↓) Usoc(↑) U totalind (↑) Us

ind(↑) S%(↑)ANESIA 301.64 ± 450.60 0.12 ± 0.36 1.55 ± 0.73 0.39 ± 0.41 0.92 ± 0.11 0.46AgentGG 1327.51 ± 2246.83 0.30 ± 0.37 1.10 ± 0.67 0.65 ± 0.39 0.87 ± 0.13 0.75KakeSoba 1154.83 ± 2108.11 0.22 ± 0.34 1.26 ± 0.62 0.71 ± 0.35 0.88 ± 0.12 0.81SAGA 287.34 ± 1058.52 0.18 ± 0.31 1.36 ± 0.56 0.58 ± 0.26 0.67 ± 0.14 0.87

Table 3: Performance Comparison of ANESIA with “teacher” strategies. Results are averaged for all the domains and profiles.See Appendix for separate results for each domains.

these hyperparameters, on our machine10 the run-time ofNSGA-II never exceeded the given timeout of 10s for de-ciding an action at each turn, while being able to retrieveempirically good solutions.

Empirical EvaluationHypothesis A: User Modelling We used two measures todetermine the difference between Uu and Uu (Wachowicz,Kersten, and Roszkowska 2019): First, Ordinal accuracy(OA) measures the proportion of bids put by U in the correctrank order (i.e., as defined by the true user model), wherean OA value of 1 implies a 100% correct ranking. Second,to capture the scale of cardinal errors, Cardinal Inaccuracy(CI) measures the differences in ratings assigned in the es-timated and true user models for all the elements in domainD. We produced results in 8 domains and two profiles (5%and 10% of total possible bids) which are averaged over 10simulations as shown in Table 1. All the values of OA (↑)and CI (↓), in each domain, for both the user profiles, are ≥0.67 and ≤ 0.90 respectively, which is quite accurate giventhe uncertainty and the fact that the CI value ∝ |D|.Hypothesis B: Pareto-Optimal Bids We used a popularmetric called Inverted Generational Distance (IGD) (Cheng,Shi, and Qin 2012) to compare the Pareto Fronts found bythe NSGA-II and the ground truth (found via brute force).Small IGD values suggest good convergence of solutions tothe Pareto Front and their good distribution over the entirePareto Front. Table 2 demonstrates the potential of NSGA-II11 for generating the Pareto-optimal bids as well as thecloseness of true utility models.

Hypothesis C: ANESIA outperforms “teacher” strate-gies We performed a total of 1440 negotiation sessions12

to evaluate the performance of ANESIA against the three“teacher” strategies (AgentGG, KakeSoba and SAGA) inthree domains (Laptop, Holiday, and Party) for two differ-ent profiles (|B| = 10, 20). These strategies were used tocollect the dataset in the same domains for supervised train-ing before the DRL process begins. Table 3 demonstratesthe average results over all the domains and profiles for eachagent. Clearly, ANESIA outperforms the “teacher” strategiesin terms of Us

ind (i.e., individual efficiency), Usoc , and Pavg

(i.e., social efficiency).

Hypothesis D: Adaptive Behaviour of ANESIA agent We10CPU: 8 Cores, 2.10GHz; RAM: 32GB11Population size = 0.02× |Ω|; Number of generations = 2.12n× (n−1)/2×x×y×z×w = 1440 where n = 4, number

of agents in a tournament; x = 2, because agents play both sides;y = 3, number of domains; z = 20, because each tournament isrepeated 20 times; w = 2, number of profiles in terms of B.

further evaluated the performance of ANESIA on agents(from ANAC’17, ANAC’18 and ANAC’19) unseen duringtraining. For this, we performed a total of 23040 negotiationsessions13. Results in Table 4(A) are averaged over all do-mains and profiles, and demonstrate that ANESIA learns tomake the optimal choice of tactics to be used at run time andoutperforms the other 8 strategies in terms of Us

ind and Usoc .

Ablation Study 1: We evaluated the ANESIA-DRL perfor-mance, i.e., an ANESIA agent that does not use templates tolearn optimal combinations of tactics, but uses only one ac-ceptance tactic, given by the dynamic DRL-based thresholdutility ut (and the Boulware and Pareto-optimal tactics forbidding) for the same negotiation settings of Hypothesis D.We observe from Table 4(B) that ANESIA-DRL outperformsthe other strategies in terms of Us

ind and Usoc .

Ablation Study 2: We evaluated the ANESIA-rand per-formance, i.e., an ANESIA agent which starts from a randomDRL policy, i.e., without any offline pre-training of the adap-tive utility threshold ut, for the same negotiation settings ofHypothesis D. From Table 4(B), we observe in ANESIA-rand some degradation of the utility metrics compared to thefully-fledged ANESIA and ANESIA-DRL, even though it re-mains equally competitive w.r.t. the ANAC’17,18 agents andoutperforms the ANAC’19 agents in Pavg , Us

ind and Usoc.This is not unexpected because, with a poorly informed (ran-dom) target utility tactic, the agent tends to accept offerswith little pay-off without negotiating for more rounds.

From the Table 4(B) results, we conclude that the combi-nation of both features (strategy templates and pre-trainingof the DRL model) are beneficial, even though these two fea-tures perform well in isolation too. During the experiments,we observed that ANESIA becomes picky by setting the dy-namic threshold utility higher and hence learns to focus ongetting the maximum utility from the end agreement. Thisleads to low success rate as compared to other agents. Fig-ure 1 shows an example of such threshold utility increaseover time in one of the domains (Grocery) against a set ofunknown opponents (Kakesoba and SAGA). We note thatAgentHerb is the best in terms of Pavg , which is not surpris-ing because this is one of the agents that know the true usermodel. This is clearly an unfair advantage over the agents,like ANESIA, that do not have this information. That said,ANESIA attains the best Pavg among ANAC’19 agents (un-aware of the true user model) and the second best lowestPavg among ANAC’17 and ANAC’18 agents (aware of thetrue user model). Even though they have an unfair advan-tage in knowing the true user model, we consider ANAC’17

13n× (n−1)/2×x×y×z×w = 23040 where n = 9; x = 2;y = 8; z = 20; and w = 2.

(A) Performance Analysis of fully-fledged ANESIAAgent Ravg(↓) Pavg(↓) Usoc(↑) U total

ind (↑) Usind(↑) S%(↑)

ANESIA 626.24 ± 432.45 0.17 ± 0.29 1.51 ± 0.47 0.66 ± 0.25 0.95 ± 0.06 0.51AgentGP • 1366.95 ± 1378.19 0.30 ± 0.29 0.82 ± 0.58 0.66 ± 0.22 0.88 ± 0.09 0.58FSEGA2019 • 1801.96 ± 1754.91 0.17 ± 0.19 1.15 ± 0.37 0.74 ± 0.18 0.82 ± 0.11 0.83AgentHerb 32.30 ± 50.53 0.01 ± 0.03 1.41 ± 0.10 0.45 ± 0.13 0.45 ± 0.13 1.00Agent33 4044.47 ± 4095.57 0.07 ± 0.16 1.31 ± 0.32 0.62 ± 0.15 0.64 ± 0.14 0.93Sontag 5129.47 ± 5855.11 0.10 ± 0.18 1.22 ± 0.37 0.73 ± 0.17 0.79 ± 0.11 0.86AgreeableAgent 5003.98 ± 5216.68 0.11 ± 0.23 0.9 ± 0.4 0.67 ± 0.20 0.73 ± 0.13 0.73PonpokoAgent ? 6609.73 ± 6611.18 0.18 ± 0.26 1.01 ± 0.49 0.76 ± 0.22 0.89 ± 0.07 0.74ParsCat2 ? 4971.43 ± 4911.8 0.11 ± 0.22 1.16 ± 0.42 0.75 ± 0.19 0.81 ± 0.12 0.85

(B) Performance Analysis of ANESIA-DRL (Ablation study 1)Agent Ravg(↓) Pavg(↓) Usoc(↑) U total


ANESIA-DRL 480.43 ± 418.81 0.10 ± 0.27 1.34 ± 0.51 0.65 ± 0.22 0.87 ± 0.14 0.56AgentGP • 296.58 ± 340.15 0.24 ± 0.25 0.95 ± 0.48 0.61 ± 0.20 0.69 ± 0.15 0.75FSEGA2019 • 1488.63 ± 1924.03 0.22 ± 0.20 0.97 ± 0.39 0.67 ± 0.20 0.79 ± 0.12 0.77AgentHerb 38.25 ± 79.16 0.09 ± 0.09 1.33 ± 0.15 0.48 ± 0.16 0.48 ± 0.16 0.99Agent33 2799.6 ± 3702.85 0.13 ± 0.15 1.22 ± 0.29 0.58 ± 0.15 0.59 ± 0.15 0.94Sontag 3490.01 ± 4909.81 0.17 ± 0.19 1.09 ± 0.37 0.69 ± 0.18 0.76 ± 0.13 0.85AgreeableAgent 3585.94 ± 4474.06 0.18 ± 0.21 0.78 ± 0.38 0.61 ± 0.19 0.68 ± 0.14 0.69PonpokoAgent ? 4470.86 ± 5482.31 0.25 ± 0.24 0.90 ± 0.45 0.71 ± 0.20 0.85 ± 0.09 0.72ParsCat2 ? 3430.55 ± 4204.93 0.17 ± 0.20 1.07 ± 0.39 0.69 ± 0.18 0.73 ± 0.16 0.85

(C) Performance analysis of Random ANESIA (Ablation study 2)Agent Ravg(↓) Pavg(↓) Usoc(↑) U total


ANESIA-rand 351.77 ± 343.24 0.17 ± 0.21 1.28 ± 0.43 0.67 ± 0.22 0.74 ± 0.16 0.80AgentGP • 258.12 ± 300.53 0.22 ± 0.24 1.0 ± 0.46 0.63 ± 0.19 0.70 ± 0.14 0.79FSEGA2019 • 1002.3 ± 1349.64 0.20 ± 0.19 1.02 ± 0.35 0.68 ± 0.2 0.79 ± 0.12 0.81AgentHerb 44.87 ± 80.16 0.09 ± 0.09 1.33 ± 0.15 0.48 ± 0.15 0.48 ± 0.15 0.99Agent33 1592.20 ± 2298.71 0.11 ± 0.12 1.26 ± 0.24 0.58 ± 0.15 0.58 ± 0.15 0.97Sontag 2276.96 ± 3366.20 0.15 ± 0.17 1.13 ± 0.34 0.70 ± 0.17 0.75 ± 0.13 0.88AgreeableAgent 2472.87 ± 3145.20 0.14 ± 0.17 0.86 ± 0.32 0.62 ± 0.18 0.67 ± 0.14 0.75PonpokoAgent ? 2825.54 ± 3745.98 0.22 ± 0.23 0.96 ± 0.41 0.73 ± 0.20 0.84 ± 0.09 0.78ParsCat2 ? 1992.60 ± 2674.90 0.13 ± 0.15 1.07 ± 0.30 0.66 ± 0.17 0.69 ± 0.14 0.86

Table 4: Performance comparison of ANESIA against other ANAC winning strategies averaged over all the 8 domains and twouncertain preference profiles in each domain. See Appendix for the separate results for each domain. ANAC’19 agents (•) haveuncertain user preferences, and no learning capabilities. ANAC’17 (?) and ANAC’18 () agents can learn from experience andare given real user preferences. In blue are the best among ANESIA and ANAC’19 agents. In purple, the overall best.

Figure 1: Increase in Dynamic Threshold Utility using DRL

and ANAC’18 agents since, like our approach, they enablelearning from past negotiations.

To this end, we note that ANESIA uses prior negotia-tion data from AgentGG to pre-train the DRL-based utilitythreshold and adjust the selection of tactics from the tem-plates. The effectiveness of our approach is demonstratedby the fact that ANESIA outperforms the same agents it wastrained on (see Hypothesis C), but, crucially, does so alsoon domains and opponents unseen during training. We fur-ther stress that the obtained performance metrics are affectedonly in part by an adequate pre-training of the strategies: thequality of the estimated user and opponent models – derivedwithout any prior training data from other agents – playsan important role too. The results in Tables 4 (A to C) evi-

dence that our agent consistently outperforms its opponentsin terms of individual and social efficiency, demonstratingthat ANESIA can learn to adapt at run-time to different ne-gotiation settings and against different unknown opponents.

ConclusionsANESIA is a novel agent model encapsulating different typesof learning to support negotiation over multiple issues andunder user preference uncertainty. An ANESIA agent usesstochastic search based on FA for user modelling and com-bines NSGA-II and TOPSIS for generating Pareto bids dur-ing negotiation. It further exploits strategy templates to learnthe best combination of acceptance and bidding tactics atany negotiation time, and among its tactics, it uses an adap-tive target threshold utility learned using the DDPG algo-rithm. We have empirically evaluated the performance ofANESIA against the winning agent strategies of ANAC’17,’18 and ’19 competitions in different settings, showing thatour agent both outperforms opponents known at trainingtime and can effectively transfer its knowledge to environ-ments with previously unseen opponent agents and domains.

An open problem worth pursuing in the future is how toupdate the synthesized template strategy dynamically.

ReferencesAydogan, R.; Baarslag, T.; Fujita, K.; Mell, J.; Gratch, J.;De Jonge, D.; Mohammad, Y.; Nakadai, S.; Morinaga, S.;Osawa, H.; et al. 2020. Challenges and main results of theautomated negotiating agents competition (anac) 2019. InMulti-agent systems and agreement technologies, 366–381.Springer.Aydogan, R.; Fujita, K.; Baarslag, T.; Jonker, C. M.; andIto, T. 2018. ANAC 2017: Repeated multilateral negotiationleague. In International Workshop on Agent-Based ComplexAutomated Negotiation, 101–115. Springer.Baarslag, T.; Fujita, K.; Gerding, E. H.; Hindriks, K.; Ito,T.; Jennings, N. R.; Jonker, C.; Kraus, S.; Lin, R.; Robu, V.;et al. 2013. Evaluating practical negotiating agents: Resultsand analysis of the 2011 international competition. ArtificialIntelligence, 198: 73–103.Baarslag, T.; Hendrikx, M. J.; Hindriks, K. V.; and Jonker,C. M. 2016. Learning about the opponent in automated bilat-eral negotiation: a comprehensive survey of opponent mod-eling techniques. Autonomous Agents and Multi-Agent Sys-tems, 30(5): 849–898.Baarslag, T.; Hindriks, K.; Hendrikx, M.; Dirkzwager, A.;and Jonker, C. 2014. Decoupling negotiating agents toexplore the space of negotiation strategies. In Novel In-sights in Agent-based Complex Automated Negotiation, 61–83. Springer.Baarslag, T.; and Kaisers, M. 2017. The value of informa-tion in automated negotiation: A decision model for elicitinguser preferences. In Proceedings of the 16th conference onautonomous agents and multiagent systems, 391–400.Bagga, P.; Paoletti, N.; Alrayes, B.; and Stathis, K. 2020.A Deep Reinforcement Learning Approach to ConcurrentBilateral Negotiation. In IJCAI.Bakker, J.; Hammond, A.; Bloembergen, D.; and Baarslag,T. 2019. RLBOA: A Modular Reinforcement LearningFramework for Autonomous Negotiating Agents. In AA-MAS, 260–268.Cheng, S.; Shi, Y.; and Qin, Q. 2012. On the performancemetrics of multiobjective optimization. In InternationalConference in Swarm Intelligence, 504–512. Springer.De Jonge, D.; and Sierra, C. 2016. GANGSTER: an auto-mated negotiator applying genetic algorithms. In Recent ad-vances in agent-based complex automated negotiation, 225–234. Springer.Deb, K.; Pratap, A.; Agarwal, S.; and Meyarivan, T. 2002.A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation, 6(2):182–197.El-Ashmawi, W. H.; Abd Elminaam, D. S.; Nabil, A. M.;and Eldesouky, E. 2020. A chaotic owl search algorithmbased bilateral negotiation model. Ain Shams EngineeringJournal.Fatima, S. S.; Wooldridge, M.; and Jennings, N. R. 2001.Optimal negotiation strategies for agents with incompleteinformation. In International Workshop on Agent Theories,Architectures, and Languages, 377–392. Springer.

Fatima, S. S.; Wooldridge, M.; and Jennings, N. R. 2005. Acomparative study of game theoretic and evolutionary mod-els of bargaining for software agents. Artificial IntelligenceReview, 23(2): 187–205.Fatima, S. S.; Wooldridge, M. J.; and Jennings, N. R. 2006.Multi-issue negotiation with deadlines. Journal of ArtificialIntelligence Research, 27: 381–417.Hashmi, K.; Alhosban, A.; Najmi, E.; Malik, Z.; et al. 2013.Automated Web service quality component negotiation us-ing NSGA-2. In 2013 ACS International Conference onComputer Systems and Applications (AICCSA), 1–6. IEEE.Hwang, C.-L.; and Yoon, K. 1981. Methods for multiple at-tribute decision making. In Multiple attribute decision mak-ing, 58–191. Springer.Jonker, C.; Aydogan, R.; Baarslag, T.; Fujita, K.; Ito, T.; andHindriks, K. 2017. Automated negotiating agents competi-tion (ANAC). In Proceedings of the AAAI Conference onArtificial Intelligence, volume 31.Jonker, C. M.; and Ito, T. 2020. ANAC 2018: Repeated Mul-tilateral Negotiation League. In Advances in Artificial In-telligence: Selected Papers from the Annual Conference ofJapanese Society of Artificial Intelligence (JSAI 2019), vol-ume 1128, 77. Springer Nature.Kadono, Y. 2016. Agent yk: An efficient estimation of op-ponent’s intention with stepped limited concessions. In Re-cent advances in agent-based complex automated negotia-tion, 279–283. Springer.Klein, M.; Faratin, P.; Sayama, H.; and Bar-Yam, Y. 2003.Negotiating complex contracts. Group Decision and Nego-tiation, 12(2): 111–125.Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.;Tassa, Y.; Silver, D.; and Wierstra, D. 2016. Continuous con-trol with deep reinforcement learning. In Proceedings of the4th International Conference on Learning Representations(ICLR 2016).Lin, R.; Kraus, S.; Baarslag, T.; Tykhonov, D.; Hindriks, K.;and Jonker, C. M. 2014. Genius: An integrated environmentfor supporting the design of generic automated negotiators.Computational Intelligence, 30(1): 48–70.Marsa-Maestre, I.; Klein, M.; Jonker, C. M.; and Aydogan,R. 2014. From problems to protocols: Towards a negotiationhandbook. Decision Support Systems, 60: 39–54.Rawat, R. 2021. User modelling for multi-issue negoti-ation: The role of population-based meta- heuristic algo-rithms over discrete issues. Master’s thesis, Royal Holloway,University of London, UK.Razeghi, Y.; Yavaz, C. O. B.; and Aydogan, R. 2020. Deepreinforcement learning for acceptance strategy in bilateralnegotiations. Turkish Journal of Electrical Engineering &Computer Sciences, 28(4): 1824–1840.Rubinstein, A. 1982. Perfect equilibrium in a bargainingmodel. Econometrica: Journal of the Econometric Society,97–109.Sato, M.; and Ito, T. 2016. Whaleagent: Hardheaded strategyand conceder strategy based on the heuristics. In Recent ad-vances in agent-based complex automated negotiation, 273–278. Springer.

Silva, F.; Faia, R.; Pinto, T.; Praca, I.; and Vale, Z. 2018. Op-timizing Opponents Selection in Bilateral Contracts Negoti-ation with Particle Swarm. In International Conference onPractical Applications of Agents and Multi-Agent Systems,116–124. Springer.Talbi, E.-G. 2009. Metaheuristics: from design to implemen-tation, volume 74. John Wiley & Sons.Tsimpoukis, D.; Baarslag, T.; Kaisers, M.; and Paterakis,N. G. 2018. Automated negotiations under user prefer-ence uncertainty: A linear programming approach. In Inter-national conference on agreement technologies, 115–129.Springer.Tunalı, O.; Aydogan, R.; and Sanchez-Anguix, V. 2017. Re-thinking frequency opponent modeling in automated negoti-ation. In International Conference on Principles and Prac-tice of Multi-Agent Systems, 263–279. Springer.Wachowicz, T.; Kersten, G. E.; and Roszkowska, E. 2019.How do I tell you what I want? Agent’s interpretation ofprincipal’s preferences and its impact on understanding thenegotiation process and outcomes. Operational Research,19(4): 993–1032.Williams, C. R.; Robu, V.; Gerding, E. H.; and Jennings,N. R. 2014. An overview of the results and insightsfrom the third automated negotiating agents competition(ANAC2012). Novel Insights in Agent-based Complex Au-tomated Negotiation, 151–162.Yang, X.-S. 2009. Firefly algorithms for multimodal opti-mization. In International symposium on stochastic algo-rithms, 169–178. Springer.Yang, X.-S.; and Karamanoglu, M. 2020. Nature-inspiredcomputation and swarm intelligence: a state-of-the-artoverview. Nature-Inspired Computation and Swarm Intel-ligence, 3–18.

Figure 2: Interaction between the components of ANESIA

Metric ANESIA AgentGG KakeSoba SAGA|B| = 5 |B| = 10 |B| = 5 |B| = 10 |B| = 5 |B| = 10 |B| = 5 |B| = 10

Holiday DomainRavg 476.81 ±

847.48259.35 ±453.3

1224.04 ±1770.48

532.77 ±841.16

1066.18 ±1710.8

414.51 ±719.58

32.3 ±83.79

238.18 ±783.88

Pavg 0.15 ± 0.57 0.17 ± 0.26 0.4 ± 0.46 0.35 ± 0.44 0.28 ± 0.44 0.17 ± 0.34 0.23 ± 0.37 0.31 ± 0.42Usoc 1.79 ± 0.81 1.72 ± 0.78 1.25 ±0.66 1.33 ± 0.62 1.41 ± 0.63 1.57 ± 0.49 1.49 ± 0.53 1.39 ± 0.59Uind 0.46 ± 0.42 0.51 ± 0.39 0.67 ±0.37 0.73 ±0.34 0.73 ± 0.34 0.8 ± 0.26 0.65 ± 0.24 0.61 ± 0.29Usind 0.88 ± 0.11 0.88 ± 0.01 0.86 ±0.13 0.88 ±0.07 0.87 ± 0.13 0.87 ± 0.11 0.73 ± 0.1 0.71 ± 0.17

Psucc 0.56 0.64 0.79 0.83 0.84 0.92 0.89 0.86Laptop Domain

Ravg 256.27 ±393.1

254.62±446.96

2840.6 ±5055.29

2186.61 ±4523.85

2292.48 ±4559.91

2054.73 ±4445.48

875.89 ±3305.07

232.84 ±1316.79

Pavg 0.02 ±0.26 0.05 ± 0.26 0.14 ±0.23 0.13 ± 0.22 0.07 ± 0.18 0.07 ± 0.18 0.06 ± 0.16 0.06 ± 0.17Usoc 1.57 ±0.81 1.50 ± 0.8 0.99 ± 0.71 1.07 ± 0.7 1.26 ± 0.63 1.34 ± 0.62 1.45 ± 0.56 1.48 ± 0.56Uind 0.47 ± 0.48 0.5 ± 0.48 0.68 ±0.42 0.72 ± 0.41 0.78 ± 0.33 0.77 ± 0.32 0.71 ± 0.26 0.73 ± 0.28Usind 0.96 ± 0.08 0.95 ± 0.09 0.93 ± 0.1 0.94 ± 0.09 0.91 ± 0.11 0.89 ± 0.1 0.8 ± 0.07 0.82 ± 0.09

Psucc 0.48 0.52 0.73 0.76 0.86 0.87 0.89 0.89Party Domain

Tavg 47.24 ±25.46

54.67 ±40.18

38.85±33.71

44.35±27.26

41.19 ±32.82

45.15 ±26.6 7

16.11 ±23.87

23.58 ±44.59

Ravg 213.2 ±274.16

229.98 ±288.62

508.04±607.31

673.01±682.91

473.42 ±560.71

627.69 ±652.17

135.17 ±347.58

209.67 ±514.01

Pavg 0.15 ±0.42 0.18 ±0.4 0.39± 0.43 0.37± 0.46 0.4 ± 0.46 0.36 ± 0.46 0.21 ± 0.35 0.22 ± 0.39Usoc 1.41 ± 0.6 1.34 ± 0.59 0.97± 0.64 1.0± 0.69 0.96 ± 0.69 1.0 ± 0.69 1.18 ± 0.51 1.17 ± 0.58Uind 0.23 ±0.35 0.2 ±0.35 0.57± 0.4 0.54± 0.41 0.56 ± 0.42 0.62 ± 0.43 0.42 ± 0.25 0.39 ± 0.26Usind 0.92 ± 0.2 0.92 ± 0.17 0.81 ± 0.17 0.79 ± 0.22 0.82 ± 0.19 0.91 ± 0.11 0.5 ± 0.2 0.48 ± 0.19

Psucc 0.32 0.26 0.7 0.69 0.67 0.69 0.85 0.82

Table 5: Performance Comparison of ANESIA with “teacher” strategies.

Airport Domain B = 5% of ΩAgent Ravg(↓) Pavg(↓) Usoc(↑) U total






Table 6: Performance of fully-fledged ANESIA in the domain AIRPORT (1440 ×2 profiles = 2880 simulations)

Camera Domain B = 5% of ΩAgent Ravg(↓) Pavg(↓) Usoc(↑) U total





ANESIA 0.02 ± 0.25 1.64 ± 0.57 0.54 ± 0.46 0.93 ± 0.08 0.58AgentGP • 555.57 ± 485.57 0.23 ± 0.24 0.72 ± 0.67 0.5 ± 0.44 0.89 ± 0.09 0.56FSEGA2019 • 58.76 ± 35.67 0.06 ± 0.15 1.06 ± 0.41 0.78 ± 0.27 0.86 ± 0.06 0.90AgentHerb 4.63 ± 2.37 0.01 ± 0.02 1.41 ± 0.13 0.45 ± 0.13 0.45 ± 0.13 1.00Agent33 272.72 ± 306.3 0.01 ± 0.05 1.4 ± 0.22 0.64 ± 0.21 0.65 ± 0.2 0.99Sontag 731.21 ± 956.39 0.03 ± 0.12 1.24 ± 0.38 0.77 ± 0.22 0.82 ± 0.1 0.94AgreeableAgent 1231.69 ± 1291.81 0.07 ± 0.17 0.87 ± 0.4 0.76 ± 0.33 0.88 ± 0.15 0.86PonpokoAgent ? 898.9 ± 1035.82 0.12 ± 0.21 0.9 ± 0.54 0.69 ± 0.39 0.90 ± 0.06 0.77ParsCat2 ? 721.07 ± 836.11 0.05 ± 0.15 1.11 ± 0.44 0.75 ± 0.27 0.84 ± 0.09 0.90

Table 7: Performance of fully-fledged ANESIA in the domain Camera (1440 ×2 profiles = 2880 simulations)

Energy Domain B = 5% of ΩAgent Ravg(↓) Pavg(↓) Usoc(↑) U total






Table 8: Performance of fully-fledged ANESIA in the domain Energy (1440 ×2 profiles = 2880 simulations)

Grocery Domain B = 5% of ΩAgent Ravg(↓) Pavg(↓) Usoc(↑) U total






Table 9: Performance of fully-fledged ANESIA in the domain Grocery (1440 ×2 profiles = 2880 simulations)

Fitness Domain B = 5% of ΩAgent Ravg(↓) Pavg(↓) Usoc(↑) U total


ANESIA 8.13 ± 1.19 0.0 ± 0.0 1.51 ± 0.01 1.0 ± 0.0 1.0 ± 0.0 1.00AgentGP • 5025.1 ± 4975.25 0.01 ± 0.01 1.49 ± 0.05 0.92 ± 0.02 0.92 ± 0.02 1.00FSEGA2019 • 6462.11 ± 5295.59 0.0 ± 0.01 1.45 ± 0.07 0.81 ± 0.11 0.81 ± 0.11 1.00AgentHerb 7.25 ± 2.01 0.01 ± 0.02 1.51 ± 0.04 0.55 ± 0.07 0.55 ± 0.07 1.00Agent33 3483.88 ± 2350.29 0.01 ± 0.02 1.5 ± 0.04 0.7 ± 0.09 0.7 ± 0.09 1.00Sontag 6117.57 ± 6137.66 0.0 ± 0.01 1.47 ± 0.07 0.8 ± 0.11 0.8 ± 0.11 1.00AgreeableAgent 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0.00PonpokoAgent ? 9504.37 ± 8067.99 0.0 ± 0.01 1.4 ± 0.1 0.92 ± 0.05 0.92 ± 0.05 1.00ParsCat2 ? 8187.34 ± 6755.87 0.01 ± 0.02 1.42 ± 0.09 0.87 ± 0.1 0.87 ± 0.1 1.00



ANESIA 10.33 ± 1.05 0.0 ± 0.0 1.57 ± 0.01 1.0 ± 0.0 1.0 ± 0.0 1.00AgentGP • 7520.09 ± 8454.27 0.01 ± 0.01 1.57 ± 0.04 0.93 ± 0.03 0.93 ± 0.03 1.00FSEGA2019 • 10477.08 ± 7392.38 0.01 ± 0.04 1.56 ± 0.07 0.77 ± 0.12 0.77 ± 0.12 1.00AgentHerb 12.52 ± 3.01 0.01 ± 0.02 1.52 ± 0.04 0.54 ± 0.07 0.54 ± 0.07 1.00Agent33 18549.4 ± 14784.47 0.03 ± 0.04 1.55 ± 0.07 0.84 ± 0.09 0.84 ± 0.09 1.00Sontag 12031.32 ± 10180.53 0.01 ± 0.01 1.57 ± 0.05 0.76 ± 0.12 0.76 ± 0.12 1.00AgreeableAgent 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0.00PonpokoAgent ? 19677.92 ± 15053.14 0.01 ± 0.02 1.57 ± 0.04 0.9 ± 0.06 0.9 ± 0.06 1.00ParsCat2 ? 16583.31 ± 12421.79 0.01 ± 0.02 1.57 ± 0.05 0.84 ± 0.1 0.84 ± 0.1 1.00

Table 10: Performance of fully-fledged ANESIA in the domain Fitness (1440 ×2 profiles = 2880 simulations)

Flight Booking Domain B = 5% of ΩAgent Ravg(↓) Pavg(↓) Usoc(↑) U total






Table 11: Performance of fully-fledged ANESIA in the domain Flight Booking (1440 ×2 profiles = 2880 simulations)

Itex Domain B = 5% of ΩAgent Ravg(↓) Pavg(↓) Usoc(↑) U total






Table 12: Performance of fully-fledged ANESIA in the domain Itex (1440 ×2 profiles = 2880 simulations)

Outfit Domain B = 5% of ΩAgent Ravg(↓) Pavg(↓) Usoc(↑) U total






Table 13: Performance of fully-fledged ANESIA in the domain Outfit (1440 ×2 profiles = 2880 simulations)







Table 14: Performance of ANESIA-DRL - Ablation Study1 - over domain AIRPORT (1440 ×2 profiles = 2880 simulations)






ANESIA-DRL 280.15 ± 286.97 0.06 ± 0.2 1.47 ± 0.57 0.47 ± 0.37 0.88 ± 0.13 0.64AgentGP • 83.22 ± 220.98 0.14 ± 0.16 1.04 ± 0.49 0.61 ± 0.25 0.7 ± 0.11 0.88AgentHerb 3.15 ± 0.77 0.05 ± 0.1 1.4 ± 0.19 0.5 ± 0.19 0.5 ± 0.19 1.00FSEGA2019 • 52.5 ± 41.27 0.09 ± 0.17 1.02 ± 0.48 0.73 ± 0.3 0.85 ± 0.1 0.87Agent33 180.65 ± 213.62 0.05 ± 0.08 1.39 ± 0.2 0.68 ± 0.18 0.68 ± 0.18 1.00Sontag 637.83 ± 979.8 0.03 ± 0.08 1.26 ± 0.29 0.79 ± 0.17 0.8 ± 0.13 0.98AgreeableAgent 1115.14 ± 1334.16 0.07 ± 0.14 0.9 ± 0.38 0.76 ± 0.29 0.84 ± 0.17 0.91PonpokoAgent ? 821.57 ± 1069.73 0.1 ± 0.18 0.99 ± 0.52 0.74 ± 0.33 0.88 ± 0.06 0.83ParsCat2 ? 666.97 ± 968.01 0.06 ± 0.11 1.15 ± 0.38 0.76 ± 0.21 0.8 ± 0.13 0.95

Table 15: Performance of ANESIA-DRL - Ablation Study1 - over domain Camera (1440 ×2 profiles = 2880 simulations)







Table 16: Performance of ANESIA-DRL - Ablation Study1 - over domain Energy (1440 ×2 profiles = 2880 simulations)







Table 17: Performance of ANESIA-DRL - Ablation Study1 - over domain Grocery (1440 ×2 profiles = 2880 simulations)



ANESIA-DRL 6.5 ± 2.6 0.05 ± 0.06 1.48 ± 0.07 0.87 ± 0.15 0.87 ± 0.15 1.00AgentGP • 507.39 ± 1234.05 0.1 ± 0.2 1.32 ± 0.37 0.67 ± 0.16 0.68 ± 0.16 0.93FSEGA2019 • 2142.49 ± 3180.88 0.09 ± 0.14 1.32 ± 0.25 0.77 ± 0.13 0.78 ± 0.12 0.97AgentHerb 7.36 ± 3.51 0.05 ± 0.05 1.48 ± 0.05 0.58 ± 0.1 0.58 ± 0.1 1.00Agent33 761.68 ± 1289.88 0.07 ± 0.05 1.44 ± 0.07 0.68 ± 0.12 0.68 ± 0.12 1.00Sontag 2813.95 ± 5044.51 0.07 ± 0.07 1.36 ± 0.14 0.78 ± 0.13 0.79 ± 0.13 1.00AgreeableAgent 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0.00PonpokoAgent ? 3993.86 ± 5928.48 0.07 ± 0.09 1.3 ± 0.18 0.84 ± 0.11 0.84 ± 0.11 0.99ParsCat2 ? 3173.43 ± 4762.17 0.08 ± 0.08 1.32 ± 0.17 0.8 ± 0.12 0.8 ± 0.12 0.99



ANESIA-DRL 8.05 ± 2.67 0.05 ± 0.05 1.51 ± 0.09 0.81 ± 0.08 0.81 ± 0.08 1.00AgentGP • 86.82 ± 151.37 0.06 ± 0.03 1.50 ± 0.04 0.76 ± 0.07 0.76 ± 0.07 1.00FSEGA2019 • 7584.39 ± 7834.62 0.05 ± 0.04 1.51 ± 0.06 0.77 ± 0.11 0.77 ± 0.11 1.00AgentHerb 9.39 ± 3.46 0.06 ± 0.04 1.48 ± 0.05 0.6 ± 0.08 0.6 ± 0.08 1.00Agent33 11817.2 ± 11426.17 0.08 ± 0.06 1.48 ± 0.08 0.79 ± 0.09 0.79 ± 0.09 1.00Sontag 8291.78 ± 8988.64 0.06 ± 0.05 1.50 ± 0.06 0.73 ± 0.11 0.73 ± 0.11 1.00AgreeableAgent 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0.00PonpokoAgent ? 13363.2 ± 12016.88 0.08 ± 0.04 1.47 ± 0.05 0.81 ± 0.08 0.81 ± 0.08 1.00ParsCat2 ? 11296.37 ± 10537.68 0.07 ± 0.05 1.49 ± 0.06 0.77 ± 0.1 0.77 ± 0.1 1.00

Table 18: Performance of ANESIA-DRL - Ablation Study1 - over domain Fitness (1440 ×2 profiles = 2880 simulations)







Table 19: Performance of ANESIA-DRL - Ablation Study1 - over domain Flight Booking (1440×2 profiles = 2880 simulations)







Table 20: Performance of ANESIA-DRL - Ablation Study1 - over domain Itex (1440 ×2 profiles = 2880 simulations)







Table 21: Performance of ANESIA-DRL - Ablation Study1 - over domain Outfit (1440 ×2 profiles = 2880 simulations)







Table 22: Performance of ANESIA-Random - Ablation Study2 - over domain AIRPORT (1440 ×2 profiles = 2880 simulations)







Table 23: Performance of ANESIA-Random - Ablation Study2 - over domain Camera (1440 ×2 profiles = 2880 simulations)







Table 24: Performance of ANESIA-Random - Ablation Study2 - over domain Energy (1440 ×2 profiles = 2880 simulations)







Table 25: Performance of ANESIA-Random - Ablation Study2 - over domain Grocery (1440 ×2 profiles = 2880 simulations)



ANESIA-rand 7.11 ± 2.59 0.05 ± 0.07 1.49 ± 0.08 0.89 ± 0.13 0.89 ± 0.13 1.00AgentGP • 577.86 ± 1122.43 0.19 ± 0.31 1.16 ± 0.56 0.64 ± 0.16 0.67 ± 0.16 0.82FSEGA2019 • 1476.17 ± 1984.41 0.1 ± 0.16 1.3 ± 0.29 0.77 ± 0.13 0.78 ± 0.12 0.96AgentHerb 7.23 ± 3.22 0.04 ± 0.05 1.49 ± 0.05 0.58 ± 0.1 0.58 ± 0.1 1.00Agent33 569.85 ± 956.24 0.07 ± 0.05 1.43 ± 0.07 0.69 ± 0.12 0.69 ± 0.12 1.00Sontag 1708.57 ± 2728.7 0.09 ± 0.14 1.33 ± 0.26 0.78 ± 0.14 0.78 ± 0.13 0.97AgreeableAgent 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0.00PonpokoAgent ? 2805.01 ± 4277.85 0.09 ± 0.16 1.26 ± 0.28 0.83 ± 0.13 0.84 ± 0.11 0.96ParsCat2 ? 2641.97 ± 4328.13 0.11 ± 0.16 1.27 ± 0.29 0.79 ± 0.13 0.80 ± 0.12 0.96



ANESIA-rand 7.88 ± 2.6 0.08 ± 0.05 1.48 ± 0.08 0.77 ± 0.09 0.77 ± 0.09 1.00AgentGP • 121.18 ± 359.01 0.08 ± 0.12 1.48 ± 0.17 0.76 ± 0.08 0.76 ± 0.06 0.99FSEGA2019 • 2461.25 ± 3791.92 0.05 ± 0.04 1.51 ± 0.06 0.77 ± 0.11 0.77 ± 0.11 1.00AgentHerb 9.57 ± 3.75 0.06 ± 0.03 1.48 ± 0.05 0.6 ± 0.09 0.6 ± 0.09 1.00Agent33 4209.9 ± 6395.93 0.09 ± 0.13 1.46 ± 0.18 0.78 ± 0.11 0.79 ± 0.09 0.99Sontag 3223.29 ± 6109.35 0.06 ± 0.05 1.49 ± 0.07 0.73 ± 0.11 0.73 ± 0.11 1.00AgreeableAgent 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0.00PonpokoAgent ? 5035.09 ± 7372.96 0.08 ± 0.04 1.47 ± 0.05 0.8 ± 0.08 0.80 ± 0.08 1.00ParsCat2 ? 4005.65 ± 5637.52 0.07 ± 0.05 1.49 ± 0.07 0.77 ± 0.11 0.77 ± 0.11 1.00

Table 26: Performance of ANESIA-Random - Ablation Study2 - over domain Fitness (1440 ×2 profiles = 2880 simulations)



ANESIA-rand 579.56 ± 684.55 0.34 ± 0.27 1.16 ± 0.63 0.66 ± 0.21 0.77 ± 0.24 0.46AgentGP • 529.11 ± 275.89 0.43 ± 0.24 0.38 ± 0.57 0.55 ± 0.12 0.65 ± 0.16 0.32FSEGA2019 • 3146.95 ± 4221.66 0.46 ± 0.24 0.24 ± 0.49 0.60 ± 0.2 1.0 ± 0.0 0.21AgentHerb 553.35 ± 1149.49 0.12 ± 0.2 1.16 ± 0.47 0.52 ± 0.16 0.52 ± 0.17 0.87Agent33 1633.69 ± 2938.1 0.12 ± 0.2 1.01 ± 0.43 0.46 ± 0.15 0.45 ± 0.16 0.87Sontag 3183.22 ± 4665.68 0.38 ± 0.27 0.48 ± 0.62 0.66 ± 0.21 0.91 ± 0.11 0.39AgreeableAgent 3320.16 ± 4423.85 0.47 ± 0.23 0.25 ± 0.51 0.60 ± 0.19 0.98 ± 0.02 0.20PonpokoAgent ? 3631.62 ± 4792.51 0.4 ± 0.26 0.39 ± 0.56 0.64 ± 0.2 0.93 ± 0.06 0.33ParsCat2 ? 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0 ± 0 0.00




Table 27: Performance of ANESIA-Random - Ablation Study2 - over domain Flight Booking (1440 ×2 profiles = 2880 simula-tions)







Table 28: Performance of ANESIA-Random - Ablation Study2 - over domain Itex (1440 ×2 profiles = 2880 simulations)







Table 29: Performance of ANESIA-Random - Ablation Study2 - over domain Outfit (1440 ×2 profiles = 2880 simulations)

Learnable Strategies for Bilateral Agent Negotiation over ...

Documents

Transcript of Learnable Strategies for Bilateral Agent Negotiation over ...