PARTIAL INFORMATION BASED COMPETITIVE ADVERTISING

PARTIAL INFORMATION BASED COMPETITIVE ADVERTISING

VARTIKA SINGH∗ AND VEERARUNA KAVITHA,

IEOR, INDIAN INSTITUTE OF TECHNOLOGY BOMBAY, INDIA

Abstract.We study a competitive advertisement process, for example over a social network. Many potential

customers are hidden in the network and multiple content providers (CPs) attempt to contact themthrough advertising in order to acquire them. Any customer chooses one among the contactedCPs. The CPs have access only to local information, none of them are aware of the customersalready contacted by their opponents. Further, the information updates are asynchronous and ouraim is to obtain relevant equilibrium policies. Towards this we consider a stochastic game withpartial, asymmetric and non-classical information. Our approach is to consider open-loop control tillinformation update, which allows managing the belief updates in a structured manner. We analysea general framework that is applicable to a variety of problems with such partial and asymmetricinformation structure and strategies. Using standard tools of optimal control theory and Markovdecision process (MDP) we solve a bi-level control problem while deriving the relevant best responses;every stage of the dynamic programming equation of the MDP is solved using optimal control tools.We finally reduce the game with an infinite number of states and infinite-dimensional actions to afinite state game with finite dimensional actions under certain assumptions. We prove the existence ofNash Equilibrium in advertising problem and provide important structural results. We also provideclosed-form expressions for some special cases.

Key words. Partial information, Stochastic game, Advertisement, Social network, Markovdecision process, Optimal control.

1. Introduction. Social networks have become a useful platform for contentproviders (CPs) to propagate their content to potential customers. We consider allsuch networks, where potential customers are hidden and CPs try to reach themthrough advertising. But advertising is expensive and the customers are limited, soit may not be useful to advertise if majority of customers are already taken. Thus,one needs to consider the trade-off between the potential reward and cost. The CPscompete among themselves to gain more customers by controlling their (dynamic)advertising policies. As one may anticipate, this competition depends significantly onthe information available to the CPs at various decision making stages.

To be more precise, n CPs attempt to acquire M available potential customers(referred to as locks) till a given deadline T ; each CP controls its rate of contact(advertisement) to increase its chances of winning the locks, while trading off theassociated cost. A similar problem is considered in [1], where the authors consider twotypes of information structures: a) no-information case where the CPs are unawareof the number of locks contacted by even themselves, and b) the full information casewhere the CPs know the contact status of everyone. However, it is more realistic toassume that the agents/CPs have access only to the local information, where CPsknow their own contact status and not that of the opponents. This leads to a partialinformation game. Further, every contact does not automatically imply acquisition,the customers/locks wait for some time and choose one among the contacted CPs.We say the CP has acquired a lock, if the latter chooses the former; this acquisitionstatus is revealed to the CPs after the waiting time which we refer to as deadline T .This competition leads to a non-classical information game (as described in [2]).

Basar et. al in [2] describe a game to be of non-classical information type, andwe describe the same in our words: if the state of agent i depends upon (previous)actions of agent j, and if agent j has some information which is not available to

∗work partially supported by Prime Minister’s Research Fellowship (PMRF), India.

1

agent i, we have a non-classical information game. These games are typically hard tosolve [2]; when one attempts to find best response against a strategy profile of others,one would require belief of others’ states, belief about the belief of others, and so on.

In our problem, the global state (which reflects further acquisition chances ofvarious CPs) is influenced by contact status of all the CPs. This in turn dependsupon the advertising policies (actions) of the agents; the agents choose their actionsbased only on local information/state (available at corresponding decision epochs),i.e., the locks contacted by themselves. Since this local information is unavailable toothers, the competitive Advertisement process for Social Networks problem (a.k.a.,ASN) leads to a non-classical information game.

When any CP contacts a lock, there is a change in it’s state/information whichcalls for a new advertisement policy; the new policy should probably last till thedeadline T . We indeed assume that the policy of any CP is exactly of this nature,which we refer to as “open loop control till information update (a.k.a., OLC-IU)”,as in [3]. The policies in our case depend upon the local state (contacts) as wellas the left-over time. With no-information, one has to resort to open loop policies(action that changes with time, but, is oblivious to the state). This is the bestwhen one has no access to information updates. With full information, one can haveclosed loop policies (actions can also depend on state). Further, in full informationcontrolled Markov jump processes, every agent is informed immediately of the jumpin the global state and can change its action based on the change. In our partialinformation case, the agents can observe only some (local) jumps and not all; thus inall, we need policies that are of type OLC-IU, which are also asynchronous. These arenatural policies that are usually adapted by individuals in various real life scenarios1

and facilitate structural estimation of beliefs.The structure of the game is known to all the CPs, i.e., they know any other

CP chooses it’s advertisement process for the rest of the horizon, based on the localinformation (number of contacts by that CP). Like in any standard game theoreticsetting, even with such a information and strategy structure, one can estimate theexpected rewards at any stage based on their local information. These estimates helpthem in choosing appropriate strategies which might lead to a Nash Equilibrium (NE).In effect, CPs are able to estimate the beliefs about their reward, i.e., the anticipatedreward, in a structured manner against any given strategy profile of the opponents.

Some initial results for a problem with similar information structure and strate-gies, but, with sequential contacts are available in [3]. The authors consider twoagents with one/two locks. In [4], we considered a significant generalisation of [3]with n agents and M locks. But in both these problems, the locks were ordered, i.e.,agents could contact the locks only in a given order, known to all the agents. Theseresults are applicable for Project Completion in Multiple phases (a.k.a., PCM), whereagents are not aware if the project is already acquired by others.

We first provide a general problem formulation, which can model a variety ofapplications, including the two above mentioned problems, PCM and ASN. In adynamic system with multiple agents, the state of the system is influenced by theactions of all the agents; but, the agents know only a partial component of state andthis information is different for different agents. Further, the information updates are

1An OLC-IU strategy for a student could be as follows: if the performance after intermediateexam is poor (s)he plans to drop the course, if average (s)he plans to continue the course with moreefforts and if performance remains average or poor in the next intermediate exam (s)he plans todrop.

2

asynchronous across the agents, and these observation epochs depend only on the ac-tions of the associated agent; however, rest of the details, e.g., rewards, depend uponactions of all. As already mentioned, the policies considered are of the type OLC-IU,where action for any state dictates the rate of next information update epcoh. Basi-cally at every information update, the corresponding agent chooses a new open loopcontrol (till end) depending upon the new information.

Any strategy profile in our game is described by one open loop policy for each(local) state, each state is also described by the time of previous information update;thus we have an infinite dimensional game. We provide two general results, thatare applicable to the problems considered in [3] and [4] as well as the social networkproblem (ASN) that we consider in this paper. We use the tools of optimal controltheory (Hamiltonian Jacobi equations) and discrete time Markov decision processto reduce this infinite dimensional game to a finite one, under certain assumptions;we show that the best response against any strategy profile of others includes a timethreshold policy (maximum rate till a threshold of time, after which there would be noattempt for information update); more importantly we show that these thresholds canbe specified by a finite number of constants (one for each private state), irrespectiveof the time of information update.

We next prove the existence and characterise NE in the reduced game for ASN.We also derive important structural properties of the NE: a) the advertisement policycorresponding to lesser number of left-over locks has smaller threshold, b) one canhave have multiple (uncountably many) NE in some cases, and, c) we have closedform expressions of NE for a two player two lock game, along with an example ofmultiple NE. We also provide an algorithm to compute the NE. This algorithm isbased on fictitious play and stochastic approximation techniques.

In a related field of literature that considers advertising over a social network([5]-[7]), the potential customers are known to the competing agents, and the acquiredcustomers may also influence the fellow customers; in [6], the firms allocate initialcustomer specific budget towards winning the customers, while [7] considers the samein a dynamic fashion. In our case, the customers are hidden and the agents blindlyattempt to reach them. In [8]-[10], the customers are not known, however the firmscompete for visibility over timelines of a online social network; these papers do notconsider the acquisition/winning over of the hidden customers. The visibility metricincludes the relative influence of advertisement-post residing at different positions ofthe timeline; amount/fraction of time the post resides on a timeline is consideredin [8, 10], while relative fraction of live copies of post of a CP to that of others isconsidered in [11].

The organisation of paper is as follows. The general problem is considered in sec-tions 2-3, the reduced game for general problem and PCM are considered respectivelyin sections 4 and 5. The reduced game corresponding to ASN is available in sections6-7, a possible extension of ASN is provided in section 8 followed by conclusions.

2. Problem statement. Our problem is motivated by a social network wheremultiple content providers (CPs) compete to acquire potential customers throughadvertising, which influences their chances of contacting the customers. We modelthe advertising as Poisson search with controllable rates, where higher rate results inbetter chances, but at higher cost. The agents work asynchronously and are unawareof the contact status of the opponents. We consider a general problem formulationthat supports many variants including this problem; prior to that we describe thedetails of this (ASN) problem, as well as the variant that studies project completion

3

in multiple phases (PCM) with partial information.In PCM problem, n agents are competing to win a project, by acquiring M locks

before the deadline T . One can view it as completing M -phases of the project. Thelocks are ordered, i.e, all the agents compete for the first lock in the beginning, afterwhich the successful agent will attempt for subsequent locks in the given order. We sayk-th contact (of k-th lock) is successful if this contact happens before time T and if theagent was the first one to contact all the previous (k−1) locks along with the k-th lock;agents get some reward at every successful contact. Further, the agent that contactsall the locks successfully wins the project and gets a terminal reward. However,agents are not aware of contact status of others before they contact the first lock.The acquisition/contact process of each agent is modelled by independent (possiblynon-homogeneous) Poisson Processes; they can choose the rate of their (Poisson)contact process as a function of time t ∈ [0, T ]. A higher rate of contact will increasethe chances of success but will also incur higher cost. The aim of the agent is tomaximize its expected reward. This variant is presented in [4] with partial proofs.

In the second variant (i.e., in ASN), the locks are no more ordered, i.e., any agentcan contact any lock at any time. Each lock is associated with a reward, irrespectiveof the contact status of the agent with other locks. The first agent to contact any lockis not guaranteed the reward, but has higher chances to win that reward. Reward ofany lock is provided to one among the contacted agents, according to some probabilitydistribution. This variant models the social network problem more realistically (thanin [1]), as we consider the following partial information aspects. The reward is notrevealed during the game, however agents are aware of the probability distributiongoverning the rewards. So players won’t know if their contact is successful, but theyknow that the probability of winning the reward increases with decrease in the timeof contact. The overall rewards are revealed only after the deadline T . Further theplayers will know the number of locks contacted by themselves, but not by the others.The contact process is controlled as in the first variant.

The first variant is applicable to the project that needs to be acquired initially(competition ends here) and needs to be completed in several phases. The secondvariant represents the acquisition problem in social network as in [1], but with partialinformation. In this paper we consider a general framework for a multi-agent sequen-tial decision making problem (under certain assumptions) and derive some importantstructural properties of the optimal policies (at appropriate Nash Equilibrium); theframework is sufficiently general and facilitates analysing the above two variants; thestructural properties significantly reduce the dimensionality of the problem.

General Problem Formulation: Consider a dynamic system with M stagesand n decision makers/agents. The stages for different agents are asynchronous. Atthe end of any stage, an agent observes it’s own state and makes a decision; it cannot observe state components belonging to others and any other component thatcan reveal partial and useful information about others. Based on current state anddecision chosen, the agent transits to a new state and the next stage begins; the timespent in any state (time duration between two stages) also depends on these decisions.The agents control their process using non homogeneous Poisson Processes, one foreach stage and state. The decision making process ends at terminal time T , andit is possible that some agents have not gone through all M transitions by time T .Further, the agents can quit the process at any time.

At every stage, any agent gets some reward based on their current state, actionand the new state, as well as the state of all other agents; however the agents mightor might not be aware of the reward immediately. In the first variant (PCM), agents

4

derive the reward immediately, if the contact is successful; in this case all the requiredinformation is revealed (only) to that agent. In the second variant (ASN), the agentscan not observe the rewards before terminal time T . Throughout the game, no infor-mation related to other agents is revealed to the agents which can help them choosebetter action. For example, CPs could be advertising to attract the customers, butthey may not know if the same customers are acquired/contacted already by others.They can only speculate the state of other agents, and estimate their expected rewardfor any stage. This structure introduces a game theoretic framework, as the statesand or the rewards of agents depend on strategies of others.Information structure: The players have no information about the components ofthe state that correspond to other players, they only know their own state component(henceforth briefly referred to as state). But they know the set of strategies used byother players, using which they can derive the probability distribution of the states ofothers, for any given opponent-strategy profile. Using this information, they can es-timate their expected rewards at any time point, further using the private informationof their own state, for every given opponent-strategy profile. Basically they system-atically estimate belief about other’s states, which in turn helps in estimating theirexpected rewards. This helps them in choosing a strategy that is a part of a Nashequilibrium (if one exists), of an appropriate game.Decision epochs: Whenever a new piece of information is available to a playeri.e., whenever a stage ends (for example upon contacting a lock in the two examplesdiscussed above), the player can make a new decision based on this update. Hence,such information update epochs become the natural decision epochs for the agents.These information updates are asynchronous for the agents and hence the decisionepochs will be asynchronous and can also be random. Also, at maximum, there canbe M decision epochs for each agent.State: We denote the (private) state of player i at k-th decision epoch as zik. Since theinformation is updated at only decision epochs, this state remains the same betweentwo decision epochs. This state is represented by zik = (l

ik, τ

ik−1), where lik ∈ L

ik is

problem dependent and τ ik−1 is the time instance of (k − 1)-th transition, set τ i0 = 0.For each k, the (private) state space Li

k is of finite cardinality, i.e., ∣Lik ∣ <∞.

Actions: The agents choose the rate functions (defined till T ) at their respectivedecision epochs based on their states; these functions are open loop policies, whereinthe time-dynamic action is independent of the state; the agents change their actiononly at next decision epoch. This approach is called “Open loop policy till informationupdate (OLC-IU)” as in [3]. The rate of Poisson process, for agent i, at any timecan take values in the interval [0, βi] and the rate function is measurable. To beprecise, agent i at decision epoch k, i.e., at time instance τ ik−1, chooses an action aik ∈L∞[τ ik−1, T ], as the control process to be used till the next information update. HereL∞[τ ik−1, T ] is the space of all measurable functions that are uniformly bounded by thegiven constant; the bounds (βi for agent i) can be different for different agents. Theseform a closed subset of Polish space of essentially bounded functions, i.e., with finiteessential supremum norm: ∣∣a∣∣∞ ∶= infβ

i ∶ ∣a(t)∣ ≤ βi for almost all t ∈ [τ ik−1, T ].Strategy: The strategy of player i is a collection of open loop policies, one for eachdecision epoch and each (private) state, as below:

πi = aik(⋅ ; zik) ∈ L∞; for all zik and all k ∈ 1,⋯,M ,(2.1)

where ai1(⋅; zi1) represents the open loop policy used at start, while aik(⋅; zik) rep-resents the OLC-IU to be used at k-th decision epoch; this choice depends upon theavailable information zik. When there is no ambiguity, notation zik is dropped and we

5

use L∞ in place of L∞[τ ik−1, T ], βi, lik represented as β, lk etc.

We begin by first analysing the best response (BR) of players. Towards this, weintroduce few notations. Let N ∶= 1,2,⋯, n be the set of players. Without lossof generality, consider BR of agent i and let the ensemble of all other players berepresented by −i ∶= N − i. Let π−i be the strategy profile of opponents.Controlled transitions: Given any state zik = (lk, τ

ik−1), the transition to a new

state zik+1 = (lk+1, τik) also depends upon the action aik(⋅) and π−i. Policy aik deter-

mines the time of transition τ ik, which is exponentially distributed with time varyingrate function q(lk)a

ik(⋅); here q(lk) is problem dependent, e.g., q(lk) = lk in both the

variants discussed at beginning of section 2. The state and action dependent proba-bility density function of τ ik and probability of an example related event, after defining

aik(t) ∶= ∫tτ ik−1

aik(s)ds, are given by:

fq(⋅) = qaik(⋅)e−qa

ik(⋅), and P (τ i

k ≥ T ;aik) = 1 − ∫

T

τik−1

fq(lk)(t)dt = e−q(lk)ai

k(T ).(2.2)

Further given τ ik ≈ t, the term Pk(lk+1∣lk, t, π−i) represents the probability of transitionto lk+1. We assume that, this transition probability satisfies:A.1 Pk(lk+1∣lk, t, π−i) is Lipschitz continuous in t for any k, lk, lk+1 and π−i. Also

lk+1 is stochastically dominated2: Pk(⋅∣lk, t, π−i)d≥ Pk(⋅∣l

′k, s, π−i) when (t,−lk) ≤

(s,−l′k). Further, q ∶ Lk →R is non-decreasing for every k.We drop k in notation Pk(⋅∣ ⋅, ⋅, ⋅) when context directly indicates k.Rewards/Costs: At every decision epoch, the agent observes it’s state and (asexplained before) calculates it’s expected reward till next epoch to make a new decisionagainst any π−i. Based on the rate control function chosen, one has to pay a cost aswell. These quantities correspond to the stage wise rewards and costs. Recall thatT ∧ τ ik represents the time instance3 at which k-th stage ends. Then the cost spenton acceleration for k-th stage is proportional to aik(T ∧ τ

ik) (see (2.2)).

Thus the expected (immediate) utility of player i for stage k, when it chooses acontrol aik against opponent strategy profile π−i equals (X⋅ represents indicator):

rik(zik, a

ik;π−i) = Ezi

k[cik(T ∧ τ

ik;π−i) − νa

ik(T ∧ τ

ik)]Xτ i

k−1<T,(2.3)

where the first term equals the expected reward (details to follow), ν is the trade-offfactor and the second term is the expected cost given by (using (2.2)):

Ezik[ai

k(T ∧ τ ik)] = ai

k(T )P (τ ik ≥ T ;ai

k) + ∫T

τik−1

aik(s)fq(lk)(s)ds,(2.4)

= aik(T )e−q(lk)a

ik(T ) + ∫

T

τik−1

aik(s)fq(lk)(s)ds.

If an agent fails to transit before the deadline T , it still has to pay for the entireduration T − τ ik−1 and hence the first term in (2.4) which is evaluated in (2.2). Thefirst term of equation (2.3), cik(s;π−i) is the conditional expected reward that can bereceived by agent i, given the agent completes the k-th stage at time τ ik ≈ s and whenthe strategy profile of opponents is π−i. We require the following assumption:A.2 We assume the reward cik(⋅;π−i) is Lipschitz continuous and non-increasing func-

tion for every π−i, k, i.

2For any non-decreasing f , E[f(lk+1)∣ lk, t, π−i] ≥ E[f(lk+1)∣ l′k, s, π−i] if (t,−lk) ≤ (s,−l

′k).

3 The contact clocks τ ik are free running Poisson clocks, however we would be interested onlyin those contacts that occurred before deadline T .

6

Apart from continuity, the assumption also requires that, the expected reward is betterif the stage is completed earlier. Such an assumption is usually satisfied in most ofthe applications including the two problem variants (as will be discussed later).Game Formulation: This problem can be modelled as a non-cooperative strategicform game, G = ⟨N,S,Φ⟩, with S = Sii and Φ = ϕi. Here Si is the strategy setof player i given by Si ∶= πi ∶ πi as in (2.1), and the overall utility of agent i equals,

ϕil1(π

i, π−i) =M

∑k=1

E[rik(zik, aik; π−i)∣z

i1 = (l1,0)].(2.5)

Our aim is to find a tuple of strategies (that depend only upon the available informa-tion) that form the Nash equilibrium (NE). We conclude this section by mentioningsome of the main results of the paper.

2.1. Important results. We convert this problem to a much simplified andreduced strategic form game, such that the NE of the reduced game is also the NEof the original game. We found the reduced game with the help of Theorem 2.1 andTheorem 2.2 provided after the following definitions.Threshold Policy: A threshold policy is an open loop policy uniquely identified bya threshold function θ(⋅) and s the starting point; this policy suggests maximumpossible rate β in the interval [s, θ(s)] and value zero in the remaining interval, whens ≤ θ(s). If the starting point s > θ(s), then the rate function is 0 for all t. Thus thethreshold policy is represented by Γθ(s)(⋅), when the starting epoch is s, where,

Γθ(t) ∶= βXt≤θ, for any t ∈ [s, T ].

Threshold (T) strategy: This is a strategy made up of threshold policies, one for eachk and each lk ∈ Lk. A typical T-strategy is defined by M threshold (vector) functionsθk(⋅)k≤M , where θk(⋅) = θl,k(⋅)l∈Lk

for each k, and is defined as below:

π = θ1(⋅) . . . , θM(⋅) = ak(⋅ ; zk) ∶ ak(⋅ ; zk) = Γθl,k(s)(⋅) when zk = (l, s), ∀k.

Basically the threshold functions depend on the starting time of the stage and thestate of the system at that time instance; here θl,k(s) is the threshold used for policy,if the k-th lock contacted at time s results in (private) state l ∈ Lk.M-Thresholds (MT) Strategy: This is a special type of T-strategy with constantthreshold function, i.e., θ(s) = θ for all starting epochs s. A typical MT-strategyis defined by L-thresholds as below, where L ∶= ∑k ∣Lk ∣,

π = (θk,lk) ∶ lk ∈ Lk∀k = ak( ⋅ ; zk); ak( ⋅ ; zk) = Γθl,k(⋅) for any zk = (l, s) and k.

For better clarity, we would like to explain how an agent controls the rate functionsunder OLC-IU strategy, which is of type MT-strategy. An MT-strategy is completelyrepresented by L-thresholds θl,k, one for each l and k, such that: a) θl,k representsthe time threshold till which agent will attempt for k-th stage if lk = l; b) the thresholdvector θl,kl∈Lk

is independent of τk−1, the start time of k-th stage; and c) if τk−1 isbigger than θl,k (i.e., if the attempt for (k − 1)-th stage ends after θl,k, when lk = l)then the agent would no longer attempt for the k-th and subsequent stages. Observethat any MT strategy π ∈ [0, T ]L and hence is a finite dimensional strategy.

To summarize, the strategy of any player is made up of OLC-IU policies, one forevery (possible) state and for each decision epoch. Thus we have infinite dimensionalactions, however the following structural results about the best response strategiesunder assumptions A.1 and A.2 reduce the complexity of the game significantly:

7

Theorem 2.1. [Threshold Strategy] AssumeA.1 andA.2. There exists a T -strategy that is a best response strategy of any given player, against any given strategyprofile of opponents. ∎

Theorem 2.2. [M-Thresholds strategy] Assume A.1 and A.2. There existsan MT-strategy that is a best response strategy of any given player, against any givenstrategy profile of opponents. ∎

The proofs of both the theorems are in section 3. By virtue of the above theorems,there exists a best response strategy represented completely using L thresholds, whichis optimal among (uncountable) state dependent strategies with infinite dimensionalstrategy space. We will thus have a reduced game in RL which is further analyzed insections 4-6; we prove the existence of NE among MT-strategies in the reduced game(implies existence for original game) for both the variants; we also find the NE.

3. Best responses and Proofs. Our aim is to derive Nash equilibrium (NE) forthis partial and asymmetric information stochastic game. We begin with deriving thebest response (BR) of player i against any given strategy profile π−i of the opponents.Dynamic programming equations: The BR is obtained by maximizing the objec-tive function (2.5) with respect to the strategies πi ∈ Si. It is easy to observe that thisoptimization is an example of a Markov decision process which can be solved using(M -stage) dynamic programming (DP) equations given below (e.g, see [12], [13]):

vik(zik;π−i) = 0 if q(lk) = 0 or if τ ik−1 > T or if k >M, and otherwise,

vik(zik;π−i) = supaik∈L∞rik(zik, ai

k;π−i) +E[vik+1(zik+1;π−i)∣zik, aik] .(3.1)

It is well known that the value function in the above with k = 1, vi1(zi1;π−i) =

supπi∈Si ϕil1(πi;π−i), which exactly gives the BR, i.e., the optimizer of objective func-

tion (2.5). Thus the BR is obtained by solving the above DP equations (e.g., [13]).The k-th stage DP equation can be re-written as below:

(3.2) vik(zik;π−i) = sup

aik∈L∞

J ik(z

ik, a

ik;π−i),

where the cost J ik is defined as (see equations (2.2)-(2.4) and (3.1)):

J ik(zik, ai

k;π−i) = ∫T

τik−1

(cik(t;π−i) − νaik(t)) fq(lk)(t)dt − νa

ik(T )e−q(lk)a

ik(T )

+∫T

τik−1

∑lk+1∈Lk+1

vik+1((lk+1, t);π−i)P (lk+1∣lk, t, π−i)fq(lk)(t)dt.(3.3)

One can re-write the above cost as,

J ik(zik, ai

k;π−i) = ∫T

τik−1

(hik(t; lk, π−i) − νai

k(t)) fq(lk)(t)dt − νaik(T )e−q(lk)a

ik(T ), with

hik(s; lk, π−i) ∶= cik(s;π−i) + ∑

lk+1∈Lk+1

vik+1((lk+1, t);π−i)P (lk+1∣lk, t, π−i).(3.4)

Optimal control: From the structure of the optimization problem (3.2)-(3.4) defin-ing the k-th stage DP equation, it is clear that one can solve it using an appropriateoptimal control problem. One can write this optimization problem as:

vik(zik;π−i) = u(τ,0); when zik = (lk, τ),(3.5)

8

where u(s, x) (defined for any s ∈ [τ, T ] and any x) is the value function of the optimalcontrol problem with details:

u(s, x) ∶= supa∈L∞[s,T ]

J(s, x, a), where the objective function,(3.6)

J(s, x, a) ∶= ∫T

s(hi

k(s′) − νx(s′)) fq(lk)(s′)ds′ + g(x(T )), hi

k(s′) ∶= hik(s′, lk;π−i)

with state process of the optimal control problem given by

⋅

x (s′) = a(s′) with x(s) = x, for any s′ ∈ [s, T ],(3.7)

and thus x(s′) = x + ∫s′

s a(s)ds, and with terminal cost g(x′) = −νx′e−q(lk)x′

; observefrom (2.2) that, x(s′) = aik(s

′) if x = 0 and s = τ ik−1.We need to solve this optimal control problem to get the BR, and the standard

technique to solve such problems is using Hamiltonian Jacobi (HJB) PDEs [14], andthe one corresponding to (3.6) is given by:

us(s, x) + supa∈[0,βi]

a(hik(s) − νx)q(lk)e−q(lk)x + ux(s, x) = 0, with u(T,x) = g(x),(3.8)

where us, ux are partial derivatives of the (optimal control) value function. Using thestandard tools we immediately obtain the following result (proof in Appendix A):

Theorem 3.1. [Existence] At any stage k and any state zik,(i) The optimal-control value function u(⋅, ⋅) is the unique viscosity solution of HJB

(3.8). Further, the value function u(⋅, ⋅) is Lipschitz continuous in (s, x).(ii) The function hk(⋅) defined in equation (3.4) is Lipschitz continuous in s.(iii) We have an a∗(⋅) that solves the control problem (3.6). ∎

Remark: Theorem 3.1 implies every stage of DP equation has an optimizer, i.e.,for every stage k and state zik, the player i has a best response policy, call it ai∗k (⋅; z

ik).

Thus, πi∗ = ai∗k (⋅; zik)k,zi

kis a BR strategy of player i against π−i (see e.g., [12], [13]).

We further show that the optimal strategy (in BR) can be a threshold strategy in thefollowing. We begin with (proof is in Appendix A):

Lemma 3.2. For any k, zik and open loop policy aik(⋅), one can construct a thresh-old policy Γθ(⋅) such that the transition time to next stage under threshold policy (de-

noted by τθ) stochastically dominates that (call it τa) under aik(⋅), i.e, τθd≤ τa; for

any monotone decreasing function f , E[f(τa)] ≤ E[f(τθ)]. Further the expected costs(2.4) under both the policies are equal. ∎

Thus at any stage k and state zik, agent i can transition to the new stage fasterusing threshold policies and the running costs (2.3) are better (recall ci(⋅;π−i) ismonotone decreasing by assumption A.2). When the transitions are faster, i.e., whenthe next stage starts earlier, the “cost to go” (the value function) till the end is better,this is also true with better initial state, as shown below (proof is in Appendix A):

Lemma 3.3. For any stage k, the value function with (t,−l′k) ≤ (τ,−lk),

vik(zik;π−i) ≥ v

ik(z

ik;π−i) when zik = (l

′k, t) and zik = (lk, τ). ∎

3.1. Completing the proof of Theorem 2.1. We need to prove that forevery stage, there exists a threshold policy which is in best response against any fixedstrategy of the opponents. From Theorem 3.1 we have the existence of the optimalpolicy (BR policy against any fixed π−i) for every stage. Let’s denote the optimal

9

policy for k-th stage and state zik by a∗(⋅), and, say it is not a threshold policy. Letτa∗ denote the contact epoch under policy a∗. If a∗ is the optimal policy, it must bethe optimizer (maximizer) of the k-th stage DP equation with state zik (3.1) and so,

vik(zik;π−i) = r

ik(z

ik, a

∗;π−i) +E[vik+1((lk+1, τa∗);π−i)∣z

ik, a

∗].

Construct a threshold policy Γθ using a∗ as in proof of Lemma 3.2 and then fromthe same lemma, and A.2 (see (2.3)) rik(z

ik, a

∗;π−i) ≤ rik(z

ik,Γθ;π−i). For the second

term, from Lemma 3.3 for any t ≥ s,

v(t) ∶= ∑l∈Lk+1

vik+1((l, t);π−i)P (l∣lk, t, π−i) ≤ ∑l∈Lk+1

vik+1((l, s);π−i)P (l∣lk, t, π−i)

≤ ∑l∈Lk+1

vik+1((l, s);π−i)P (l∣lk, s, π−i) = v(s), again using Lemma 3.3 and A.1.

Hence from Lemma 3.2 and (3.3), the second term is also not inferior under Γθ,

E[vik+1((lk+1, τa∗);π−i)∣zik, a∗] = E[v(τa∗)] ≤ E[v(τθ)] = E[vik+1((lk+1, τθ);π−i)∣zik,Γθ].

This proves a threshold policy is among BR policies against π−i, using (3.1). ∎

3.2. Proof of Theorem 2.2. We prove the second theorem in two steps; inthe first step we prove the optimal policies for k-th stage coincide (for same lk) in allpossible time intervals, irrespective of the start of this stage τ ik−1, as in the following:

Theorem 3.4. Let τ ≥ t and fix l. The optimal/BR policy in k-th stage, ai∗k (⋅; z)with z = (l, t) coincides with BR policy ai∗k (⋅; z

′) with z′ = (l, τ) from τ onwards, i.e.,

ai∗k (s; z) = ai∗k (s; z

′) for all τ ≤ s ≤ T.

Proof: The optimal policy ai∗k (⋅; z) for k-th stage, with state z = (l, t) is optimizer ofthe value function (3.6) with x = 0, i.e., u(t,0) = supa∈L∞[t,T ] J(t,0, a). From Dynamic

programming principle of optimal control problems [14, Theorem 5.12], we have:

u(t,0) = supa∈L∞[t,τ]

∫τ

t(hi

k(s) − νx(s))q(l)a(s)e−q(l)x(s)ds + u(τ, x(τ)),(3.9)

for any τ ∈ [t, T ]. As in [14, Lemma 4.2] one needs to find the optimizer for the timeinterval [t, τ), considering that the optimal control from τ onwards will be the onethat obtains the optimal u(τ, x(τ)), where x(τ) is the state at τ . And if both theproblems have optimal policy (the existence is established in Theorem 3.1), then theoptimal policy for the entire interval is given by (as in [14, ]):

ai∗k (s) = a∗1(s) for all s ≤ τ, and ai∗k (s) = a

∗2(s) for all s > τ,(3.10)

where a∗2(s) is the optimal policy attaining u(τ, x∗(τ)), x∗(τ) is state at τ when a∗1 isused in interval [t, τ] and where a∗1 is the optimizer of (3.9). The optimal control fromτ onwards in general depend on state x(τ) at time τ , but in our case, by Lemma A.1given in Appendix A, (3.9) modifies to:

u(t,0) = supaik∈L∞[t,τ]

∫τ

t(hi

k(s) − νx(s))q(l)aik(s)e−q(l)x(s)ds + e−q(l)x(τ)[u(τ,0) − νx(τ)].

By Lemma A.1, the optimal control from τ onwards is independent of the state at τ ,i.e., the optimal control policies defining u(τ, x(τ)) and u(τ,0) are the same, and thecommon one forms a part of ai∗k (see (3.10)); this completes the proof. ∎

Thus it suffices to optimize for every stage with zk = (lk,0) (i.e., with τk−1 = 0)and the rest of the optimal policies (with τk−1 > 0) can be constructed using thesezero-starting optimal policies, which immediately leads to the following corollary:

10

Corollary 1. The optimal (BR) strategy can be completely specified by a finite(L) collection of control policies, πi∗

0 (π−i) ∶= ai∗0k(⋅; lk), one for each state and stage

(lk, k), with each of them starting at time zero and such that:

πi∗(π−i) = a

i∗k (⋅; zk), for all zk, where (with τ0 ≡ 0),

ai∗k (s; zk) = ai∗0k(s; lk) for all s ≥ τk−1 when zk = (lk, τk−1) for any lk ∈ Lk, τk−1 ≥ 0. ∎

Step 2: By Theorem 2.1, the (zero-starting) BR policies ai∗0k(⋅; lk) can be chosento be threshold policies; in other words any best response strategy can be describedcompletely using L-thresholds, say call them (θk,lk)k,lk . This implies the existenceof an MT-strategy among the BR strategies against any given strategy profile of op-ponents, which completes the proof of Theorem 2.2. ∎

4. Reduced Game. By Theorem 2.1 any BR includes a T-strategy and furtherby Theorem 2.2 at least one of the BR strategies is an MT-strategy. By virtue ofthese results, one can find a NE (if it exists) in a much reduced game consisting onlyof MT-strategies; the space of strategies in the original game is infinite dimensionalwhile that in the reduced game would be RL.

In other words, one can reduce the problem to the following game, G = ⟨N,S,Φ⟩,where N is the set of players as before, S = Si where each Si is simplified to abounded set of L dimensional vectors (set of MT-strategies), Si = Θi = (θ

il,k) ⊂

[0, T ]L, and the utilities Φ = ϕi are now redefined (see (2.2)-(2.5)) by the following:

ϕil1(Θi;Θ−i) =

M

∑k=1

E[rik(zik, θ

ilk,k

;Θ−i)∣zi1 = (l1,0)] where Θ−i ∶= Θmm≠i.(4.1)

The redefined terms (rik, ϕi etc.) depend only upon the MT-strategies, the thresholds

of which are defined using zero-starting optimal policies, ai∗0k(⋅; lk) of Corollary 1.In the following, we consider the two variants discussed in section 2 in detail.

5. Project completion in multiple phases (PCM). We now consider thePCM problem. There are n agents trying to acquire a project which has multiplecompletion phases, before deadline T ; M locks have to be acquired sequentially beforeT . The locks are ordered, and the first one to contact the first lock acquires the project;others lose the project completely. The successful agent has the potential to derivesubsequent rewards depending upon the number of locks acquired hence after.

The contact process of any agent is modelled by independent Poisson process; therate of the process is chosen by the agent and it can possibly be non-homogeneous. Ahigher rate increases the probability of earlier contact, but incurs higher cost. Afterfirst contact, the successful agent controls the rate to contact further locks, as everylock is associated with a specific reward.

The information structure of this problem is partial and asymmetric. Upon con-tact the agents get partial information about others; they get to know whether theyare the first one to contact; until first contact, they have no information about other’sprogress, and may continue even if the lock is already taken by the opponent. Asalready mentioned, we presented a brief study about this problem in [4] with par-tial proofs. We now complete the proof, by modelling PCM using the framework ofsection 2.

The update in information occurs only at the contact epochs, and players canchoose a new contact process for the remaining time interval (i.e., upto time T ) based

11

on the updated information. Hence the contact epochs become the natural decisionepochs which can be asynchronous across the agents. The state of any agent i is theinformation available to it represented by zik = (lk, τ

ik−1), where for k ≥ 2, τ ik−1 is the

(k − 1)-th contact epoch and lk ∈ Lk = 0,1 with 0 representing the failure and 1 thesuccess of the (k−1)-th contact; also L1 = 1 and τ i0 ≡ 0. This (private) state remainsthe same between two decision epochs. If the agent acquires the project by successfulfirst contact, then all the subsequent contacts may result in some reward if theyhappen before T . We then call them successful contacts and set the correspondinglk = 1, otherwise lk = 0. The set of actions and strategy for any agent is exactly as insection 2; parameters like βi and ν etc., have the same meaning.

The agents can estimate their expected reward for any given strategy profile ofthe opponents. For any player i, the reward upon first contact will be ci1 if it is thefirst one to contact, zero otherwise. The chances of acquiring this reward will dependupon the contact instance τ i1 and the strategy profile π−i of others. Conditioned onτ i1 ≈ s, the expected reward of agent i is given by,

ci1(s;π−i) = ci1P

i1(s;π−i),(5.1)

where P i1(s;π−i) is the probability that i is first one to contact the first lock; this is the

probability that none of the opponents have contacted the first lock before instances, under opponent strategy profile π−i and equals,

P i1(s;π−i) = e

−∑m≠i am1 (s), where am1 (s) = ∫

s

0am1 (t)dt.(5.2)

Observe amk (s) is exactly as in (2.2), and the policies am1 (⋅)m≠i are given by π−i.If agent i is successful in acquiring the project, then the expected reward for

subsequent stages (k ≥ 2) is cik if the corresponding lock is acquired before T and onlythen lk+1 = 1. To summarise, for k ≥ 2 conditioned on τ ik ≈ s (see footnote 3),

(cik(s;π−i), lk+1) = (cik, 1) if l1 = 1 and s ≤ T,(0, 0) else.

(5.3)

In all, the expected reward of any player i, conditioned on it’s state zik = (lk, τik−1)

equals zero if lk = 0 and otherwise is given by the following when action aik(⋅) is used:

Ezik[ci(T ∧ τ ik);π−i] = ∫

T

τ ik−1

ci(s;π−i)aik(s)e

−aik(s)ds.(5.4)

To summarise, the state and action dependent (Poisson) density for any k and forany lk, can be written as fq(lk)(⋅) of (2.2) with q(lk) = lk.

This problem can thus be formulated as in (2.5) with rik defined exactly as in (2.3)-(2.4) using (5.1)-(5.4). We begin with proving the required assumptionsA.1 andA.2.For A.1, the transition probability for any t ≤ T equals (set zi0 = (l1, τ

i0) = (1,0)),

P (lk+1∣t, lk, π−i) =

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

P i1(t;π−i) if k = 1 and lk+1 = 1,

1 − P i1(t;π−i) if k = 1 and lk+1 = 0,

Xlk+1=lk if k ≥ 2,

which is clearly Lipschitz continuous in t for any lk+1, lk, π−i. Further q(l) = l, L1 = 1(singleton), P i

1(⋅;π−i) is monotone, and for k ≥ 2, f , we have E[f(lk+1)∣lk, ⋅, ⋅] = f(lk),which satisfy A.1. From equations (5.1)-(5.3) the reward functions cik(⋅;π−i)k arenon-increasing and Lipschitz continuous for any π−i, satisfying A.2.

12

Therefore Theorem 2.2 is applicable. This completes the proof of the statementthat reduced game with MT-strategies can provide NE for the original game corre-sponding to PCM. Hence, one can find a NE (if it exists) in the reduced game.

We in fact have unique NE in this reduced game, as discussed in [4]. This is provedby further reducing the game to one-dimensional game (see [4]). The uniqueness of NEamong MT-strategies, and a method to find the unique NE is provided in [4, Theorems6-7]. We reproduce the result here for the sake of completion (see [4, 15] for proof):

Theorem 5.1. The unique NE is given by the set of thresholds θi∗k i≤n,k≤M de-

fined recursively along with ‘NE-value functions’ Υi∗k (⋅) as below for any i ≤ n and

k ≤M : (i) for k =M i.e., for the last lock we have,

θi∗M = TXciM>ν and Υi∗

M(t) ∶= (ciM − ν) (1 − e−βi(T−t))Xci

M>ν,

(ii) for any 2 ≤ k <M(with ∅ - null set) θi∗k ∶= inft ≥ 0 ∶ cik +Υi∗k+1(t) ≤ ν, inf ∅ ∶= T,

Υi∗k (t) = Xt<θi∗

k ∫

θi∗k

t(cik + Υi∗

k+1(s) − ν)βie−βi(s−t)ds,

and (iii) for k = 1 i.e., for the first lock, thresholds simultaneously satisfy the follow-ing:

θi∗1 = inf t ∶ (Υi∗2 (t) + ci1)e−∑m≠i β

m(t∧θm∗1 ) ≤ ν ∧ T, for all 1 ≤ i ≤ n. ∎(5.5)

Remarks: We found NE of PCM game by reducing it to a much simpler gamesuch that the NE of the reduced game would also be the NE of the original game.Theorem 5.1 gives a procedure to find the unique NE of the reduced game. In [4], wecomputed approximate closed form expressions for NE for some cases. We have alsoprovided some interesting numerical examples in [4] and some more in [15]. Someof the important conclusions related to this application are: a) one either tries thelast lock till the end or never attempts it, i.e., θi∗M ∈ 0, T, b) the thresholds can benon-monotone (in k) depending upon rewards associated with various stages (locksacquired). The non-monotonicity is observed if the rewards in later stages (after morelock acquisitions) is much higher than that in the earlier stages; then the optimalpolicy at some time instances allows the agent to continue with further acquisitionprocess only if the number of acquisitions is more (see figures 2-4 in [15]).

In this example, the competition ends with first lock. However, other agentsmight continue their efforts unaware of the end of the competition. This is already aninteresting and practically important variant of partial information games. It wouldbe more interesting if the competition extends beyond the first lock. We considerone such variant in the next section, wherein agents try till the end to acquire asmany (identical and unordered) locks as possible, without really knowing if some/allof them are already acquired.

6. Advertising over Social network (ASN). We now consider the secondvariant discussed in section 2, a more realistic variant of the problem in [1].

There are n CPs trying to acquire M potential customers through advertising.The contact process is modelled using independent Poisson search clocks, where ahigher rate results in higher chances of contact4, but also incurs higher cost. Thecustomer chooses one among all the CPs that contacted it till a given deadline T ,with a higher priority to the ones that contacted earlier. This choice, and hence the

4We assume that any content provider can contact any customer only once.

13

reward is revealed only after T . The CPs only know the customers/locks contactedby them and have no information about the contact status of the opponents.

We again model this problem using the framework of section 2. Unlike first variantdiscussed in section 5, neither the success of contact nor any partial information aboutopponents is revealed upon contact. Throughout the game, the agents only knowthe locks contacted by them and are unaware of the success/failure of the contact.Whenever an agent contacts a lock, it can choose a new advertisement rate process forthe remaining time period based on left over opportunities. Hence the contact epochsfor any agent again become the natural decision epochs. These epochs are againasynchronous across the agents. The state zik of any agent i is given by (lk, τ

ik−1)

where lk ∈ Lk = M − k + 1 is the number of left-over locks and τ ik−1 is the previouscontact epoch. This (private) state remains the same between two decision epochs.The actions and strategies are exactly as in section 2.

Rewards and costs: The agents do not receive any reward till deadline T .They may not receive the reward at any intermediate epoch, but they can estimatethe corresponding expected reward for any given strategy profile of opponents, asexplained in the following. For any player i, the expected reward upon k-th contactwill depend upon the number of opponents that already contacted the same lock; thisin turn will depend upon the exact contact instance τ ik and the opponent strategyprofile π−i. Conditioned on τ ik ≈ s, the expected reward of agent i is given by,

cik(s;π−i) = cin

∑m=1

P im(s;π−i)wm,(6.1)

where: a) ci is the fixed reward of agent i associated with any lock, b) P im(s, π−i) is

the probability that i is the m-th player to contact the lock, given contact instanceof that particular lock is s (expression for P i

m(⋅, ⋅) is given in (B.1)-(B.3) of AppendixB), and c) wm is the probability that m-th one to contact will get the reward. Wemake a natural assumption that wm ≥ wm+1 for any m and these constants are knownto all the players. We have identical locks, and hence cik(⋅;π−i) does not depend on k,also evident from (6.1); thus we denote it as ci(⋅;π−i).

After contacting (k − 1) locks, the k-th contact will be made if agent finds oneamong lk =M − k + 1 left-over locks (contact epoch will be the minimum of lk inde-pendent exponential clocks each governed by aik(⋅)). In all the expected reward of anyplayer i, conditioned on it’s state zik = (lk, τ

ik−1) and action aik(⋅) is given by:

Ezik[ci(T ∧ τ i

k);π−i] = ∫T

τik−1

ci(s;π−i) lkaik(s)e−lka

ik(s)ds, ai

k(t) = ∫t

τik−1

aik(s)ds.(6.2)

Thus the state and action dependent (Poisson) density fq(⋅) of (2.2) has the following

form in (6.2): q(lk) = lk, and fq(lk)(s;aik)ds = lka

ik(s)e

−lkaik(s). For ease of notation,

we let fq(lk)(⋅;aik) = flk(⋅). This problem can be formulated as in (2.5) with rik defined

exactly as in (2.3)-(2.4) using (6.1)-(6.2).Assumptions: We begin with proving the assumptions A.1-A.2. For A.1,

observe that, P (lk+1∣lk, t, π−i) = Xlk+1=lk−1 which is a constant function (in t) andhence is trivially Lipschitz continuous; further Lk is a singleton and q(lk) = lk. Hence,A.1 is satisfied.

For A.2, define P ij (s;π−i) ∶= ∑m≥j P

im(s;π−i), as the probability that i is at least

j-th player to contact the lock (at least (j−1) opponents have already contacted), givenit contacted that lock at time instance s. Then, cik(s;π−i) = ci∑n

m=1 Pim(s;π−i)wm =

ci∑nj=1 P

ij (s;π−i)(wj − wj−1). Also, observe that P i

j (⋅;π−i) is a monotone increasing

14

function5 and (wj − wj−1) ≤ 0. Thus ci(⋅;π−i) is monotone decreasing function. ByLemma B.1 in Appendix B, cik(s;π−i) is Lipschitz continuous, thus satisfying A.2.

Hence, using Theorem 2.2, one can find NE (if it exists) in reduced game (4.1) inthe space of MT-strategies. Observe Lkk are singletons and so an MT-strategy isgiven by M -dimensional vector, Θi = (θ

i1, θ

i2 . . . θ

iM) and thus aik(t) = β

iXt≤θik. Thus

for any k, the cost (and other terms) defined in (2.2)-(2.4) and (6.1)-(6.2) simplifies,

rik(zik, θik;Θ−i) = 0 if τ ik−1 ≥ θik , else it equals, with l =M − k + 1,

rik(zik, θik;Θ−i) = −νβ(θik − τ ik−1)e−lβ

i(θik−τik−1) + ∫

θik

τik−1

(ci(s;Θ−i) − νβi(s − τ ik−1))fl(s)ds,

a= ∫θik

τik−1

(ci(s;Θ−i) − νk) fl(s)ds,where νk ∶=ν

M − k + 1, fl(s) = lβie−lβ

i(s−τik−1).(6.3)

The equality ‘a’ is obtained by simplifying after rewriting the first term as an integral.

6.1. Remarks on ci function. Before we proceed, we would like to note downsome important properties of the expected reward function ci(⋅;Θ−i) which are instru-mental in analysing the reduced game. It is easy to verify the following statements.The expected reward function ci(⋅;Θ−i) against any fixed MT strategy Θ−i of oppo-nents is a strict monotone decreasing function till maxk,j∶j≠i θjk, and is a constantafterwards. Further for any Θ−i and Θ′−i, c

i(s;Θ−i) = ci(s;Θ′−i) if all the opponentthresholds smaller than s are the same in Θ−i and Θ′−i (see Lemma B.2 for moredetails and a general statement).

Best response: We begin with the analysis of BR of agent i against MT strategyΘ−i of opponents. From Theorem 3.4, the BR threshold for k-th lock can be derivedfor any arbitrary starting point (τ ik−1) using the solution of DP equations (3.1), withzero starting point (τ ik−1 = 0) i.e., with zik = (M − k + 1,0):

vik(zik;Θ−i) = sup

θk∈[0,T ]rik(z

ik, θ

ik;Θ−i) +E[v

ik+1(z

ik+1;Θ−i)∣z

ik, a

ik] .(6.4)

In the above, the conditional expectation in the last term is only for τ ik as lk+1 =M −k(a constant). This can be re-written as in (3.2), now for the specific application andwhen restricted to MT-strategies:

(6.5) vik(zik;Θ−i) = sup

θik∈[0,T ]

J ik(z

ik, θ

ik;Θ−i),

where the cost J ik is again defined in integral form as below (see (6.3)):

J ik(zik, θik;Θ−i) = ∫

θik

0(hi

k(s;Θ−i)) − νk) fM−k+1(s)ds, with,(6.6)

hik(s;Θ−i) ∶= (ci(s;Θ−i) + vik+1(zik+1;Θ−i)) , where zik+1 = (M − k, s).

By Lemma 3.3, vik+1((M − k, ⋅);Θ−i) and hence hik(⋅;Θ−i) is non-increasing function

for any given k and Θ−i (also using A.2). Therefore from the integral form of J ik

given in (6.6), the set of thresholds corresponding to k-th lock in the BR is given by:

Bik(Θ−i) =

⎧⎪⎪⎪⎨⎪⎪⎪⎩

0 if hik(0;Θ−i)) ≤ νk,

T if hik(T ;Θ−i)) > νk,

t ∶ hik(t;Θ−i) = νk else.

(6.7)

5it is clear when s ≤ s′ under the same opponent strategy π−i, the two events are related as:

P (at least l − 1 opponents contacted before s) ≤ P (at least l − 1 opponents contacted before s′)

15

Combining together, the set of BR for player i against opponent strategy Θ−i is theproduct of the M component-wise BR sets Bik(Θ−i)k corresponding to each lock,

Bi(Θ−i) = B

i1(Θ−i) × B

i2(Θ−i) ⋅ ⋅ ⋅ × B

iM(Θ−i).

6.2. Characterizing Nash Equilibrium. We first continue with BR analysis.In (6.7), the BR depends upon hi

k(⋅;Θ−i) which depends on (next stage) value functionvik+1. We immediately have the following monotonicity result (proof in Appendix B):

Lemma 6.1. [Monotone in contacted locks] For any agent i, the BR valuefunction after k contacts, is greater than or equal to that after (k + 1) contacts if thegiven contacts are within the same time t, i.e.,

vik((M − k + 1, t);Θ−i) ≥ vik+1((M − k, t);Θ−i), against any Θ−i. ∎

Thus the value function starting from same time t is higher if the left-over op-portunities are more. Hence, one might anticipate that the BR thresholds shouldbe monotone decreasing. This is indeed true, we further show that the BR can becharacterized directly using ci(⋅;Θ−i) function (proof in Appendix B):

Theorem 6.2. [BR using ci(⋅;Θ−i)] For any player i and strategy Θ−i of the op-ponents, define k ∶=mink ∶ ciw1 < νk, k ≥ 1 and

¯k ∶=maxk ∶ ci(T ;Θ−i)) > νk, k ≥ 1;

set¯k = 0 if maximum is over empty set. Then

(i) Any best response is a monotone decreasing MT strategy, i.e., θi∗k ≥ θi∗k+1.

(ii) θi∗k = 0 for all k ≥ k and θi∗k = T for all k ≤¯k.

(iii) Bik(Θ−i) = t ≥ 0 ∶ ci(t;Θ−i) = νk if

¯k < k < k.

(iv) These properties are true even at Nash equilibrium. ∎

By above theorem, next-stage value functions are not required to find BR thresh-olds; rather the reward function ci(⋅;Θ−i) is sufficient. Further by part (i), the NE ismade up of monotone decreasing thresholds. In view of the above, we immediatelyhave the following characterization of the NE (proof in Appendix B):

Theorem 6.3. [Nash Equilibrium characterisation] The NE always exists.The tuple (Θ∗1, . . . ,Θ

∗n), with, Θ

∗i = (θ

i∗1 . . . θi∗M) is a NE if and only if,

θi∗k ∈

⎧⎪⎪⎪⎨⎪⎪⎪⎩

0 if ciw1 ≤ νk,T if ci(T ;Θ∗−i) > νk,t ∶ ci(t;Θ∗−i) = νk else, for all k, i. ∎

(6.8)

6.2.1. Multiple NE. Unlike previous application, this reduced game can havemultiple NE. We begin with an example with two symmetric players (n=2) and twolocks (M = 2). In this case ci = c and βi = β for both i = 1,2 and from equations (6.1)and (B.1)-(B.3) of Appendix B (with j ∶= −i and Θj = (θj1, θ

j2)),

ci(s;Θj) = c (w1Pi1(s;Θj) +w2(1 − P i

1(s;Θj))) = c (P i1(s;Θj)(w1 −w2) +w2) , with,

P i1(s;Θj) = 1

2(e−2β(θ

j1∧s)+2e−β(θ

j2∧s) −e−2β(θ

j2∧s)) .(6.9)

One can have unique or multiple NE depending upon the problem parameters. ByTheorem 6.3, it is easy to verify that (Θ∗1,Θ

∗2) (given below) is unique and symmetric

NE, i.e., Θ∗1 = Θ∗2 = Θ

∗, in the following cases (recall ν1 = ν/M−1+1 = ν/2, ν2 = ν):

Θ∗ = (θ∗1 , θ∗2) =

⎧⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎩

(0,0) if cw1 ≤ ν1,(T,0) if ν1 ≤ cw1 ≤ ν2 and c(T ; (T,0)) ≥ ν1,(T,T ) if ν2 < cw1 and c(T ; (T,T )) ≥ ν2,(T, 1

βln ( cw1−cw2

ν2−cw2)) if ν2 < cw1 and c(T ; (T,T )) < ν2.

16

The last line follows from (6.9) because (for the last case), there is a unique θ∗2 thatsatisfies third line of (6.8) by monotonicity of P i

1(s;Θj); we are only left with the

sub-case in which ν1 ≤ cw1 < ν2 and c(T ; (T,0)) < ν1. We have multiple NE for thissub-case, in fact uncountably many. Exactly one of them is symmetric NE and isgiven by (Θ∗,Θ∗) with Θ∗ = (θ∗1 , θ

∗2) where,

θ∗1 ∶=1

2βln( cw1 − cw2

ν − (cw1 + cw2)) and θ∗2 = 0.(6.10)

Again from Theorem 6.3, other NE for this symmetric game are asymmetric, and aregiven by the following set (θ∗1 as in (6.10)):

⋃i=1,2(Θ∗1,Θ∗2) ∶ Θ∗i = (θ∗1 ,0), Θ∗−i = (θ∗1 ,0) and θ∗1 ∈ (θ∗1 , T ] .

It is evident from the above example that the NE in ASN problem need not be unique.In fact, in any symmetric game, i.e., when ci = cj and βi = βj for all i, j ∈ N , if there isa NE with interior first threshold (0 < θi∗1 < T for some i), then there are uncountablymany NE along with a symmetric NE (see subsection 6.1). Such NE are symmetricfrom second threshold onwards and are given by following set:

(6.11)n

⋃i=1(Θ∗1,Θ∗2 . . .Θ∗n) ∶ Θ∗j = (θ∗1 ,Θ∗2∶M)∀j ≠ i, Θ∗i = (θ∗1 ,Θ∗2∶M)) and θ∗1 ∈ (θ∗1 , T ] .

In the above, the vector, Θ∗2∶M ∶= (θ∗2 , θ

∗3 , . . . θ

∗M) represents the NE-thresholds from

second lock onwards which can be derived from (6.8) of Theorem 6.3.We discuss other case studies through numerical examples in section 7. Prior to

that, we conclude the section with exact details of the NE-strategy for a special case.

6.2.2. NE for a special case. From theorems 6.2-6.3, one may anticipate theNE thresholds equal T initially and equal zero towards the end. This result is indeedtrue for a special case with sufficiently small βi (proof in Appendix B).

Lemma 6.4. Assume there exists a β, such that 0 < βi < β for all i. Say νki <

ciw1 ≤ νki+1 for some ki ≤ M . Then (Θ∗1,Θ∗2 . . .Θ

∗n) with Θ∗i = (T,⋯, T, θ

∗ki ,0,⋯,0)

∀i is an NE, where θ∗ki is the solution of ci(θ∗ki ;Θ∗−i) = νki (if exists) or T . ∎

7. Numerical examples. We now study some numerical examples and providean algorithm to compute the NE. The NE are verified by directly checking the BRs.Algorithm to find NE: The algorithm is constructed with the help of Theorem 6.3,fictitious play [16], and stochastic approximation [17] based ideas to achieve the equal-ity given in (6.8) describing the NE. Towards this we need k-estimates, as explainedbelow.k-estimates, ci,k(s,Θ−i): The expected reward ci(s,Θ−i) is given by integrals, asin equations (6.1) and (B.1)-(B.3) of Appendix B. We obtain the k-estimate of thesame by computing the mean of k realizations of the integrand, each obtained usingrealizations of the relevant random variables (generated using densities fl() underMT-strategies). For each such realization, a lock is chosen at random, contact instanceis set to τ i = s; contact instances of opponents are generated for this particular lockusing Θ−i; one of the contacted players is rewarded according to distribution wm.

We compute the NE using Algorithm 7.1 and study few numerical examples.In the first example tabulated in Table 1, we consider symmetric players with

M = 2, c = 1, β = 1, T = 5, ν = 0.5, w1 = 0.7, w2 = 0.2 and wi = 0 for i ≥ 3. We vary thenumber of players n from 2 to 6. The results affirm the findings of subsection 6.2.1:a) we have symmetric NE, which are tabulated in Table 1; b) the first thresholdθi∗1 = T and we have unique NE for n ≤ 3; and c) for n ≥ 4, we have interior first

17

Algorithm 7.1 NE-Algorithm with inputs: M,n, ν, T,βi,wm,ci, k,Θi,0

Step 1: set t = 0 and initialize with a strategy profile Θi,0

Step 2: for each agent i = 1 . . . n,i) find k-estimate and if ci,k(T,Θ−i,t) > νl for some l, then θil′,t+1 = T for all l′ ≤ l

ii) if ciw1 < νl for for some l, then θil′,t+1 = 0 for all l′ ≥ liii) stochastic approximation: for the remaining l, update,

θil,t+1 = θil,t + ϵt (c

i,k(θil,t,Θ−i,t) − νl) , using k-estimate ci,k(θil,t,Θ−i,t).

Step 3: check if converged (terminate if yes), else set t = t + 1 and go to Step 2.

n θi∗1 θi∗22 5 0.523 5 0.244∗ 0.69 0.165∗ 0.44 0.116∗ 0.33 0.09

Table 1: Identicalplayers with M = 2.

Figure 1: NE as a function of β2 = β: NE-components thatremain fixed are, θ3∗1 = T = 5 and (θ1∗3 , θ2∗3 , θ3∗3 ) = (0,0,0)

threshold, we also observe multiple NE as in (6.11). For example, Θ∗1 = (θ,0.16) and

Θ∗i = (0.69,0.16) for every i ≥ 2 is a NE if θ ∈ [0.69,5] when n = 4. Further, the NEthresholds decrease as n increases. When Θ−i is appended with any arbitrary strategywith an extra player, clearly ci(s; ⋅) decreases and this probably explains the above.

In another example provided in Figure 1, we consider three asymmetric playersn = 3 and three locks M = 3. The reward is fixed at ci = 1 for all i and we set T = 5,ν = 1 and (w1,w2,w3) = (0.7,0.20.1). We fix (β1, β3) = (0.1,2), and vary β2 from 0.1to 3. We observe that four components of NE remain fixed (given in caption), whilethe remaining five components vary with β2, and are plotted in the two sub-figures.The NE satisfy Theorem 6.3. Further, as anticipated, with increase in β2, componentsof second player increase (given by triangles in figure), and components of the thirdplayer decrease (given by diamonds in left sub-figure). But interestingly some NEcomponents of the weakest player, player 1 (circles in both sub-figures), have non-monotone behaviour; the first component θ1∗1 initially decreases (as β2 increases), thenincreases when β2 is near β3 = 2 and then decreases again. Further θ2∗1 , jumps to Twhen β2 is increased slightly more than β1 = 0.1 (left sub-figure), from θ2∗1 = θ

1∗1 = 1.95.

In the next example tabulated in Table 2, we consider three asymmetric players(n = 3) and vary the number of locks from 2 to 5. The rewards are c1 = 1.33, c2 = 1.66and c3 = 2. Other parameters are β1 = 0.1, β2 = 0.2, β3 = 0.3, T = 5, w1 = 0.7, w2 = 0.2,w3 = 0.1 and ν = 1. Interestingly from the table, the last non-zero thresholds (shownin boxes) of the weaker players (smaller ci) are more or less the same. It appears thatthe best response with one/two left-over locks is unperturbed by M . On the otherhand, the strongest player tries till the end for all the locks for smaller M ; but aslocks (M) increase, the last threshold becomes interior. Further the NE-thresholdsare definitely monotone in M , i.e., θi∗k is non-decreasing in M for any i, k.

8. Extensions with Random customers. We now discuss possible exten-sions. Say the number of potential customers M1 is unknown. Let (M,d1) be thedistribution of the same at start, i.e., say P (M1 = m) = d1(m) for any m ≤M . The

18

M i θi∗1 θi∗2 θi∗3 θi∗4 θi∗5 M i θi∗1 θi∗2 θi∗3 θi∗4 θi∗5

1 1.04 0 - - - 1 3.95 2.61 1.03 0 -2 2 5 1.32 - - - 4 2 5 5 5 1.25 -

3 5 5 - - - 3 5 5 5 5 -

1 2.67 1.03 0 - - 1 5 3.89 2.57 1.02 03 2 5 5 1.26 - - 5 2 5 5 5 5 1.25

3 5 5 5 - - 3 5 5 5 5 4.34

Table 2: n = 3, varying locks

rest of the information structure and decision epochs remain the same. After (k − 1)contacts the (local) state zik = (lk, τ

ik−1) with now lk ∶= (M − k + 1, dk); here dk is the

distribution over number of left-over locks Mk, with dk(m) = P (Mk = m) for anyk ≤M − k + 1; this distribution can be derived from dk−1 using elementary tools. Theactions and strategies remain the same.Controlled transitions: For any state zik = (lk, τ

ik−1) and action aik(⋅), the distribu-

tion and example related event (2.2) of time of contact τ ik modifies to:

flk(⋅) =M−k+1∑m=1

dk(m)ma(⋅)e−ma(⋅), P (τ ik ≥ T ;ai

k) = 1 −∫T

τik−1

flk(t)dt =M−k+1∑m=1

dk(m)e−ma(T ).(8.1)

Rewards and cost: The expected reward of player i upon k-th contact, conditionedon τ ik ≈ s is given by (6.1) with expression for P i

m(⋅, ⋅) as in (B.1)-(B.3) of AppendixB, now defined using (8.1) in place of (2.2). Again, cik(⋅;π−i) does not depend onk, observe P i

m(⋅, ⋅) depends only on π−i. In all the expected reward of any player i,conditioned on it’s state zik = (lk, τ

ik−1) and action aik(⋅) is given by:

Ezik[ci(T ∧ τ i

k);π−i] = ∫T

τik−1

(ci(s;π−i)M−k+1∑m=1

dk(m)ma(s)e−ma(s))ds.(8.2)

This problem can be formulated as in (2.5) with rik defined exactly as in (2.3)-(2.4)using (6.1) and (8.2). Since, Lk = (M −k+1, dk) is a singleton, and there is no q(⋅),A.1 is satisfied. Further, A.2 can be proved using similar arguments as in ASN andLemma B.1.

Since the structure of this problem is different, we can not directly apply Theo-rem 2.1. We prove Theorem 2.1 for this case as follows: a) Theorem 3.1 and Lemma 3.2follow in the exact same manner, b) for Lemma 3.3, we do not have Lemma A.1. In-stead we directly apply dynamic programming principle to (3.6):

u(t,0) = supak∈L∞[t,τ]

∫τ

t(hk(s) − νx(s))flk(s)ds + u(τ, x(τ)) , where

u(τ, x(τ)) = supak∈L∞[τ,T ]

J(τ, x(τ), aik).

Consider a(⋅) ∈ L∞[t, τ], with a ≡ 0. Under this policy the above objective functionexactly equals u(τ,0), and so vik((lk, t);π−i) ≥ vik((lk, τ);π−i) for t ≤ τ , since Lk issingleton, the rest of the proof of Lemma 3.3 follows.

Now using exactly same arguments as in subsection 3.1, Theorem 2.1 is true evenfor random customers; thus thresholds policies are optimal. Further study of thisgame is an interesting aspect for future study. One can consider other such interestingextensions, for example, when agents are not fully aware of their own contacts; saythey get to know about their contacts only with probability p. One can also considerthe case when customers are added randomly and so on.

19

9. Conclusions. We consider a stochastic game with partial information. Thesystem state is influenced by actions of every agent in pursuit of higher individualreward; but only a partial component of the state is visible to any player. Theaim of each agent is to determine ‘optimal policy’, dependent only on the informationavailable. This leads to a non-classical game with partial and asymmetric information.

We propose a set of strategies made-up of open loop policies (one for each localstate/information); the agent chooses one of the open loop policies depending on localinformation at every (asynchronous) update epoch, and the policy lasts till the end ifthere is no further update. These policies facilitate structural estimation of beliefs.

We derive structural results for best response policies under certain assumptionsusing standard tools of MDP and optimal control theory; a special type of thresholdpolicies form best responses. These results reduce the infinite dimensional game toa finite one such that Nash Equilibrium (NE) of reduced game is the NE of originalgame. We consider two applications that satisfy the assumptions and provide a simplercharacterization of NE; we also have closed form expressions for some special cases.Further, we provide an algorithm to compute the NE using stochastic approximationand fictitious play, and provide some numerical examples.

In particular we consider advertising over a network where the customers arehidden, and content providers (CPs) are not aware of their availability. CPs try tocontact the customers through advertising in an attempt to acquire them, unaware ofthe contact status of the opponents. The first result is the existence of NE among asignificantly simplified strategy space; strategy is represented by one single thresholdfor each value of left-over customers (irrespective of previous contact epoch). Further,these thresholds are decreasing in the number of left-over opportunities.

We derived more insights into the advertising problem with the help of numericalexamples. With remaining parameters fixed, the thresholds at NE decrease if numberof players increase, and increase when the potential customers increase, in majority ofcases. However, we also observe non-monotonicity in thresholds, especially of weakerplayers, as the strength of stronger players increases. We briefly discussed an exten-sion to the case when number of potential customers is random; we established theexistence of the contact-time-dependent threshold policies among best responses.

REFERENCES

[1] A. Eitan et al., “A stochastic game approach for competition over popularity in social networks,”Dynamic Games and Applications, vol. 3, no. 2, pp. 313–323, 2013.

[2] T. Basar and J. B. Cruz Jr, “Concepts and methods in multi-person coordination and control.”ILLINOIS UNIV AT URBANA DECISION AND CONTROL LAB, Tech. Rep., 1981.

[3] Veeraruna Kavitha, Mayank Maheshwari, and Eitan Altman “Acquisition Gameswith Partial-Asymmetric Information,” Allerton 2019, USA, also downloadable athttps://arxiv.org/abs/1909.06633.

[4] Singh V, Kavitha V. Asymmetric Information Acquisition Games. In 2020 59th IEEE Conferenceon Decision and Control (CDC) 2020 Dec 14 (pp. 331-336). IEEE.

[5] Tzoumas, Vasileios, Christos Amanatidis, and Evangelos Markakis. ”A game-theoretic analysisof a competitive diffusion process over social networks.” International Workshop on Internetand Network Economics. Springer, Berlin, Heidelberg, 2012.

[6] Bimpikis, Kostas, Asuman Ozdaglar, and Ercan Yildiz. ”Competitive targeted advertising overnetworks.” Operations Research 64.3 (2016): 705-720.

[7] Chasparis, Georgios C., and Jeff S. Shamma. ”Control of preferences in social networks.” 49thIEEE Conference on Decision and Control (CDC). IEEE, 2010.

[8] Altman, Eitan, et al. ”Dynamic games for analyzing competition in the Internet and in on-linesocial networks.” International Conference on Network Games, Control, and Optimization.Birkhauser, Cham, 2016.

20

[9] Touya, Khadija, et al. ”A Game theoretic approach for competition over visibility in socialnetworks.” Bulletin of Electrical Engineering and Informatics 8.2 (2019): 674-682.

[10] Altman, Eitan, et al. ”Competition over timeline in social networks.” 2013 IEEE/ACM Interna-tional Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).IEEE, 2013.

[11] Dhounchak, Ranbir, Veeraruna Kavitha, and Yezekael Hayel. ”To participate or not in a coalitionin adversarial games.” Network Games, Control, and Optimization. Birkhauser, Cham, 2019.125-144.

[12] Hinderer, K., 1970. Decision models. In Foundations of Non-stationary Dynamic Programmingwith Discrete Time Parameter (pp. 78-83). Springer, Berlin, Heidelberg.

[13] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming.John Wiley & Sons, 2014.

[14] W. H. Fleming and H. M. Soner, Controlled Markov processes and viscosity solutions. SpringerScience & Business Media, 2006, vol. 25.

[15] Singh, V. and Kavitha, V., 2020. Asymmetric Information Acquisition Games. arXiv preprintarXiv:2009.02053.

[16] Fudenberg, Drew, et al. The theory of learning in games. Vol. 2. MIT press, 1998.[17] Borkar, Vivek S. Stochastic approximation: a dynamical systems viewpoint. Vol. 48. Springer,

2009.[18] Emilio Roxin et al., “The existence of optimal controls.” The Michigan Mathematical Journal,

9(2):109-119, 1962.[19] Sundaram, R.K., 1996. A first course in optimization theory. Cambridge university press.

Appendix A. Proofs related Theorems 2.1 and 2.2 .Proof of Theorem 3.1: The control problem (3.6)-(3.7) for any k can be rewrit-ten as:

u(s, x) = supa(⋅)∫

T

sLk(s′, x(s′), a(s′))ds′ + g(x(T )) with

Lk(s, x, a) =⎛⎝cik(s;π−i) + ∑

lk+1∈Lk+1

vik+1((lk+1, s);π−i)P (lk+1∣lk, s, π−i) − νx)⎞⎠fq(lk)(s)ds

and g(x) = −νxe−q(lk)x. For k =M , viM+1 is defined to be 0, and

LM(s, x, a) = (ciM(s;π−i) − νx)) fq(lk)(s)

Observe that x can be confined to range [0, βT ]. Then LM(⋅) is bounded on [0, T ] ×[0, βT ] × [0, β] and is also Lipshitz continuous by assumption A.2. Further clearlythe RHS of the ODE (3.7) and the terminal cost g are all bounded and Lipschitzcontinuous. Thus by [14, Theorem 10.1 and the following Remark 10.1, Chapter 2]the value function u(⋅, ⋅) is unique viscosity solution which is Lipschitz continuouswhen k = M . This implies viM((l, s);π−i) is Lipschitz continuous in s, further it isalso bounded (see (3.5)). Assume this holds true for k+1, k+2 . . .M , then for k, fromthe induction hypothesis, vk+1 is Lipschitz continuous and from A.1 and A.2, Lk(⋅)

is bounded on [0, T ] × [0, βT ] × [0, β] and is also Lipshitz continuous. Again, RHSof the ODE (3.7) and the terminal cost g are all bounded and Lipschitz continuous;again using the results for optimal control problem (3.5) defining the k-th stage DPequation. Thus by the same result, the value function u(⋅, ⋅) is unique viscosity solutionwhich is Lipschitz continuous. By backward induction on k, part (i) and (ii) are true.

For part (iii) we apply the results of [18]. Towards this the optimal controlproblem can be converted into Mayer-type (finite horizon problem with only terminalcost) by usual technique of augmenting a new component to state which represents

y(s′) ∶= ∫s′

sLk(s, x(s), a(s))ds,

21

and equivalently maximizing y(T ) + g(x(T )). By part (ii) all the required assump-tions [18, Assumptions (i) to (vii)] are satisfied, with compact control space U = [0, β],compact state space X = [0, βT ] (it is easy to verify that the state variable could beconfined to this range); assumptions (i)-(ii) are trivially satisfied; assumption (iii)follows by part (i); for assumption (iv) one can actually bound by a constant inde-pendent of (s′, (x, y)); easy to verify convexity requirement of (vii), because for anygiven (s′, (x, y)) the set in mention is an interval. This proves part (iii). ∎

Proof of Lemma 3.2 For any stage k, given any state zik, consider the openloop policy aik(⋅), briefly represented as a(⋅). If a(⋅) is already a threshold type, wehave nothing to prove. If not, choose two intervals [t1, t1 + δ1] and [t2, t2 + δ2] with

t2 ≥ t1 + δ1 such that ∫t1+δ1t1

a(t)dt < βδ1 and ∫t2+δ2t2

a(t)dt > 0. Further by choosing

appropriate end points, ∫t1+δ1t1

a(t)dt + ∫t2+δ2t2

a(t)dt = βδ1.

Now construct policy a′(t) such that, ∫t1+δ1t1

a′(t)dt = βδ1 and ∫t2+δ2t2

a′(t)dt = 0, andon rest of the intervals the policy a′(⋅) matches completely with a(⋅). This new policyis basically constructed by shifting the mass from a later interval [t2, t2 + δ2] to aformer interval [t1, t1 + δ1] in policy a(⋅), note that if one can’t find such intervals, itimplies that the policy a(⋅) itself is a Threshold policy and we have nothing to prove.

Observe for all t < t1 we have, a(t) = a′(t) and hence a(t) = ∫tτk−1

a(s)ds = a′(t).

Using similar logic, for any t ∈ (t1, t2], a(t) < a′(t); for any t ∈ (t2, t2 + δ2] we havea(t) ≤ a′(t); and for all t > t2 + δ2 we have a(t) = a′(t).

This implies, the time to transition to next stage with policy a(⋅) denoted as τa isstochastically dominated by that with policy a′(⋅) denoted as τa′ as explained below;consider the CDFs with both the policies; i) for any x < t1,

Fa(x) = Prob(τa ≤ x) = 1 − e−q(lk)a(x) = 1 − e−q(lk)a

′(x)= Fa′(x),

ii) for any x ∈ (t1, t1 + δ1), we have a(x) < a′(x) and so Fa(x) < Fa′(x), iii) forany x ∈ (t2, t2 + δ2), we have a(x) ≤ a′(x) and so Fa(x) ≤ Fa′(x), and iv) for anyx ∈ [t2 + δ2, T ], we have a(t) = a′(t) and so Fa(x) = Fa′(x). This proves the required

stochastic dominance, τad≤ τa′ .

Further observe that the expected cost with policy a(⋅) (by change of variablesand the fact that a(τk−1) = 0)

E[a(τa); τa < T ] + a(T )e−q(lk)a(T ) = ∫

T

τk−1a(t)q(lk)a(t)e

−q(lk)a(t)dt + a(T )e−q(lk)a(T )

= ∫

a(T )

0q(lk)xe

−q(lk)xdx + a(T )e−q(lk)a(T ) =1

q(lk)(1 − e−q(lk)a(T )),

which is the same as that using a′ because a(T ) = a′(T ). One can keep on improvingthe policy until it becomes a threshold policy. This completes the proof. ∎

Proof of Lemma 3.3: We prove it in two steps. For brevity, we drop π−i and i.Step 1: We prove vk(lk, t) ≥ vk(lk, τ) if τ ≥ t. For any stage k and state zk = (lk, t),the k-th stage DP equation equals the following optimal control problem (see (3.6)),

u(t,0) = supa(⋅)∈L∞

∫T

t(hk(s′) − νx(s′))q(lk)a(s′)e−q(lk)x(s

′)ds′ + g(x(T )).

By Dynamic programming principle [14, Theorem 5.1] as applied to this optimalcontrol problem, we can rewrite it as follows;

22


∫τ

t(hk(s) − νx(s))q(lk)ak(s)e−q(lk)x(s)ds + u(τ, x(τ)) ,

where, u(τ, x(τ)) = supak∈L∞[τ,T ]

J(τ, x(τ), aik).

Observe that the function J(τ, x(τ), ak) (see equation (3.6)) has same structure asfunction Jh(t, x, a) in Lemma A.1 with h = hk, t = τ , x = x(τ) and a = ak andcontinuity follows by Theorem 3.1. Hence we have,

u(τ, x(τ)) = e−q(lk)x(τ) [u(τ,0) − νx(τ)] , and so


∫τ

t(hk(s) − νx(s))q(lk)ake

−q(lk)x(s)ds + e−q(lk)x(τ)[u(τ,0) − νx(τ)] .

In the above under zero policy (ak(s) = 0 for all s ∈ [t, τ]), we have x(τ) = 0 andthe first term (integral) in the above supremum is zero; hence u(t,0) ≥ u(τ,0). Thus,vk(zk) ≥ vk(zk), as optimal control value u(τ,0) corresponds to state zk = (lk, τ).Step 2: We prove vik((l, t);π−i) ≥ vik((l′, t);π−i) if l ≥ l′. Let q(l) = q, q(l′) = q′ and

respective transition times, Ω ∼ fq and Ω′ ∼ fq′ etc. By A.1, Ωd≤ Ω′. Begin backward

induction with M , zM = (l, t) and z′M = (l′, t). For any policy a, from (3.3)-(3.4),

JM(zM , a) = ∫T

t(cM(s) − νa(s))fq(s)ds − νa(T )e−qa(T ),

≥ ∫T

t(cM(s) − νa(s))fq′(s)ds − νa(T )e−q

′a(T ) = JM(z′M , a), (since Ωd≤ Ω′).

Thus from (3.2), vM(l, t) ≥ vM(l′, t). Assume it is true for M . . . k + 1, and consider

k, zk = (l, t) and z′k = (l′, t). Fix policy a, then ∑l vk+1(l, s)P (l∣l, s) ≥ ∑l vk+1(l, s)P (l∣l′, s)

for any s by A.1 and induction. Thus, from (3.3)-(3.4) (since Ωd≤ Ω′),

Jk(zk, a) ≥ ∫T

t(ck(s) + ∑

l∈Lk+1

vk+1(l, s)P (l∣l′, s) − νa(s))fq(s)ds − νa(T )e−qa(T ),

≥ ∫T

t(ck(s) + ∑

l∈Lk+1

vk+1(l, s)P (l∣l′, s) − νa(s))fq′(s)ds − νa(T )e−q′a(T ) = Jk(z′k, a).

Thus from (3.2), vk(l, t) ≥ vk(l′, t). Combining the two steps, we have the result. ∎

Lemma A.1. Let Jh(t, x, a) be a function of the form

Jh(t, x, a) = ∫T

t(h(s) − νx(s)) qa(s)e−qx(s)ds − νx(T )e−qx(T ),

defined using continuous function h(⋅) and state process⋅

x (s) = a(s), with initialcondition, x(t) = x. Define u(t, x) ∶= sup a∈L∞ Jh(t, x, a), then we have:(i) Jh(t, x, a) = e

−qx[Jh(t,0, a) − νx] and u(t, x) = e−qx[u(t,0) − νx],(ii) The optimal policy a∗(⋅) is independent of x.

Proof: By change of variable x(s) = x+ x(s), we have⋅

x (s) = a(s), i.e., x(s) = ∫s

t a(s)dswith x(t) = 0, and hence,

Jh(t, x, a) = ∫T

t(h(s) − ν(x + x(s)))qa(s)e−q(x+x(s))ds − ν (x + x(T ))e−q(x+x(T ))) ,

= e−qx(∫T

t(h(s) − νx(s))qa(s)e−qx(s)ds− νx(T )e−qx(T ) − νx∫

T

tqa(s)e−qx(s)ds − νxe−qx(T ))

= e−qx (Jh(t,0, a) − νx) , because, ∫T

tqa(s)e−qx(s)ds + e−qx(T ) = 1.

23

This completes part (i). For part (ii), from the above equation we have,

u(t, x) = supa∈L∞

Jh(t, x, a) = supa∈L∞

e−qx [Jh(t,0, a) − νx] = e−qx[ supa∈L∞

Jh(t,0, a) − νx] ,

and hence we have, u(t, x) = e−qx[u(t,0)−νx]. This proves part (ii). Further it is clearfrom above that the optimal policy a∗(⋅) remains the same for all initial conditions xand this proves part (iii). ∎

Appendix B. Proofs related to section 6.

Lemma B.1. The probability P im(s;π−i) is Lipshcitz continuous in s.

Proof: Define Qj(s;πj) to be the probability that agent j (among opponents) hascontacted the particular lock l till time s using it’s own strategy πj contained in π−i.Then,

P im(s;π−i) = ∑

J ∶∣J ∣=m−1,i/∈J

⎛⎝∏j∈J

Qj(s;πj)⎞⎠⎛⎝1 − ∏

j/∈J ,j≠iQj(s;πj)

⎞⎠

(B.1)

Observe for any agent j

Qj(s;πj) =M

∑k=1

k

MQj(s;k, πj),(B.2)

where k/M is the probability that l is among the contacted k locks and Qj(s;k, πj)

is the probability that j has contacted k locks by time s, and is given by:

(B.3) Qj(s;k, πj

) = ∫

s

0fM(t1)∫

s

t1⋯∫

s

tk−1fM−k+1(tk)e

−(M−k)ajk+1(s)dtk . . . dt1,

where fM−l+1(⋅) ∶= (M − l + 1)ajl (⋅)e

−(M−l+1)ajl(⋅) is the density of contact time. Also

ajl (t) = ∫ttl−1

ajl (t′)dt′ (see (2.2)). By integral definition given in equations (B.1)-(B.3)

and boundedness (e.g., by βj), the function P im(⋅;π−i) is Lipshitz continuous in s. ∎

Lemma B.2. For any strategy Θ−i and Θ′−i of opponents, if time s satisfies thefollowing for some M ′,

minj≠i

θjm ≥ s and minj≠i

θ′jm ≥ s for all locks m <M ′,

and θjm = θ′jm for all m ≥M ′, then ci(t,Θ−i) = c

i(t,Θ′−i) for all t ≤ s.

Proof: From equation (B.3), the probability that any agent j has contacted k lock bytime t using strategy Θj of Θ−i, Q

j(s;k,Θ−i) is same as that using Θ′j of Θ′−i for allt ≤ s. Basically, both the opponent strategies are the same till s; there is a differenceonly after s. Then Qj(t;k,Θ−i) = Qj(t;k,Θ′−i) for all j, for all t ≤ s. Using (B.2),(B.1) and (6.1), we have the result. ∎

Proof of Lemma 6.1: For the ease of notation, in this proof we denote ci(t;Θ−i)as c(t), and vik((M − k + 1, t);Θ−i) as vk(t). Let Ωk represent random variable dis-tributed according to density fk. By DP equations (6.5)-(6.6), now with τ ik = t, ∀k:

vk+1(t) = supθ∈[t,T ]

∫θ−t

0(c(s + t) − νk+1 + vk+2(s + t)) fM−k(s)ds = sup

θ∈[t,T ]E[gk+1(ΩM−k; θ)],

with, gk(s; θ) ∶= (c(s + t) − νk + vk+1(s + t)Xk<M)Xs<θ−t.(B.4)

Observe for k =M , vM(t) = supθ∈[t,T ]

∫θ−t

0(c(s + t) − νM) f1(s)ds = sup

θ∈[t,T ]E[gM(Ω1; θ)].

24

We prove using backward induction. As vM(s) ≥ 0 we have for k =M − 1 and all s, θ,gM−1(s; θ) = (c(s+t)−νM−1+vM(s+t))Xs<θ−t ≥ (c(s+t)−νM)Xs<θ−t = gM(s; θ). Further

Ω1

d≥ Ω2 (stochastically dominates), and gM(⋅) is decreasing by A.2 (for inequality a)

and so, vM−1(t) = supθ∈[t,T ]

E[gM−1(Ω2)] ≥ supθ∈[t,T ]

E[gM(Ω2)]a≥ sup

θ∈[t,T ]E[gM(Ω1)] = vM(t).

Assume vl+1(s) ≥ vl+2(s) for all s, l ≥ k. Then from (B.4) gk(s; θ) ≥ gk+1(s; θ)∀s, θ,

and, vk(t) = supθ∈[t,T ]

E[gk(ΩM−k+1)] ≥ supθ∈[t,T ]

E[gk+1(ΩM−k+1)]a≥ supθ∈[t,T ]

E[gk+1(ΩM−k)] = vk+1(t).

Inequality a is true because ΩM−kd≥ ΩM−k+1 and function gk+1(⋅) is decreasing again

by A.2 (c(⋅) is decreasing) and also using Lemma 3.3 (vk+2(⋅) is decreasing). ∎

Proof of Theorem 6.2: Fix opponents’ MT strategy as Θ−i, let νk ∶=ν

M−k+1 .

For brevity, here we denote ci(t;Θ−i) as c(t), hik(t;Θ−i) as hk(t), v

ik((M−k+1, t);Θ−i)

as vk(t) and Bk(Θ−i) as Bk. For any k, if hk(0) ≤ νk then from Lemma 6.1, hk′(0) ≤νk′ and so Bk′ = 0 from (6.7) for all k′ ≥ k. Clearly, the value function at end,vk+1(T ) = 0, and so hk(T ) = c(T ). Thus for any k, c(T ) ≥ νk implies c(T ) ≥ νk′ , andBk′ = T (from (6.7)) for all k′ ≤ k.

Consider other k, i.e, with c(T ) < νk < hk(0). Then clearly the BR sets are,Bk = t ≥ 0 ∶ c(t) + vk+1(t) = νk and Bk+1 ⊂ t ≥ 0 ∶ c(t) + vk+2(t) = νk+1 ∪ 0. So bymonotonicity, for any t ∈ Bk, c(t) = νk − vk+1(t) < νk+1 − vk+2(t). This implies t′ < tfor any t ∈ Bk and t′ ∈ Bk+1 for the above (interior) type of k. Thus in all, any BRstrategy Θ∗i has θi∗k ≥ θ

i∗k+1, with equality only when θi∗k ∈ 0, T. This proves part (i).

Further by backward induction (6.7) becomes,

Bk =⎧⎪⎪⎪⎨⎪⎪⎪⎩

0 if ci(0;Θ−i) = ciw1 ≤ νk,T if ci(T ;Θ−i) > νk,t ∶ ci(t;Θ−i) = νk else,

(B.5)

because: a) hM(t) = c(t), so Bk is given by (B.5) for k =M ; b) assume Bk is given by(B.5) for k = M . . . l + 1 and consider any θl+1 ∈ Bl+1; c) from part (i), Bl ⊂ [θl+1, T ]and for any t ∈ [θl+1, T ], by Lemma 3.3 vl+1(t) ≤ vl+1(θl+1) = 0; hence hl(t) = c(t),and thus by (6.7) Bl is also given by (B.5). Proof of other parts follow from (B.5). ∎

Proof of Theorem 6.3: From Theorem 6.2 and (B.5), one can verify that onlythe tuples that satisfy (6.8) are NE, so second part is done. For the existence ofNE, define the correspondence g ∶ [0, T ]nM → [0, T ]nM where, g(⋅) = gik(⋅)i,k withcomponents defined as below,

gik(Θ) = arg min0≤s≤T

(ci(s;Θ−i) − νk)2 , for any i, k,where Θ = (Θi,Θ−i) ∈ [0, T ]nM .

It is easy to verify that the fixed point of the above correspondence (see [19]) is a NE.From the Maximum theorem [19], gik(⋅) is an upper semi continuous correspondence(uscc) for every i, k; so g(Θ) = ×i,kg

ik(Θ), is also uscc. Further, g(⋅) is non-empty,

compact and convex for any Θ ∈ [0, T ]nM , and [0, T ]nM is compact and convex.Hence, from Kakutani’s fixed point theorem [19], g(⋅) has a fixed point. ∎

Proof of Lemma 6.4: From Theorem 6.2, for any i given that ciw1 ≤ νki+1 forsome ki, any BR of agent i against any strategy of opponents will contain θ∗k = 0 forall k ≥ ki + 1. Hence the same is true for any NE. Further from (B.3), as βj → 0,the probabilities Qj(⋅, ⋅, ⋅) converge to zero point-wise. This implies convergence ofP im(⋅, ⋅) → Xm=1 from (B.1). Thus from (6.1), ci(T ;Θ−i) → ciw1. By continuity of

ci(⋅;Θ−i) function, there exists a β > 0 such that, ci(T ;Θ−i) > νki−1 for all βj ≤ β, k ≥ki, with ki as in hypothesis. Thus the proof can be completed using Theorem 6.2. ∎

25

PARTIAL INFORMATION BASED COMPETITIVE ADVERTISING

Documents

Transcript of PARTIAL INFORMATION BASED COMPETITIVE ADVERTISING