Houjun Liu

multiagent reasoning

simple games


  • agent \(i \in X\) the set of agents.
  • joint action space: \(A = A’ \times A^{2} \times … \times A^{k}\)
  • joint action would be one per agent \(\vec{a} = (a_{1}, …, a_{k})\)
  • joint reward function \(R(a) = R’(\vec{a}), …, R(\vec{a})\)

additional information

prisoner’s dilemma

Cooperate-1, -1-4, 0
Defect0, -4-3, -3

traveler’s dilemma

  • two people write down the price of their luggage, between 2-100
  • the lower amount gets that value plus 2
  • the higher amount gets the lower amount minus 2

joint policy agent utility

for agent number \(i\)

\begin{equation} U^{i} (\vec{\pi}) = \sum_{a \in A}^{} R^{(i)}(\vec{a}) \prod_{j}^{} \pi^{(j)}(a^{(j)}) \end{equation}

this is essentially the reward you get given you took

response model

how would other agents respond to our system?

  • \(a^{-i}\): joint action except for agent \(i\)
  • \(\vec{a} = (a^{i}, a^{-i})\),
  • \(R(a^{i}, a^{-i}) = R(\vec{a})\)


deterministic best response model for agent \(i\):

\begin{equation} \arg\max_{a^{i} \in A^{i}} U^{i}(a^{i}, \pi^{-i}) \end{equation}

where the response to agent \(a\) is deterministically selected.

For prisoner’s dilemma, this results in both parties defecting because that would maximise the utility.

softmax response

its like Softmax Method:

\begin{equation} \pi^{i}(a^{i}) \propto \exp\qty(\lambda U^{i}(a^{i}, \pi^{-1})) \end{equation}

fictitious play

play at some kind of game continuously

Dominant Strategy Equilibrium

The dominant strategy is a policy that is the best response to all other possible agent policies. Not all games have a Dominant Strategy Equilibrium, because there are games for which the best response is never invariant to others’ strategies (rock paper scissors).

Nash Equilibrium

A Nash Equilibrium is a joint policy \(\pi\) where everyone is following their best response: i.e. no one is incentive to unilaterally change from their policy. This exists for every game. In general, Nash Equilibrium is very hard to compute: it is p-pad (which is unclear relationally to np-complete).