Markov Decision Process

A MDP is a decision network whereby a sequences of actions causes a sequence of states. Each state is dependent on the action we take and the state we are in, and each utility is dependent on action taken and the state we are in.

Note that, unlike a POMDP, we know what state we are in—the observations from the states are just unclear.

constituents

\(S\): state space (assuming discrete for now, there are \(n\) states) — “minimum set of information that allows you to solve a problem”
\(A\): action space — set of things your agent can do
\(T(s’ | s,a)\): “dynamics”, state-transition model “probability that we end up in \(s’\) given \(s\) and action \(a\)”: good idea to make a table of probabilities of source vs. destination variables
\(R(s,a,s’)\): expected reward given in an action and a state (real world reward maybe stochastic)
\(\pi_{t}(s_{1:t}, a_{1:t-1})\): the policy, returning an action, a system of assigning actions based on states
- however, our past states are d-seperated from our current action given knowing the state, so really we have \(\pi_{t}(s_{t})\)

additional information

We assume policy to be exact right now.

stationary Markov Decision Process

This is a stationary Markov Decision Process because at each node \(S_{n}\), we have: \(P(S_{n+1} | A_n, S_n)\). Time is not a variable: as long as you know what state you are in, and what you did, you know the transition probability.

(that is, the set of states is not dependent on time)

calculating utility with instantaneous rewards

Because, typically, in decision networks you sum all the utilities together, you’d think that we should sum the utilities together.

finite-horizon models

We want to maximize reward over time, over a finite horizon \(n\). Therefore, we try to maximize:

\begin{equation} \sum_{t=1}^{n}r_{t} \end{equation}

this function is typically called “return”.

infinite-horizon models

If you lived forever, small positive \(r_{t}\) and large \(r_{t}\) makes no utility difference. We therefore add discounting:

\begin{equation} \sum_{t=1}^{\infty} \gamma^{t-1} r_{t} \end{equation}

where, \(\gamma \in (0,1)\)

we discount the future by some amount—an “interest rate”—reward now is better than reward in the future.

\(\gamma \to 0\): “myopic” strategies, near-sighted strategies
\(\gamma \to 1\): “non-discounting”

average return models

We don’t care about this as much:

\begin{equation} \lim_{n \to \infty} \frac{1}{n} \sum_{t=1}^{n}r_{t} \end{equation}

but its close to infinite-horizon models with Gama close to \(1\)

Solving an MDP

You are handed or can predict \(R(s,a)\), and know all transitions

Small, Discrete State Space
Get an exact solution for \(U^{*}(s)\) (and hence \(\pi^{ *}(a, s)\)) for the problem via…
- policy iteration
- value iteration

Large, Continuous State Space
- Parameterize Policy
  Optimize \(\pi_{\theta}\) to maximize \(U(\pi_{\theta})\) using Policy Optimization methods!
  Gradient Free: lower dimension policy space
  Gradient Based Method: higher dimension policy space
  Policy Gradient
- Parameterize Value Function
  Optimize \(U_{\theta}(S)\) via global approximation or local approximation methods, then use a greedy policy on that nice and optimized value function.

You can only reason about your immediate surroundings/local reachable states

online planning

or… “you don’t know the model whatsoever”

reinforcement learning

during these cases, you never argmax over all actions; hence, its important to remember the methods to preserve Exploration and Exploitation.