A MDP is a decision network whereby a sequences of actions causes a sequence of states. Each state is dependent on the action we take and the state we are in, and each utility is dependent on action taken and the state we are in.

Note that, unlike a POMDP, we know what state we are in—the observations from the states are just unclear.

## constituents

- \(S\): state space (assuming discrete for now, there are \(n\) states) — “minimum set of information that allows you to solve a problem”
- \(A\): action space — set of things your agent can do
- \(T(s’ | s,a)\): “dynamics”, state-transition model “probability that we end up in \(s’\) given \(s\) and action \(a\)”: good idea to make a table of probabilities of source vs. destination variables
- \(R(s,a,s’)\): expected reward given in an action and a state (real world reward maybe stochastic)
- \(\pi_{t}(s_{1:t}, a_{1:t-1})\): the policy, returning an action, a system of assigning actions based on states
- however, our past states are d-seperated from our current action given knowing the state, so really we have \(\pi_{t}(s_{t})\)

## additional information

We assume policy to be exact right now.

### stationary Markov Decision Process

This is a stationary Markov Decision Process because at each node \(S_{n}\), we have: \(P(S_{n+1} | A_n, S_n)\). Time is **not** a variable: as long as you know what state you are in, and what you did, you know the transition probability.

(that is, the set of states is not dependent on time)

### calculating utility with instantaneous rewards

Because, typically, in decision networks you sum all the utilities together, you’d think that we should sum the utilities together.

#### finite-horizon models

We want to maximize reward over time, over a finite horizon \(n\). Therefore, we try to maximize:

\begin{equation} \sum_{t=1}^{n}r_{t} \end{equation}

this function is typically called “return”.

#### infinite-horizon models

If you lived forever, small positive \(r_{t}\) and large \(r_{t}\) makes no utility difference. We therefore add discounting:

\begin{equation} \sum_{t=1}^{\infty} \gamma^{t-1} r_{t} \end{equation}

where, \(\gamma \in (0,1)\)

we discount the future by some amount—an “interest rate”—reward now is better than reward in the future.

- \(\gamma \to 0\): “myopic” strategies, near-sighted strategies
- \(\gamma \to 1\): “non-discounting”

#### average return models

We don’t care about this as much:

\begin{equation} \lim_{n \to \infty} \frac{1}{n} \sum_{t=1}^{n}r_{t} \end{equation}

but its close to infinite-horizon models with Gama close to \(1\)

### Solving an MDP

#### You are handed or can predict \(R(s,a)\), and know all transitions

Small, Discrete State Space

Get an exact solution for \(U^{*}(s)\) (and hence \(\pi^{ *}(a, s)\)) for the problem via…

Large, Continuous State Space

Parameterize Policy

Optimize \(\pi_{\theta}\) to maximize \(U(\pi_{\theta})\) using Policy Optimization methods!

: lower dimension policy space**Gradient Free**: higher dimension policy space**Gradient Based Method**

Parameterize Value Function

Optimize \(U_{\theta}(S)\) via global approximation or local approximation methods, then use a greedy policy on that nice and optimized value function.

#### You can only reason about your immediate surroundings/local reachable states

or… “you don’t know the model whatsoever”

during these cases, you never argmax over all actions; hence, its important to remember the methods to preserve Exploration and Exploitation.