Houjun Liu

model-free reinforcement learning

In model-based reinforcement learning, we tried real hard to get \(T\) and \(R\). What if we just estimated \(Q(s,a)\) directly? model-free reinforcement learning tends to be quite slow, compared to model-based reinforcement learning methods.

review: estimating mean of a random variable

we got \(m\) points \(x^{(1 \dots m)} \in X\) , what is the mean of \(X\)?

\begin{equation} \hat{x_{m}} = \frac{1}{m} \sum_{i=1}^{m} x^{(i)} \end{equation}

\begin{equation} \hat{x}_{m} = \hat{x}_{m-1} + \frac{1}{m} (x^{(m)} - \hat{x}_{m-1}) \end{equation}

every time you get a new measurement \(x^{(m)}\). sometimes we don’t scale it by \(\frac{1}{m}\), you can scale it with constant \(\alpha\) which actually causes exponential decay of past samples (as it keeps getting scaled by \(\alpha\)).

\begin{equation} \hat{x} = \hat{x} + \alpha (x- \hat{x}) \end{equation}


Let us review the action-value function:

\begin{equation} Q(s,a) = R(s,a) + \gamma \sum_{s’}^{} T(s’|s,a) U(s’) \end{equation}

this is a model-free method, substituting in the definition of the value function:

\begin{equation} Q(s,a) = R(s,a) + \gamma \sum_{s’}^{} T(s’|s,a) \max_{a} Q(s’, a’) \end{equation}

Note! The second half is know in the shape of an expectation ("probability times the value"). Recall also that \(R(s,a)\) is the expected reward \(r\) when taking \(s,a\).


\begin{equation} r = \mathbb{E}[R(s,a)] \end{equation}

So we can say that:

\begin{equation} Q(s,a) = \mathbb{E} \qty[r + \gamma \max_{a’} Q(s’, a’)] \end{equation}

Finally, then, we can perform random variable mean estimation scheme given above; recall:

\begin{equation} \hat{x} = \hat{x} + \alpha (x- \hat{x}) \end{equation}

hence, we update our new mean with:

\begin{equation} Q(s,a) \leftarrow Q(s,a) + \alpha \qty [(r + \gamma \max_{a’} Q(s’, a’)) - Q(s,a)] \end{equation}


SARSA is Q-Learning where you hope the model converges. You HAVE to perform some Exploration and Exploitation to try out other actions, and then you just update your function accordingly:

\begin{equation} Q(s,a) \leftarrow Q(s,a) + \alpha \qty [(r + \gamma Q(s’, a’)) - Q(s,a)] \end{equation}

this works in theory because over time, good Exploration and Exploitation assumes that:

\begin{equation} a’ \rightarrow \arg\max_{a’} Q(s’,a’) \end{equation}

Eligibility Traces

Eligibility Traces is a change to SARSA which uses the number of visits as an additional constraint that allows updates to propagate each reward backwards given the list of states which caused that reward to be distributed.

Meaning, let \(\lambda\) be some decay parameter, we have:

\begin{equation} \delta = r + \gamma Q(s’,a’) - Q(s,a) \end{equation}

and, we can write:

\begin{equation} Q(s,a) \leftarrow Q(s,a) + \lambda \delta N(s,a) \end{equation}

where by the visit counts are discounted such that:

\begin{equation} N(s,a) \leftarrow \gamma \lambda N(s,a) \end{equation}

See also Sarsa (Lambda).

Generalized Q-Learning with Gradient action-value

Consider Value Function Approximation. We were trying to fit a set of \(\theta\) at that time to find \(U_{\theta}\) that matches \(U^{*}\).

We now want to compute some \(Q_{\theta}\) in the same flavour:

\begin{equation} Q_{\theta}(s,a) \sim Q^{*}(s,a) \end{equation}

We can measure the difference between these two values like so:

\begin{equation} \ell(\theta) = \frac{1}{2}\mathbb{E}_{(s,a)\sim \pi^{*}}\qty[(Q^{*}(s,a) - Q_{\theta}(s,a))^{2}] \end{equation}

We want to write this expected value distributed over \(s,a\) of the optimal policy because we want to calculate more samples over those states that the optimal policy ends up at most.

To optimize \(Q_{\theta}\), then, you betcha know what’s happenin:

\begin{align} \nabla \ell &= \nabla \frac{1}{2} \nabla \mathbb{E}_{(s,a) \sim \pi^{*}} \qty[(Q^{*}(s,a)-Q_{\theta}(s,a))^{2}] \\ &= \mathbb{E}_{(s,a) \sim \pi^{*}} \qty[(Q^{*}(s,a)-Q_{\theta}(s,a)) (-1)\nabla Q_{\theta}(s,a)] \\ &= -\mathbb{E}_{(s,a) \sim \pi^{*}} \qty[(Q^{*}(s,a)-Q_{\theta}(s,a)) \nabla Q_{\theta}(s,a)] \end{align}

by a healthy dose of the chain rule.

Now, to minimize this loss, we go in the direction opposite the gradient. The negatives then cancel out to give us:

\begin{equation} \theta \leftarrow \theta + \alpha \qty[\mathbb{E}_{(s,a) \sim \pi^{*}} \qty[(Q^{*}(s,a)-Q_{\theta}(s,a)) \nabla Q_{\theta}(s,a)] ] \end{equation}

where \(\alpha \in (0,1)\).

Similar to the SARSA assumption, good Exploration and Exploitation assumes that:

\begin{equation} Q \to Q^{*} \end{equation}

so we can drop our expectation with:

\begin{equation} \theta \leftarrow \theta + \alpha \qty[(Q^{*}(s,a)-Q_{\theta}(s,a)) \nabla Q_{\theta}(s,a)] \end{equation}

Now, we can make one more assumption, the assumption from Q-Learning:

\begin{equation} Q^{*}(s,a) \approx r_{s} + \gamma \max_{a’} Q_{\theta}(s’,a’) \end{equation}

that taking the best actions with the \(Q\) you have will slowly approximate optimal \(Q\).

\begin{equation} \theta \leftarrow \theta + \alpha \qty[\qty((r_{s} + \gamma \max_{a’} Q_{\theta}(s’,a’))-Q_{\theta}(s,a)) \nabla Q_{\theta}(s,a)] \end{equation}

you will note! this is actually just Q-Learning multiplying with a gradient.

Policy Gradient

see also Policy Gradient