Model Evaluation
Last edited: August 8, 2025Extrinsic Evaluation
Extrinsic Evaluation, also known as In-Vivo Evaluation, focuses on benchmarking two language models in terms of their differing performance on a test task.
Intrinsic Evaluation
In-Vitro Evaluation or Intrinsic Evaluation focuses on evaluating the language models’ performance at, well, language modeling.
Typically, we use perplexity.
- directly measure language model performance
- doesn’t necessarily correspond with real applications
model fitting
Last edited: August 8, 2025model-based reinforcement learning
Last edited: August 8, 2025Step 1: Getting Model
We want a model
- \(T\): transition probability
- \(R\): rewards
Maximum Likelihood Parameter Learning Method
\begin{equation} N(s,a,s’) \end{equation}
which is the count of transitions from \(s,a\) to \(s’\) and increment it as \(s, a, s’\) gets observed. This makes, with Maximum Likelihood Parameter Learning:
\begin{equation} T(s’ | s,a) = \frac{N(s,a,s’)}{\sum_{s’’}^{} N(s,a,s’’)} \end{equation}
We also keep a table:
\begin{equation} p(s,a) \end{equation}
the sum of rewards when taking \(s,a\). To calculate a reward, we take the average:
model-free reinforcement learning
Last edited: August 8, 2025In model-based reinforcement learning, we tried real hard to get \(T\) and \(R\). What if we just estimated \(Q(s,a)\) directly? model-free reinforcement learning tends to be quite slow, compared to model-based reinforcement learning methods.
review: estimating mean of a random variable
we got \(m\) points \(x^{(1 \dots m)} \in X\) , what is the mean of \(X\)?
\begin{equation} \hat{x_{m}} = \frac{1}{m} \sum_{i=1}^{m} x^{(i)} \end{equation}
\begin{equation} \hat{x}_{m} = \hat{x}_{m-1} + \frac{1}{m} (x^{(m)} - \hat{x}_{m-1}) \end{equation}