_index.org

SU-CS229 NOV102025

Last edited: November 11, 2025

Key Sequence

Notation

New Concepts

Important Results / Claims

Questions

Interesting Factoids

  • “sometimes we may want to model slower than the data to be collected; for instance, your helicopter really doesn’t move anywhere every 100ths of a second to be learned, but you can collect data that fast”

Debugging RL

RL should work when

  1. The simulator is good
  2. The RL algorithm correctly maximize \(V^{\pi}\)
  3. Reward such that maximum expected payoff corresponds to your goal

Diagnostics

  • check your simulator: if your policy works in sim but not IRL, your sim is bad
  • if \(V^{\text{RL}} < V^{\text{human}}\), then your RL algorithm is just bad
  • if \(V^{\text{RL}} \geq V^{\text{human}}\), then your objective function is bad

EMNLP2025 Extra Things

Last edited: November 11, 2025

EMNLP2025 Yu: Long-Context LM Fail in Basic Retrieval

Synthetic dataset finds that needle-in-the-haystack problems fail when needle needs reasoning

EMNLP2025 Friday Afternoon Posters

Last edited: November 11, 2025

EMNLP2025 Ghonim: concept-ediq

a massive bank of concepts multi model semantically linked

EMNLP2025 Bai: understanding and leveraging expert specialization of context faithfulness

Two steps: step one is to use router tuning to prioritize experts that rely on context, step two is to especially hit those for fine-tuning for improved Qantas alliance. Big gainz and hot pot and other QA data set just by the router tuning

EMNLP2025 Vasu: literature grounded hypothesis generation

Use citation links to generate a Providence graph of hypothesis, then, fine tune a language model to reproduce this Providence graph, use resulting model to improve RAG that would be contextually grounded

EMNLP2025 Wednesday Morning Posters

Last edited: November 11, 2025

EMNLP2025 Xu: tree of prompting

Evaluate the quote attribution score as a way to prioritize more factual quotes.

EMNLP2025 Fan: medium is not the message

Unwanted feature such as language a medium who found in embedding, use linear concept of eraser to learn a projection that minimize information on unwanted features

EMNLP2025 Hong: variance sensitivity induces attention entropy collapse

Softmax is highly sensitive to variance which is why pre-training loss spikes without QK norm

policy evaluation

Last edited: November 11, 2025

See also Roll-out utility if you don’t want to get a vector utility over all states.

solving for the utility of a policy

We can solve for the utility of the policy given the transitions \(T\) and reward \(R\) by solving the following equation

\begin{equation} \bold{U}^{\pi} = (I - \gamma T^{\pi})^{-1} \bold{R}^{\pi} \end{equation}

where \(T\) is an \(|S| \times |S|\) square matrix where each horizontal row is supposed to add up to \(1\) which encodes the probability of transitioning from each horizontal row to the column next rows.