SU-CS229 NOV102025
Last edited: November 11, 2025Key Sequence
Notation
New Concepts
Important Results / Claims
Questions
Interesting Factoids
- “sometimes we may want to model slower than the data to be collected; for instance, your helicopter really doesn’t move anywhere every 100ths of a second to be learned, but you can collect data that fast”
Debugging RL
RL should work when
- The simulator is good
- The RL algorithm correctly maximize \(V^{\pi}\)
- Reward such that maximum expected payoff corresponds to your goal
Diagnostics
- check your simulator: if your policy works in sim but not IRL, your sim is bad
- if \(V^{\text{RL}} < V^{\text{human}}\), then your RL algorithm is just bad
- if \(V^{\text{RL}} \geq V^{\text{human}}\), then your objective function is bad
EMNLP2025 Extra Things
Last edited: November 11, 2025EMNLP2025 Yu: Long-Context LM Fail in Basic Retrieval
Synthetic dataset finds that needle-in-the-haystack problems fail when needle needs reasoning
EMNLP2025 Friday Afternoon Posters
Last edited: November 11, 2025EMNLP2025 Ghonim: concept-ediq
a massive bank of concepts multi model semantically linked
EMNLP2025 Bai: understanding and leveraging expert specialization of context faithfulness
Two steps: step one is to use router tuning to prioritize experts that rely on context, step two is to especially hit those for fine-tuning for improved Qantas alliance. Big gainz and hot pot and other QA data set just by the router tuning
EMNLP2025 Vasu: literature grounded hypothesis generation
Use citation links to generate a Providence graph of hypothesis, then, fine tune a language model to reproduce this Providence graph, use resulting model to improve RAG that would be contextually grounded
EMNLP2025 Wednesday Morning Posters
Last edited: November 11, 2025EMNLP2025 Xu: tree of prompting
Evaluate the quote attribution score as a way to prioritize more factual quotes.
EMNLP2025 Fan: medium is not the message
Unwanted feature such as language a medium who found in embedding, use linear concept of eraser to learn a projection that minimize information on unwanted features
EMNLP2025 Hong: variance sensitivity induces attention entropy collapse
Softmax is highly sensitive to variance which is why pre-training loss spikes without QK norm
policy evaluation
Last edited: November 11, 2025See also Roll-out utility if you don’t want to get a vector utility over all states.
solving for the utility of a policy
We can solve for the utility of the policy given the transitions \(T\) and reward \(R\) by solving the following equation
\begin{equation} \bold{U}^{\pi} = (I - \gamma T^{\pi})^{-1} \bold{R}^{\pi} \end{equation}
where \(T\) is an \(|S| \times |S|\) square matrix where each horizontal row is supposed to add up to \(1\) which encodes the probability of transitioning from each horizontal row to the column next rows.
