SU-CS229 Andrew's Advice
Last edited: November 11, 2025“How quickly can be prototype quickly?” 2-7 days.
SU-CS229 NOV062025
Last edited: November 11, 2025Key Sequence
Notation
New Concepts
Important Results / Claims
Questions
Interesting Factoids
229 MDP notation
\(S\) (state), \(A\) (actions), \(P_{(s,a)}\qty(s’) = T\qty(s’ | s,a)\) , \(\gamma\) (discount), \(R\qty(s,a)\).
FUN FACT: discount factors \(< 1\) makes value iteration converge.
\begin{equation} V^{\pi}\qty(s) = \mathbb{E}\qty [R\qty(s_{0},a_{0}) + \gamma R\qty(s_{1}, a_{1}) + \gamma^{2} \dots] \end{equation}
\begin{equation} V^{\pi} \qty(s) = R\qty(s) + \gamma \sum_{s’}^{} P_{s,\pi\qty(s)}\qty(s’) V^{\pi}\qty(s’) \end{equation}
EMNLP2025 Eo: Expert Generalization in MoE in IFT
Last edited: November 11, 2025One-Liner
cluster the input, activate a seperate expert group for cluster target.
Motivation
- heterogeneity of input instruction tuning data poses difficulty for MoE
- routing only operates at token level, so can’t deal with sequence level generalization
Novelty
Architecture to enable hierarchical expert routing.
Notable Methods
Mixture of Clustered Experts
Dual-stage routing mechanism.
- group the \(M\) experts into groups of \(N\) expert (i.e. \(M = \qty(N, \dots, N)\)
- k-means clustering the sequence embedding at input
- given the assigned cluster, only route to the assigned subgroup
Results
- outperforms MoE baselines
- demonstrate expert group specialization
EMNLP2025 Wu: Zero Shot Graph Learning via Explicit Reasoning
Last edited: November 11, 2025One-Liner
Novelty
Background
How do LLMs do graphs?
- predict text from graphs (convert graph into text, autoregression)
- align text with graph (GNN + LLM late fusion)
- encode text with graph (stick LLM embedding to a GNN as a prompt)
