EMNLP2025 Zhang: Diffusion vs. Autoregression Language Models
Last edited: November 11, 2025One-Liner
Novelty
Notable Methods
Key Figs
New Concepts
Notes
EMNLP2025: MUSE, MCTS Driven Red Teaming
Last edited: November 11, 2025One-Liner
Notable Methods
- construct a series of perturbation actions
- \(A\qty(s)\) = decomposition (skip), expansion (rollout), dredirection
- sequence actions with MCTS
Key Figs
New Concepts
Notes
EMNLP2025 Keynote: Heng Ji
Last edited: November 11, 2025Motivation: drug discovery is extremely slow and expensive; mostly modulating previous iterations of work.
Principles of Drug Discovery
- observation: acquire/fuse knowledge from multiple data modalities (sequence, stricture, etc.)
- think: critically generating actually new hypothesis — allowing iteratively
- allowing LMs to code-switch between moladities (i.e. fuse different modalities together in the most uniform way)
LM as a heuristic helps prune down search space quickly.
SU-CS229 Midterm Sheet
Last edited: November 11, 2025- matrix calculus
- supervised learning
- gradient descent
- Newton’s Method
- regression
- \(y|x,\theta\) is linear Linear Regression
- What if \(X\) and \(y\) are not linearly related? Generalized Linear Model
- \(y|x, \theta\) can be any distribution that’s exponential family
- some exponential family distributinos: SU-CS229 Distribution Sheet
- classification
- take linear regression, squish it: logistic regression; for multi-class, use softmax \(p\qty(y=k|x) = \frac{\exp \theta_{k}^{T} x}{\sum_{j}^{} \exp \theta_{j}^{T} x}\)
- generative learning
- modeling each class’ distributions, and then check which one is more likely: GDA
- Naive Bayes
- bias variance tradeoff
- regularization
- unsupervised learning
- feature map and precomputing Kernel Trick
- Decision Tree
- boosting
backpropegation
Last edited: October 10, 2025backpropegation is a special case of “backwards differentiation” to update a computation grap.h
constituents
- chain rule: suppose \(J=J\qty(g_{1}…g_{k}), g_{i} = g_{i}\qty(\theta_{1} \dots \theta_{p})\), then \(\pdv{J}{\theta_{i}} = \sum_{j=1}^{K} \pdv{J}{g_{j}} \pdv{g_{j}}{\theta_{i}}\)
- a neural network
requirements
Consider the notation in the following two layer NN:
\begin{equation} z = w^{(1)} x + b^{(1)} \end{equation}
\begin{equation} a = \text{ReLU}\qty(z) \end{equation}
\begin{equation} h_{\theta}\qty(x) = w^{(2)} a + b^{(2)} \end{equation}
\begin{equation} J = \frac{1}{2}\qty(y - h_{\theta}\qty(x))^{2} \end{equation}
- in a forward pass, compute the values of each value \(z^{(1)}, a^{(1)}, \ldots\)
- in a backward pass, compute…
- \(\pdv{J}{z^{(f)}}\): by hand
- \(\pdv{J}{a^{(f-1)}}\): lemma 3 below
- \(\pdv{J}{z^{f-1}}\): lemma 2 below
- \(\pdv{J}{a^{(f-2)}}\): lemma 3 below
- \(\pdv{J}{z^{f-2}}\): lemme 2 below
- and so on… until we get to the first layer
- after obtaining all of these, we compute the weight matrices'
- \(\pdv{J}{W^{(f)}}\): lemma 1 below
- \(\pdv{J}{W^{(f-1)}}\): lemma 1 below
- …, until we get tot he first layer
chain rule lemmas
Pattern match your expressions against these, from the last layer to the first layer, to amortize computation.
