EMNLP2025: MUSE, MCTS Driven Red Teaming
Last edited: November 11, 2025One-Liner
Notable Methods
- construct a series of perturbation actions
- \(A\qty(s)\) = decomposition (skip), expansion (rollout), dredirection
- sequence actions with MCTS
Key Figs
New Concepts
Notes
EMNLP2025 Keynote: Heng Ji
Last edited: November 11, 2025Motivation: drug discovery is extremely slow and expensive; mostly modulating previous iterations of work.
Principles of Drug Discovery
- observation: acquire/fuse knowledge from multiple data modalities (sequence, stricture, etc.)
- think: critically generating actually new hypothesis — allowing iteratively
- allowing LMs to code-switch between moladities (i.e. fuse different modalities together in the most uniform way)
LM as a heuristic helps prune down search space quickly.
SU-CS229 Midterm Sheet
Last edited: November 11, 2025- matrix calculus
- supervised learning
- gradient descent
- Newton’s Method
- regression
- \(y|x,\theta\) is linear Linear Regression
- What if \(X\) and \(y\) are not linearly related? Generalized Linear Model
- \(y|x, \theta\) can be any distribution that’s exponential family
- some exponential family distributinos: SU-CS229 Distribution Sheet
- classification
- take linear regression, squish it: logistic regression; for multi-class, use softmax \(p\qty(y=k|x) = \frac{\exp \theta_{k}^{T} x}{\sum_{j}^{} \exp \theta_{j}^{T} x}\)
- generative learning
- modeling each class’ distributions, and then check which one is more likely: GDA
- Naive Bayes
- bias variance tradeoff
- regularization
- unsupervised learning
- feature map and precomputing Kernel Trick
- Decision Tree
- boosting
backpropegation
Last edited: October 10, 2025backpropegation is a special case of “backwards differentiation” to update a computation grap.h
constituents
- chain rule: suppose \(J=J\qty(g_{1}…g_{k}), g_{i} = g_{i}\qty(\theta_{1} \dots \theta_{p})\), then \(\pdv{J}{\theta_{i}} = \sum_{j=1}^{K} \pdv{J}{g_{j}} \pdv{g_{j}}{\theta_{i}}\)
- a neural network
requirements
Consider the notation in the following two layer NN:
\begin{equation} z = w^{(1)} x + b^{(1)} \end{equation}
\begin{equation} a = \text{ReLU}\qty(z) \end{equation}
\begin{equation} h_{\theta}\qty(x) = w^{(2)} a + b^{(2)} \end{equation}
\begin{equation} J = \frac{1}{2}\qty(y - h_{\theta}\qty(x))^{2} \end{equation}
- in a forward pass, compute the values of each value \(z^{(1)}, a^{(1)}, \ldots\)
- in a backward pass, compute…
- \(\pdv{J}{z^{(f)}}\): by hand
- \(\pdv{J}{a^{(f-1)}}\): lemma 3 below
- \(\pdv{J}{z^{f-1}}\): lemma 2 below
- \(\pdv{J}{a^{(f-2)}}\): lemma 3 below
- \(\pdv{J}{z^{f-2}}\): lemme 2 below
- and so on… until we get to the first layer
- after obtaining all of these, we compute the weight matrices'
- \(\pdv{J}{W^{(f)}}\): lemma 1 below
- \(\pdv{J}{W^{(f-1)}}\): lemma 1 below
- …, until we get tot he first layer
chain rule lemmas
Pattern match your expressions against these, from the last layer to the first layer, to amortize computation.
deep learning
Last edited: October 10, 2025supervised learning with non-linear models.
Motivation
Previously, our learning method was linear in the parameters \(\theta\) (i.e. we can have non-linear \(x\), but our \(\theta\) is always linear). Today: with deep learning we can have non-linearity with both \(\theta\) and \(x\).
constituents
- We have \(\qty {\qty(x^{(i)}, y^{(i)})}_{i=1}^{n}\) the dataset
- Our loss \(J^{(i)}\qty(\theta) = \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))^{2}\)
- Our overall cost: \(J\qty(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\qty(\theta)\)
- Optimization: \(\min_{\theta} J\qty(\theta)\)
- Optimization step: \(\theta = \theta - \alpha \nabla_{\theta} J\qty(\theta)\)
- Hyperparameters:
- Learning rate: \(\alpha\)
- Batch size \(B\)
- Iterations: \(n_{\text{iter}}\)
- stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch)
- neural network
requirements
additional information
Background
Notation:
