EMNLP2025: MUSE, MCTS Driven Red Teaming

Last edited: November 11, 2025

One-Liner

Notable Methods

construct a series of perturbation actions
- \(A\qty(s)\) = decomposition (skip), expansion (rollout), dredirection
sequence actions with MCTS

Key Figs

New Concepts

Notes

EMNLP2025 Keynote: Heng Ji

Last edited: November 11, 2025

Motivation: drug discovery is extremely slow and expensive; mostly modulating previous iterations of work.

Principles of Drug Discovery

observation: acquire/fuse knowledge from multiple data modalities (sequence, stricture, etc.)
think: critically generating actually new hypothesis — allowing iteratively
allowing LMs to code-switch between moladities (i.e. fuse different modalities together in the most uniform way)

LM as a heuristic helps prune down search space quickly.

SU-CS229 Midterm Sheet

Last edited: November 11, 2025

matrix calculus
supervised learning
gradient descent
Newton’s Method
regression
- \(y|x,\theta\) is linear Linear Regression
  - least-squares error: probabilistic intuition for least-squares error in linear regression
  - Normal Equation
- What if \(X\) and \(y\) are not linearly related? Generalized Linear Model
  - \(y|x, \theta\) can be any distribution that’s exponential family
  - some exponential family distributinos: SU-CS229 Distribution Sheet
classification
- take linear regression, squish it: logistic regression; for multi-class, use softmax \(p\qty(y=k|x) = \frac{\exp \theta_{k}^{T} x}{\sum_{j}^{} \exp \theta_{j}^{T} x}\)
- generative learning
  - modeling each class’ distributions, and then check which one is more likely: GDA
  - Naive Bayes
bias variance tradeoff
regularization
unsupervised learning
- k-means clustering
- Gaussian mixture model and expectation maximization
  - Jensen’s Inequality
feature map and precomputing Kernel Trick
Decision Tree
boosting

backpropegation

Last edited: October 10, 2025

backpropegation is a special case of “backwards differentiation” to update a computation grap.h

constituents

chain rule: suppose \(J=J\qty(g_{1}…g_{k}), g_{i} = g_{i}\qty(\theta_{1} \dots \theta_{p})\), then \(\pdv{J}{\theta_{i}} = \sum_{j=1}^{K} \pdv{J}{g_{j}} \pdv{g_{j}}{\theta_{i}}\)
a neural network

requirements

Consider the notation in the following two layer NN:

\begin{equation} z = w^{(1)} x + b^{(1)} \end{equation}

\begin{equation} a = \text{ReLU}\qty(z) \end{equation}

\begin{equation} h_{\theta}\qty(x) = w^{(2)} a + b^{(2)} \end{equation}

\begin{equation} J = \frac{1}{2}\qty(y - h_{\theta}\qty(x))^{2} \end{equation}

in a forward pass, compute the values of each value \(z^{(1)}, a^{(1)}, \ldots\)
in a backward pass, compute…
1. \(\pdv{J}{z^{(f)}}\): by hand
2. \(\pdv{J}{a^{(f-1)}}\): lemma 3 below
3. \(\pdv{J}{z^{f-1}}\): lemma 2 below
4. \(\pdv{J}{a^{(f-2)}}\): lemma 3 below
5. \(\pdv{J}{z^{f-2}}\): lemme 2 below
6. and so on… until we get to the first layer
after obtaining all of these, we compute the weight matrices'
1. \(\pdv{J}{W^{(f)}}\): lemma 1 below
2. \(\pdv{J}{W^{(f-1)}}\): lemma 1 below
3. …, until we get tot he first layer

chain rule lemmas

Pattern match your expressions against these, from the last layer to the first layer, to amortize computation.

deep learning

Last edited: October 10, 2025

supervised learning with non-linear models.

Motivation

Previously, our learning method was linear in the parameters \(\theta\) (i.e. we can have non-linear \(x\), but our \(\theta\) is always linear). Today: with deep learning we can have non-linearity with both \(\theta\) and \(x\).

constituents

We have \(\qty {\qty(x^{(i)}, y^{(i)})}_{i=1}^{n}\) the dataset
Our loss \(J^{(i)}\qty(\theta) = \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))^{2}\)
Our overall cost: \(J\qty(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\qty(\theta)\)
Optimization: \(\min_{\theta} J\qty(\theta)\)
Optimization step: \(\theta = \theta - \alpha \nabla_{\theta} J\qty(\theta)\)
Hyperparameters:
- Learning rate: \(\alpha\)
- Batch size \(B\)
- Iterations: \(n_{\text{iter}}\)
stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch)
neural network

requirements

additional information

Background

Notation: