Posts

EMNLP2025: MUSE, MCTS Driven Red Teaming

Last edited: November 11, 2025

One-Liner

Notable Methods

  1. construct a series of perturbation actions
    • \(A\qty(s)\) = decomposition (skip), expansion (rollout), dredirection
  2. sequence actions with MCTS

Key Figs

New Concepts

Notes

EMNLP2025 Keynote: Heng Ji

Last edited: November 11, 2025

Motivation: drug discovery is extremely slow and expensive; mostly modulating previous iterations of work.

Principles of Drug Discovery

  • observation: acquire/fuse knowledge from multiple data modalities (sequence, stricture, etc.)
  • think: critically generating actually new hypothesis — allowing iteratively
  • allowing LMs to code-switch between moladities (i.e. fuse different modalities together in the most uniform way)

LM as a heuristic helps prune down search space quickly.

SU-CS229 Midterm Sheet

Last edited: November 11, 2025

backpropegation

Last edited: October 10, 2025

backpropegation is a special case of “backwards differentiation” to update a computation grap.h

constituents

  • chain rule: suppose \(J=J\qty(g_{1}…g_{k}), g_{i} = g_{i}\qty(\theta_{1} \dots \theta_{p})\), then \(\pdv{J}{\theta_{i}} = \sum_{j=1}^{K} \pdv{J}{g_{j}} \pdv{g_{j}}{\theta_{i}}\)
  • a neural network

requirements

Consider the notation in the following two layer NN:

\begin{equation} z = w^{(1)} x + b^{(1)} \end{equation}

\begin{equation} a = \text{ReLU}\qty(z) \end{equation}

\begin{equation} h_{\theta}\qty(x) = w^{(2)} a + b^{(2)} \end{equation}

\begin{equation} J = \frac{1}{2}\qty(y - h_{\theta}\qty(x))^{2} \end{equation}


  1. in a forward pass, compute the values of each value \(z^{(1)}, a^{(1)}, \ldots\)
  2. in a backward pass, compute…
    1. \(\pdv{J}{z^{(f)}}\): by hand
    2. \(\pdv{J}{a^{(f-1)}}\): lemma 3 below
    3. \(\pdv{J}{z^{f-1}}\): lemma 2 below
    4. \(\pdv{J}{a^{(f-2)}}\): lemma 3 below
    5. \(\pdv{J}{z^{f-2}}\): lemme 2 below
    6. and so on… until we get to the first layer
  3. after obtaining all of these, we compute the weight matrices'
    1. \(\pdv{J}{W^{(f)}}\): lemma 1 below
    2. \(\pdv{J}{W^{(f-1)}}\): lemma 1 below
    3. …, until we get tot he first layer

chain rule lemmas

Pattern match your expressions against these, from the last layer to the first layer, to amortize computation.

deep learning

Last edited: October 10, 2025

supervised learning with non-linear models.

Motivation

Previously, our learning method was linear in the parameters \(\theta\) (i.e. we can have non-linear \(x\), but our \(\theta\) is always linear). Today: with deep learning we can have non-linearity with both \(\theta\) and \(x\).

constituents

  • We have \(\qty {\qty(x^{(i)}, y^{(i)})}_{i=1}^{n}\) the dataset
  • Our loss \(J^{(i)}\qty(\theta) = \qty(y^{(i)} - h_{\theta}\qty(x^{(i)}))^{2}\)
  • Our overall cost: \(J\qty(\theta) = \frac{1}{n} \sum_{i=1}^{n} J^{(i)}\qty(\theta)\)
  • Optimization: \(\min_{\theta} J\qty(\theta)\)
  • Optimization step: \(\theta = \theta - \alpha \nabla_{\theta} J\qty(\theta)\)
  • Hyperparameters:
    • Learning rate: \(\alpha\)
    • Batch size \(B\)
    • Iterations: \(n_{\text{iter}}\)
  • stochastic gradient descent (where we randomly sample a dataset point, etc.) or batch gradient descent (where we scale learning rate by batch size and comput e abatch)
  • neural network

requirements

additional information

Background

Notation: