Posts

MOEReview Zhang: Mixure of Attention Heads

Last edited: December 12, 2025

Split \(Q\) projection and attention out projection into experts, with one router coordinating them.

Better than MHA performanec.

Stanford UG Courses Index

Last edited: December 12, 2025

Stanford UG Y1, Aut

Stanford UG Y1, Win

Stanford UG Y1, Spr

Stanford UG Y2, Aut

Stanford UG Y2, Win

Stanford UG Y2, Spr

Stanford UG Y3, Aut

Stanford UG Talks

DateTopicPresenterLink
<2023-09-20 Wed>UG Research ProgramBrian ThomasStanford UG Research Program
<2023-09-28 Thu>Bld an Ecosystem, Not MonolithColin RaffelBuild a System
<2023-10-05 Thu>Training Helpful CHatbotsNazeen RajaniTraining Helpful Chatbots
<2023-10-26 Thu>AI Intepretability for BioGasper BegusAI Intepretability
<2023-11-02 Thu>PT Transformers on Long SeqsMike LewisPretraining Long Transformers
<2023-11-07 Tue>Transformers!A. VaswaniTransformers
<2023-11-09 Thu>Towards Interactive AgentsJessy LinInteractive Agent
<2023-11-16 Thu>Dissociating Language and ThoughtAnna IvanovaDissociating Language and Thought
<2024-01-11 Thu>Language AgentsKarthik NarasimhanLanguage Agents with Karthik
<2024-02-01 Thu>Pretraining Data
<2024-02-08 Thu>value alignmentBeen KimLM Alignment
<2024-02-15 Thu>model editingPeter HaseKnowledge Editing
<2024-07-18 Thu>Knowledge Localization
<2024-11-11 Mon>PresentationsSydney KatzPresentations
<2025-01-06 Mon>Video Generation with Learned PriorMeenakshi SarkarPriors
<2025-01-06 Mon>Theoretical Drone ControlSliding Mode UAV Control
<2025-01-09 Thu>VLM to AgentsTao YuVLM to Agents
<2025-01-13 Mon>Social RLNatasha JaquesSocial Reinforcement Learning
<2025-02-10 Mon>Model Predictive Control + PromptingGabriel MaherLLM MPC
<2025-03-03 Mon>Planning for Learning
<2025-03-06 Thu>Theorem ProvingSelf-Play Conjection Generalization
<2025-04-10 Thu>Safety for TrucksSafety for Autonomous Trucking
<2025-08-04 Mon>Collaborate Multiagent DMCollaborative Multiagent DM
<2025-09-22 Mon>AI Safety TalksAI Safety Annual Meeting
<2025-10-02 Thu>Pretraining under infinite computeLimited Samples and Infinite Compute
<2025-10-06 Mon>Mel KrusniakDecisions.jl
<2025-10-11 Sat>SISL Flash TalksSISL Talks
<2025-10-16 Thu>Predicting Scaling Performance
<2025-12-08 Mon>mixed-autonomy traffic with LLMSmixed-autonomy traffic with LLMs

Contacts

Talk Contacts

Houjun's Academic Home Page

Last edited: December 12, 2025

👋 Howdy, I'm Houjun Liu!

I’m a third-year coterminal MSCS and BSCS student in the Computer Science Department at Stanford University, grateful to be advised by Prof. Mykel Kochenderfer. In the course of my research, I have also had the fortunate opportunity to work with Stanford NLP under Prof. Chris Manning, CMU TalkBank under Prof. Brian MacWhinney, and Prof. Xin Liu at UC Davis Engineering. I am affiliated with the Stanford NLP Group and Stanford Intelligent Systems Lab.

Bellman-Ford Algorithm

Last edited: December 12, 2025

Bellman-Ford Algorithm is a dynamic programming problem that solves single-source shortest path.

d[v] = \infty
d[s] = 0
for i = 0, ..., n-2:
    for v in v:
        d_prev = d
        for u in v.in_neigbors:
            d[v] = min(d_prev[v], d_prev[u] + w(u,v))

Notice you need \(O\qty(n)\) space (2 \(d\) rounds, the previous round and the next round), and runtime is \(O\qty(nm)\) (outer \(n\) loop, inner is an iteration between \(deg\qty(v)*|v| = |e| = m\); that is, we have an minimum over the degree of each \(v\) for every \(v\), which adds up to the total number of edges.)

Dijikstra's Algorithm

Last edited: December 12, 2025

constituents

  • a weighted directed graph, meaning edges have weights
  • where cost of a path is the sum of the weights along that path
  • the shortest path is the path with the minimum cost
  • starting node \(s\), target node \(t\)

Each node has two states:

  • unlabeled
  • labeled

And stores a piece of information:

\(d\qty(v) = \text{distance}\qty(s, v)\)

We initialize \(d\qty(v) = \infty, v \neq s\) , and \(d\qty(s) = 0\).

requirements

  1. pick the node \(u\) that has unlabeled state with smallest \(d\qty(u)\)
  2. for all neighbor v of u:
    1. set \(d\qty(v) = \min \qty(d\qty(v), d\qty(u) + \text{edgeWeight}\qty(u,v))\)
  3. mark \(v\) as labeled
def dike(G,s,t):
    set verticies to State.NOTSURE
    for v in G.V:
        d[v] = float("+inf")
        p[v] = None # (for parents)
    d[s] = 0
    while State.NOTSURE in G.V:
        # get node with minimum distance that's not sure
        u = argmin(d[v] for v in G.V if v.state == State.NOTSURE)
        u.state = State.SURE
        for v in u.out_neighbors:
            d[v] = min(d[v], d[u] + edgeWeight(u,v))
            if d[v] was changed:
               p[v] = u # (for obtaining next path in chain for shortes paths)
    return d[t]

additional information

proof

Let’s start with a lemma