_index.org

Mahatma Ghandi

Last edited: August 8, 2025

Make Models Go Brrr: Model Parallel Whisper Training

Last edited: August 8, 2025

Happy Monday friends.

The deliverable of the week was to make the a ASR model for Batchalign. Essentially, most copies of Whisper is pretty bad at Language Sample Analysis (LSA), because they mostly don’t work in terms trying to actually capture the things that people doing LSA want to capture (disfluencies, stuttering, etc.). OpenAI even acknowledged in the paper that they filtered out the disfluencies from their gold transcript to prevent Whisper from writing down too much of them.

map restriction operator

Last edited: August 8, 2025

Suppose \(T \in \mathcal{L}(V)\), and \(U \subset V\), an invariant subspace under \(T\). Then:

\begin{equation} T|_{U}(u) = Tu,\ \forall u \in U \end{equation}

where \(T|_{U} \in \mathcal{L}(U)\)

mapping reduction

Last edited: August 8, 2025

A language \(A\) is mapping reducible to language \(B\), written as \(A \leq_{m} B\), if there is a computable function \(f: \Sigma^{*} \to \Sigma ^{ *}\) such that for every \(w\), \(w \in A \Leftrightarrow f(w) \in B\).

This is sometimes called a “many-to-one” reduction because often times you want to have multiple \(w\) mapping to the same \(f(w)\).

We remember this as “A is weaker (“not stronger”) than B”; or “A is reducable to B”

MapReduce

Last edited: August 8, 2025

MapReduce is an distributed algorithm.

https://www.psc.edu/wp-content/uploads/2023/07/A-Brief-History-of-Big-Data.pdf

  • Map: \((in\_key, in\_value) \Rightarrow list(out\_key, intermediate\_value)\).
  • Reduce:
    • Group map outputs by \(out\_key\)
    • \((out\_key, list(intermediate\_value)) \Rightarrow list(out\_value)\)

example of MapReduce

Say, if you want to count word frequencies in a set of documents.

  • Map: \((document\_name, document\_contents) \Rightarrow list(word, #\ occurrences)\)

You can see that this can be distributed to multiple processors. You can have each processor count the word frequencies in a single document. We have now broken the contents into divide and conquerable groups.