Combined bag-of-words on transcript + ADR on audio to various classifiers for AD; ablated BERT’s decesion space for attention to make more easy models in the future.
- Pre-processed each of the two modalities before fusing it (late fusion)
- Archieved \(93.75\%\) accuracy on AD detection
- The data being forced-aligned and fed with late fusion allows one to see what sounds/words the BERT model was focusing on by just focusing on the attention on the words
- Used classic cookie theft data
- bag of words to do ADR but for words
- multimodality but late fusion with one (hot-swappable) classifier
How they did it
This is how the combined the forced aligned (:tada:) audio and transcript together.
Ablated BERT results.
The model overall tends to focus on early parts of sentences. y is attention weight, x is position in sentence, blue is TD, red is AD.