Houjun Liu

Chlasta 2021

# ntj

DOI: 10.3389/fpsyg.2020.623237

One-Liner (thrice)

  1. Used features extracted by VGGish from raw acoustic audio against a SVM, Perceptron, 1NN; got \(59.1\%\) classif. accuracy for dementia
  2. Then, trained a CNN on raw wave-forms and got \(63.6\%\) accuracy
  3. Then, they fine-tuned a VGGish on the raw wave-forms and didn’t report their results and just said “we discovered that audio transfer learning with a pretrained VGGish feature extractor performs better” Gah!


Threw the kitchen sink to process only raw acoustic input, most of it missed; wanted 0 human involvement. It seems like last method is promising.

Notable Methods

fine-tuning VGGish against raw acoustic waveforms to build a classifier via a CNN.

Key Figs

Their fancy network

Its just a CNN afaik with much maxpooling; could have used some skipped connections. I wonder if it overfit?

Their actual training results

Looks generally pretty bad, but a run of their DemCNN seem to have gotten state-of-the-art results. Not sure where transfer training data went.

New Concepts


Accuracy question

According to this the state of the art at the time from pure audio was 56.6%? For a binary classifier isn’t that just doing nothing?

So somebody did get better before?