Houjun Liu

Chlasta 2021

# ntj

DOI: 10.3389/fpsyg.2020.623237

One-Liner (thrice)

Used features extracted by VGGish from raw acoustic audio against a SVM, Perceptron, 1NN; got \(59.1\%\) classif. accuracy for dementia
Then, trained a CNN on raw wave-forms and got \(63.6\%\) accuracy
Then, they fine-tuned a VGGish on the raw wave-forms and didn’t report their results and just said “we discovered that audio transfer learning with a pretrained VGGish feature extractor performs better” Gah!

Novelty

Threw the kitchen sink to process only raw acoustic input, most of it missed; wanted 0 human involvement. It seems like last method is promising.

Notable Methods

fine-tuning VGGish against raw acoustic waveforms to build a classifier via a CNN.

Key Figs

Their fancy network

Its just a CNN afaik with much maxpooling; could have used some skipped connections. I wonder if it overfit?

Their actual training results

Looks generally pretty bad, but a run of their DemCNN seem to have gotten state-of-the-art results. Not sure where transfer training data went.

New Concepts

VGGish

Notes

Accuracy question

According to this the state of the art at the time from pure audio was 56.6%? For a binary classifier isn’t that just doing nothing?

So somebody did get better before?