DementiaBank Acoustics Project

The DementiaBank Acoustics Project is a working title for an acoustic-only challenge for AD detection. This document serves as the lab notebook for this project.

This project will attempt to replicate some of the results of Wang 2019 and Martinc 2021, but focusing on minimizing human involvement; we will first work on raw transcript classification with ERNIE (cutting all CHAT annotations), then introduce pause-encoding in a manner similar to Yuan 2021 which is automated by MFA. Goal is to replicate the results of Yuan 2021/or even Martinc 2021 in a completely automated manner.

Background Reading

I first began by doing a literature survey on the ADReSS Challenge results published in the Frontiers AD special interest group issue.

Proposal

And then, we wrote a proposal: DementiaBank Acoustics Project Proposal

Brainstoming

More notes from the meeting: DementiaBank Acoustics Brainstoming

Protocol Notes

July 1st

Began by moving a subsample of Pitt’s Cookie Theft to pitt-7-1 in the raw data folder
Ran flo on all collected samples. Arguments used are the same as that for batchalign, except we filter out the INV tier as we are detecting AD on patient and not investigator: so flo +d +ca +t* -tINV
Moved all collected samples (and changed extension to .txt) to the same sub-folder, but in transcripts_nodisfluency

July 2nd

Created a dataprep script dataprep.py which dumps a pickled copy of cleaned data to transcripts_nodisfluency/pitt-7-1.dat.
Created sliding windows of 5 pieces of dialogue concatenated, stored it in transcripts_nodisfluency/pitt-7-1-windowed.dat
Used tencent/HuYong’s nghuyong/ernie-2.0-en Ernie 2.0 model, the continuous language model from Baidu (Layer:12, Hidden:768, Heads:12)

July 4th

Finalized training code. Selected base hyperparameters {bs: 8, epochs: 2, lr: 3e-3, length: 60}. Again, we are using Baidu’s nghuyong/ernie-2.0-en.
Started training fastcalculator on 24bc812

train: faithful-frog-3

{bs: 8, epochs: 2, lr: 3e-3, length: 60, pitt-7-1-windowed.dat }

Commentary: LR could be too high, looking at the divergent loss behavior.
Decision: dropping bs to 4 and lr to 1e-5, similar to previous transformers. Also training for 3 epochs.

train: revived-disco-5

{bs: 4, epochs: 3, lr: 1e-5, length: 60, pitt-7-1-windowed.dat }

Commentary: quintessential overfitting
Decision:
- Made the corpus bigger
  - cleaned the entire Pitt corpus (pitt-7-4 in the raw folder) to become training data. Similar to pitt-7-1, ran flo on all collected samples; arguments used are the same as that for batchalign, except we filter out the INV tier as we are detecting AD on patient and not investigator: so flo +d +ca +t* -tINV; the flo’d results are in transcripts_nodisfluency.
  - the notable difference between the previous dataset 7-1 and the current one 7-4 is that the 7-4 are prepended numbered by the task (cookie/100-01.cha > =cookie-100-01.txt)
  - New (full) Pitt data as prepared above is ran though the dataprep script as of b325514cfad79da82d7a519ed29ea19ed87b2be4 (difference is that empty/dummy files are ignored), and pickled at transcripts_nodisfluency/pitt-7-4.dat and transcripts_nodisfluency/pitt-7-4-windowed.dat respectively.
  - For new data, window size is still 5, splitting 10 cases out for testing now instead of 5.

train: vocal-oath-6

{bs: 4, epochs: 3, lr: 1e-5, length: 60, pitt-7-4-windowed.dat}

Commentary: high recall, low precision. Perhaps classes aren’t balanced?
- Spoiler alert: they are not.
- An inspection of data reveals that there is 3211 rows of dementia, 2397 rows of control
Decision:
- Created pitt-7-4-bal and pitt-7-4-windowed-bal series of data based on dataprep.py on 703f79248a20fd7a13a5033ca2bf7f691f42c941. This version force-crops to make sure that the dementia and control indicies have the exact same length for each class.

train: helpful-leaf-7

{bs: 4, epochs: 3, lr: 1e-5, length: 60, pitt-7-4-windowed-bal.dat}

Beautiful. Question now is whether or not there is data leakage/external heuristics. It is a good time to do some LOOCV. Getting this result without any disfluency calculations seems unlikely.

But anyways, going to discuss these results as they seem to meet results we see in Yuan 2021, even without top-N ensemble; though this is one trial, LOOCV may still show that we actually need it.

July 5th

Began the day with creating the script k-fold validation; I originally hoped to exactly replicate the procedure of Yuan 2021 for comparability, but, not sure how they got the actual result of a min/max range with LOOCV on binary; therefore, we will instead create a 95% confidence interval analysis via a single-variable t test on standard k-fold validation. K=50
During one-off testing, another set of hyperparameters seems to work too: {bs: 72, epochs: 3, lr: 1e-5, length: 60, pitt-7-4-windowed-bal.dat}. As we have not begun tuning for hyperparameters, we are just going to use this set, K=50, for the first k-fold trial.

k-fold: F4ZVbGfdBAQvtvXemWZCZD

code: 55f77ff1dea03c3ed66967864dc52fd2c0062f23

{bs: 72, epochs: 3, lr: 1e-5, length: 60, pitt-7-4-windowed-bal.dat} K = 50

It seems like the results we got is consistent and validates in a manner which we expect.

July 7th

Yesterday was a day filled with working on batchalign, but we are back now. Today, I aim to look into the heuristic that I identified yesterday by playing with the model, which is that it seems like the model prefers the use of long-focused sentences about cookies, so the heruistic its picking up is probably on-topicness.

I am going to first leverage the lovely cdpierse/transformers-interpret tool to help build some explainability by adding it to validate.py. Upon some human validation with random sampling, the model seem to do less well than I’d hoped. Running a train cycle with the new results/params seen above to see if it does better.

train: brisk-oath-10

{bs: 72, epochs: 3, lr: 1e-5, length: 60, pitt-7-4-windowed-bal.dat}

Commentary: It seems like the model is doing overall worse from validation data, but it does fairly well during test data.
Decision:
- I can fairly confidently claim that the model is just fitting on topic. As in, if the topic is about cookies (theft/taking/cookie/mother/etc.), it will be classified as control.
- One thing that we can do is to claim this task as directly task-controlled: that is, include no data except cookie and control for that difference
- Then, the model would’t be able to predict the result b/c the variation in topic won’t have influence.
- This is going to be prepared in the cookiepitt-7-7-bal* based on dataprep.py in commit 518dec82bb961c0a8ad02e3080289b56102aa1a2

train: super-durian-11

{bs: 72, epochs: 3, lr: 1e-5, length: 60, cookiepitt-7-7-windowed-bal.dat}

Commentary: the model is no where near convergence
Decision: multiplying the LR by 10

train: floral-sunset-12

{bs: 72, epochs: 3, lr: 1e-4, length: 60, cookiepitt-7-7-windowed-bal.dat}

Commentary: There we go. This seem to be more in line with what we see in Yuan 2021
Decision: ok, let’s elongate the actual content. Perhaps we can try a 7-element search instead? This is written as cookiepitt-7-7-*-long. Code based on 9e31f4bc13c4bfe193dcc049059c3d9bda46c8d0

train: sweet-plasma-13

{bs: 72, epochs: 3, lr: 1e-4, length: 60, cookiepitt-7-7-windowed-long-bal.dat}

Commentary: underfitting
Dropping batch size down to 64 to add more steps

train: smart-river-14

{bs: 64, epochs: 3, lr: 1e-4, length: 60, cookiepitt-7-7-windowed-long-bal.dat}

Commentary: this finally fits to the specifications which Yuan 2021 have revealed
Decision: running k-fold on this architecture

k-fold: XgsP4FVS6ScFxCZKFJoVQ5.

Code: 3870651ba71da8ddb3f481a7c3e046397a09d8b2

July 8th

Began the day with aligning the entirety of cookie for both control and dementia, named the dataset alignedpitt-7-8 in the RAW folder
Per what we discussed, will add [pause] as a token to the model. Then, transcript the text such that it would contain normalized values to the pauses for pauses > 0.250 seconds. Therefore, the data would look like
“hello my name is [pause] 262 [pause] bob”

July 9th

Created transcript.py, which coverts the data in raw to transcripts_pauses, which contains pause values > 250 msc and prepends them with [pause] tokens
The code from above is taken from check.py in batchalign, used transcript.py from 7e19a4912cf0ad5d269c139da5ce018615495ebb to clean out the dataset; placed it in similar txt format to alignedpitt-7-8
Ran dataprep with window size of 5, created alignedpitt-7-8.bat and alignedpitt-7-8-windowed.bat as the dataprep file
starting a new training run, with [pause] added as a new token, code 06846c6c95e6b1ccf17f0660c5da76aa50231567

train: golden-tree-16

{bs: 64, epochs: 3, lr: 1e-4, length: 60, alignedpitt-7-8-windowed.dat}

So realistically, we have the same F1 between the two, but pause encoding increased the accuracy of prediction yet dropped recall dramatically.

As a random check, let’s find out if simple fine-tuning (only training on classifier) would work, so:

train: jumping-blaze-17

{bs: 64, epochs: 3, lr: 1e-4, length: 60, alignedpitt-7-8-windowed.dat}. This time with only training classifier.

Commentary: we did not like. start coverging
Bumping LR by a factor of 10

train: vital-water-18

{bs: 64, epochs: 3, lr: 1e-3, length: 60, alignedpitt-7-8-windowed.dat}. This time with only training classifier.

Commentary: barely started converging, seem to be a local
Training for 2 more epochs

train: fiery-smoke-19

{bs: 64, epochs: 5, lr: 1e-3, length: 60, alignedpitt-7-8-windowed.dat}. This time with only training classifier.

Commentary: classic overfitting

At this point, unlocking the model would probably be a good bet

train: leafy-deluge-20

{bs: 64, epochs: 5, lr: 1e-4, length: 60, alignedpitt-7-8-windowed.dat}.

Training once again with code without locking, and bump LR down

Commentary: classic the recall is slowly creeping up
Decision: let’s go for 8 epochs

train: royal-pond-21

{bs: 64, epochs: 8, lr: 1e-4, length: 60, alignedpitt-7-8-windowed.dat}.

Commentary: let’s run k-fold now, with these settings.

k-fold: QskZWfEsML52ofcQgGujE2.

{bs: 64, epochs: 8, lr: 1e-4, length: 60, alignedpitt-7-8-windowed.dat}.

Ok, the base hypothesis from Yuan 2021 is very much confirmed here. The same training, same content, but pause encoding is very beneficial to the quality of the results. The results that they reported contained an ensemble data, which is in the high 80s; we can now continue doing something new as Yuan 2021’s conclusion is fairly achieved.

We can probably call the replication stage done, with no dramatically better effect.

July 10th

FluCalc! Leonid’s lovely new program can be an uberuseful feature extraction tool
Let’s try using to build a new dataset, and network. FluCalc + Pause Encoding + Textual Data late fusion
This is becoming alignedpitt-7-8-flucalc. As the program is currently under heavy development to include results from batchalign, we will specify version V 09-Jul-2022 11:00 for now.
Done, the new data has the same i/o shape, but then has a bunch of features filtered for nulls which contains outputs from flucalc. Again, alignedpitt-7-8-flucalc from 4346fc07c4707343c507e32786b6769b6bd6fb49 does not take into account results from the %wor tier!

July 11th

ab19abd6486884141c9ab4e4e185255a77ae833e is the final-ish version of the late fusion model
We are going to use alignedpitt-7-8-flucalc to start training

train: royal-pond-21

{bs: 64, epochs: 8, lr: 1e-4, length: 60, alignedpitt-7-8-flucalc-windowed.dat}.

Commentary: overfitting
Decision, droping lr by a factor of 10, also increasing length to 70

train: fallen-dust-25

{bs: 64, epochs: 8, lr: 1e-5, length: 70, alignedpitt-7-8-flucalc-windowed.dat}.

Commentary: overfitting
Decision, droping lr by a factor of 10, dropping batch size to 32, training more to 10

train: dainty-meadow-26

{bs: 32, epochs: 10, lr: 1e-6, length: 70, alignedpitt-7-8-flucalc-windowed.dat}.

At this point, I think it’d be good to do some feature selection
Let’s do a chi^2 correlation, and select 3 best features

import pandas as pd

DATA = "/Users/houliu/Documents/Projects/DBC/data/transcripts_pauses/alignedpitt-7-8-flucalc-windowed.bat"

# read pickle
df = pd.read_pickle(DATA)
# test
test_data = df[df.split=="test"]
# also, get only train data
df = df[df.split=="train"]
df

              target  mor_Utts  ...  split                                          utterance
trial sample                    ...
120-2 1049         1 -0.179084  ...  train  well the boy is getting some cookies handing o...
336-1 2492         0 -0.481740  ...  train  +oh okay, the the little girl askin(g) for the...
076-4 786          1 -0.179084  ...  train  well the little boy was looking at that cookie...
279-0 2250         1  1.980274  ...  train  kid's stool turnin(g) [pause]540[pause] over s...
014-2 151          1  0.746355  ...  train  he's fallin(g) off the chair  down here or try...
...              ...       ...  ...    ...                                                ...
208-0 1655         0 -0.481740  ...  train  the boy [pause]920[pause] is going after [paus...
492-0 2696         1 -0.179084  ...  train  oh yes quite a_lot the kid's tryin(g) to get t...
497-1 2727         1  0.129396  ...  train  what else ? &uh the see the [pause]2400[pause]...
175-2 1535         0  0.863668  ...  train  the window is open you can see out the curtain...
279-0 2261         1  1.980274  ...  train  the other kid with [pause]610[pause] the stool...

[2848 rows x 44 columns]

Let’s slice out the bits which is labels, etc.

in_data = df.drop(columns=["utterance", "target", "split"])
in_data.columns

Index(['mor_Utts', 'mor_Words', 'mor_syllables', '#_Prolongation',
       '%_Prolongation', '#_Broken_word', '%_Broken_word', '#_Block',
       '%_Block', '#_PWR', '%_PWR', '#_PWR-RU', '%_PWR-RU', '#_WWR', '%_WWR',
       '#_mono-WWR', '%_mono-WWR', '#_WWR-RU', '%_WWR-RU', '#_mono-WWR-RU',
       '%_mono-WWR-RU', 'Mean_RU', '#_Phonological_fragment',
       '%_Phonological_fragment', '#_Phrase_repetitions',
       '%_Phrase_repetitions', '#_Word_revisions', '%_Word_revisions',
       '#_Phrase_revisions', '%_Phrase_revisions', '#_Pauses', '%_Pauses',
       '#_Filled_pauses', '%_Filled_pauses', '#_TD', '%_TD', '#_SLD', '%_SLD',
       '#_Total_(SLD+TD)', '%_Total_(SLD+TD)', 'Weighted_SLD'],
      dtype='object')

And the labels:

out_data = df["target"]
out_data

trial  sample
120-2  1049      1
336-1  2492      0
076-4  786       1
279-0  2250      1
014-2  151       1
                ..
208-0  1655      0
492-0  2696      1
497-1  2727      1
175-2  1535      0
279-0  2261      1
Name: target, Length: 2848, dtype: int64

And now, let’s select 3 best features.

from sklearn.feature_selection import SelectKBest, f_classif

k_best_tool = SelectKBest(f_classif, k=3)
k_best_tool.fit(in_data, out_data)

best_features = k_best_tool.get_feature_names_out()
best_features


%_WWR	%_mono-WWR	%Total(SLD+TD)

OD = other disfluencies; SLD = stuttering-like disfluencies; TD = total disfluencies; WWR = whole-word-repetition

ok, let’s select those features

train: visionary-plasma-27

{bs: 32, epochs: 10, lr: 1e-6, length: 70, alignedpitt-7-8-flucalc-windowed.dat}. Also with feature selection.

hmmm.

I am curious if we just ran something like a decision tree, what happens.

in_features = df.drop(columns=["utterance", "target", "split"])
test_features = test_data.drop(columns=["utterance", "target", "split"])
in_targets = df["target"]
test_targets = test_data["target"]

seed the classifier, and fit.

from sklearn.ensemble import RandomForestClassifier

clsf = RandomForestClassifier()
clsf.fit(in_features, in_targets)
clsf.score(test_features, test_targets)

0.5932203389830508

OK nevermind. What about SVC?

from sklearn.svm import SVC

clsf = SVC()
clsf.fit(in_features, in_targets)
clsf.score(test_features, test_targets)

0.5932203389830508

Turns out, deep learning still does better. I’m thinking maybe the output is being faulty, say, for something like the loss function.

Decision: switching activation to sigmoid.

train: sunny-bush-31

{bs: 32, epochs: 10, lr: 1e-6, length: 70, alignedpitt-7-8-flucalc-windowed.dat}, selected features

Ok let’s think about this. Decision: added batch normalization.

train: autumn-jazz-32

{bs: 32, epochs: 10, lr: 1e-6, length: 70, alignedpitt-7-8-flucalc-windowed.dat}, selected features

The model maybe overfitting on some simple heuristic; some basic statistics revealed that these variables are actually quite differently distributed.

Perhaps we should increase the complexity of the model?

train: fallen-microwave-33

{bs: 32, epochs: 10, lr: 1e-6, length: 70, alignedpitt-7-8-flucalc-windowed.dat}, selected features

Just to test, I am bumping the LR to 1e-5, just to see what happens. I am very confused.

train: upbeat-flower-35

{bs: 32, epochs: 10, lr: 1e-5, length: 70, alignedpitt-7-8-flucalc-windowed.dat}, selected features

The more we work on this, the more overfit it gets. (I FORGOT A RELUCTIFIER)

a note

{bs: 32, epochs: 10, lr: 1e-5, length: 70, alignedpitt-7-11-flucalc-windowed.dat}, selected features

Pauses, no meta:

Pauses, meta:

so effectively cointoss

Concerns and Questions

July 2nd

pitt7-1/dementia/493-0 PAR tier “tell me everything you see going on in that picture” doesn’t seem to be labeled correctly; I am guessing that’s supposed to be INV?
Has anyone tried to include investigator/participant cross-dialogue?

July 4th

Is the model overfitting on antiquated language?
Is the model overfitting on cooke-theft on-topic-ness?

July 11th

LSTM only on pauses?