Abstract
Alzheimer’s Disease (AD) is a demonstrativeness disease marked by declines in cognitive function. Despite early diagnoses being critical for AD prognosis and treatment, currently accepted diagnoses mechanisms for AD requires clinical outpatient testing with a medical professional, which reduces its accessibility. In this work, we propose a possible feature extraction mechanism leveraging the previously demonstrated errors of Hidden Markov-based forced alignment (FA) tools upon cognitively impaired patients as an automated means to quantify linguistic disfluency.
Background
Annotated linguistic disfluency features, used in combination with semantic features, have been shown ((Antonsson et al. 2021)) to improve the accuracy of AD classification systems. However, manual annotation of disfluency hinders the throughput of AD detection systems. Furthermore, there is a dearth ((Guo et al. 2021)) of data provided with preexisting annotated results.
Existing acoustic-only approaches ((Lindsay, Tröger, and König 2021; Shah et al. 2021)) frequently places focus on the actual speech features such as silence, energy, rate, or loudness. While this approach has returned promising results ((Wang et al. 2019)), it renders the acoustic data features extracted independent of actual linguistic disfluency. Of course, some approaches (including that in (Wang et al. 2019)) perform separate, manual annotation on both aspects and treat them jointly with late fusion. However, no existing approaches have an effective feature representation that bridges the acoustic-linguistic gap.
An incidental effect of Hidden Markov Model (HMM) based Viterbi forced alignment (FA) tools (such as P2FA) is that its quality is shown ((Saz et al. 2009)) to be lowered in cognitively impaired speakers, resulting from a roughly \(50\%\) decrease in power of discrimination between stressed and unstressed vowels. Other ASR and FA approaches ((Tao, Xueqing, and Bian 2010)) has since been designed discriminate against such changes more effectively.
Proposal
By encoding FA results of HMM based approaches in embedding space, we introduce a novel feature representation of acoustic information. As FA requires an existing transcript, this method is considered semi-automated because the test must be either administered via a common-transcript, transcribed manually later, or transcribed using ASR techniques. After encoding, the proposed feature can be used in a few ways.
Euclidean distance
The Euclidean Distance approach compares the embedding of the HMM FA vector with a “reference” benchmark via pythagoras in high dimension.
There are two possible modalities by which the “reference” can be acquired; if the data was sourced via the patient sample reading a standardized transcript, a reference FA sample could be provided via the audio of another individual reading the same transcript screened traditionally screened without AD. Therefore, the “deviation from reference” would be used as an input feature group to any proposed model architectures.
Alternatively, as stated before, other FA approaches are less susceptible to lexical hindrances with decreased discriminatory power. Therefore, we could equally take the Euclidean distance between embedded results of two different FA mechanisms—one shown to be more sustainable to cognitively impaired speakers and one not—as input features to training architectures.
Cross-Attention
One key issue with the Euclidean Distance approach is that the difference between “normal” pauses, changes in speaker pace, etc. which would be variable between different speakers even controlling for AD prognoses.
In computer vision, few-shot classification cross-attention ((Hou et al. 2019)) has shown promising results in discrimination; furthermore, trainable cross-attention ensures more flexible control to non-prognostic verbal disturbances such as a normal change in pace which would otherwise cause a large difference in the Euclidean Distance approach.
In practice, a model similar to that proposed by ((Hou et al. 2019)) would be used as the basis to encode (or even discriminate) between pairwise samples of different FA approaches or against a non-AD control, as per highlighted in the section above.
As input features
Of course, the raw FA embedding can be used as an input feature. There are less prior work on this front as this project would be, as far as we know, proposing the use of forced aligner outputs as a feature input heuristic.