Houjun Liu

Forced-Alignment Error for Feature Extraction for Acoustic AD Detection


Alzheimer’s Disease (AD) is a demonstrativeness disease marked by declines in cognitive function. Despite early diagnoses being critical for AD prognosis and treatment, currently accepted diagnoses mechanisms for AD requires clinical outpatient testing with a medical professional, which reduces its accessibility. In this work, we propose a possible feature extraction mechanism leveraging the previously demonstrated errors of Hidden Markov-based forced alignment (FA) tools upon cognitively impaired patients as an automated means to quantify linguistic disfluency.


Annotated linguistic disfluency features, used in combination with semantic features, have been shown ((Antonsson et al. 2021)) to improve the accuracy of AD classification systems. However, manual annotation of disfluency hinders the throughput of AD detection systems. Furthermore, there is a dearth ((Guo et al. 2021)) of data provided with preexisting annotated results.

Existing acoustic-only approaches ((Lindsay, Tröger, and König 2021; Shah et al. 2021)) frequently places focus on the actual speech features such as silence, energy, rate, or loudness. While this approach has returned promising results ((Wang et al. 2019)), it renders the acoustic data features extracted independent of actual linguistic disfluency. Of course, some approaches (including that in (Wang et al. 2019)) perform separate, manual annotation on both aspects and treat them jointly with late fusion. However, no existing approaches have an effective feature representation that bridges the acoustic-linguistic gap.

An incidental effect of Hidden Markov Model (HMM) based Viterbi forced alignment (FA) tools (such as P2FA) is that its quality is shown ((Saz et al. 2009)) to be lowered in cognitively impaired speakers, resulting from a roughly \(50\%\) decrease in power of discrimination between stressed and unstressed vowels. Other ASR and FA approaches ((Tao, Xueqing, and Bian 2010)) has since been designed discriminate against such changes more effectively.


By encoding FA results of HMM based approaches in embedding space, we introduce a novel feature representation of acoustic information. As FA requires an existing transcript, this method is considered semi-automated because the test must be either administered via a common-transcript, transcribed manually later, or transcribed using ASR techniques. After encoding, the proposed feature can be used in a few ways.

Euclidean distance

The Euclidean Distance approach compares the embedding of the HMM FA vector with a “reference” benchmark via pythagoras in high dimension.

There are two possible modalities by which the “reference” can be acquired; if the data was sourced via the patient sample reading a standardized transcript, a reference FA sample could be provided via the audio of another individual reading the same transcript screened traditionally screened without AD. Therefore, the “deviation from reference” would be used as an input feature group to any proposed model architectures.

Alternatively, as stated before, other FA approaches are less susceptible to lexical hindrances with decreased discriminatory power. Therefore, we could equally take the Euclidean distance between embedded results of two different FA mechanisms—one shown to be more sustainable to cognitively impaired speakers and one not—as input features to training architectures.


One key issue with the Euclidean Distance approach is that the difference between “normal” pauses, changes in speaker pace, etc. which would be variable between different speakers even controlling for AD prognoses.

In computer vision, few-shot classification cross-attention ((Hou et al. 2019)) has shown promising results in discrimination; furthermore, trainable cross-attention ensures more flexible control to non-prognostic verbal disturbances such as a normal change in pace which would otherwise cause a large difference in the Euclidean Distance approach.

In practice, a model similar to that proposed by ((Hou et al. 2019)) would be used as the basis to encode (or even discriminate) between pairwise samples of different FA approaches or against a non-AD control, as per highlighted in the section above.

As input features

Of course, the raw FA embedding can be used as an input feature. There are less prior work on this front as this project would be, as far as we know, proposing the use of forced aligner outputs as a feature input heuristic.


Antonsson, Malin, Kristina Lundholm Fors, Marie Eckerström, and Dimitrios Kokkinakis. 2021. “Using a Discourse Task to Explore Semantic Ability in Persons with Cognitive Impairment.” Frontiers in Aging Neuroscience 12 (January): 607449. doi:10.3389/fnagi.2020.607449.
Guo, Yue, Changye Li, Carol Roan, Serguei Pakhomov, and Trevor Cohen. 2021. “Crossing the ‘Cookie Theft’ Corpus Chasm: Applying What BERT Learns from Outside Data to the ADReSS Challenge Dementia Detection Task.” Frontiers in Computer Science 3 (April): 642517. doi:10.3389/fcomp.2021.642517.
Hou, Ruibing, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. 2019. “Cross Attention Network for Few-Shot Classification.” Advances in Neural Information Processing Systems 32.
Lindsay, Hali, Johannes Tröger, and Alexandra König. 2021. “Language Impairment in Alzheimer’s Disease—Robust and Explainable Evidence for AD-Related Deterioration of Spontaneous Speech through Multilingual Machine Learning.” Frontiers in Aging Neuroscience 13 (May): 642033. doi:10.3389/fnagi.2021.642033.
Saz, Oscar, Javier Simón, W Ricardo Rodr\’ıguez, Eduardo Lleida, and Carlos Vaquero. 2009. “Analysis of Acoustic Features in Speakers with Cognitive Disorders and Speech Impairments.” Eurasip Journal on Advances in Signal Processing 2009. Springer: 1–11.
Shah, Zehra, Jeffrey Sawalha, Mashrura Tasnim, Shi-ang Qi, Eleni Stroulia, and Russell Greiner. 2021. “Learning Language and Acoustic Models for Identifying Alzheimer’s Dementia from Speech.” Frontiers in Computer Science 3 (February): 624659. doi:10.3389/fcomp.2021.624659.
Tao, Ye, Li Xueqing, and Wu Bian. 2010. “A Dynamic Alignment Algorithm for Imperfect Speech and Transcript.” Computer Science and Information Systems 7 (1): 75–84. doi:10.2298/CSIS1001075T.
Wang, Tianqi, Chongyuan Lian, Jingshen Pan, Quanlei Yan, Feiqi Zhu, Manwa L. Ng, Lan Wang, and Nan Yan. 2019. “Towards the Speech Features of Mild Cognitive Impairment: Universal Evidence from Structured and Unstructured Connected Speech of Chinese.” In Interspeech 2019, 3880–84. ISCA. doi:10.21437/Interspeech.2019-2414.