Houjun Liu

Zhu 2021

# ntj

DOI: 10.3389/fcomp.2021.624683


late fusion of multimodal signal on the CTP task using transformers, mobilnet, yamnet, and mockingjay


  • Similar to Martinc 2021 and Shah 2021 but actually used the the current Neural-Network state of the art
  • Used late fusion again after the base model training
  • Proposed that inconsistency in the diagnoses of MMSE scores could be a great contributing factor to multi-task learning performance hindrance

Notable Methods

  • Proposed base model for transfer learning from text based on MobileNet (image), YAMNet (audio), Mockingjay (speech) and BERT (text)
  • Data all sourced from recording/transcribing/recognizing CTP task

Key Figs

Figure 3 and 4

This figure tells us the late fusion architecture used

Table 2

Pre-training with an existing dataset had (not statistically quantified) improvement against a randomly seeded model.

Table 3

Concat/Add fusion methods between audio and text provided even better results; confirms Martinc 2021 on newer data