Houjun Liu

ConDef Abstract

Current automated lexicography (term definition) techniques cannot include contextual or new term information as a part of its synthesis. We propose a novel data harvesting scheme leveraging lead paragraphs in Wikipedia to train automated context-aware lexicographical models. Furthermore, we present ConDef, a fine-tuned BART trained on the harvested data that defines vocabulary terms from a short context. ConDef is determined to be highly accurate in context-dependent lexicography as validated on ROUGE-1 and ROUGE-L measures in an 1000-item withheld test set, achieving scores of 46.40% and 43.26% respectively. Furthermore, we demonstrate that ConDef’s synthesis serve as good proxies for term definitions by achieving ROUGE-1 measure of 27.79% directly against gold-standard WordNet definitions.Accepted to the 2022 SAI Computing Conference, to be published on Springer Nature’s Lecture Notes on Networks and Systems Current automated lexicography (term definition) techniques cannot include contextual or new term information as a part of its synthesis. We propose a novel data harvesting scheme leveraging lead paragraphs in Wikipedia to train automated context-aware lexicographical models. Furthermore, we present ConDef, a fine-tuned BART trained on the harvested data that defines vocabulary terms from a short context. ConDef is determined to be highly accurate in context-dependent lexicography as validated on ROUGE-1 and ROUGE-L measures in an 1000-item withheld test set, achieving scores of 46.40% and 43.26% respectively. Furthermore, we demonstrate that ConDef’s synthesis serve as good proxies for term definitions by achieving ROUGE-1 measure of 27.79% directly against gold-standard WordNet definitions.