Word embeddings for biomedical natural language processing: A survey

Billy CHIU*, Simon BAKER

*Corresponding author for this work

Research output: Journal PublicationsJournal Article (refereed)peer-review

5 Citations (Scopus)

Abstract

Word representations are mathematical objects that capture the semantic and syntactic properties of words in a way that is interpretable by machines. Recently, encoding word properties into low-dimensional vector spaces using neural networks has become increasingly popular. Word embeddings are now used as the main input to natural language processing (NLP) applications, achieving cutting-edge results. Nevertheless, most word-embedding studies are carried out with general-domain text and evaluation datasets, and their results do not necessarily apply to text from other domains (e.g., biomedicine) that are linguistically distinct from general English. To achieve maximum benefit when using word embeddings for biomedical NLP tasks, they need to be induced and evaluated using in-domain resources. Thus, it is essential to create a detailed review of biomedical embeddings that can be used as a reference for researchers to train in-domain models. In this paper, we review biomedical word embedding studies from three key aspects: the corpora, models and evaluation methods. We first describe the characteristics of various biomedical corpora, and then compare popular embedding models. After that, we discuss different evaluation methods for biomedical embeddings. For each aspect, we summarize the various challenges discussed in the literature. Finally, we conclude the paper by proposing future directions that will help advance research into biomedical embeddings.

Original languageEnglish
Article numbere12402
JournalLanguage and Linguistics Compass
Volume14
Issue number12
DOIs
Publication statusPublished - Dec 2020
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2020 John Wiley & Sons Ltd.

Keywords

  • biomedical NLP
  • evaluation
  • word embeddings

Fingerprint

Dive into the research topics of 'Word embeddings for biomedical natural language processing: A survey'. Together they form a unique fingerprint.

Cite this