Abstract
The quality of word representations is frequently assessed using correlation with human judgements of word similarity. Here, we question whether such intrinsic evaluation can predict the merits of the representations for downstream tasks. We study the correlation between results on ten word similarity benchmarks and tagger performance on three standard sequence labeling tasks using a variety of word vectors induced from an unannotated corpus of 3.8 billion words, and demonstrate that most intrinsic evaluations are poor predictors of downstream performance. We argue that this issue can be traced in part to a failure to distinguish specific similarity from relatedness in intrinsic evaluation datasets. We make our evaluation tools openly available to facilitate further study.
Original language | English |
---|---|
Title of host publication | Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP |
Editors | Omer LEVY, Felix HILL, Anna KORHONEN, Roi REICHART, Yoav GOLDBERG, Antoine BORDES |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 1-6 |
Number of pages | 7 |
ISBN (Electronic) | 9781945626142 |
DOIs | |
Publication status | Published - Aug 2016 |
Externally published | Yes |
Event | The 54th Annual Meeting of the Association for Computational Linguistics: the 1st Workshop on Evaluating Vector-Space Representations for NLP - Berlin, Germany Duration: 7 Aug 2016 → 12 Aug 2016 https://aclanthology.org/volumes/W16-25/ |
Conference
Conference | The 54th Annual Meeting of the Association for Computational Linguistics: the 1st Workshop on Evaluating Vector-Space Representations for NLP |
---|---|
Country/Territory | Germany |
City | Berlin |
Period | 7/08/16 → 12/08/16 |
Internet address |