Evaluating Word Similarity Measure of Embeddings Through Binary Classification

Evaluating Word Similarity Measure of Embeddings Through Binary Classification


DOI: https://doi.org/10.30564/jcsr.v1i3.1268

Abstract


We consider the following problem: given neural language models (embeddings) each of which is trained on an unknown data set, how can we determine which model would provide a better result when used for feature representation in a downstream task such as text classification or entity recognition? In this paper, we assess the word similarity measure through analyzing its impact on word embeddings learned from various datasets and how they perform in a simple classification task. Word representations were learned and assessed under the same conditions. For training word vectors, we used the implementation of Continuous Bag of Words described in [1]. To assess the quality of the vectors, we applied the analogy questions test for word similarity described in the same paper. Further, to measure the retrieval rate of an embedding model, we introduced a new metric (Average Retrieval Error) which measures the percentage of missing words in the model. We observe that scoring a high accuracy of syntactic and semantic similarities between word pairs is not an indicator of better classification results. This observation can be justified by the fact that a domain-specific corpus contributes to the performance better than a general-purpose corpus. For reproducibility, we release our experiments scripts and results.


Keywords


Word embeddings; Embeddings evaluation; Binary classification; Word2vec

Full Text:

PDF

Comments

Popular posts from this blog

Impact of Polymer Coating on the Flexural Strength and Deflection Characteristics of Fiber-Reinforced Concrete Beams

Achieving Sustainable Use and Management of Water Resources for Irrigation in Nigeria

Quality Models for Open, Flexible, and Online Learning