How contextualized are BERT, GPT-2 and ELMo word representations?
How contextualized are BERT, GPT-2 and ELMo word representations?
Until these major breakthroughs happened recently, NLP approaches were built around static representations of words (word2vec). Static embeddings of a word, say “mouse”, would fare poorly in accounting for variance in the various contextualized representations of the word (as rodent or gadget).
BERT, GPT-2 and ELMo changed all that – and in a big way. They created different representations of the word “mouse”, each highly specific to its context. And that led to a huge improvement in all NLP outcomes.
A team of researchers measured how contextual really are these three models? They used different measures like self-similarity, intra-sentence similarity and maximum explainable variance. And it turns out that all three models do highly contextualized embeddings.
Which isn’t surprising but the following things were: One, the embeddings for all words occupied a narrow region in the embedding space than being distributed throughout. Two, each model contextualized the words very differently. Three, less than 5 percent of the variance could be explained by static embedding – so static embeddings are indeed a poor substitute for contextualized ones.
To know more of this good stuff, read this blog by one of the authors of the research paper here. But having said that, these state-of-the-art NLP models still have a lot more ground to cover.
Image Source: Shutterstock