How contextualized are BERT, GPT-2 and ELMo word representations?
Until these major breakthroughs happened recently, NLP approaches were built around static representations of words (word2vec). Static embeddings of a word, say “mouse”, would fare poorly in accounting for variance in the various contextualized representations of the word (as rodent or gadget).
BERT, GPT-2 and ELMo changed all that – and in a big way. They created different representations of the word “mouse”, each highly specific to its context. And that led to a huge improvement in all NLP outcomes.
A team of researchers measured how contextual really are these three models? They used different measures like self-similarity, intra-sentence similarity and maximum explainable variance. And it turns out that all three models do highly contextualized embeddings.
Which isn’t surprising but the following things were: One, the embeddings for all words occupied a narrow region in the embedding space than being distributed throughout. Two, each model contextualized the words very differently. Three, less than 5 percent of the variance could be explained by static embedding – so static embeddings are indeed a poor substitute for contextualized ones.
Image Source: Shutterstock