Neural Networks are designed to learn from numerical data.
Word Embedding is really all about improving the ability of networks to learn from text data. By representing that data as lower dimensional vectors. These vectors are called Embedding.
This technique is used to reduce the dimensionality of text data but these models can also learn some interesting traits about words in a vocabulary.
General approach for dealing with words in your text data is to one-hot encode your text. You will have tens of thousands of unique words in your text vocabulary. Computations with such one-hot encoded vectors for these words will be very inefficient because most values in your one-hot vector will be 0. So, the matrix calculation that will happen in between a one-hot vector and a first hidden layer will result in a output that will have mostly 0 values。
We use embeddings to solve this problem and greatly improve the efficiency of our network. Embeddings are just like a fully-connected layer. We will call this layer as— embedding layer and the weights as — embedding weights.
So, we use this Weight Matrix as lookup table. We encode the words as integers, for example ‘cool’ is encoded as 512, ‘hot’ is encoded as 764. Then to get hidden layer output value for ‘cool’ we just simply need to lookup the 512th row in the weight matrix. This process is called Embedding Lookup. The number of dimension in the hidden layer output is the embedding dimension:
a) The embedding layer is just a hidden layer
b) The lookup table is just a embedding weight matrix
c) The lookup is just a shortcut for matrix multiplication
d) The lookup table is trained just like any weight matrix
Popular off-the-shelf word embedding models in use today:
Word2Vec (by Google)
GloVe (by Stanford)
fastText (by Facebook)
This model is provided by Google and is trained on Google News data. This model has 300 dimensions and is trained on 3 million words from google news data. Team used skip-gram and negative sampling to build this model. It was released in 2013.
Global Vectors for words representation (GloVe) is provided by Stanford. They provided various models from 25, 50, 100, 200 to 300 dimensions based on 2, 6, 42, 840 billion tokens. Team used word-to-word co-occurrence to build this model. In other words, if two words co-occur many times, it means they have some linguistic or semantic similarity.
This model is developed by Facebook. They provide 3 models with 300 dimensions each. fastText is able to achieve good performance for word representations and sentence classifications because they are making use of character level representations. Each word is represented as bag of characters n-grams in addition to the word itself. For example, for the word partial, with n=3, the fastText representation for the character n-grams is