# Using Gensim Word2Vec Embeddings in Keras

A short post and script regarding using Gensim Word2Vec embeddings in Keras, with example code.

# Introduction

This will be a quick post about using Gensim’s Word2Vec embeddings in Keras. This topic has been covered elsewhere by other people, but I thought another code example and explanation might be useful.

# Resources

• Keras Blog: Francois Chollet wrote a whole post about this exact topic a few weeks ago, so that’s the authoritative source on how to do this.
• Github Issue: Another reference, with some relevant code.
• Discussion on the Google Group: This topic was hashed out about a year ago on the Keras Google Group, and has since migrated to its own Slack channel.

# Installing Dependencies

Usually pip install ... works if you don’t already have Keras or Gensim.

# Tokenizing

In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. -Wikipedia

We want to tokenize each string to get a list of words, usually by making everything lowercase and splitting along the spaces. In contrast, lemmatization involves getting the root of each word, which can be helpful but is more computationally expensive (enough so that you would want to preprocess your text rather than do it on-the-fly).

# Create Embeddings

We first create a SentenceGenerator class which will generate our text line-by-line, tokenized. This generator is passed to the Gensim Word2Vec model, which takes care of the training in the background. We can pass parameters through the function to the model as keyword **params.

## Key Observation

The syn0 weight matrix in Gensim corresponds exactly to weights of the Embedding layer in Keras. We want to save it so that we can use it later, so we dump it to a file. We also want to save the vocabulary so that we know which columns of the Gensim weight matrix correspond to which word; in Keras, this dictionary will tell us which index to pass to the Embedding layer for a given word. We’ll dump this as a JSON file to make it more human-readable.