Noun Phrase to Vec
Overview
Noun Phrases (NP) play a particular role in NLP applications. This code consists in training a word embedding’s model for Noun NP’s using word2vec or fasttext algorithm. It assumes that the NP’s are already extracted and marked in the input corpus. All the terms in the corpus are used as context in order to train the word embedding’s model; however, at the end of the training, only the word embedding’s of the NP’s are stored, except for the case of Fasttext training with word_ngrams=1; in this case, we store all the word embedding’s, including non-NP’s in order to be able to estimate word embeddings of out-of-vocabulary NP’s (NP’s that don’t appear in the training corpora).
Note
This code can be also used to train a word embedding’s model on any marked corpus. For example, if you mark verbs in your corpus, you can train a verb2vec model.
NP’s have to be marked in the corpus by a marking character between the words of the NP and as a suffix of the NP. For example, if the marking character is “_”, the NP “Natural Language Processing” will be marked as “Natural_Language_Processing”.
We use the CONLL2000 shared task dataset in the default parameters of our example for training
NP2vec
model. The terms and conditions of the data set license apply. Intel does not grant any rights to the data files.
Files
Running Modalities
Training
To train the model with default parameters, the following command can be used:
python examples/np2vec/train.py \
--corpus sample_corpus.json \
--corpus_format json \
--np2vec_model_file sample_np2vec.model
Inference
To run inference with a saved model, the following command can be used:
python examples/np2vec/inference.py --np2vec_model_file sample_np2vec.model --np <noun phrase>
More details about the hyperparameters at https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec for word2vec and https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText for Fasttext.