Information Extraction
Noun Phrase to Vec
Overview
Noun Phrases (NP) play a particular role in NLP applications. This code consists in training a word embedding’s model for Noun NP’s using word2vec or fasttext algorithm. It assumes that the NP’s are already extracted and marked in the input corpus. All the terms in the corpus are used as context in order to train the word embedding’s model; however, at the end of the training, only the word embedding’s of the NP’s are stored, except for the case of Fasttext training with word_ngrams=1; in this case, we store all the word embedding’s, including non-NP’s in order to be able to estimate word embeddings of out-of-vocabulary NP’s (NP’s that don’t appear in the training corpora).
Note
This code can be also used to train a word embedding’s model on any marked corpus. For example, if you mark verbs in your corpus, you can train a verb2vec model.
NP’s have to be marked in the corpus by a marking character between the words of the NP and as a suffix of the NP. For example, if the marking character is “_”, the NP “Natural Language Processing” will be marked as “Natural_Language_Processing”.
We use the CONLL2000 shared task dataset in the default parameters of our example for training
NP2vec
model. The terms and conditions of the data set license apply. Intel does not grant any rights to the data files.
Files
Running Modalities
Training
To train the model with default parameters, the following command can be used:
python examples/np2vec/train.py \
--corpus sample_corpus.json \
--corpus_format json \
--np2vec_model_file sample_np2vec.model
Inference
To run inference with a saved model, the following command can be used:
python examples/np2vec/inference.py --np2vec_model_file sample_np2vec.model --np <noun phrase>
More details about the hyperparameters at https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec for word2vec and https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText for Fasttext.
Cross Document Co-Reference
Overview
Cross Document Coreference resolution is the task of determining which event or entity mentions expressed in language refer to a similar real-world event or entity across different documents in the same topic.
Definitions:
- Event mention refers to verb and action phrases in a document text.
- Entity mentions refers to object, location, person, time and so on phrases in a document text.
- Document refers to a text article (with one or more sentences) on a single subject and which contains entity and event mentions.
- Topic refers to a set of documents the are on the same subject or topic.
Sieve-based System
The cross document coreference system provided is a sieve-based system. A sieve is a logical layer that uses a single semantic relation identifier that extracts a certain relation type. See details descriptions of relation identifiers and types of relations in Identifying Semantic Relation.
The sieve-based system consists of a set of configurable sieves. Each sieve uses a computational rule based logic or an external knowledge resource in order to extract semantic relations between event or entity mentions pairs, with the purpose of clustering same or semantically similar relation mentions across multiple documents.
Refer to Configuration section below to see how-to configure a sieved-based system.
Results
The sieve-based system was tested on ECB+ [1]_ corpus and evaluated using CoNLL F1 (Pradhan et al., 2014) metric.
The ECB+ corpus component consists of 502 documents that belong to 43 topics, annotated with mentions of events and their times, locations, human and non-human participants as well as with within- and cross-document event and entity coreference information.
The system achieved the following:
- Best in class results achieve on ECB+ Entity Cross Document Co-Reference (69.8% F1) using the sieves set [Head Lemma, Exact Match, Wikipedia Redirect, Wikipedia Disambiguation and Elmo]
- Best in class results achieve on ECB+ Event Cross Document Co-Reference (79.0% F1) using the sieves set [Head Lemma, Exact Match, Wikipedia Redirect, Wikipedia Disambiguation and Fuzzy Head]
[1] | ECB+: Agata Cybulska and Piek Vossen. 2014. Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution. |
In Proceedings of the 9th international conference on Language Resources and Evaluation (LREC2014) ECB+ annotation is held copyright by Agata Cybulska, Piek Vossen and the VU University of Amsterdam.
Requirements
- Make sure all intended relation identifier resources are available and configured properly. Refer to Identifying Semantic Relation to see how to use and configure the identifiers.
- Prepare a JSON file with mentions to be used as input for the sieve-based cross document coreference system:
[
{
"topic_id": "2_ecb", #Required (a topic is a set of multiple documents that share the same subject)
"doc_id": "1_10.xml", #Required (the article or document id this mention belong to)
"sent_id": 0, #Optional (mention sentence number in document)
"tokens_number": [ #Optional (the token number in sentence, will be required when using Within doc entities)
13
],
"tokens_str": "Josh", #Required (the mention text)
},
{
"topic_id": "2_ecb", #Required
"doc_id": "1_11.xml",
"sent_id": 0,
"tokens_number": [
3
],
"tokens_str": "Reid",
},
...
]
- An example for an ECB+ entity mentions json file can be found here:
<nlp architect root>/datasets/ecb/ecb_all_entity_mentions.json
- An example for an ECB+ event mentions json file can be found here:
<nlp architect root>/datasets/ecb/ecb_all_event_mentions.json
Configuration
There are two modes of operation:
- Entity mentions cross document coreference - for clustering entity mentions across multiple documents
- Event mentions cross document coreference - for clustering event mentions across multiple document
- For each mode of operation there is a method for extraction defined in
cross_doc_sieves
: run_event_coref()
- running event coreference resolutionrun_entity_coref()
- running entity coreference resolution
Each mode of operation requires a configuration. The configurations define which sieve should run, in what order and define constraints and thresholds
- Use
EntitySievesConfiguration
for configuring the needed sieves for computing events mentions- Use
EntitySievesConfiguration
for configuring the needed sieves for computing entities mentions
Configuring sieves_order
enables control on the sieve configurations, sieves_order
is a list of tuples (RelationType, threshold)
Use SievesResources
to set the correct paths to all files downloaded or created for the different types of sieves.
Sieve-based system flow
The flow of the sieve-based system is identical to both event and entity resolutions:
Load all mentions from input file (mentions json file).
Separate each mention to a singleton cluster (a cluster initiated with only one mention) and group the clusters by topic (so each topic has a set of clusters that belong to it) according to the input values.
Run the configured sieves system iteratively in the order determine in the
sieves_order
configuration parameter, For each sieve:- Go over all clusters in a topic and try to merge 2 clusters at a time with current sieve RelationType
- Continue until no mergers are available using this RelationType
Continue to next sieve and repeat (3.1) on current state of clusters until no more sieves are left to run.
Return the clusters results.
See code example below for running a full cross document coreference evaluation or refer to the documentation for further details.
Code Example
You can find code example for running the system at: examples/cross_doc_coref/cross_doc_coref_sieves.py
Identifying Semantic Relations
Overview
Semantic relation identification is the task of determining whether there is a relation between two entities. Those entities could be event mentions (referring to verbs and actions phrases) or entity mentions (referring to objects, locations, persons, time, etc.). Described below are 6 different methods for extraction relations using external data resources: Wikipedia, Wordnet, Word embeddings, Computational, Referent-Dictionary and VerbOcean.
Each semantic relation identifier below is capable of identifying a set of pre-defined relation types between two events or two entity mentions.
Note
Each relation identifier extractor can be configured to initialize and run in different modes as described below in the Initialization options code example sections, this refers to working online directly against the dataset website, a locally stored resource dataset, or a snapshot of the resource containing only relevant data (created according to some input dataset defined by the user).
In order to prepare a resource snapshot refer to Downloading and generating external resources Data.
Wikipedia
- Use
WikipediaRelationExtraction
model to extract relations based on Wikipedia page information. - Supports: Event and Entity mentions.
Relation types
- Redirect Links: the two mentions have the same Wikipedia redirect link (see: Wiki-Redirect for more details)
- Aliases: one mention is a Wikipedia alias of the other input mention (see: Wiki-Aliases for more details)
- Disambiguation: one input mention is a Wikipedia disambiguation of the other input mention (see: Wiki-Disambiguation for more details)
- Category: one input mention is a Wikipedia category of the other input mention (see: Wiki-Category for more details)
- Title Parenthesis: one input mention is a Wikipedia title parenthesis of the other input mention (see: Extracting Lexical Reference Rules from Wikipedia for more details)
- Be-Comp / Is-A: one input mention has a ‘is-a’ relation which contains the other input mention (see: Extracting Lexical Reference Rules from Wikipedia for more details)
Initialization options
# 3 methods for Wikipedia extractor initialization (running against wiki web site, data sub-set or local elastic DB)
# Online initialization for full data access against Wikipedia site
wiki_online = WikipediaRelationExtraction(WikipediaSearchMethod.ONLINE)
# Or use offline initialization if created a snapshot
wiki_offline = WikipediaRelationExtraction(WikipediaSearchMethod.OFFLINE, ROOT_DIR + '/mini_wiki.json')
# Or use elastic initialization if you created a local database of wikipedia
wiki_elastic = WikipediaRelationExtraction(WikipediaSearchMethod.ELASTIC, host='localhost', port=9200, index='enwiki_v2')
Wordnet
- Use
WordnetRelationExtraction
to extract relations based on WordNet. - Support: Event and Entity mentions.
Relation types
- Derivationally - Terms in different syntactic categories that have the same root form and are semantically related
- Synset - A synonym set; a set of words that are interchangeable in some context without changing the truth value of the preposition in which they are embedded
See: WordNet Glossary for more details.
Initialization options
# 2 methods for Wordnet extractor initialization (Running on original data or on a sub-set)
# Initialization for full data access
wn_online = WordnetRelationExtraction(OnlineOROfflineMethod.ONLINE)
# Or use offline initialization if created a snapshot
wn_offline = WordnetRelationExtraction(OnlineOROfflineMethod.OFFLINE, wn_file=ROOT_DIR + '/mini_wn.json')
Verb-Ocean
- Use
VerboceanRelationExtraction
to extract relations based on Verb-Ocean. - Support: Event mentions only.
Initialization options
# 2 method for VerbOcean extractor initialization (with original data or a sub-set)
# Initialization for full data access
vo_online = VerboceanRelationExtraction(OnlineOROfflineMethod.ONLINE, ROOT_DIR + '/verbocean.unrefined.2004-05-20.txt')
# Or use offline initialization if created a snapshot
vo_offline = VerboceanRelationExtraction(OnlineOROfflineMethod.OFFLINE, ROOT_DIR + '/mini_vo.json')
© Timothy Chklovski and Patrick Pantel 2004-2016; All Rights Reserved. With any questions, contact Timothy Chklovski or Patrick Pantel.
Referent-Dictionary
- Use
ReferentDictRelationExtraction
to extract relations based on Referent-Dict. - Support: Entity mentions only.
Initialization options
# 2 methods for ReferentDict extractor initialization (with original data or a sub-set)
# Initialization for full data access
ref_dict_onine = ReferentDictRelationExtraction(OnlineOROfflineMethod.ONLINE, ROOT_DIR '/ref.dict1.tsv')
# Or use offline initialization if created a snapshot
ref_dict_offline = ReferentDictRelationExtraction(OnlineOROfflineMethod.OFFLINE, ROOT_DIR + '/mini_dict.json')
© Marta Recasens, Matthew Can, and Dan Jurafsky. 2013. Same Referent, Different Words: Unsupervised Mining of Opaque Coreferent Mentions. Proceedings of NAACL 2013.
Word Embedding
- Use
WordEmbeddingRelationExtraction
to extract relations based on w2v distance. - Support: Event and Entity mentions.
Supported Embeddings types
Initialization options
# 4 flavors of Embedding model initialization (running Elmo, Glove or data sub-set of them)
# Initialization for Elmo Pre-Trained vectors
embed_elmo_online = WordEmbaddingRelationExtraction(EmbeddingMethod.ELMO)
embed_elmo_offline = WordEmbaddingRelationExtraction(EmbeddingMethod.ELMO_OFFLINE, glove_file='ROOT_DIR + '/elmo_snippet.pickle')
# Embedding extractor initialization (GloVe)
# Initialization of Glove Pre-Trained vectors
embed_glove_online = WordEmbeddingRelationExtraction(EmbeddingMethod.GLOVE, glove_file='ROOT_DIR + '/glove.840B.300d.txt')
# Or use offline initialization if created a snapshot
embed_glove_offline = WordEmbaddingRelationExtraction(EmbeddingMethod.GLOVE_OFFLINE, glove_file='ROOT_DIR + '/glove_mini.pickle')
Computational
- Use
ComputedRelationExtraction
to extract relations based on rules such as Head match and Fuzzy Fit. - Support: Event and Entity mentions.
Relation types
- Exact Match: Mentions are identical
- Fuzzy Match: Mentions are fuzzy similar
- Fuzzy Head: Mentions heads are fuzzy similar (in cases mentions are more then a single token)
- Head Lemma: Mentions have the same head lemma (in cases mentions are more then a single token)
Initialization
# 1 method fpr Computed extractor initialization
computed = ComputedRelationExtraction()
Examples
- Using Wikipedia Relation identifier for mentions of ‘IBM’ and ‘International Business Machines’ will result with the following relation types:
`WIKIPEDIA_CATEGORY, WIKIPEDIA_ALIASES, WIKIPEDIA_REDIRECT_LINK`
- Using WordNet Relation identifier for mentions of ‘lawyer’ and ‘attorney’ will result with the following relations types:
`WORDNET_SAME_SYNSET, WORDNET_DERIVATIONALLY`
- Using Referent-Dict Relation identifier for mentions of ‘company’ and ‘apple’ will result with
`REFERENT_DICT`
relation type. - Using VerbOcean Relation identifier for mentions of ‘expedite’ and ‘accelerate’ will result with
`VERBOCEAN_MATCH`
relation type.
Code Example
Each relation identifier implements two main methods to identify the relations types:
extract_all_relations()
- Extract all supported relations types from this relation modelextract_sub_relations()
- Extract particular relation type, from this relation model
See detailed example below and methods documentation for more details on how to use the identifiers.
computed = ComputedRelationExtraction()
ref_dict = ReferentDictRelationExtraction(OnlineOROfflineMethod.ONLINE,
'<replace with Ref-Dict data location>')
vo = VerboceanRelationExtraction(OnlineOROfflineMethod.ONLINE,
'<replace with VerbOcean data location>')
wiki = WikipediaRelationExtraction(WikipediaSearchMethod.ONLINE)
embed = WordEmbaddingRelationExtraction(EmbeddingMethod.ELMO)
wn = WordnetRelationExtraction(OnlineOROfflineMethod.ONLINE)
mention_x1 = MentionDataLight(
'IBM',
mention_context='IBM manufactures and markets computer hardware, middleware and software')
mention_y1 = MentionDataLight(
'International Business Machines',
mention_context='International Business Machines Corporation is an '
'American multinational information technology company')
computed_relations = computed.extract_all_relations(mention_x1, mention_y1)
ref_dict_relations = ref_dict.extract_all_relations(mention_x1, mention_y1)
vo_relations = vo.extract_all_relations(mention_x1, mention_y1)
wiki_relations = wiki.extract_all_relations(mention_x1, mention_y1)
embed_relations = embed.extract_all_relations(mention_x1, mention_y1)
wn_relaions = wn.extract_all_relations(mention_x1, mention_y1)
You can find the above example in this location: examples/cross_doc_coref/relation_extraction_example.py
Downloading and generating external resources data
This section describes how to download resources required for relation identifiers and how to prepare resources for working locally or with a snapshot of a resource.
Full External Resources
- Referent-Dict, used in
ReferentDictRelationExtraction
- Verb-Ocean used in
VerboceanRelationExtraction
- Glove used in
WordEmbeddingRelationExtraction
Generating resource snapshots
Using a large dataset with relation identifiers that work by querying an online resource might take a lot of time due to network latency and overhead. In addition, capturing an online dataset is useful for many train/test tasks that the user might do. For this purpose we included scripts to capture a snapshot (or a subset) of an online resource. The downloaded snapshot can be loaded using the relation identifiers as data input.
Each script requires a mentions file in JSON format as seen below. This file must contain the event or entity mentions that the user is interested it (or the subset of data needed to be captured):
[
{ # Mention 1
"tokens_str": "Intel" #Required,
"context": "Intel is the world's second largest and second highest valued semiconductor chip maker" #Optional (used in Elmo)
},
{ # Mention 2
"tokens_str": "Tara Reid"
},
...
]
Generate Scripts
Generate ReferentDict:
python -m nlp_architect.data.cdc_resources.gen_scripts.create_reference_dict_dump --ref_dict=<ref.dict1.tsv downloaded file> --mentions=<in_mentions.json> --output=<output.json>
Generate VerbOcean:
python -m nlp_architect.data.cdc_resources.gen_scripts.create_verbocean_dump --vo=<verbocean.unrefined.2004-05-20.txt downloaded file> --mentions=<in_mentions.json> --output=<output.json>
Generate WordEmbedding Glove:
python -m nlp_architect.data.cdc_resources.gen_scripts.create_word_embed_glove_dump --mentions=<in_mentions.json> --glove=glove.840B.300d.txt --output=<output.pickle>
Generate Wordnet:
python -m nlp_architect.data.cdc_resources.gen_scripts.create_wordnet_dump --mentions=<in_mentions.json> --output=<output.json>
Generate Wikipedia:
python -m nlp_architect.data.cdc_resources.gen_scripts.create_wiki_dump --mentions=<in_mentions.json> --output=<output.json>``
Note
For a fast evaluation using Wikipedia at run time, on live data, there is an option to generate a local ElasticSearch database of the entire Wiki site using this resource: Wiki to Elastic, It is highly recommended since using online evaluation against Wikipedia site can be very slow.
In case you adopt elastic local database, Initiate WikipediaRelationExtraction
relation extraction using WikipediaSearchMethod.ELASTIC
Generate Wikipedia Snapshot using Elastic data instead of from online wikipedia site:
python -m nlp_architect.data.cdc_resources.gen_scripts.create_wiki_dump --mentions=<in_mentions.json> --host=<elastic_host eg:localhost> --port=<elastic_port eg:9200> --index=<elastic_index> --output=<output.json>``
Noun Phrase Semantic Segmentation
Overview
Noun-Phrase (NP) is a phrase which has a noun (or pronoun) as its head and zero or more dependent modifiers. Noun-Phrase is the most frequently occurring phrase type and its inner segmentation is critical for understanding the semantics of the Noun-Phrase. The most basic division of the semantic segmentation is to two classes:
- Descriptive Structure - a structure where all dependent modifiers are not changing the semantic meaning of the Head.
- Collocation Structure - a sequence of words or term that co-occur and change the semantic meaning of the Head.
For example:
fresh hot dog
- hot dog is a collocation, and changes the head (dog
) semantic meaning.fresh hot pizza
- fresh and hot are descriptions for the pizza.
Model
The NpSemanticSegClassifier
model is the first step in the Semantic Segmentation algorithm - the MLP classifier.
The Semantic Segmentation algorithm takes the dependency relations between the Noun-Phrase words, and the MLP classifier inference as the
input - and build a semantic hierarchy that represents the semantic meaning.
The Semantic Segmentation algorithm eventually create a tree where each tier represent a semantic meaning -> if a sequence of words is a
collocation then a collocation tier is created, else the elements are broken down and each one is mapped
to different tier in the tree.
This model trains MLP classifier and inference from such classifier in order to conclude the correct segmentation for the given NP.
For the examples above the classifier will output 1 (==Collocation) for hot dog
and output 0 (== not collocation)
for hot pizza
.
Files
NpSemanticSegClassifier
: is the MLP classifier model.- examples/np_semantic_segmentation/data.py: Prepare string data for both
train.py
andinference.py
using pre-trained word embedding, NLTKCollocations score, Wordnet and wikidata. - examples/np_semantic_segmentation/feature_extraction.py: contains the feature extraction services
- examples/np_semantic_segmentation/train.py: train the MLP classifier.
- examples/np_semantic_segmentation/inference.py: load the trained model and inference the input data by the model.
Dataset
The expected dataset is a CSV file with 2 columns. the first column contains the Noun-Phrase string (a Noun-Phrase containing 2 words), and the second column contains the correct label (if the 2 word Noun-Phrase is a collocation - the label is 1, else 0)
If you wish to use an existing dataset for training the model, you can download Tratz 2011 et al. dataset [1]_ [2]_ [3]_ [4] from the following link: Tratz 2011 Dataset. Is also available in here. (The terms and conditions of the data set license apply. Intel does not grant any rights to the data files or database.
After downloading and unzipping the dataset, run
preprocess_tratz2011.py
in order to construct the labeled data and
save it in a CSV file (as expected for the model). The scripts read 2
.tsv files (‘tratz2011_coarse_grained_random/train.tsv’ and
‘tratz2011_coarse_grained_random/val.tsv’) and outputs 2 .csv files
accordingly to the same location.
Quick example:
python examples/np_semantic_segmentation/preprocess_tratz2011.py --data path_to_Tratz_2011_dataset_folder
Pre-processing the data
A feature vector is extracted from each Noun-Phrase string using the
command python data.py
- Word2Vec word embedding (300 size vector for each word in the
Noun-Phrase) .
- Pre-trained Google News Word2vec model can download here
- The terms and conditions of the data set license apply. Intel does not grant any rights to the data files or database.
- Cosine distance between 2 words in the Noun-Phrase.
- NLTKCollocations score (PMI score (from Manning and Schutze 5.4) and Chi-square score (Manning and Schutze 5.3.3)).
- A binary features whether the Noun-Phrase has existing entity in Wikidata.
- A binary features whether the Noun-Phrase has existing entity in WordNet.
Quick example:
python data.py --data input_data_path.csv --output prepared_data_path.csv --w2v_path <path_to_w2v>/GoogleNews-vectors-negative300.bin
Running Modalities
Training
The command python examples/np_semantic_segmentation/train.py
will train the MLP classifier and
evaluate it. After training is done, the model is saved automatically:
Quick example:
python examples/np_semantic_segmentation/train.py \
--data prepared_data_path.csv \
--model_path np_semantic_segmentation_path.h5
Inference
In order to run inference you need to have pre-trained
<model_name>.h5
& <model_name>.json
files and data CSV file that was generated by
prepare_data.py
. The result of python inference.py
is a CSV
file, each row contains the model’s inference in respect to the input
data.
Quick example:
python examples/np_semantic_segmentation/inference.py \
--model np_semantic_segmentation_path.h5 \
--data prepared_data_path.csv \
--output inference_data.csv \
--print_stats
References
[1] | Stephen Tratz and Eduard Hovy. 2011. A Fast, Accurate, Non-Projective, Semantically-Enriched Parser. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scotland, UK. |
[2] | Dirk Hovy, Stephen Tratz, and Eduard Hovy. 2010. What’s in a Preposition? Dimensions of Sense Disambiguation for an Interesting Word Class. In Proceedings of COLING 2010: Poster Volume. Beijing, China. |
[3] | Stephen Tratz and Dirk Hovy. 2009. Disambiguation of Preposition Sense using Linguistically Motivated Features. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium. Boulder, Colorado. |
[4] | Stephen Tratz and Eduard Hovy. 2010. A Taxonomy, Dataset, and Classifier for Automatic Noun Compound Interpretation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden |
Most Common Word Sense
Overview
The most common word sense algorithm’s goal is to extract the most common sense of a target word. The input to the algorithm is the target word and the output are the senses of the target word where each sense is scored according to the most commonly used sense in the language. note that most of the words in the language have many senses. The sense of a word a consists of the definition of the word and the inherited hypernyms of the word.
For example: the most common sense of the target_word burger is:
definition: "a sandwich consisting of a fried cake of minced beef served on a bun, often with other ingredients"
inherited hypernyms: ['sandwich', 'snack_food']
whereas the least common sense is:
definition: "United States jurist appointed chief justice of the United States Supreme Court by Richard Nixon (1907-1995)"
Our approach:
Training: the training inputs a list of target_words where each word is associated with a correct (true example) or incorrect (false example) sense. The sense consists of the definition and the inherited hypernyms of the target word in a specific sense.
Inference: extracts all the possible senses for a specific target_word and scores those senses according to the most common sense of the target_word. the higher the score the higher the probability of the sense being the most commonly used sense.
In both training and inference a feature vector is constructed as input to the neural network. The feature vector consists of:
- the word embedding distance between the target_word and the inherited hypernyms
- 2 variations of the word embedding distance between the target_word and the definition
- the word embedding of the target_word
- the CBOW word embedding of the definition
The model above is implemented in the MostCommonWordSense
class.
Dataset
The training module requires a gold standard csv file which is list of target_words where each word is associated with a CLASS_LABEL - a correct (true example) or an incorrect (false example) sense. The sense consists of the definition and the inherited hypernyms of the target word in a specific sense. The user needs to prepare this gold standard csv file in advance. The file should include the following 4 columns:
|TARGET_WORD|DEFINITION|SEMANTIC_BRANCH|CLASS_LABEL
where:
- TARGET_WORD: the word that you want to get the most common sense of.
- DEFINITION: the definition of the word (usually a single sentence) extracted from external resource such as Wordnet or Wikidata
- SEMANTIC_BRANCH: the inherited hypernyms extracted from external resource such as Wordnet or Wikidata
- CLASS_LABEL: a binary [0,1] Y value that represent whether the sense (Definition and semantic branch) is the most common sense of the target word
Store the file in the data folder of the project.
Running Modalities
Dataset Preparation
The script prepare_data.py uses the gold standard csv file as described in the requirements section above using pre-trained Google News Word2vec model [1]_ [2]_ [3]_. Pre-trained Google News Word2vec model can be download here. The terms and conditions of the data set license apply. Intel does not grant any rights to the data files.
python examples/most_common_word_sense/prepare_data.py --gold_standard_file data/gold_standard.csv
--word_embedding_model_file pretrained_models/GoogleNews-vectors-negative300.bin
--training_to_validation_size_ratio 0.8
--data_set_file data/data_set.pkl
Training
Trains the MLP classifier (model
) and evaluate it.
python examples/most_common_word_sense/train.py --data_set_file data/data_set.pkl
--model data/wsd_classification_model.h5
Inference
python examples/most_common_word_sense/inference.py --max_num_of_senses_to_search 3
--input_inference_examples_file data/input_inference_examples.csv
--word_embedding_model_file pretrained_models/GoogleNews-vectors-negative300.bin
--model data/wsd_classification_model.h5
Where the max_num_of_senses_to_search
is the maximum number of senses that are checked per target word (default =3)
and input_inference_examples_file
is a csv file containing the input inference data. This file includes
a single column wherein each entry in this column is a different target word
Note
The results are printed to the terminal using different colors therefore using a white terminal background is best to view the results
[1] | Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. |
[2] | Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. |
[3] | Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013. |