Noun Phrase Semantic Segmentation
Overview
Noun-Phrase (NP) is a phrase which has a noun (or pronoun) as its head and zero or more dependent modifiers. Noun-Phrase is the most frequently occurring phrase type and its inner segmentation is critical for understanding the semantics of the Noun-Phrase. The most basic division of the semantic segmentation is to two classes:
- Descriptive Structure - a structure where all dependent modifiers are not changing the semantic meaning of the Head.
- Collocation Structure - a sequence of words or term that co-occur and change the semantic meaning of the Head.
For example:
fresh hot dog
- hot dog is a collocation, and changes the head (dog
) semantic meaning.fresh hot pizza
- fresh and hot are descriptions for the pizza.
Model
The NpSemanticSegClassifier
model is the first step in the Semantic Segmentation algorithm - the MLP classifier.
The Semantic Segmentation algorithm takes the dependency relations between the Noun-Phrase words, and the MLP classifier inference as the
input - and build a semantic hierarchy that represents the semantic meaning.
The Semantic Segmentation algorithm eventually create a tree where each tier represent a semantic meaning -> if a sequence of words is a
collocation then a collocation tier is created, else the elements are broken down and each one is mapped
to different tier in the tree.
This model trains MLP classifier and inference from such classifier in order to conclude the correct segmentation for the given NP.
For the examples above the classifier will output 1 (==Collocation) for hot dog
and output 0 (== not collocation)
for hot pizza
.
Files
NpSemanticSegClassifier
: is the MLP classifier model.- examples/np_semantic_segmentation/data.py: Prepare string data for both
train.py
andinference.py
using pre-trained word embedding, NLTKCollocations score, Wordnet and wikidata. - examples/np_semantic_segmentation/feature_extraction.py: contains the feature extraction services
- examples/np_semantic_segmentation/train.py: train the MLP classifier.
- examples/np_semantic_segmentation/inference.py: load the trained model and inference the input data by the model.
Dataset
The expected dataset is a CSV file with 2 columns. the first column contains the Noun-Phrase string (a Noun-Phrase containing 2 words), and the second column contains the correct label (if the 2 word Noun-Phrase is a collocation - the label is 1, else 0)
If you wish to use an existing dataset for training the model, you can download Tratz 2011 et al. dataset [1] [2] [3] [4] from the following link: Tratz 2011 Dataset. Is also available in here. (The terms and conditions of the data set license apply. Intel does not grant any rights to the data files or database.
After downloading and unzipping the dataset, run
preprocess_tratz2011.py
in order to construct the labeled data and
save it in a CSV file (as expected for the model). The scripts read 2
.tsv files (‘tratz2011_coarse_grained_random/train.tsv’ and
‘tratz2011_coarse_grained_random/val.tsv’) and outputs 2 .csv files
accordingly to the same location.
Quick example:
python examples/np_semantic_segmentation/preprocess_tratz2011.py --data path_to_Tratz_2011_dataset_folder
Pre-processing the data
A feature vector is extracted from each Noun-Phrase string using the
command python data.py
- Word2Vec word embedding (300 size vector for each word in the
Noun-Phrase) .
- Pre-trained Google News Word2vec model can download here
- The terms and conditions of the data set license apply. Intel does not grant any rights to the data files or database.
- Cosine distance between 2 words in the Noun-Phrase.
- NLTKCollocations score (PMI score (from Manning and Schutze 5.4) and Chi-square score (Manning and Schutze 5.3.3)).
- A binary features whether the Noun-Phrase has existing entity in Wikidata.
- A binary features whether the Noun-Phrase has existing entity in WordNet.
Quick example:
python data.py --data input_data_path.csv --output prepared_data_path.csv --w2v_path <path_to_w2v>/GoogleNews-vectors-negative300.bin
Running Modalities
Training
The command python examples/np_semantic_segmentation/train.py
will train the MLP classifier and
evaluate it. After training is done, the model is saved automatically:
Quick example:
python examples/np_semantic_segmentation/train.py \
--data prepared_data_path.csv \
--model_path np_semantic_segmentation_path.h5
Inference
In order to run inference you need to have pre-trained
<model_name>.h5
& <model_name>.json
files and data CSV file that was generated by
prepare_data.py
. The result of python inference.py
is a CSV
file, each row contains the model’s inference in respect to the input
data.
Quick example:
python examples/np_semantic_segmentation/inference.py \
--model np_semantic_segmentation_path.h5 \
--data prepared_data_path.csv \
--output inference_data.csv \
--print_stats
References
[1] | Stephen Tratz and Eduard Hovy. 2011. A Fast, Accurate, Non-Projective, Semantically-Enriched Parser. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scotland, UK. |
[2] | Dirk Hovy, Stephen Tratz, and Eduard Hovy. 2010. What’s in a Preposition? Dimensions of Sense Disambiguation for an Interesting Word Class. In Proceedings of COLING 2010: Poster Volume. Beijing, China. |
[3] | Stephen Tratz and Dirk Hovy. 2009. Disambiguation of Preposition Sense using Linguistically Motivated Features. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium. Boulder, Colorado. |
[4] | Stephen Tratz and Eduard Hovy. 2010. A Taxonomy, Dataset, and Classifier for Automatic Noun Compound Interpretation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden |