Identifying Semantic Relations
Overview
Semantic relation identification is the task of determining whether there is a relation between two entities. Those entities could be event mentions (referring to verbs and actions phrases) or entity mentions (referring to objects, locations, persons, time, etc.). Described below are 6 different methods for extraction relations using external data resources: Wikipedia, Wordnet, Word embeddings, Computational, Referent-Dictionary and VerbOcean.
Each semantic relation identifier below is capable of identifying a set of pre-defined relation types between two events or two entity mentions.
Note
Each relation identifier extractor can be configured to initialize and run in different modes as described below in the Initialization options code example sections, this refers to working online directly against the dataset website, a locally stored resource dataset, or a snapshot of the resource containing only relevant data (created according to some input dataset defined by the user).
In order to prepare a resource snapshot refer to Downloading and generating external resources Data.
Wikipedia
- Use
WikipediaRelationExtraction
model to extract relations based on Wikipedia page information. - Supports: Event and Entity mentions.
Relation types
- Redirect Links: the two mentions have the same Wikipedia redirect link (see: Wiki-Redirect for more details)
- Aliases: one mention is a Wikipedia alias of the other input mention (see: Wiki-Aliases for more details)
- Disambiguation: one input mention is a Wikipedia disambiguation of the other input mention (see: Wiki-Disambiguation for more details)
- Category: one input mention is a Wikipedia category of the other input mention (see: Wiki-Category for more details)
- Title Parenthesis: one input mention is a Wikipedia title parenthesis of the other input mention (see: Extracting Lexical Reference Rules from Wikipedia for more details)
- Be-Comp / Is-A: one input mention has a ‘is-a’ relation which contains the other input mention (see: Extracting Lexical Reference Rules from Wikipedia for more details)
Initialization options
# 3 methods for Wikipedia extractor initialization (running against wiki web site, data sub-set or local elastic DB)
# Online initialization for full data access against Wikipedia site
wiki_online = WikipediaRelationExtraction(WikipediaSearchMethod.ONLINE)
# Or use offline initialization if created a snapshot
wiki_offline = WikipediaRelationExtraction(WikipediaSearchMethod.OFFLINE, ROOT_DIR + '/mini_wiki.json')
# Or use elastic initialization if you created a local database of wikipedia
wiki_elastic = WikipediaRelationExtraction(WikipediaSearchMethod.ELASTIC, host='localhost', port=9200, index='enwiki_v2')
Wordnet
- Use
WordnetRelationExtraction
to extract relations based on WordNet. - Support: Event and Entity mentions.
Relation types
- Derivationally - Terms in different syntactic categories that have the same root form and are semantically related
- Synset - A synonym set; a set of words that are interchangeable in some context without changing the truth value of the preposition in which they are embedded
See: WordNet Glossary for more details.
Initialization options
# 2 methods for Wordnet extractor initialization (Running on original data or on a sub-set)
# Initialization for full data access
wn_online = WordnetRelationExtraction(OnlineOROfflineMethod.ONLINE)
# Or use offline initialization if created a snapshot
wn_offline = WordnetRelationExtraction(OnlineOROfflineMethod.OFFLINE, wn_file=ROOT_DIR + '/mini_wn.json')
Verb-Ocean
- Use
VerboceanRelationExtraction
to extract relations based on Verb-Ocean. - Support: Event mentions only.
Initialization options
# 2 method for VerbOcean extractor initialization (with original data or a sub-set)
# Initialization for full data access
vo_online = VerboceanRelationExtraction(OnlineOROfflineMethod.ONLINE, ROOT_DIR + '/verbocean.unrefined.2004-05-20.txt')
# Or use offline initialization if created a snapshot
vo_offline = VerboceanRelationExtraction(OnlineOROfflineMethod.OFFLINE, ROOT_DIR + '/mini_vo.json')
© Timothy Chklovski and Patrick Pantel 2004-2016; All Rights Reserved. With any questions, contact Timothy Chklovski or Patrick Pantel.
Referent-Dictionary
- Use
ReferentDictRelationExtraction
to extract relations based on Referent-Dict. - Support: Entity mentions only.
Initialization options
# 2 methods for ReferentDict extractor initialization (with original data or a sub-set)
# Initialization for full data access
ref_dict_onine = ReferentDictRelationExtraction(OnlineOROfflineMethod.ONLINE, ROOT_DIR '/ref.dict1.tsv')
# Or use offline initialization if created a snapshot
ref_dict_offline = ReferentDictRelationExtraction(OnlineOROfflineMethod.OFFLINE, ROOT_DIR + '/mini_dict.json')
© Marta Recasens, Matthew Can, and Dan Jurafsky. 2013. Same Referent, Different Words: Unsupervised Mining of Opaque Coreferent Mentions. Proceedings of NAACL 2013.
Word Embedding
- Use
WordEmbeddingRelationExtraction
to extract relations based on w2v distance. - Support: Event and Entity mentions.
Supported Embeddings types
Initialization options
# 4 flavors of Embedding model initialization (running Elmo, Glove or data sub-set of them)
# Initialization for Elmo Pre-Trained vectors
embed_elmo_online = WordEmbaddingRelationExtraction(EmbeddingMethod.ELMO)
embed_elmo_offline = WordEmbaddingRelationExtraction(EmbeddingMethod.ELMO_OFFLINE, glove_file='ROOT_DIR + '/elmo_snippet.pickle')
# Embedding extractor initialization (GloVe)
# Initialization of Glove Pre-Trained vectors
embed_glove_online = WordEmbeddingRelationExtraction(EmbeddingMethod.GLOVE, glove_file='ROOT_DIR + '/glove.840B.300d.txt')
# Or use offline initialization if created a snapshot
embed_glove_offline = WordEmbaddingRelationExtraction(EmbeddingMethod.GLOVE_OFFLINE, glove_file='ROOT_DIR + '/glove_mini.pickle')
Computational
- Use
ComputedRelationExtraction
to extract relations based on rules such as Head match and Fuzzy Fit. - Support: Event and Entity mentions.
Relation types
- Exact Match: Mentions are identical
- Fuzzy Match: Mentions are fuzzy similar
- Fuzzy Head: Mentions heads are fuzzy similar (in cases mentions are more then a single token)
- Head Lemma: Mentions have the same head lemma (in cases mentions are more then a single token)
Initialization
# 1 method fpr Computed extractor initialization
computed = ComputedRelationExtraction()
Examples
- Using Wikipedia Relation identifier for mentions of ‘IBM’ and ‘International Business Machines’ will result with the following relation types:
`WIKIPEDIA_CATEGORY, WIKIPEDIA_ALIASES, WIKIPEDIA_REDIRECT_LINK`
- Using WordNet Relation identifier for mentions of ‘lawyer’ and ‘attorney’ will result with the following relations types:
`WORDNET_SAME_SYNSET, WORDNET_DERIVATIONALLY`
- Using Referent-Dict Relation identifier for mentions of ‘company’ and ‘apple’ will result with
`REFERENT_DICT`
relation type. - Using VerbOcean Relation identifier for mentions of ‘expedite’ and ‘accelerate’ will result with
`VERBOCEAN_MATCH`
relation type.
Code Example
Each relation identifier implements two main methods to identify the relations types:
extract_all_relations()
- Extract all supported relations types from this relation modelextract_sub_relations()
- Extract particular relation type, from this relation model
See detailed example below and methods documentation for more details on how to use the identifiers.
computed = ComputedRelationExtraction()
ref_dict = ReferentDictRelationExtraction(OnlineOROfflineMethod.ONLINE,
'<replace with Ref-Dict data location>')
vo = VerboceanRelationExtraction(OnlineOROfflineMethod.ONLINE,
'<replace with VerbOcean data location>')
wiki = WikipediaRelationExtraction(WikipediaSearchMethod.ONLINE)
embed = WordEmbaddingRelationExtraction(EmbeddingMethod.ELMO)
wn = WordnetRelationExtraction(OnlineOROfflineMethod.ONLINE)
mention_x1 = MentionDataLight(
'IBM',
mention_context='IBM manufactures and markets computer hardware, middleware and software')
mention_y1 = MentionDataLight(
'International Business Machines',
mention_context='International Business Machines Corporation is an '
'American multinational information technology company')
computed_relations = computed.extract_all_relations(mention_x1, mention_y1)
ref_dict_relations = ref_dict.extract_all_relations(mention_x1, mention_y1)
vo_relations = vo.extract_all_relations(mention_x1, mention_y1)
wiki_relations = wiki.extract_all_relations(mention_x1, mention_y1)
embed_relations = embed.extract_all_relations(mention_x1, mention_y1)
wn_relaions = wn.extract_all_relations(mention_x1, mention_y1)
You can find the above example in this location: examples/cross_doc_coref/relation_extraction_example.py
Downloading and generating external resources data
This section describes how to download resources required for relation identifiers and how to prepare resources for working locally or with a snapshot of a resource.
Full External Resources
- Referent-Dict, used in
ReferentDictRelationExtraction
- Verb-Ocean used in
VerboceanRelationExtraction
- Glove used in
WordEmbeddingRelationExtraction
Generating resource snapshots
Using a large dataset with relation identifiers that work by querying an online resource might take a lot of time due to network latency and overhead. In addition, capturing an online dataset is useful for many train/test tasks that the user might do. For this purpose we included scripts to capture a snapshot (or a subset) of an online resource. The downloaded snapshot can be loaded using the relation identifiers as data input.
Each script requires a mentions file in JSON format as seen below. This file must contain the event or entity mentions that the user is interested it (or the subset of data needed to be captured):
[
{ # Mention 1
"tokens_str": "Intel" #Required,
"context": "Intel is the world's second largest and second highest valued semiconductor chip maker" #Optional (used in Elmo)
},
{ # Mention 2
"tokens_str": "Tara Reid"
},
...
]
Generate Scripts
Generate ReferentDict:
python -m nlp_architect.data.cdc_resources.gen_scripts.create_reference_dict_dump --ref_dict=<ref.dict1.tsv downloaded file> --mentions=<in_mentions.json> --output=<output.json>
Generate VerbOcean:
python -m nlp_architect.data.cdc_resources.gen_scripts.create_verbocean_dump --vo=<verbocean.unrefined.2004-05-20.txt downloaded file> --mentions=<in_mentions.json> --output=<output.json>
Generate WordEmbedding Glove:
python -m nlp_architect.data.cdc_resources.gen_scripts.create_word_embed_glove_dump --mentions=<in_mentions.json> --glove=glove.840B.300d.txt --output=<output.pickle>
Generate Wordnet:
python -m nlp_architect.data.cdc_resources.gen_scripts.create_wordnet_dump --mentions=<in_mentions.json> --output=<output.json>
Generate Wikipedia:
python -m nlp_architect.data.cdc_resources.gen_scripts.create_wiki_dump --mentions=<in_mentions.json> --output=<output.json>``
Note
For a fast evaluation using Wikipedia at run time, on live data, there is an option to generate a local ElasticSearch database of the entire Wiki site using this resource: Wiki to Elastic, It is highly recommended since using online evaluation against Wikipedia site can be very slow.
In case you adopt elastic local database, Initiate WikipediaRelationExtraction
relation extraction using WikipediaSearchMethod.ELASTIC
Generate Wikipedia Snapshot using Elastic data instead of from online wikipedia site:
python -m nlp_architect.data.cdc_resources.gen_scripts.create_wiki_dump --mentions=<in_mentions.json> --host=<elastic_host eg:localhost> --port=<elastic_port eg:9200> --index=<elastic_index> --output=<output.json>``