Spacy-BIST Parser
Raw text parser based on Spacy and BIST parsers
The parser uses Spacy’s english model for sentence breaking, tokenization and token annotations (part-of-speech, lemma, NER). Dependency relations between tokens are extracted using BIST parser. The BIST parser is described here, and its code is documented here.
Usage
To use the module, import it like so:
from nlp_architect.pipelines.spacy_bist import SpacyBISTParser
Training
By default, the parser uses a pre-trained BIST model and Spacy’s English
model (en
). A pre-trained BIST model is automatically
downloaded (on-demand) to spacy_bist/bist-pretrained/
and then loaded
from that directory. To use other models, supply a path or link to each
model at initialization (see example below).
For instructions on how to train a BIST model, see BIST documentation. For instructions on how to get spaCy models or how to train a model see spaCy training instructions
Example
parser = SpacyBISTParser(spacy_model='/path/or/link/to/spacy/model', bist_model='/path/to/bist/model')
Parsing
The parser accepts a document as a raw text string encoded in UTF-8 format and outputs a
CoreNLPDoc
instance which contains the annotations (example output below).
Example
parser = SpacyBISTParser()
parsed_doc = parser.parse(doc_text='First sentence. Second sentence')
print(parsed_doc)
Output
{
"doc_text": "First sentence. Second sentence",
"sentences": [
[
{
"start": 0,
"len": 5,
"pos": "JJ",
"ner": "ORDINAL",
"lemma": "first",
"gov": 1,
"rel": "amod",
"text": "First"
},
{
"start": 6,
"len": 8,
"pos": "NN",
"ner": "",
"lemma": "sentence",
"gov": -1,
"rel": "root",
"text": "sentence"
},
{
"start": 14,
"len": 1,
"pos": ".",
"ner": "",
"lemma": ".",
"gov": 1,
"rel": "punct",
"text": "."
}
],
[
{
"start": 16,
"len": 6,
"pos": "JJ",
"ner": "ORDINAL",
"lemma": "second",
"gov": 1,
"rel": "amod",
"text": "Second"
},
{
"start": 23,
"len": 8,
"pos": "NN",
"ner": "",
"lemma": "sentence",
"gov": -1,
"rel": "root",
"text": "sentence"
}
]
]
}
References
[1] | Kiperwasser, E., & Goldberg, Y. (2016). Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations. Transactions Of The Association For Computational Linguistics, 4, 313-327. https://transacl.org/ojs/index.php/tacl/article/view/885/198 |