Spacy-BIST Parser

Raw text parser based on Spacy and BIST parsers

The parser uses Spacy’s english model for sentence breaking, tokenization and token annotations (part-of-speech, lemma, NER). Dependency relations between tokens are extracted using BIST parser. The BIST parser is described here, and its code is documented here.

Usage

To use the module, import it like so:

from nlp_architect.pipelines.spacy_bist import SpacyBISTParser

Training

By default, the parser uses a pre-trained BIST model and Spacy’s English model (en). A pre-trained BIST model is automatically downloaded (on-demand) to spacy_bist/bist-pretrained/ and then loaded from that directory. To use other models, supply a path or link to each model at initialization (see example below).

For instructions on how to train a BIST model, see BIST documentation. For instructions on how to get spaCy models or how to train a model see spaCy training instructions

Example

parser = SpacyBISTParser(spacy_model='/path/or/link/to/spacy/model', bist_model='/path/to/bist/model')

Parsing

The parser accepts a document as a raw text string encoded in UTF-8 format and outputs a CoreNLPDoc instance which contains the annotations (example output below).

Example

parser = SpacyBISTParser()
parsed_doc = parser.parse(doc_text='First sentence. Second sentence')
print(parsed_doc)

Output

{
    "doc_text": "First sentence. Second sentence",
    "sentences": [
        [
            {
                "start": 0,
                "len": 5,
                "pos": "JJ",
                "ner": "ORDINAL",
                "lemma": "first",
                "gov": 1,
                "rel": "amod",
                "text": "First"
            },
            {
                "start": 6,
                "len": 8,
                "pos": "NN",
                "ner": "",
                "lemma": "sentence",
                "gov": -1,
                "rel": "root",
                "text": "sentence"
            },
            {
                "start": 14,
                "len": 1,
                "pos": ".",
                "ner": "",
                "lemma": ".",
                "gov": 1,
                "rel": "punct",
                "text": "."
            }
        ],
        [
            {
                "start": 16,
                "len": 6,
                "pos": "JJ",
                "ner": "ORDINAL",
                "lemma": "second",
                "gov": 1,
                "rel": "amod",
                "text": "Second"
            },
            {
                "start": 23,
                "len": 8,
                "pos": "NN",
                "ner": "",
                "lemma": "sentence",
                "gov": -1,
                "rel": "root",
                "text": "sentence"
            }
        ]
    ]
}

References

[1]Kiperwasser, E., & Goldberg, Y. (2016). Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations. Transactions Of The Association For Computational Linguistics, 4, 313-327. https://transacl.org/ojs/index.php/tacl/article/view/885/198