nlp_architect.utils package

Submodules

nlp_architect.utils.ansi2html module

nlp_architect.utils.ansi2html.ansi2html(text, palette='solarized')[source]
nlp_architect.utils.ansi2html.run(file, out)[source]

nlp_architect.utils.embedding module

class nlp_architect.utils.embedding.ELMoEmbedderTFHUB[source]

Bases: object

get_vector(tokens)[source]
class nlp_architect.utils.embedding.FasttextEmbeddingsModel(size: int = 5, window: int = 3, min_count: int = 1, skipgram: bool = True)[source]

Bases: object

Fasttext embedding trainer class

Parameters:
  • texts (List[List[str]]) – list of tokenized sentences
  • size (int) – embedding size
  • epochs (int, optional) – number of epochs to train
  • window (int, optional) – The maximum distance between
  • current and predicted word within a sentence (the) –
classmethod load(path)[source]

load model from path

save(path) → None[source]

save model to path

train(texts: List[List[str]], epochs: int = 100)[source]
vec(word: str) → numpy.ndarray[source]

return vector corresponding given word

nlp_architect.utils.embedding.fill_embedding_mat(src_mat, src_lex, emb_lex, emb_size)[source]

Creates a new matrix from given matrix of int words using the embedding model provided.

Parameters:
  • src_mat (numpy.ndarray) – source matrix
  • src_lex (dict) – source matrix lexicon
  • emb_lex (dict) – embedding lexicon
  • emb_size (int) – embedding vector size
nlp_architect.utils.embedding.get_embedding_matrix(embeddings: dict, vocab: nlp_architect.utils.text.Vocabulary, embedding_size: int = None, lowercase_only: bool = False) → numpy.ndarray[source]

Generate a matrix of word embeddings given a vocabulary

Parameters:
  • embeddings (dict) – a dictionary of embedding vectors
  • vocab (Vocabulary) – a Vocabulary
  • embedding_size (int) – custom embedding matrix size
Returns:

a 2D numpy matrix of lexicon embeddings

nlp_architect.utils.embedding.load_embedding_file(filename: str, dim: int = None) → dict[source]

Load a word embedding file

Parameters:filename (str) – path to embedding file
Returns:dictionary with embedding vectors
Return type:dict
nlp_architect.utils.embedding.load_word_embeddings(file_path, vocab=None)[source]

Loads a word embedding model text file into a word(str) to numpy vector dictionary

Parameters:
  • file_path (str) – path to model file
  • vocab (list of str) – optional - vocabulary
Returns:

a dictionary of numpy.ndarray vectors int: detected word embedding vector size

Return type:

list

nlp_architect.utils.file_cache module

Utilities for working with the local dataset cache.

nlp_architect.utils.file_cache.cached_path(url_or_filename: Union[str, pathlib.Path], cache_dir: str = None) → str[source]

Given something that might be a URL (or might be a local path), determine which. If it’s a URL, download the file and cache it, and return the path to the cached file. If it’s already a local path, make sure the file exists and then return the path.

nlp_architect.utils.file_cache.filename_to_url(filename: str, cache_dir: str = None) → Tuple[str, str][source]

Return the url and etag (which may be None) stored for filename. Raise FileNotFoundError if filename or its stored metadata do not exist.

nlp_architect.utils.file_cache.get_from_cache(url: str, cache_dir: str = None) → str[source]

Given a URL, look for the corresponding dataset in the local cache. If it’s not there, download it. Then return the path to the cached file.

nlp_architect.utils.file_cache.http_get(url: str, temp_file: IO) → None[source]
nlp_architect.utils.file_cache.url_to_filename(url: str, etag: str = None) → str[source]

Convert url into a hashed filename in a repeatable way. If etag is specified, append its hash to the url’s, delimited by a period.

nlp_architect.utils.generic module

nlp_architect.utils.generic.add_offset(mat: numpy.ndarray, offset: int = 1) → numpy.ndarray[source]

Add +1 to all values in matrix mat

Parameters:
  • mat (numpy.ndarray) – A 2D matrix with int values
  • offset (int) – offset to add
Returns:

input matrix

Return type:

numpy.ndarray

nlp_architect.utils.generic.balance(df)[source]
nlp_architect.utils.generic.license_prompt(model_name, model_website, dataset_dir=None)[source]
nlp_architect.utils.generic.normalize(txt, vocab=None, replace_char=' ', max_length=300, pad_out=True, to_lower=True, reverse=False, truncate_left=False, encoding=None)[source]
nlp_architect.utils.generic.one_hot(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source]

Convert a 1D matrix of ints into one-hot encoded vectors.

Parameters:
  • mat (numpy.ndarray) – A 1D matrix of labels (int)
  • num_classes (int) – Number of all possible classes
Returns:

A 2D matrix

Return type:

numpy.ndarray

nlp_architect.utils.generic.one_hot_sentence(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source]

Convert a 2D matrix of ints into one-hot encoded 3D matrix

Parameters:
  • mat (numpy.ndarray) – A 2D matrix of labels (int)
  • num_classes (int) – Number of all possible classes
Returns:

A 3D matrix

Return type:

numpy.ndarray

nlp_architect.utils.generic.pad_sentences(sequences: numpy.ndarray, max_length: int = None, padding_value: int = 0, padding_style='post') → numpy.ndarray[source]

Pad input sequences up to max_length values are aligned to the right

Parameters:
  • sequences (iter) – a 2D matrix (np.array) to pad
  • max_length (int, optional) – max length of resulting sequences
  • padding_value (int, optional) – padding value
  • padding_style (str, optional) – add padding values as prefix (use with ‘pre’) or postfix (use with ‘post’)
Returns:

input sequences padded to size ‘max_length’

nlp_architect.utils.generic.to_one_hot(txt, vocab={'!': 40, '#': 49, '$': 50, '%': 51, '&': 53, '(': 61, ')': 62, '*': 54, '+': 57, ', ': 37, '-': 36, '.': 39, '/': 44, '0': 26, '1': 27, '2': 28, '3': 29, '4': 30, '5': 31, '6': 32, '7': 33, '8': 34, '9': 35, ':': 42, ';': 38, '<': 59, '=': 58, '>': 60, '?': 41, '@': 48, '[': 63, '\\': 45, ']': 64, '_': 47, 'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25, '{': 65, '|': 46, '}': 66, 'ˆ': 52, '˜': 55, '‘': 56, '’': 43})[source]

nlp_architect.utils.io module

nlp_architect.utils.io.check(validator)[source]
nlp_architect.utils.io.check_directory_and_create(dir_path)[source]

Check if given directory exists, create if not.

Parameters:dir_path (str) – path to directory
nlp_architect.utils.io.check_size(min_size=None, max_size=None)[source]
nlp_architect.utils.io.create_folder(path)[source]
nlp_architect.utils.io.download_unlicensed_file(url, sourcefile, destfile, totalsz=None)[source]

Download the file specified by the given URL.

Parameters:
  • url (str) – url to download from
  • sourcefile (str) – file to download from url
  • destfile (str) – save path
  • totalsz (int, optional) – total size of file
nlp_architect.utils.io.download_unzip(url: str, sourcefile: str, unzipped_path: str, license_msg: str = None)[source]

Downloads a zip file, extracts it to destination, deletes the zip file. If license_msg is supplied, user is prompted for download confirmation.

nlp_architect.utils.io.gzip_str(g_str)[source]

Transform string to GZIP coding

Parameters:g_str (str) – string of data
Returns:GZIP bytes data
nlp_architect.utils.io.json_dumper(obj)[source]

for objects that have members that cant be serialized and implement toJson() method

nlp_architect.utils.io.line_count(file)[source]

Utility function for getting number of lines in a text file.

nlp_architect.utils.io.load_files_from_path(dir_path, extension='txt')[source]

load all files from given directory (with given extension)

nlp_architect.utils.io.load_json_file(file_path)[source]

load a file into a json object

nlp_architect.utils.io.prepare_output_path(output_dir: str, overwrite_output_dir: str)[source]

Create output directory or throw error if exists and overwrite_output_dir is false

nlp_architect.utils.io.sanitize_path(path)[source]
nlp_architect.utils.io.uncompress_file(filepath: str, outpath='.')[source]

Unzip a file to the same location of filepath uses decompressing algorithm by file extension

Parameters:
  • filepath (str) – path to file
  • outpath (str) – path to extract to
nlp_architect.utils.io.valid_path_append(path, *args)[source]

Helper to validate passed path directory and append any subsequent filename arguments.

Parameters:
  • path (str) – Initial filesystem path. Should expand to a valid directory.
  • *args (list, optional) – Any filename or path suffices to append to path for returning.
  • Returns
    (list, str): path prepended list of files from args, or path alone if
    no args specified.
Raises:

ValueError – if path is not a valid directory on this filesystem.

nlp_architect.utils.io.validate(*args)[source]

Validate all arguments are of correct type and in correct range. :param *args: Each tuple represents an argument validation like so: :type *args: tuple of tuples :param Option 1 - With range check: (arg, class, min_val, max_val) :param Option 2 - Without range check: (arg, class) :param If class is a tuple of type objects check if arg is an instance of any of the types.: :param To allow a None valued argument, include type: :type To allow a None valued argument, include type: None :param To disable lower or upper bound check, set min_val or max_val to None, respectively.: :param If arg has the len attribute: :type If arg has the len attribute: such as string

nlp_architect.utils.io.validate_boolean(arg)[source]

Validates an input argument of type boolean

nlp_architect.utils.io.validate_existing_directory(arg)[source]

Validates an input argument is a path string to an existing directory.

nlp_architect.utils.io.validate_existing_filepath(arg)[source]

Validates an input argument is a path string to an existing file.

nlp_architect.utils.io.validate_existing_path(arg)[source]

Validates an input argument is a path string to an existing file or directory.

nlp_architect.utils.io.validate_parent_exists(arg)[source]

Validates an input argument is a path string, and its parent directory exists.

nlp_architect.utils.io.validate_proxy_path(arg)[source]

Validates an input argument is a valid proxy path or None

nlp_architect.utils.io.walk_directory(directory, verbose=False)[source]

Iterates a directory’s text files and their contents.

nlp_architect.utils.io.zipfile_list(filepath: str)[source]

List the files inside a given zip file

Parameters:filepath (str) – path to file
Returns:String list of filenames

nlp_architect.utils.metrics module

nlp_architect.utils.metrics.acc_and_f1(preds, labels)[source]

return accuracy and f1 score

nlp_architect.utils.metrics.accuracy(preds, labels)[source]

return simple accuracy in expected dict format

nlp_architect.utils.metrics.classification_report(y_true, y_pred, digits=2, suffix=False)[source]

Build a text report showing the main classification metrics.

Parameters:
  • y_true – 2d array. Ground truth (correct) target values.
  • y_pred – 2d array. Estimated targets as returned by a classifier.
  • digits – int. Number of digits for formatting output floating point values.
Returns:

string. Text summary of the precision, recall, F1 score for each class.

Return type:

report

Examples

>>> from seqeval.metrics import classification_report
>>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'],
>>> ['B-PER', 'I-PER', 'O']]
>>> print(classification_report(y_true, y_pred))
             precision    recall  f1-score   support
<BLANKLINE>
       MISC       0.00      0.00      0.00         1
        PER       1.00      1.00      1.00         1
<BLANKLINE>
  micro avg       0.50      0.50      0.50         2
  macro avg       0.50      0.50      0.50         2
<BLANKLINE>
nlp_architect.utils.metrics.end_of_chunk(prev_tag, tag, prev_type, type_)[source]

Checks if a chunk ended between the previous and current word.

Parameters:
  • prev_tag – previous chunk tag.
  • tag – current chunk tag.
  • prev_type – previous type.
  • type – current type.
Returns:

boolean.

Return type:

chunk_end

nlp_architect.utils.metrics.get_conll_scores(predictions, y, y_lex, unk='O')[source]

Get Conll style scores (precision, recall, f1)

nlp_architect.utils.metrics.get_entities(seq, suffix=False)[source]

Gets entities from sequence.

Parameters:seq (list) – sequence of labels.
Returns:list of (chunk_type, chunk_start, chunk_end).
Return type:list

Example

>>> from seqeval.metrics.sequence_labeling import get_entities
>>> seq = ['B-PER', 'I-PER', 'O', 'B-LOC']
>>> get_entities(seq)
[('PER', 0, 1), ('LOC', 3, 3)]
nlp_architect.utils.metrics.pearson_and_spearman(preds, labels)[source]

get pearson and spearman correlation

nlp_architect.utils.metrics.sequence_accuracy_score(y_true, y_pred)[source]

Accuracy classification score.

In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

Parameters:
  • y_true – 2d array. Ground truth (correct) target values.
  • y_pred – 2d array. Estimated targets as returned by a tagger.
Returns:

float.

Return type:

score

Example

>>> from seqeval.metrics import accuracy_score
>>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'],
>>> ['B-PER', 'I-PER', 'O']]
>>> accuracy_score(y_true, y_pred)
0.80
nlp_architect.utils.metrics.sequence_f1_score(y_true, y_pred, suffix=False)[source]

Compute the F1 score.

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)
Parameters:
  • y_true – 2d array. Ground truth (correct) target values.
  • y_pred – 2d array. Estimated targets as returned by a tagger.
Returns:

float.

Return type:

score

Example

>>> from seqeval.metrics import f1_score
>>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'],
>>> ['B-PER', 'I-PER', 'O']]
>>> f1_score(y_true, y_pred)
0.50
nlp_architect.utils.metrics.sequence_performance_measure(y_true, y_pred)[source]

Compute the performance metrics: TP, FP, FN, TN

Parameters:
  • y_true – 2d array. Ground truth (correct) target values.
  • y_pred – 2d array. Estimated targets as returned by a tagger.
Returns:

dict

Return type:

performance_dict

Example

>>> from seqeval.metrics import performance_measure
>>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'O', 'B-ORG'], ['B-PER', 'I-PER', 'O']]
>>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O', 'O'], ['B-PER', 'I-PER', 'O']]
>>> performance_measure(y_true, y_pred)
(3, 3, 1, 4)
nlp_architect.utils.metrics.sequence_precision_score(y_true, y_pred, suffix=False)[source]

Compute the precision.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample.

The best value is 1 and the worst value is 0.

Parameters:
  • y_true – 2d array. Ground truth (correct) target values.
  • y_pred – 2d array. Estimated targets as returned by a tagger.
Returns:

float.

Return type:

score

Example

>>> from seqeval.metrics import precision_score
>>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'],
>>> ['B-PER', 'I-PER', 'O']]
>>> precision_score(y_true, y_pred)
0.50
nlp_architect.utils.metrics.sequence_recall_score(y_true, y_pred, suffix=False)[source]

Compute the recall.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The best value is 1 and the worst value is 0.

Parameters:
  • y_true – 2d array. Ground truth (correct) target values.
  • y_pred – 2d array. Estimated targets as returned by a tagger.
Returns:

float.

Return type:

score

Example

>>> from seqeval.metrics import recall_score
>>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'],
>>> ['B-PER', 'I-PER', 'O']]
>>> recall_score(y_true, y_pred)
0.50
nlp_architect.utils.metrics.simple_accuracy(preds, labels)[source]

return simple accuracy

nlp_architect.utils.metrics.start_of_chunk(prev_tag, tag, prev_type, type_)[source]

Checks if a chunk started between the previous and current word.

Parameters:
  • prev_tag – previous chunk tag.
  • tag – current chunk tag.
  • prev_type – previous type.
  • type – current type.
Returns:

boolean.

Return type:

chunk_start

nlp_architect.utils.metrics.tagging(preds, labels)[source]

nlp_architect.utils.string_utils module

class nlp_architect.utils.string_utils.StringUtils[source]

Bases: object

determiners = []
static find_head_lemma_pos_ner(x: str)[source]

Parameters:x – mention
Returns:the head word and the head word lemma of the mention
static is_determiner(in_str: str) → bool[source]
static is_preposition(in_str: str) → bool[source]
static is_pronoun(in_str: str) → bool[source]
static is_stop(token: str) → bool[source]
static normalize_str(in_str: str) → str[source]
static normalize_string_list(str_list: str) → List[str][source]
preposition = []
pronouns = []
spacy_no_parser = <nlp_architect.utils.text.SpacyInstance object>
spacy_parser = <nlp_architect.utils.text.SpacyInstance object>
stop_words = []

nlp_architect.utils.testing module

class nlp_architect.utils.testing.NLPArchitectTestCase(methodName='runTest')[source]

Bases: unittest.case.TestCase

setUp()[source]

Hook method for setting up the test fixture before exercising it.

tearDown()[source]

Hook method for deconstructing the test fixture after testing it.

nlp_architect.utils.text module

class nlp_architect.utils.text.SpacyInstance(model='en', disable=None, display_prompt=True, n_jobs=8, batch_size=1500, spacy_doc=False, show_tok=True, show_doc=True, ptb_pos=False)[source]

Bases: object

Spacy pipeline wrapper which prompts user for model download authorization.

Parameters:
  • model (str, optional) – spacy model name (default: english small model)
  • disable (list of string, optional) – pipeline annotators to disable (default: [])
  • display_prompt (bool, optional) – flag to display/skip license prompt
  • n_jobs (int, optional) – maximum number of concurrent Python worker processes. If -1 all CPUs are used.
  • batch_size (int, optional) – number of docs per batch.
  • spacy_doc (bool, optional) – if True, parser outputs spacy.tokens.doc instead of CoreNLPDoc
  • show_tok (bool, optional) – include token text in CoreNLPDoc output
  • show_dok (bool, optional) – include document text in CoreNLPDoc output
  • ptb_pos (bool, optional) – convert spacy POS tags to Penn Treebank tags
parse(texts, output_dir=None)[source]

Parse a list of documents. If more than 1 document is passed, use multi-processing.

Parameters:
  • texts (list of str) – documents to parse
  • output_dir (Path or str, optional) – if given, parsed documents will be written here
parser

return Spacy’s instance parser

process_batch(texts, output_dir=None, batch_id=0)[source]
tokenize(text: str) → List[str][source]

Tokenize a sentence into tokens :param text: text to tokenize :type text: str

Returns:a list of str tokens of input
Return type:list
class nlp_architect.utils.text.Stopwords[source]

Bases: object

Stop words list class.

static get_words()[source]
stop_words = []
class nlp_architect.utils.text.Vocabulary(start=0, include_oov=True)[source]

Bases: object

A vocabulary that maps words to ints (storing a vocabulary)

add(word)[source]

Add word to vocabulary

Parameters:word (str) – word to add
Returns:id of added word
Return type:int
add_vocab_offset(offset)[source]

Adds an offset to the ints of the vocabulary

Parameters:offset (int) – an int offset
id_to_word(wid)[source]

Word-id to word (string)

Parameters:wid (int) – word id
Returns:string of given word id
Return type:str
max
reverse_vocab()[source]

Return the vocabulary as a reversed dict object

Returns:reversed vocabulary object
Return type:dict
vocab

get the dict object of the vocabulary

Type:dict
word_id(word)[source]

Get the word_id of given word

Parameters:word (str) – word from vocabulary
Returns:int id of word
Return type:int
nlp_architect.utils.text.bio_to_spans(text: List[str], tags: List[str]) → List[Tuple[int, int, str]][source]

Convert BIO tagged list of strings into span starts and ends :param text: list of words :param tags: list of tags

Returns:list of start, end and tag of detected spans
Return type:tuple
nlp_architect.utils.text.char_to_id(c)[source]
return int id of given character
OOV char = len(all_letter) + 1
Parameters:c (str) – string character
Returns:int value of given char
Return type:int
nlp_architect.utils.text.character_vector_generator(data, start=0)[source]

Character word vector generator util. Transforms a list of sentences into numpy int vectors of the characters of the words of the sentence, and returns the constructed vocabulary

Parameters:
  • data (list) – list of list of strings
  • start (int, optional) – vocabulary index start integer
Returns:

a 2D numpy array Vocabulary: constructed vocabulary

Return type:

np.array

nlp_architect.utils.text.extract_nps(annotation_list, text=None)[source]

Extract Noun Phrases from given text tokens and phrase annotations. Returns a list of tuples with start/end indexes.

Parameters:
  • annotation_list (list) – a list of annotation tags in str
  • text (list, optional) – a list of token texts in str
Returns:

list of start/end markers of noun phrases, if text is provided a list of noun phrase texts

nlp_architect.utils.text.id_to_char(c_id)[source]

return character of given char id

nlp_architect.utils.text.read_sequential_tagging_file(file_path, ignore_line_patterns=None)[source]

Read a tab separated sequential tagging file. Returns a list of list of tuple of tags (sentences, words)

Parameters:
  • file_path (str) – input file path
  • ignore_line_patterns (list, optional) – list of string patterns to ignore
Returns:

list of list of tuples

nlp_architect.utils.text.simple_normalizer(text)[source]

Simple text normalizer. Runs each token of a phrase thru wordnet lemmatizer and a stemmer.

nlp_architect.utils.text.spacy_normalizer(text, lemma=None)[source]

Simple text normalizer using spacy lemmatizer. Runs each token of a phrase thru a lemmatizer and a stemmer. :param text: the text to normalize. :type text: string :param lemma: lemma of the given text. in this case only stemmer will :type lemma: string :param run.:

nlp_architect.utils.text.try_to_load_spacy(model_name)[source]
nlp_architect.utils.text.word_vector_generator(data, lower=False, start=0)[source]

Word vector generator util. Transforms a list of sentences into numpy int vectors and returns the constructed vocabulary

Parameters:
  • data (list) – list of list of strings
  • lower (bool, optional) – transform strings into lower case
  • start (int, optional) – vocabulary index start integer
Returns:

2D numpy array and Vocabulary of the detected words

Module contents