nlp_architect.utils package
Submodules
nlp_architect.utils.ansi2html module
nlp_architect.utils.embedding module
-
class
nlp_architect.utils.embedding.
FasttextEmbeddingsModel
(size: int = 5, window: int = 3, min_count: int = 1, skipgram: bool = True)[source] Bases:
object
Fasttext embedding trainer class
Parameters: - texts (List[List[str]]) – list of tokenized sentences
- size (int) – embedding size
- epochs (int, optional) – number of epochs to train
- window (int, optional) – The maximum distance between
- current and predicted word within a sentence (the) –
-
nlp_architect.utils.embedding.
fill_embedding_mat
(src_mat, src_lex, emb_lex, emb_size)[source] Creates a new matrix from given matrix of int words using the embedding model provided.
Parameters: - src_mat (numpy.ndarray) – source matrix
- src_lex (dict) – source matrix lexicon
- emb_lex (dict) – embedding lexicon
- emb_size (int) – embedding vector size
-
nlp_architect.utils.embedding.
get_embedding_matrix
(embeddings: dict, vocab: nlp_architect.utils.text.Vocabulary, embedding_size: int = None, lowercase_only: bool = False) → numpy.ndarray[source] Generate a matrix of word embeddings given a vocabulary
Parameters: - embeddings (dict) – a dictionary of embedding vectors
- vocab (Vocabulary) – a Vocabulary
- embedding_size (int) – custom embedding matrix size
Returns: a 2D numpy matrix of lexicon embeddings
-
nlp_architect.utils.embedding.
load_embedding_file
(filename: str, dim: int = None) → dict[source] Load a word embedding file
Parameters: filename (str) – path to embedding file Returns: dictionary with embedding vectors Return type: dict
-
nlp_architect.utils.embedding.
load_word_embeddings
(file_path, vocab=None)[source] Loads a word embedding model text file into a word(str) to numpy vector dictionary
Parameters: - file_path (str) – path to model file
- vocab (list of str) – optional - vocabulary
Returns: a dictionary of numpy.ndarray vectors int: detected word embedding vector size
Return type: list
nlp_architect.utils.file_cache module
Utilities for working with the local dataset cache.
-
nlp_architect.utils.file_cache.
cached_path
(url_or_filename: Union[str, pathlib.Path], cache_dir: str = None) → str[source] Given something that might be a URL (or might be a local path), determine which. If it’s a URL, download the file and cache it, and return the path to the cached file. If it’s already a local path, make sure the file exists and then return the path.
-
nlp_architect.utils.file_cache.
filename_to_url
(filename: str, cache_dir: str = None) → Tuple[str, str][source] Return the url and etag (which may be
None
) stored for filename. RaiseFileNotFoundError
if filename or its stored metadata do not exist.
nlp_architect.utils.generic module
-
nlp_architect.utils.generic.
add_offset
(mat: numpy.ndarray, offset: int = 1) → numpy.ndarray[source] Add +1 to all values in matrix mat
Parameters: - mat (numpy.ndarray) – A 2D matrix with int values
- offset (int) – offset to add
Returns: input matrix
Return type: numpy.ndarray
-
nlp_architect.utils.generic.
normalize
(txt, vocab=None, replace_char=' ', max_length=300, pad_out=True, to_lower=True, reverse=False, truncate_left=False, encoding=None)[source]
-
nlp_architect.utils.generic.
one_hot
(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source] Convert a 1D matrix of ints into one-hot encoded vectors.
Parameters: - mat (numpy.ndarray) – A 1D matrix of labels (int)
- num_classes (int) – Number of all possible classes
Returns: A 2D matrix
Return type: numpy.ndarray
-
nlp_architect.utils.generic.
one_hot_sentence
(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source] Convert a 2D matrix of ints into one-hot encoded 3D matrix
Parameters: - mat (numpy.ndarray) – A 2D matrix of labels (int)
- num_classes (int) – Number of all possible classes
Returns: A 3D matrix
Return type: numpy.ndarray
-
nlp_architect.utils.generic.
pad_sentences
(sequences: numpy.ndarray, max_length: int = None, padding_value: int = 0, padding_style='post') → numpy.ndarray[source] Pad input sequences up to max_length values are aligned to the right
Parameters: - sequences (iter) – a 2D matrix (np.array) to pad
- max_length (int, optional) – max length of resulting sequences
- padding_value (int, optional) – padding value
- padding_style (str, optional) – add padding values as prefix (use with ‘pre’) or postfix (use with ‘post’)
Returns: input sequences padded to size ‘max_length’
-
nlp_architect.utils.generic.
to_one_hot
(txt, vocab={'!': 40, '#': 49, '$': 50, '%': 51, '&': 53, '(': 61, ')': 62, '*': 54, '+': 57, ', ': 37, '-': 36, '.': 39, '/': 44, '0': 26, '1': 27, '2': 28, '3': 29, '4': 30, '5': 31, '6': 32, '7': 33, '8': 34, '9': 35, ':': 42, ';': 38, '<': 59, '=': 58, '>': 60, '?': 41, '@': 48, '[': 63, '\\': 45, ']': 64, '_': 47, 'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25, '{': 65, '|': 46, '}': 66, 'ˆ': 52, '˜': 55, '‘': 56, '’': 43})[source]
nlp_architect.utils.io module
-
nlp_architect.utils.io.
check_directory_and_create
(dir_path)[source] Check if given directory exists, create if not.
Parameters: dir_path (str) – path to directory
-
nlp_architect.utils.io.
download_unlicensed_file
(url, sourcefile, destfile, totalsz=None)[source] Download the file specified by the given URL.
Parameters: - url (str) – url to download from
- sourcefile (str) – file to download from url
- destfile (str) – save path
- totalsz (
int
, optional) – total size of file
-
nlp_architect.utils.io.
download_unzip
(url: str, sourcefile: str, unzipped_path: str, license_msg: str = None)[source] Downloads a zip file, extracts it to destination, deletes the zip file. If license_msg is supplied, user is prompted for download confirmation.
-
nlp_architect.utils.io.
gzip_str
(g_str)[source] Transform string to GZIP coding
Parameters: g_str (str) – string of data Returns: GZIP bytes data
-
nlp_architect.utils.io.
json_dumper
(obj)[source] for objects that have members that cant be serialized and implement toJson() method
-
nlp_architect.utils.io.
line_count
(file)[source] Utility function for getting number of lines in a text file.
-
nlp_architect.utils.io.
load_files_from_path
(dir_path, extension='txt')[source] load all files from given directory (with given extension)
-
nlp_architect.utils.io.
prepare_output_path
(output_dir: str, overwrite_output_dir: str)[source] Create output directory or throw error if exists and overwrite_output_dir is false
-
nlp_architect.utils.io.
uncompress_file
(filepath: str, outpath='.')[source] Unzip a file to the same location of filepath uses decompressing algorithm by file extension
Parameters: - filepath (str) – path to file
- outpath (str) – path to extract to
-
nlp_architect.utils.io.
valid_path_append
(path, *args)[source] Helper to validate passed path directory and append any subsequent filename arguments.
Parameters: - path (str) – Initial filesystem path. Should expand to a valid directory.
- *args (list, optional) – Any filename or path suffices to append to path for returning.
- Returns –
- (list, str): path prepended list of files from args, or path alone if
- no args specified.
Raises: ValueError
– if path is not a valid directory on this filesystem.
-
nlp_architect.utils.io.
validate
(*args)[source] Validate all arguments are of correct type and in correct range. :param *args: Each tuple represents an argument validation like so: :type *args: tuple of tuples :param Option 1 - With range check: (arg, class, min_val, max_val) :param Option 2 - Without range check: (arg, class) :param If class is a tuple of type objects check if arg is an instance of any of the types.: :param To allow a None valued argument, include type: :type To allow a None valued argument, include type: None :param To disable lower or upper bound check, set min_val or max_val to None, respectively.: :param If arg has the len attribute: :type If arg has the len attribute: such as string
-
nlp_architect.utils.io.
validate_existing_directory
(arg)[source] Validates an input argument is a path string to an existing directory.
-
nlp_architect.utils.io.
validate_existing_filepath
(arg)[source] Validates an input argument is a path string to an existing file.
-
nlp_architect.utils.io.
validate_existing_path
(arg)[source] Validates an input argument is a path string to an existing file or directory.
-
nlp_architect.utils.io.
validate_parent_exists
(arg)[source] Validates an input argument is a path string, and its parent directory exists.
-
nlp_architect.utils.io.
validate_proxy_path
(arg)[source] Validates an input argument is a valid proxy path or None
nlp_architect.utils.metrics module
-
nlp_architect.utils.metrics.
accuracy
(preds, labels)[source] return simple accuracy in expected dict format
-
nlp_architect.utils.metrics.
classification_report
(y_true, y_pred, digits=2, suffix=False)[source] Build a text report showing the main classification metrics.
Parameters: - y_true – 2d array. Ground truth (correct) target values.
- y_pred – 2d array. Estimated targets as returned by a classifier.
- digits – int. Number of digits for formatting output floating point values.
Returns: string. Text summary of the precision, recall, F1 score for each class.
Return type: report
Examples
>>> from seqeval.metrics import classification_report >>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], >>> ['B-PER', 'I-PER', 'O']] >>> print(classification_report(y_true, y_pred)) precision recall f1-score support <BLANKLINE> MISC 0.00 0.00 0.00 1 PER 1.00 1.00 1.00 1 <BLANKLINE> micro avg 0.50 0.50 0.50 2 macro avg 0.50 0.50 0.50 2 <BLANKLINE>
-
nlp_architect.utils.metrics.
end_of_chunk
(prev_tag, tag, prev_type, type_)[source] Checks if a chunk ended between the previous and current word.
Parameters: - prev_tag – previous chunk tag.
- tag – current chunk tag.
- prev_type – previous type.
- type – current type.
Returns: boolean.
Return type: chunk_end
-
nlp_architect.utils.metrics.
get_conll_scores
(predictions, y, y_lex, unk='O')[source] Get Conll style scores (precision, recall, f1)
-
nlp_architect.utils.metrics.
get_entities
(seq, suffix=False)[source] Gets entities from sequence.
Parameters: seq (list) – sequence of labels. Returns: list of (chunk_type, chunk_start, chunk_end). Return type: list Example
>>> from seqeval.metrics.sequence_labeling import get_entities >>> seq = ['B-PER', 'I-PER', 'O', 'B-LOC'] >>> get_entities(seq) [('PER', 0, 1), ('LOC', 3, 3)]
-
nlp_architect.utils.metrics.
pearson_and_spearman
(preds, labels)[source] get pearson and spearman correlation
-
nlp_architect.utils.metrics.
sequence_accuracy_score
(y_true, y_pred)[source] Accuracy classification score.
In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
Parameters: - y_true – 2d array. Ground truth (correct) target values.
- y_pred – 2d array. Estimated targets as returned by a tagger.
Returns: float.
Return type: score
Example
>>> from seqeval.metrics import accuracy_score >>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], >>> ['B-PER', 'I-PER', 'O']] >>> accuracy_score(y_true, y_pred) 0.80
-
nlp_architect.utils.metrics.
sequence_f1_score
(y_true, y_pred, suffix=False)[source] Compute the F1 score.
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
F1 = 2 * (precision * recall) / (precision + recall)
Parameters: - y_true – 2d array. Ground truth (correct) target values.
- y_pred – 2d array. Estimated targets as returned by a tagger.
Returns: float.
Return type: score
Example
>>> from seqeval.metrics import f1_score >>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], >>> ['B-PER', 'I-PER', 'O']] >>> f1_score(y_true, y_pred) 0.50
-
nlp_architect.utils.metrics.
sequence_performance_measure
(y_true, y_pred)[source] Compute the performance metrics: TP, FP, FN, TN
Parameters: - y_true – 2d array. Ground truth (correct) target values.
- y_pred – 2d array. Estimated targets as returned by a tagger.
Returns: dict
Return type: performance_dict
Example
>>> from seqeval.metrics import performance_measure >>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'O', 'B-ORG'], ['B-PER', 'I-PER', 'O']] >>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O', 'O'], ['B-PER', 'I-PER', 'O']] >>> performance_measure(y_true, y_pred) (3, 3, 1, 4)
-
nlp_architect.utils.metrics.
sequence_precision_score
(y_true, y_pred, suffix=False)[source] Compute the precision.
The precision is the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample.The best value is 1 and the worst value is 0.
Parameters: - y_true – 2d array. Ground truth (correct) target values.
- y_pred – 2d array. Estimated targets as returned by a tagger.
Returns: float.
Return type: score
Example
>>> from seqeval.metrics import precision_score >>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], >>> ['B-PER', 'I-PER', 'O']] >>> precision_score(y_true, y_pred) 0.50
-
nlp_architect.utils.metrics.
sequence_recall_score
(y_true, y_pred, suffix=False)[source] Compute the recall.
The recall is the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.The best value is 1 and the worst value is 0.
Parameters: - y_true – 2d array. Ground truth (correct) target values.
- y_pred – 2d array. Estimated targets as returned by a tagger.
Returns: float.
Return type: score
Example
>>> from seqeval.metrics import recall_score >>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], >>> ['B-PER', 'I-PER', 'O']] >>> recall_score(y_true, y_pred) 0.50
-
nlp_architect.utils.metrics.
start_of_chunk
(prev_tag, tag, prev_type, type_)[source] Checks if a chunk started between the previous and current word.
Parameters: - prev_tag – previous chunk tag.
- tag – current chunk tag.
- prev_type – previous type.
- type – current type.
Returns: boolean.
Return type: chunk_start
nlp_architect.utils.string_utils module
-
class
nlp_architect.utils.string_utils.
StringUtils
[source] Bases:
object
-
determiners
= []
-
static
find_head_lemma_pos_ner
(x: str)[source] “
Parameters: x – mention Returns: the head word and the head word lemma of the mention
-
preposition
= []
-
pronouns
= []
-
spacy_no_parser
= <nlp_architect.utils.text.SpacyInstance object>
-
spacy_parser
= <nlp_architect.utils.text.SpacyInstance object>
-
stop_words
= []
-
nlp_architect.utils.testing module
nlp_architect.utils.text module
-
class
nlp_architect.utils.text.
SpacyInstance
(model='en', disable=None, display_prompt=True, n_jobs=8, batch_size=1500, spacy_doc=False, show_tok=True, show_doc=True, ptb_pos=False)[source] Bases:
object
Spacy pipeline wrapper which prompts user for model download authorization.
Parameters: - model (str, optional) – spacy model name (default: english small model)
- disable (list of string, optional) – pipeline annotators to disable (default: [])
- display_prompt (bool, optional) – flag to display/skip license prompt
- n_jobs (int, optional) – maximum number of concurrent Python worker processes. If -1 all CPUs are used.
- batch_size (int, optional) – number of docs per batch.
- spacy_doc (bool, optional) – if True, parser outputs spacy.tokens.doc instead of CoreNLPDoc
- show_tok (bool, optional) – include token text in CoreNLPDoc output
- show_dok (bool, optional) – include document text in CoreNLPDoc output
- ptb_pos (bool, optional) – convert spacy POS tags to Penn Treebank tags
-
parse
(texts, output_dir=None)[source] Parse a list of documents. If more than 1 document is passed, use multi-processing.
Parameters: - texts (list of str) – documents to parse
- output_dir (Path or str, optional) – if given, parsed documents will be written here
-
parser
return Spacy’s instance parser
-
class
nlp_architect.utils.text.
Stopwords
[source] Bases:
object
Stop words list class.
-
stop_words
= []
-
-
class
nlp_architect.utils.text.
Vocabulary
(start=0, include_oov=True)[source] Bases:
object
A vocabulary that maps words to ints (storing a vocabulary)
-
add
(word)[source] Add word to vocabulary
Parameters: word (str) – word to add Returns: id of added word Return type: int
-
add_vocab_offset
(offset)[source] Adds an offset to the ints of the vocabulary
Parameters: offset (int) – an int offset
-
id_to_word
(wid)[source] Word-id to word (string)
Parameters: wid (int) – word id Returns: string of given word id Return type: str
-
max
-
reverse_vocab
()[source] Return the vocabulary as a reversed dict object
Returns: reversed vocabulary object Return type: dict
-
vocab
get the dict object of the vocabulary
Type: dict
-
-
nlp_architect.utils.text.
bio_to_spans
(text: List[str], tags: List[str]) → List[Tuple[int, int, str]][source] Convert BIO tagged list of strings into span starts and ends :param text: list of words :param tags: list of tags
Returns: list of start, end and tag of detected spans Return type: tuple
-
nlp_architect.utils.text.
char_to_id
(c)[source] - return int id of given character
- OOV char = len(all_letter) + 1
Parameters: c (str) – string character Returns: int value of given char Return type: int
-
nlp_architect.utils.text.
character_vector_generator
(data, start=0)[source] Character word vector generator util. Transforms a list of sentences into numpy int vectors of the characters of the words of the sentence, and returns the constructed vocabulary
Parameters: - data (list) – list of list of strings
- start (int, optional) – vocabulary index start integer
Returns: a 2D numpy array Vocabulary: constructed vocabulary
Return type: np.array
-
nlp_architect.utils.text.
extract_nps
(annotation_list, text=None)[source] Extract Noun Phrases from given text tokens and phrase annotations. Returns a list of tuples with start/end indexes.
Parameters: - annotation_list (list) – a list of annotation tags in str
- text (list, optional) – a list of token texts in str
Returns: list of start/end markers of noun phrases, if text is provided a list of noun phrase texts
-
nlp_architect.utils.text.
read_sequential_tagging_file
(file_path, ignore_line_patterns=None)[source] Read a tab separated sequential tagging file. Returns a list of list of tuple of tags (sentences, words)
Parameters: - file_path (str) – input file path
- ignore_line_patterns (list, optional) – list of string patterns to ignore
Returns: list of list of tuples
-
nlp_architect.utils.text.
simple_normalizer
(text)[source] Simple text normalizer. Runs each token of a phrase thru wordnet lemmatizer and a stemmer.
-
nlp_architect.utils.text.
spacy_normalizer
(text, lemma=None)[source] Simple text normalizer using spacy lemmatizer. Runs each token of a phrase thru a lemmatizer and a stemmer. :param text: the text to normalize. :type text: string :param lemma: lemma of the given text. in this case only stemmer will :type lemma: string :param run.:
-
nlp_architect.utils.text.
word_vector_generator
(data, lower=False, start=0)[source] Word vector generator util. Transforms a list of sentences into numpy int vectors and returns the constructed vocabulary
Parameters: - data (list) – list of list of strings
- lower (bool, optional) – transform strings into lower case
- start (int, optional) – vocabulary index start integer
Returns: 2D numpy array and Vocabulary of the detected words