nlp_architect.utils package
Submodules
nlp_architect.utils.ansi2html module
nlp_architect.utils.embedding module
-
class
nlp_architect.utils.embedding.FasttextEmbeddingsModel(size: int = 5, window: int = 3, min_count: int = 1, skipgram: bool = True)[source] Bases:
objectFasttext embedding trainer class
Parameters: - texts (List[List[str]]) – list of tokenized sentences
- size (int) – embedding size
- epochs (int, optional) – number of epochs to train
- window (int, optional) – The maximum distance between
- current and predicted word within a sentence (the) –
-
nlp_architect.utils.embedding.fill_embedding_mat(src_mat, src_lex, emb_lex, emb_size)[source] Creates a new matrix from given matrix of int words using the embedding model provided.
Parameters: - src_mat (numpy.ndarray) – source matrix
- src_lex (dict) – source matrix lexicon
- emb_lex (dict) – embedding lexicon
- emb_size (int) – embedding vector size
-
nlp_architect.utils.embedding.get_embedding_matrix(embeddings: dict, vocab: nlp_architect.utils.text.Vocabulary, embedding_size: int = None, lowercase_only: bool = False) → numpy.ndarray[source] Generate a matrix of word embeddings given a vocabulary
Parameters: - embeddings (dict) – a dictionary of embedding vectors
- vocab (Vocabulary) – a Vocabulary
- embedding_size (int) – custom embedding matrix size
Returns: a 2D numpy matrix of lexicon embeddings
-
nlp_architect.utils.embedding.load_embedding_file(filename: str, dim: int = None) → dict[source] Load a word embedding file
Parameters: filename (str) – path to embedding file Returns: dictionary with embedding vectors Return type: dict
-
nlp_architect.utils.embedding.load_word_embeddings(file_path, vocab=None)[source] Loads a word embedding model text file into a word(str) to numpy vector dictionary
Parameters: - file_path (str) – path to model file
- vocab (list of str) – optional - vocabulary
Returns: a dictionary of numpy.ndarray vectors int: detected word embedding vector size
Return type: list
nlp_architect.utils.file_cache module
Utilities for working with the local dataset cache.
-
nlp_architect.utils.file_cache.cached_path(url_or_filename: Union[str, pathlib.Path], cache_dir: str = None) → str[source] Given something that might be a URL (or might be a local path), determine which. If it’s a URL, download the file and cache it, and return the path to the cached file. If it’s already a local path, make sure the file exists and then return the path.
-
nlp_architect.utils.file_cache.filename_to_url(filename: str, cache_dir: str = None) → Tuple[str, str][source] Return the url and etag (which may be
None) stored for filename. RaiseFileNotFoundErrorif filename or its stored metadata do not exist.
nlp_architect.utils.generic module
-
nlp_architect.utils.generic.add_offset(mat: numpy.ndarray, offset: int = 1) → numpy.ndarray[source] Add +1 to all values in matrix mat
Parameters: - mat (numpy.ndarray) – A 2D matrix with int values
- offset (int) – offset to add
Returns: input matrix
Return type: numpy.ndarray
-
nlp_architect.utils.generic.normalize(txt, vocab=None, replace_char=' ', max_length=300, pad_out=True, to_lower=True, reverse=False, truncate_left=False, encoding=None)[source]
-
nlp_architect.utils.generic.one_hot(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source] Convert a 1D matrix of ints into one-hot encoded vectors.
Parameters: - mat (numpy.ndarray) – A 1D matrix of labels (int)
- num_classes (int) – Number of all possible classes
Returns: A 2D matrix
Return type: numpy.ndarray
-
nlp_architect.utils.generic.one_hot_sentence(mat: numpy.ndarray, num_classes: int) → numpy.ndarray[source] Convert a 2D matrix of ints into one-hot encoded 3D matrix
Parameters: - mat (numpy.ndarray) – A 2D matrix of labels (int)
- num_classes (int) – Number of all possible classes
Returns: A 3D matrix
Return type: numpy.ndarray
-
nlp_architect.utils.generic.pad_sentences(sequences: numpy.ndarray, max_length: int = None, padding_value: int = 0, padding_style='post') → numpy.ndarray[source] Pad input sequences up to max_length values are aligned to the right
Parameters: - sequences (iter) – a 2D matrix (np.array) to pad
- max_length (int, optional) – max length of resulting sequences
- padding_value (int, optional) – padding value
- padding_style (str, optional) – add padding values as prefix (use with ‘pre’) or postfix (use with ‘post’)
Returns: input sequences padded to size ‘max_length’
-
nlp_architect.utils.generic.to_one_hot(txt, vocab={'!': 40, '#': 49, '$': 50, '%': 51, '&': 53, '(': 61, ')': 62, '*': 54, '+': 57, ', ': 37, '-': 36, '.': 39, '/': 44, '0': 26, '1': 27, '2': 28, '3': 29, '4': 30, '5': 31, '6': 32, '7': 33, '8': 34, '9': 35, ':': 42, ';': 38, '<': 59, '=': 58, '>': 60, '?': 41, '@': 48, '[': 63, '\\': 45, ']': 64, '_': 47, 'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 'q': 16, 'r': 17, 's': 18, 't': 19, 'u': 20, 'v': 21, 'w': 22, 'x': 23, 'y': 24, 'z': 25, '{': 65, '|': 46, '}': 66, 'ˆ': 52, '˜': 55, '‘': 56, '’': 43})[source]
nlp_architect.utils.io module
-
nlp_architect.utils.io.check_directory_and_create(dir_path)[source] Check if given directory exists, create if not.
Parameters: dir_path (str) – path to directory
-
nlp_architect.utils.io.download_unlicensed_file(url, sourcefile, destfile, totalsz=None)[source] Download the file specified by the given URL.
Parameters: - url (str) – url to download from
- sourcefile (str) – file to download from url
- destfile (str) – save path
- totalsz (
int, optional) – total size of file
-
nlp_architect.utils.io.download_unzip(url: str, sourcefile: str, unzipped_path: str, license_msg: str = None)[source] Downloads a zip file, extracts it to destination, deletes the zip file. If license_msg is supplied, user is prompted for download confirmation.
-
nlp_architect.utils.io.gzip_str(g_str)[source] Transform string to GZIP coding
Parameters: g_str (str) – string of data Returns: GZIP bytes data
-
nlp_architect.utils.io.json_dumper(obj)[source] for objects that have members that cant be serialized and implement toJson() method
-
nlp_architect.utils.io.line_count(file)[source] Utility function for getting number of lines in a text file.
-
nlp_architect.utils.io.load_files_from_path(dir_path, extension='txt')[source] load all files from given directory (with given extension)
-
nlp_architect.utils.io.prepare_output_path(output_dir: str, overwrite_output_dir: str)[source] Create output directory or throw error if exists and overwrite_output_dir is false
-
nlp_architect.utils.io.uncompress_file(filepath: str, outpath='.')[source] Unzip a file to the same location of filepath uses decompressing algorithm by file extension
Parameters: - filepath (str) – path to file
- outpath (str) – path to extract to
-
nlp_architect.utils.io.valid_path_append(path, *args)[source] Helper to validate passed path directory and append any subsequent filename arguments.
Parameters: - path (str) – Initial filesystem path. Should expand to a valid directory.
- *args (list, optional) – Any filename or path suffices to append to path for returning.
- Returns –
- (list, str): path prepended list of files from args, or path alone if
- no args specified.
Raises: ValueError– if path is not a valid directory on this filesystem.
-
nlp_architect.utils.io.validate(*args)[source] Validate all arguments are of correct type and in correct range. :param *args: Each tuple represents an argument validation like so: :type *args: tuple of tuples :param Option 1 - With range check: (arg, class, min_val, max_val) :param Option 2 - Without range check: (arg, class) :param If class is a tuple of type objects check if arg is an instance of any of the types.: :param To allow a None valued argument, include type: :type To allow a None valued argument, include type: None :param To disable lower or upper bound check, set min_val or max_val to None, respectively.: :param If arg has the len attribute: :type If arg has the len attribute: such as string
-
nlp_architect.utils.io.validate_existing_directory(arg)[source] Validates an input argument is a path string to an existing directory.
-
nlp_architect.utils.io.validate_existing_filepath(arg)[source] Validates an input argument is a path string to an existing file.
-
nlp_architect.utils.io.validate_existing_path(arg)[source] Validates an input argument is a path string to an existing file or directory.
-
nlp_architect.utils.io.validate_parent_exists(arg)[source] Validates an input argument is a path string, and its parent directory exists.
-
nlp_architect.utils.io.validate_proxy_path(arg)[source] Validates an input argument is a valid proxy path or None
nlp_architect.utils.metrics module
-
nlp_architect.utils.metrics.accuracy(preds, labels)[source] return simple accuracy in expected dict format
-
nlp_architect.utils.metrics.classification_report(y_true, y_pred, digits=2, suffix=False)[source] Build a text report showing the main classification metrics.
Parameters: - y_true – 2d array. Ground truth (correct) target values.
- y_pred – 2d array. Estimated targets as returned by a classifier.
- digits – int. Number of digits for formatting output floating point values.
Returns: string. Text summary of the precision, recall, F1 score for each class.
Return type: report
Examples
>>> from seqeval.metrics import classification_report >>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], >>> ['B-PER', 'I-PER', 'O']] >>> print(classification_report(y_true, y_pred)) precision recall f1-score support <BLANKLINE> MISC 0.00 0.00 0.00 1 PER 1.00 1.00 1.00 1 <BLANKLINE> micro avg 0.50 0.50 0.50 2 macro avg 0.50 0.50 0.50 2 <BLANKLINE>
-
nlp_architect.utils.metrics.end_of_chunk(prev_tag, tag, prev_type, type_)[source] Checks if a chunk ended between the previous and current word.
Parameters: - prev_tag – previous chunk tag.
- tag – current chunk tag.
- prev_type – previous type.
- type – current type.
Returns: boolean.
Return type: chunk_end
-
nlp_architect.utils.metrics.get_conll_scores(predictions, y, y_lex, unk='O')[source] Get Conll style scores (precision, recall, f1)
-
nlp_architect.utils.metrics.get_entities(seq, suffix=False)[source] Gets entities from sequence.
Parameters: seq (list) – sequence of labels. Returns: list of (chunk_type, chunk_start, chunk_end). Return type: list Example
>>> from seqeval.metrics.sequence_labeling import get_entities >>> seq = ['B-PER', 'I-PER', 'O', 'B-LOC'] >>> get_entities(seq) [('PER', 0, 1), ('LOC', 3, 3)]
-
nlp_architect.utils.metrics.pearson_and_spearman(preds, labels)[source] get pearson and spearman correlation
-
nlp_architect.utils.metrics.sequence_accuracy_score(y_true, y_pred)[source] Accuracy classification score.
In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
Parameters: - y_true – 2d array. Ground truth (correct) target values.
- y_pred – 2d array. Estimated targets as returned by a tagger.
Returns: float.
Return type: score
Example
>>> from seqeval.metrics import accuracy_score >>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], >>> ['B-PER', 'I-PER', 'O']] >>> accuracy_score(y_true, y_pred) 0.80
-
nlp_architect.utils.metrics.sequence_f1_score(y_true, y_pred, suffix=False)[source] Compute the F1 score.
The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
F1 = 2 * (precision * recall) / (precision + recall)
Parameters: - y_true – 2d array. Ground truth (correct) target values.
- y_pred – 2d array. Estimated targets as returned by a tagger.
Returns: float.
Return type: score
Example
>>> from seqeval.metrics import f1_score >>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], >>> ['B-PER', 'I-PER', 'O']] >>> f1_score(y_true, y_pred) 0.50
-
nlp_architect.utils.metrics.sequence_performance_measure(y_true, y_pred)[source] Compute the performance metrics: TP, FP, FN, TN
Parameters: - y_true – 2d array. Ground truth (correct) target values.
- y_pred – 2d array. Estimated targets as returned by a tagger.
Returns: dict
Return type: performance_dict
Example
>>> from seqeval.metrics import performance_measure >>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'O', 'B-ORG'], ['B-PER', 'I-PER', 'O']] >>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O', 'O'], ['B-PER', 'I-PER', 'O']] >>> performance_measure(y_true, y_pred) (3, 3, 1, 4)
-
nlp_architect.utils.metrics.sequence_precision_score(y_true, y_pred, suffix=False)[source] Compute the precision.
The precision is the ratio
tp / (tp + fp)wheretpis the number of true positives andfpthe number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample.The best value is 1 and the worst value is 0.
Parameters: - y_true – 2d array. Ground truth (correct) target values.
- y_pred – 2d array. Estimated targets as returned by a tagger.
Returns: float.
Return type: score
Example
>>> from seqeval.metrics import precision_score >>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], >>> ['B-PER', 'I-PER', 'O']] >>> precision_score(y_true, y_pred) 0.50
-
nlp_architect.utils.metrics.sequence_recall_score(y_true, y_pred, suffix=False)[source] Compute the recall.
The recall is the ratio
tp / (tp + fn)wheretpis the number of true positives andfnthe number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.The best value is 1 and the worst value is 0.
Parameters: - y_true – 2d array. Ground truth (correct) target values.
- y_pred – 2d array. Estimated targets as returned by a tagger.
Returns: float.
Return type: score
Example
>>> from seqeval.metrics import recall_score >>> y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], >>> ['B-PER', 'I-PER', 'O']] >>> recall_score(y_true, y_pred) 0.50
-
nlp_architect.utils.metrics.start_of_chunk(prev_tag, tag, prev_type, type_)[source] Checks if a chunk started between the previous and current word.
Parameters: - prev_tag – previous chunk tag.
- tag – current chunk tag.
- prev_type – previous type.
- type – current type.
Returns: boolean.
Return type: chunk_start
nlp_architect.utils.string_utils module
-
class
nlp_architect.utils.string_utils.StringUtils[source] Bases:
object-
determiners= []
-
static
find_head_lemma_pos_ner(x: str)[source] “
Parameters: x – mention Returns: the head word and the head word lemma of the mention
-
preposition= []
-
pronouns= []
-
spacy_no_parser= <nlp_architect.utils.text.SpacyInstance object>
-
spacy_parser= <nlp_architect.utils.text.SpacyInstance object>
-
stop_words= []
-
nlp_architect.utils.testing module
nlp_architect.utils.text module
-
class
nlp_architect.utils.text.SpacyInstance(model='en', disable=None, display_prompt=True, n_jobs=8, batch_size=1500, spacy_doc=False, show_tok=True, show_doc=True, ptb_pos=False)[source] Bases:
objectSpacy pipeline wrapper which prompts user for model download authorization.
Parameters: - model (str, optional) – spacy model name (default: english small model)
- disable (list of string, optional) – pipeline annotators to disable (default: [])
- display_prompt (bool, optional) – flag to display/skip license prompt
- n_jobs (int, optional) – maximum number of concurrent Python worker processes. If -1 all CPUs are used.
- batch_size (int, optional) – number of docs per batch.
- spacy_doc (bool, optional) – if True, parser outputs spacy.tokens.doc instead of CoreNLPDoc
- show_tok (bool, optional) – include token text in CoreNLPDoc output
- show_dok (bool, optional) – include document text in CoreNLPDoc output
- ptb_pos (bool, optional) – convert spacy POS tags to Penn Treebank tags
-
parse(texts, output_dir=None)[source] Parse a list of documents. If more than 1 document is passed, use multi-processing.
Parameters: - texts (list of str) – documents to parse
- output_dir (Path or str, optional) – if given, parsed documents will be written here
-
parser return Spacy’s instance parser
-
class
nlp_architect.utils.text.Stopwords[source] Bases:
objectStop words list class.
-
stop_words= []
-
-
class
nlp_architect.utils.text.Vocabulary(start=0, include_oov=True)[source] Bases:
objectA vocabulary that maps words to ints (storing a vocabulary)
-
add(word)[source] Add word to vocabulary
Parameters: word (str) – word to add Returns: id of added word Return type: int
-
add_vocab_offset(offset)[source] Adds an offset to the ints of the vocabulary
Parameters: offset (int) – an int offset
-
id_to_word(wid)[source] Word-id to word (string)
Parameters: wid (int) – word id Returns: string of given word id Return type: str
-
max
-
reverse_vocab()[source] Return the vocabulary as a reversed dict object
Returns: reversed vocabulary object Return type: dict
-
vocab get the dict object of the vocabulary
Type: dict
-
-
nlp_architect.utils.text.bio_to_spans(text: List[str], tags: List[str]) → List[Tuple[int, int, str]][source] Convert BIO tagged list of strings into span starts and ends :param text: list of words :param tags: list of tags
Returns: list of start, end and tag of detected spans Return type: tuple
-
nlp_architect.utils.text.char_to_id(c)[source] - return int id of given character
- OOV char = len(all_letter) + 1
Parameters: c (str) – string character Returns: int value of given char Return type: int
-
nlp_architect.utils.text.character_vector_generator(data, start=0)[source] Character word vector generator util. Transforms a list of sentences into numpy int vectors of the characters of the words of the sentence, and returns the constructed vocabulary
Parameters: - data (list) – list of list of strings
- start (int, optional) – vocabulary index start integer
Returns: a 2D numpy array Vocabulary: constructed vocabulary
Return type: np.array
-
nlp_architect.utils.text.extract_nps(annotation_list, text=None)[source] Extract Noun Phrases from given text tokens and phrase annotations. Returns a list of tuples with start/end indexes.
Parameters: - annotation_list (list) – a list of annotation tags in str
- text (list, optional) – a list of token texts in str
Returns: list of start/end markers of noun phrases, if text is provided a list of noun phrase texts
-
nlp_architect.utils.text.read_sequential_tagging_file(file_path, ignore_line_patterns=None)[source] Read a tab separated sequential tagging file. Returns a list of list of tuple of tags (sentences, words)
Parameters: - file_path (str) – input file path
- ignore_line_patterns (list, optional) – list of string patterns to ignore
Returns: list of list of tuples
-
nlp_architect.utils.text.simple_normalizer(text)[source] Simple text normalizer. Runs each token of a phrase thru wordnet lemmatizer and a stemmer.
-
nlp_architect.utils.text.spacy_normalizer(text, lemma=None)[source] Simple text normalizer using spacy lemmatizer. Runs each token of a phrase thru a lemmatizer and a stemmer. :param text: the text to normalize. :type text: string :param lemma: lemma of the given text. in this case only stemmer will :type lemma: string :param run.:
-
nlp_architect.utils.text.word_vector_generator(data, lower=False, start=0)[source] Word vector generator util. Transforms a list of sentences into numpy int vectors and returns the constructed vocabulary
Parameters: - data (list) – list of list of strings
- lower (bool, optional) – transform strings into lower case
- start (int, optional) – vocabulary index start integer
Returns: 2D numpy array and Vocabulary of the detected words