nlp_architect.data package
Subpackages
- nlp_architect.data.cdc_resources package
- Subpackages
- nlp_architect.data.cdc_resources.data_types package
- nlp_architect.data.cdc_resources.embedding package
- nlp_architect.data.cdc_resources.gen_scripts package
- Submodules
- nlp_architect.data.cdc_resources.gen_scripts.create_reference_dict_dump module
- nlp_architect.data.cdc_resources.gen_scripts.create_verbocean_dump module
- nlp_architect.data.cdc_resources.gen_scripts.create_wiki_dump module
- nlp_architect.data.cdc_resources.gen_scripts.create_word_embed_elmo_dump module
- nlp_architect.data.cdc_resources.gen_scripts.create_word_embed_glove_dump module
- nlp_architect.data.cdc_resources.gen_scripts.create_wordnet_dump module
- Module contents
- nlp_architect.data.cdc_resources.relations package
- Submodules
- nlp_architect.data.cdc_resources.relations.computed_relation_extraction module
- nlp_architect.data.cdc_resources.relations.referent_dict_relation_extraction module
- nlp_architect.data.cdc_resources.relations.relation_extraction module
- nlp_architect.data.cdc_resources.relations.relation_types_enums module
- nlp_architect.data.cdc_resources.relations.verbocean_relation_extraction module
- nlp_architect.data.cdc_resources.relations.wikipedia_relation_extraction module
- nlp_architect.data.cdc_resources.relations.within_doc_coref_extraction module
- nlp_architect.data.cdc_resources.relations.word_embedding_relation_extraction module
- nlp_architect.data.cdc_resources.relations.wordnet_relation_extraction module
- Module contents
- nlp_architect.data.cdc_resources.wikipedia package
- nlp_architect.data.cdc_resources.wordnet package
- Module contents
- Subpackages
Submodules
nlp_architect.data.conll module
nlp_architect.data.fasttext_emb module
-
class
nlp_architect.data.fasttext_emb.Dictionary(id2word, word2id, lang)[source] Bases:
objectMerges word2idx and idx2word dictionaries :param id2word dictionary: :param word2id dictionary: :param language of the dictionary:
- Usage:
- dico.index(word) - returns an index dico[index] - returns the word
-
class
nlp_architect.data.fasttext_emb.FastTextEmb(path, language, vocab_size, emb_dim=300)[source] Bases:
objectDownloads FastText Embeddings for a given language to the given path. :param path: Local path to copy embeddings :type path: str :param language: Embeddings language :type language: str :param vocab_size: Size of vocabulary :type vocab_size: int
Returns: Returns a dictionary and reverse dictionary Returns a numpy array with embeddings in emb_sizexvocab_size shape
-
nlp_architect.data.fasttext_emb.get_eval_data(eval_path, src_lang, tgt_lang)[source] Downloads evaluation cross lingual dictionaries to the eval_path :param eval_path: Path where cross-lingual dictionaries are downloaded :param src_lang: Source Language :param tgt_lang: Target Language
Returns: Path to where cross lingual dictionaries are downloaded
nlp_architect.data.glue_tasks module
-
class
nlp_architect.data.glue_tasks.ColaProcessor[source] Bases:
nlp_architect.data.utils.DataProcessorProcessor for the CoLA data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.InputFeatures(input_ids, input_mask, segment_ids, label_id, valid_ids=None)[source] Bases:
objectA single set of features of data.
-
class
nlp_architect.data.glue_tasks.MnliMismatchedProcessor[source] Bases:
nlp_architect.data.glue_tasks.MnliProcessorProcessor for the MultiNLI Mismatched data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.MnliProcessor[source] Bases:
nlp_architect.data.utils.DataProcessorProcessor for the MultiNLI data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.MrpcProcessor[source] Bases:
nlp_architect.data.utils.DataProcessorProcessor for the MRPC data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.QnliProcessor[source] Bases:
nlp_architect.data.utils.DataProcessorProcessor for the QNLI data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.QqpProcessor[source] Bases:
nlp_architect.data.utils.DataProcessorProcessor for the QQP data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.RteProcessor[source] Bases:
nlp_architect.data.utils.DataProcessorProcessor for the RTE data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.Sst2Processor[source] Bases:
nlp_architect.data.utils.DataProcessorProcessor for the SST-2 data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.StsbProcessor[source] Bases:
nlp_architect.data.utils.DataProcessorProcessor for the STS-B data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.WnliProcessor[source] Bases:
nlp_architect.data.utils.DataProcessorProcessor for the WNLI data set (GLUE version).
-
nlp_architect.data.glue_tasks.convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, output_mode, cls_token_at_end=False, pad_on_left=False, cls_token='[CLS]', sep_token='[SEP]', pad_token=0, sequence_a_segment_id=0, sequence_b_segment_id=1, cls_token_segment_id=1, pad_token_segment_id=0, mask_padding_with_zero=True)[source] Loads a data file into a list of InputBatch`s `cls_token_at_end define the location of the CLS token:
- False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
- True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
cls_token_segment_id define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
nlp_architect.data.intent_datasets module
-
class
nlp_architect.data.intent_datasets.IntentDataset(sentence_length=50, word_length=12)[source] Bases:
objectIntent extraction dataset base class
Parameters: sentence_length (int) – max sentence length -
char_vocab word character vocabulary
Type: dict
-
char_vocab_size char vocabulary size
Type: int
-
intent_size intent label vocabulary size
Type: int
-
intents_vocab intent labels vocabulary
Type: dict
-
label_vocab_size label vocabulary size
Type: int
labels vocabulary
Type: dict
-
test_set test set
Type: tupleofnumpy.ndarray
-
train_set train set
Type: tupleofnumpy.ndarray
-
word_vocab tokens vocabulary
Type: dict
-
word_vocab_size vocabulary size
Type: int
-
-
class
nlp_architect.data.intent_datasets.SNIPS(path, sentence_length=30, word_length=12)[source] Bases:
nlp_architect.data.intent_datasets.IntentDatasetSNIPS dataset class
Parameters: - path (str) – dataset path
- sentence_length (int, optional) – max sentence length
- word_length (int, optional) – max word length
-
files= ['train', 'test']
-
test_files= ['AddToPlaylist/validate_AddToPlaylist.json', 'BookRestaurant/validate_BookRestaurant.json', 'GetWeather/validate_GetWeather.json', 'PlayMusic/validate_PlayMusic.json', 'RateBook/validate_RateBook.json', 'SearchCreativeWork/validate_SearchCreativeWork.json', 'SearchScreeningEvent/validate_SearchScreeningEvent.json']
-
train_files= ['AddToPlaylist/train_AddToPlaylist_full.json', 'BookRestaurant/train_BookRestaurant_full.json', 'GetWeather/train_GetWeather_full.json', 'PlayMusic/train_PlayMusic_full.json', 'RateBook/train_RateBook_full.json', 'SearchCreativeWork/train_SearchCreativeWork_full.json', 'SearchScreeningEvent/train_SearchScreeningEvent_full.json']
-
class
nlp_architect.data.intent_datasets.TabularIntentDataset(train_file, test_file, sentence_length=30, word_length=12)[source] Bases:
nlp_architect.data.intent_datasets.IntentDatasetTabular Intent/Slot tags dataset loader. Compatible with many sequence tagging datasets (ATIS, CoNLL, etc..) data format must be int tabular format where: - one word per line with tag annotation and intent type separated by tabs <token> <tag_label> <intent>
- sentences are separated by an empty line
Parameters: - train_file (str) – path to train set file
- test_file (str) – path to test set file
- sentence_length (int) – max sentence length
- word_length (int) – max word length
-
files= ['train', 'test']
nlp_architect.data.ptb module
Data loader for penn tree bank dataset
-
class
nlp_architect.data.ptb.PTBDataLoader(word_dict, seq_len=100, data_dir='/home/runner/data', dataset='WikiText-103', batch_size=32, skip=30, split_type='train', loop=True)[source] Bases:
objectClass that defines data loader
-
decode_line(tokens)[source] Decode a given line from index to word :param tokens: List of indexes
Returns: str, a sentence
-
-
class
nlp_architect.data.ptb.PTBDictionary(data_dir='/home/runner/data', dataset='WikiText-103')[source] Bases:
objectClass for generating a dictionary of all words in the PTB corpus
-
add_word(word)[source] Method for adding a single word to the dictionary :param word: str, word to be added
Returns: None
-
nlp_architect.data.sequence_classification module
-
class
nlp_architect.data.sequence_classification.SequenceClsInputExample(guid: str, text: str, text_b: str = None, label: str = None)[source] Bases:
nlp_architect.data.utils.InputExampleA single training/test example for simple sequence classification.
nlp_architect.data.sequential_tagging module
-
class
nlp_architect.data.sequential_tagging.CONLL2000(data_path, sentence_length=None, max_word_length=None, extract_chars=False, lowercase=True)[source] Bases:
objectCONLL 2000 POS/chunking task data set (numpy)
Parameters: - data_path (str) – directory containing CONLL2000 files
- sentence_length (int, optional) – number of time steps to embed the data. None value will not truncate vectors
- max_word_length (int, optional) – max word length in characters. None value will not truncate vectors
- extract_chars (boolean, optional) – Yield Char RNN features.
- lowercase (bool, optional) – lower case sentence words
-
char_vocab character Vocabulary
-
chunk_vocab chunk label Vocabulary
-
dataset_files= {'test': 'test.txt', 'train': 'train.txt'}
-
pos_vocab pos label Vocabulary
-
test_set get the test set
-
train_set get the train set
-
word_vocab word Vocabulary
-
class
nlp_architect.data.sequential_tagging.SequentialTaggingDataset(train_file, test_file, max_sentence_length=30, max_word_length=20, tag_field_no=2)[source] Bases:
objectSequential tagging dataset loader. Loads train/test files with tabular separation.
Parameters: - train_file (str) – path to train file
- test_file (str) – path to test file
- max_sentence_length (int, optional) – max sentence length
- max_word_length (int, optional) – max word length
- tag_field_no (int, optional) – index of column to use a y-samples
-
char_vocab characters vocabulary
-
char_vocab_size character vocabulary size
-
test_set Get the test set
-
train_set Get the train set
-
word_vocab words vocabulary
-
word_vocab_size word vocabulary size
-
y_labels return y labels
-
class
nlp_architect.data.sequential_tagging.TokenClsInputExample(guid: str, text: str, tokens: List[str], shapes: List[int] = None, label: List[str] = None)[source] Bases:
nlp_architect.data.utils.InputExampleA single training/test example for simple sequence token classification.
-
class
nlp_architect.data.sequential_tagging.TokenClsProcessor(data_dir, tag_col: int = -1, ignore_token=None)[source] Bases:
nlp_architect.data.utils.DataProcessorSequence token classification Processor dataset loader. Loads a directory with train.txt/test.txt/dev.txt files in tab separeted format (one token per line - conll style). Label dictionary is given in labels.txt file.
-
get_test_examples(filename='test.txt')[source] Gets a collection of `InputExample`s for the test set.
-
nlp_architect.data.utils module
-
class
nlp_architect.data.utils.DataProcessor[source] Bases:
objectBase class for data converters for sequence/token classification data sets.
-
class
nlp_architect.data.utils.InputExample(guid: str, text, label=None)[source] Bases:
abc.ABCBase class for a single training/dev/test example
-
class
nlp_architect.data.utils.Task(name: str, processor: nlp_architect.data.utils.DataProcessor, data_dir: str, task_type: str)[source] Bases:
objectA task definition class :param name: the name of the task :type name: str :param processor: a DataProcessor class containing a dataset loader :type processor: DataProcessor :param data_dir: path to the data source :type data_dir: str :param task_type: the task type (classification/regression/tagging) :type task_type: str
-
nlp_architect.data.utils.get_cached_filepath(data_dir, model_name, seq_length, task_name, set_type='train')[source] get cached file name
Parameters: - {str} -- data directory string (data_dir) –
- {str} -- model name (model_name) –
- {int} -- max sequence length (seq_length) –
- {str} -- name of task (task_name) –
Keyword Arguments: {str} -- set type (set_type) – {“train”})
Returns: str – cached filename
-
nlp_architect.data.utils.read_column_tagged_file(filename: str, tag_col: int = -1, ignore_token: str = None)[source] Reads column tagged (CONLL) style file (tab separated and token per line) tag_col is the column number to use as tag of the token (defualts to the last in line) :param filename: input file path :type filename: str :param tag_col: the column contains the labels :type tag_col: int :param ignore_token: a str token to exclude :type ignore_token: str
return format : [ [‘token’, ‘TAG’], [‘token’, ‘TAG2’],… ]
-
nlp_architect.data.utils.read_tsv(input_file, quotechar=None)[source] Reads a tab separated value file.
-
nlp_architect.data.utils.sample_label_unlabeled(samples: List[nlp_architect.data.utils.InputExample], no_labeled: int, no_unlabeled: int)[source] Randomly sample 2 sets of samples from a given collection of InputExamples (used for semi-supervised models)
-
nlp_architect.data.utils.split_column_dataset(first_count: int, second_count: int, out_folder, dataset, first_filename, second_filename, tag_col=-1)[source] Splits a single column tagged dataset into two files according to the amount of examples requested to be included in each file. first_count (int) : the amount of examples to include in the first split file second_count (int) : the amount of examples to include in the second split file out_folder (str) : the folder in which the result files will be stored dataset (str) : the path to the original data file first_filename (str) : the name of the first split file second_filename (str) : the name of the second split file tag_col (int) : the index of the tag column