Pipeline
DataPipeline
¶
Class for creating a data pipeline.
The pipeline holds the list of steps and run them one after the other. The
datasets are stored in a global dictionary, where datasets are referred by a
key name, as indicated in the inputs
parameter for each step. The pipeline
manages the cache lookup and creation.
Parameters:
-
name
(str
) –Name of the pipeline.
-
output_path
(str
, default:'.'
) –Path to store the cache files. Defaults to ".".
-
cache
(bool
, default:True
) –Whether to cache the datasets. Defaults to True.
-
steps
(List[BaseStep]
, default:[]
) –List of steps in the pipeline. Defaults to [].
-
inputs
(str
, default:'main_dataset'
) –Name of the main dataset. Defaults to "main_dataset".
Source code in ragfit/processing/pipeline.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
|
cache_step(step, step_index)
¶
Write to cache-files the current state of the global datasets dictionary for the given inputs.
Source code in ragfit/processing/pipeline.py
delete_cache()
¶
Removing cache files for all steps, cleaning the pipeline.
Source code in ragfit/processing/pipeline.py
gen_cache_fn(step, index, dataset_name)
¶
Create a unique cache filename for a given dataset, at a given step, in a given index. Uses the step name, inputs, hash and pipeline's path and name and dataset name.
Returns a string.
Source code in ragfit/processing/pipeline.py
get_cache_mapping(step: BaseStep, index: int)
¶
Returns a mapping between input datasets and cache filenames, for a given step.
Source code in ragfit/processing/pipeline.py
load_from_cache(caches_map)
¶
Load datasets from cache using a cache_map. Updates the global datasets dictionary.
Internal function, shouldn't be used by the user.
Source code in ragfit/processing/pipeline.py
process()
¶
Run pipeline, step after step.
Caching is handled here. A step is calculated either if there was a change in the pipeline at a previous step OR the current step has no cache file.
When a step is calculated, it is cached and self.last_update is updated to the current step index.