imputegap.tools.utils package¶
The imputegap.tools.utils package provides various utility functions and tools for handling algorithm parameters, evaluation, and other operations in the imputation process.
Submodules¶
Modules¶
Submodule Documentation¶
imputegap.tools.utils module¶
- imputegap.tools.utils.compute_batch_size(data, min_size=4, max_size=16, divisor=2, verbose=True)[source]¶
Compute an appropriate batch size based on the input data shape.
The batch size is computed as min(M // 2, max_size), where M is the number of samples. If this computed batch size is less than min_size, it is set to min_size instead.
Parameters¶
- datanp.ndarray or torch.Tensor
Input 2D data of shape (M, N), where M is the number of samples.
- min_sizeint, optional
Minimum allowed batch size. Default is 4.
- max_sizeint, optional
Maximum allowed batch size. Default is 16.
- divisorint, optional
Divisor on the shape of the dataset. Default is 2.
- verbosebool, optional
If True, prints the computed batch size. Default is True.
Returns¶
- int
Computed batch size.
- imputegap.tools.utils.config_contamination(ts, pattern, dataset_rate=0.4, series_rate=0.4, block_size=10, offset=0.1, seed=True, limit=1, shift=0.05, std_dev=0.5, explainer=False, probabilities=None, verbose=True)[source]¶
Configure and execute contamination for selected imputation algorithm and pattern.
Parameters¶
- ratefloat
Mean parameter for contamination missing percentage rate.
- ts_testTimeSeries
A TimeSeries object containing dataset.
- patternstr
Type of contamination pattern (e.g., “mcar”, “mp”, “blackout”, “disjoint”, “overlap”, “gaussian”).
- block_size_mcarint
Size of blocks removed in MCAR
Returns¶
- TimeSeries
TimeSeries object containing contaminated data.
- imputegap.tools.utils.config_forecaster(model, params)[source]¶
Configure and execute forecaster model for downstream analytics
Parameters¶
- modelstr
name of the forcaster model
- paramslist of params
List of paramaters for a forcaster model
Returns¶
- Forecaster object (SKTIME/DART)
Forecaster object for downstream analytics
- imputegap.tools.utils.config_impute_algorithm(incomp_data, algorithm, verbose=True)[source]¶
Configure and execute algorithm for selected imputation imputer and pattern.
Parameters¶
- incomp_dataTimeSeries
TimeSeries object containing dataset.
- algorithmstr
Name of algorithm
- verbosebool, optional
Whether to display the contamination information (default is False).
Returns¶
- BaseImputer
Configured imputer instance with optimal parameters.
- imputegap.tools.utils.display_title(title='Master Thesis', aut='Quentin Nater', lib='ImputeGAP', university='University Fribourg')[source]¶
Display the title and author information.
Parameters¶
- titlestr, optional
The title of the thesis (default is “Master Thesis”).
- autstr, optional
The author’s name (default is “Quentin Nater”).
- libstr, optional
The library or project name (default is “ImputeGAP”).
- universitystr, optional
The university or institution (default is “University Fribourg”).
Returns¶
None
- imputegap.tools.utils.dl_integration_transformation(input_matrix, tr_ratio=0.8, inside_tr_cont_ratio=0.2, split_ts=1, split_val=0, nan_val=-99999, prevent_leak=True, offset=0.05, block_selection=True, seed=42, verbose=False)[source]¶
Prepares contaminated data and corresponding masks for deep learning-based imputation training, validation, and testing.
This function simulates missingness in a controlled way, optionally prevents information leakage, and produces masks for training, testing, and validation using different contamination strategies.
Parameters:¶
- input_matrixnp.ndarray
The complete input time series data matrix of shape [T, N] (time steps × variables).
- tr_ratiofloat, default=0.8
The fraction of data to reserve for training when constructing the test contamination mask.
- inside_tr_cont_ratiofloat, default=0.2
The proportion of values to randomly drop inside the training data for internal contamination.
- split_tsfloat, default=1
Proportion of the total contaminated data assigned to the test set.
- split_valfloat, default=0
Proportion of the total contaminated data assigned to the validation set.
- nan_valfloat, default=-99999
Value used to represent missing entries in the masked matrix. nan_val=-1 can be used to set mean values
- prevent_leakbool, default=True
Replace the value of NaN with a high number to prevent leakage.
- offsetfloat, default=0.05
Minimum temporal offset in the begining of the series
- block_selectionbool, default=True
Whether to simulate missing values in contiguous blocks (True) or randomly (False).
- seedint, default=42
Seed for NumPy random number generation to ensure reproducibility.
- verbosebool, default=False
Whether to print logging/debug information during execution.
Returns:¶
- cont_data_matrixnp.ndarray
The input matrix with synthetic missing values introduced.
- mask_trainnp.ndarray
Boolean mask of shape [T, N] indicating the training contamination locations (True = observed, False = missing).
- mask_testnp.ndarray
Boolean mask of shape [T, N] indicating the test contamination locations.
- mask_validnp.ndarray
Boolean mask of shape [T, N] indicating the validation contamination locations.
- errorbool
Tag which is triggered if the operation is impossible.
- imputegap.tools.utils.generate_random_mask(gt, mask_test, mask_valid, droprate=0.2, offset=None, verbose=False, seed=42)[source]¶
Generate a random training mask over the non-NaN entries of gt, excluding positions already present in the test and validation masks.
Parameters¶
- gtnumpy.ndarray
Ground truth data (no NaNs).
- mask_testnumpy.ndarray
Binary mask indicating test positions.
- mask_validnumpy.ndarray
Binary mask indicating validation positions.
- dropratefloat
Proportion of eligible entries to include in the training mask.
- offsetfloat
Protect of not the offset of the dataset
- verbosebool
Whether to print debug info.
- seedint, optional
Random seed for reproducibility.
Returns¶
- numpy.ndarray
Binary mask indicating training positions.
- imputegap.tools.utils.get_missing_ratio(incomp_data)[source]¶
Check whether the proportion of missing values in the contaminated data is acceptable for training a deep learning model.
Parameters¶
- incomp_dataTimeSeries (numpy array)
TimeSeries object containing dataset.
Returns¶
- bool
True if the missing data ratio is less than or equal to 40%, False otherwise.
- imputegap.tools.utils.load_parameters(query: str = 'default', algorithm: str = 'cdrec', dataset: str = 'chlorine', optimizer: str = 'b', path=None, verbose=False)[source]¶
Load default or optimal parameters for algorithms from a TOML file.
Parameters¶
- querystr, optional
‘default’ or ‘optimal’ to load default or optimal parameters (default is “default”).
- algorithmstr, optional
Algorithm to load parameters for (default is “cdrec”).
- datasetstr, optional
Name of the dataset (default is “chlorine”).
- optimizerstr, optional
Optimizer type for optimal parameters (default is “b”).
- pathstr, optional
Custom file path for the TOML file (default is None).
- verbosebool, optional
Whether to display the contamination information (default is False).
Returns¶
- tuple
A tuple containing the loaded parameters for the given algorithm.
Load the shared library based on the operating system.
Parameters¶
- namestr, optional
The name of the shared library (default is “lib_cdrec”).
- libbool, optional
If True, the function loads the library from the default ‘imputegap’ path; if False, it loads from a local path (default is True).
- verbosebool, optional
Whether to display the contamination information (default is True).
Returns¶
- ctypes.CDLL
The loaded shared library object.
- imputegap.tools.utils.prepare_fixed_testing_set(incomp_m, tr_ratio=0.8, offset=0.05, block_selection=True, verbose=True)[source]¶
Introduces additional missing values (NaNs) into a data matrix to match a specified training ratio.
This function modifies a copy of the input matrix incomp_m by introducing NaNs such that the proportion of observed (non-NaN) values matches the desired tr_ratio. It returns the modified matrix and the corresponding missing data mask.
Parameters¶
- incomp_mnp.ndarray
A 2D NumPy array with potential pre-existing NaNs representing missing values.
- tr_ratiofloat
Desired ratio of observed (non-NaN) values in the output matrix. Must be in the range (0, 1).
- offsetfloat
Protected zone in the begining of the series
- block_selectionbool
Select the missing values by blocks or randomly (True, is by block)
- verbosebool
Whether to print debug info.
Returns¶
- data_matrix_contnp.ndarray
The modified matrix with additional NaNs introduced to match the specified training ratio.
- new_masknp.ndarray
A boolean mask of the same shape as data_matrix_cont where True indicates missing (NaN) entries.
Raises¶
- AssertionError:
If the final observed and missing ratios deviate from the target by more than 1%.
Notes¶
The function assumes that the input contains some non-NaN entries.
NaNs are added in row-major order from the list of available (non-NaN) positions.
- imputegap.tools.utils.prepare_testing_set(incomp_m, original_missing_ratio, block_selection=True, tr_ratio=0.8, verbose=True)[source]¶
- imputegap.tools.utils.prevent_leakage(matrix, mask, replacement=0, verbose=True)[source]¶
Replaces missing values in a matrix to prevent data leakage during evaluation.
This function replaces all entries in matrix that are marked as missing in mask with a specified replacement value (default is 0). It then checks to ensure that there are no remaining NaNs in the matrix and that at least one replacement occurred.
Parameters¶
- matrixnp.ndarray
A NumPy array potentially containing missing values (NaNs).
- masknp.ndarray
A boolean mask of the same shape as matrix, where True indicates positions to be replaced (typically where original values were NaN).
- replacementfloat or int, optional
The value to use in place of missing entries. Defaults to 0.
- verbosebool
Whether to print debug info.
Returns¶
- matrixnp.ndarray
The matrix with missing entries replaced by the specified value.
Raises¶
- AssertionError:
If any NaNs remain in the matrix after replacement, or if no replacements were made.
Notes¶
This function is typically used before evaluation to ensure the model does not access ground truth values where data was originally missing.
- imputegap.tools.utils.save_optimization(optimal_params, algorithm='cdrec', dataset='', optimizer='b', file_name=None)[source]¶
Save the optimization parameters to a TOML file for later use without recomputing.
Parameters¶
- optimal_paramsdict
Dictionary of the optimal parameters.
- algorithmstr, optional
The name of the imputation algorithm (default is ‘cdrec’).
- datasetstr, optional
The name of the dataset (default is an empty string).
- optimizerstr, optional
The name of the optimizer used (default is ‘b’).
- file_namestr, optional
The name of the TOML file to save the results (default is None).
Returns¶
None
- imputegap.tools.utils.search_path(set_name='test')[source]¶
Find the accurate path for loading test files.
Parameters¶
- set_namestr, optional
Name of the dataset (default is “test”).
Returns¶
- str
The correct file path for the dataset.
- imputegap.tools.utils.split_mask_bwt_test_valid(data_matrix, test_rate=0.8, valid_rate=0.2, nan_val=None, verbose=False, seed=42)[source]¶
Dispatch NaN positions in data_matrix to test and validation masks only.
Parameters¶
- data_matrixnumpy.ndarray
Input matrix containing NaNs to be split.
- test_ratefloat
Proportion of NaNs to assign to the test set (default is 0.8).
- valid_ratefloat
Proportion of NaNs to assign to the validation set (default is 0.2). test_rate + valid_rate must equal 1.0.
- verbosebool
Whether to print debug info.
- seedint, optional
Random seed for reproducibility.
Returns¶
- tuple
- test_masknumpy.ndarray
Binary mask indicating positions of NaNs in the test set.
- valid_masknumpy.ndarray
Binary mask indicating positions of NaNs in the validation set.
- n_nanint
Total number of NaN values found in the input matrix.
- imputegap.tools.utils.verification_limitation(percentage, low_limit=0.01, high_limit=1.0)[source]¶
Format and verify that the percentage given by the user is within acceptable bounds.
Parameters¶
- percentagefloat
The percentage value to be checked and potentially adjusted.
- low_limitfloat, optional
The lower limit of the acceptable percentage range (default is 0.01).
- high_limitfloat, optional
The upper limit of the acceptable percentage range (default is 1.0).
Returns¶
- float
Adjusted percentage based on the limits.
Raises¶
- ValueError
If the percentage is outside the accepted limits.
Notes¶
If the percentage is between 1 and 100, it will be divided by 100 to convert it to a decimal format.
If the percentage is outside the low and high limits, the function will print a warning and return the original value.