imputegap.tools.utils package

The imputegap.tools.utils package provides various utility functions and tools for handling algorithm parameters, evaluation, and other operations in the imputation process.

Submodules

Modules

Submodule Documentation

imputegap.tools.utils module

imputegap.tools.utils.check_family(family, algorithm)[source]
imputegap.tools.utils.compute_batch_size(data, min_size=4, max_size=16, divisor=2, verbose=True)[source]

Compute an appropriate batch size based on the input data shape.

The batch size is computed as min(M // 2, max_size), where M is the number of samples. If this computed batch size is less than min_size, it is set to min_size instead.

Parameters

datanp.ndarray or torch.Tensor

Input 2D data of shape (M, N), where M is the number of samples.

min_sizeint, optional

Minimum allowed batch size. Default is 4.

max_sizeint, optional

Maximum allowed batch size. Default is 16.

divisorint, optional

Divisor on the shape of the dataset. Default is 2.

verbosebool, optional

If True, prints the computed batch size. Default is True.

Returns

int

Computed batch size.

imputegap.tools.utils.config_contamination(ts, pattern, dataset_rate=0.4, series_rate=0.4, block_size=10, offset=0.1, seed=True, limit=1, shift=0.05, std_dev=0.5, explainer=False, probabilities=None, verbose=True)[source]

Configure and execute contamination for selected imputation algorithm and pattern.

Parameters

ratefloat

Mean parameter for contamination missing percentage rate.

ts_testTimeSeries

A TimeSeries object containing dataset.

patternstr

Type of contamination pattern (e.g., “mcar”, “mp”, “blackout”, “disjoint”, “overlap”, “gaussian”).

block_size_mcarint

Size of blocks removed in MCAR

Returns

TimeSeries

TimeSeries object containing contaminated data.

imputegap.tools.utils.config_forecaster(model, params)[source]

Configure and execute forecaster model for downstream analytics

Parameters

modelstr

name of the forcaster model

paramslist of params

List of paramaters for a forcaster model

Returns

Forecaster object (SKTIME/DART)

Forecaster object for downstream analytics

imputegap.tools.utils.config_impute_algorithm(incomp_data, algorithm, verbose=True)[source]

Configure and execute algorithm for selected imputation imputer and pattern.

Parameters

incomp_dataTimeSeries

TimeSeries object containing dataset.

algorithmstr

Name of algorithm

verbosebool, optional

Whether to display the contamination information (default is False).

Returns

BaseImputer

Configured imputer instance with optimal parameters.

imputegap.tools.utils.display_title(title='Master Thesis', aut='Quentin Nater', lib='ImputeGAP', university='University Fribourg')[source]

Display the title and author information.

Parameters

titlestr, optional

The title of the thesis (default is “Master Thesis”).

autstr, optional

The author’s name (default is “Quentin Nater”).

libstr, optional

The library or project name (default is “ImputeGAP”).

universitystr, optional

The university or institution (default is “University Fribourg”).

Returns

None

imputegap.tools.utils.dl_integration_transformation(input_matrix, tr_ratio=0.8, inside_tr_cont_ratio=0.2, split_ts=1, split_val=0, nan_val=-99999, prevent_leak=True, offset=0.05, block_selection=True, seed=42, verbose=False)[source]

Prepares contaminated data and corresponding masks for deep learning-based imputation training, validation, and testing.

This function simulates missingness in a controlled way, optionally prevents information leakage, and produces masks for training, testing, and validation using different contamination strategies.

Parameters:

input_matrixnp.ndarray

The complete input time series data matrix of shape [T, N] (time steps × variables).

tr_ratiofloat, default=0.8

The fraction of data to reserve for training when constructing the test contamination mask.

inside_tr_cont_ratiofloat, default=0.2

The proportion of values to randomly drop inside the training data for internal contamination.

split_tsfloat, default=1

Proportion of the total contaminated data assigned to the test set.

split_valfloat, default=0

Proportion of the total contaminated data assigned to the validation set.

nan_valfloat, default=-99999

Value used to represent missing entries in the masked matrix. nan_val=-1 can be used to set mean values

prevent_leakbool, default=True

Replace the value of NaN with a high number to prevent leakage.

offsetfloat, default=0.05

Minimum temporal offset in the begining of the series

block_selectionbool, default=True

Whether to simulate missing values in contiguous blocks (True) or randomly (False).

seedint, default=42

Seed for NumPy random number generation to ensure reproducibility.

verbosebool, default=False

Whether to print logging/debug information during execution.

Returns:

cont_data_matrixnp.ndarray

The input matrix with synthetic missing values introduced.

mask_trainnp.ndarray

Boolean mask of shape [T, N] indicating the training contamination locations (True = observed, False = missing).

mask_testnp.ndarray

Boolean mask of shape [T, N] indicating the test contamination locations.

mask_validnp.ndarray

Boolean mask of shape [T, N] indicating the validation contamination locations.

errorbool

Tag which is triggered if the operation is impossible.

imputegap.tools.utils.generate_random_mask(gt, mask_test, mask_valid, droprate=0.2, offset=None, verbose=False, seed=42)[source]

Generate a random training mask over the non-NaN entries of gt, excluding positions already present in the test and validation masks.

Parameters

gtnumpy.ndarray

Ground truth data (no NaNs).

mask_testnumpy.ndarray

Binary mask indicating test positions.

mask_validnumpy.ndarray

Binary mask indicating validation positions.

dropratefloat

Proportion of eligible entries to include in the training mask.

offsetfloat

Protect of not the offset of the dataset

verbosebool

Whether to print debug info.

seedint, optional

Random seed for reproducibility.

Returns

numpy.ndarray

Binary mask indicating training positions.

imputegap.tools.utils.get_missing_ratio(incomp_data)[source]

Check whether the proportion of missing values in the contaminated data is acceptable for training a deep learning model.

Parameters

incomp_dataTimeSeries (numpy array)

TimeSeries object containing dataset.

Returns

bool

True if the missing data ratio is less than or equal to 40%, False otherwise.

imputegap.tools.utils.list_of_algorithms()[source]
imputegap.tools.utils.list_of_algorithms_with_families()[source]
imputegap.tools.utils.list_of_datasets(txt=False)[source]
imputegap.tools.utils.list_of_downstreams()[source]
imputegap.tools.utils.list_of_downstreams_darts()[source]
imputegap.tools.utils.list_of_downstreams_sktime()[source]
imputegap.tools.utils.list_of_extractors()[source]
imputegap.tools.utils.list_of_families()[source]
imputegap.tools.utils.list_of_metrics()[source]
imputegap.tools.utils.list_of_normalizers()[source]
imputegap.tools.utils.list_of_optimizers()[source]
imputegap.tools.utils.list_of_patterns()[source]
imputegap.tools.utils.load_parameters(query: str = 'default', algorithm: str = 'cdrec', dataset: str = 'chlorine', optimizer: str = 'b', path=None, verbose=False)[source]

Load default or optimal parameters for algorithms from a TOML file.

Parameters

querystr, optional

‘default’ or ‘optimal’ to load default or optimal parameters (default is “default”).

algorithmstr, optional

Algorithm to load parameters for (default is “cdrec”).

datasetstr, optional

Name of the dataset (default is “chlorine”).

optimizerstr, optional

Optimizer type for optimal parameters (default is “b”).

pathstr, optional

Custom file path for the TOML file (default is None).

verbosebool, optional

Whether to display the contamination information (default is False).

Returns

tuple

A tuple containing the loaded parameters for the given algorithm.

imputegap.tools.utils.load_share_lib(name='lib_cdrec', lib=True, verbose=True)[source]

Load the shared library based on the operating system.

Parameters

namestr, optional

The name of the shared library (default is “lib_cdrec”).

libbool, optional

If True, the function loads the library from the default ‘imputegap’ path; if False, it loads from a local path (default is True).

verbosebool, optional

Whether to display the contamination information (default is True).

Returns

ctypes.CDLL

The loaded shared library object.

imputegap.tools.utils.prepare_fixed_testing_set(incomp_m, tr_ratio=0.8, offset=0.05, block_selection=True, verbose=True)[source]

Introduces additional missing values (NaNs) into a data matrix to match a specified training ratio.

This function modifies a copy of the input matrix incomp_m by introducing NaNs such that the proportion of observed (non-NaN) values matches the desired tr_ratio. It returns the modified matrix and the corresponding missing data mask.

Parameters

incomp_mnp.ndarray

A 2D NumPy array with potential pre-existing NaNs representing missing values.

tr_ratiofloat

Desired ratio of observed (non-NaN) values in the output matrix. Must be in the range (0, 1).

offsetfloat

Protected zone in the begining of the series

block_selectionbool

Select the missing values by blocks or randomly (True, is by block)

verbosebool

Whether to print debug info.

Returns

data_matrix_contnp.ndarray

The modified matrix with additional NaNs introduced to match the specified training ratio.

new_masknp.ndarray

A boolean mask of the same shape as data_matrix_cont where True indicates missing (NaN) entries.

Raises

AssertionError:

If the final observed and missing ratios deviate from the target by more than 1%.

Notes

  • The function assumes that the input contains some non-NaN entries.

  • NaNs are added in row-major order from the list of available (non-NaN) positions.

imputegap.tools.utils.prepare_testing_set(incomp_m, original_missing_ratio, block_selection=True, tr_ratio=0.8, verbose=True)[source]
imputegap.tools.utils.prevent_leakage(matrix, mask, replacement=0, verbose=True)[source]

Replaces missing values in a matrix to prevent data leakage during evaluation.

This function replaces all entries in matrix that are marked as missing in mask with a specified replacement value (default is 0). It then checks to ensure that there are no remaining NaNs in the matrix and that at least one replacement occurred.

Parameters

matrixnp.ndarray

A NumPy array potentially containing missing values (NaNs).

masknp.ndarray

A boolean mask of the same shape as matrix, where True indicates positions to be replaced (typically where original values were NaN).

replacementfloat or int, optional

The value to use in place of missing entries. Defaults to 0.

verbosebool

Whether to print debug info.

Returns

matrixnp.ndarray

The matrix with missing entries replaced by the specified value.

Raises

AssertionError:

If any NaNs remain in the matrix after replacement, or if no replacements were made.

Notes

  • This function is typically used before evaluation to ensure the model does not access ground truth values where data was originally missing.

imputegap.tools.utils.save_optimization(optimal_params, algorithm='cdrec', dataset='', optimizer='b', file_name=None)[source]

Save the optimization parameters to a TOML file for later use without recomputing.

Parameters

optimal_paramsdict

Dictionary of the optimal parameters.

algorithmstr, optional

The name of the imputation algorithm (default is ‘cdrec’).

datasetstr, optional

The name of the dataset (default is an empty string).

optimizerstr, optional

The name of the optimizer used (default is ‘b’).

file_namestr, optional

The name of the TOML file to save the results (default is None).

Returns

None

imputegap.tools.utils.search_path(set_name='test')[source]

Find the accurate path for loading test files.

Parameters

set_namestr, optional

Name of the dataset (default is “test”).

Returns

str

The correct file path for the dataset.

imputegap.tools.utils.split_mask_bwt_test_valid(data_matrix, test_rate=0.8, valid_rate=0.2, nan_val=None, verbose=False, seed=42)[source]

Dispatch NaN positions in data_matrix to test and validation masks only.

Parameters

data_matrixnumpy.ndarray

Input matrix containing NaNs to be split.

test_ratefloat

Proportion of NaNs to assign to the test set (default is 0.8).

valid_ratefloat

Proportion of NaNs to assign to the validation set (default is 0.2). test_rate + valid_rate must equal 1.0.

verbosebool

Whether to print debug info.

seedint, optional

Random seed for reproducibility.

Returns

tuple
test_masknumpy.ndarray

Binary mask indicating positions of NaNs in the test set.

valid_masknumpy.ndarray

Binary mask indicating positions of NaNs in the validation set.

n_nanint

Total number of NaN values found in the input matrix.

imputegap.tools.utils.verification_limitation(percentage, low_limit=0.01, high_limit=1.0)[source]

Format and verify that the percentage given by the user is within acceptable bounds.

Parameters

percentagefloat

The percentage value to be checked and potentially adjusted.

low_limitfloat, optional

The lower limit of the acceptable percentage range (default is 0.01).

high_limitfloat, optional

The upper limit of the acceptable percentage range (default is 1.0).

Returns

float

Adjusted percentage based on the limits.

Raises

ValueError

If the percentage is outside the accepted limits.

Notes

  • If the percentage is between 1 and 100, it will be divided by 100 to convert it to a decimal format.

  • If the percentage is outside the low and high limits, the function will print a warning and return the original value.