imputegap.tools.utils package¶

The imputegap.tools.utils package provides various utility functions and tools for handling algorithm parameters, evaluation, and other operations in the imputation process.

imputegap.tools.utils module¶

imputegap.tools.utils.check_family(family, algorithm)[source]¶

imputegap.tools.utils.compute_batch_size(data, min_size=4, max_size=16, divisor=2, verbose=True)[source]¶

Compute an appropriate batch size based on the input data shape.

The batch size is computed as min(M // 2, max_size), where M is the number of samples. If this computed batch size is less than min_size, it is set to min_size instead.

Parameters¶

datanp.ndarray or torch.Tensor: Input 2D data of shape (M, N), where M is the number of samples.
min_sizeint, optional: Minimum allowed batch size. Default is 4.
max_sizeint, optional: Maximum allowed batch size. Default is 16.
divisorint, optional: Divisor on the shape of the dataset. Default is 2.
verbosebool, optional: If True, prints the computed batch size. Default is True.

Returns¶

int: Computed batch size.

imputegap.tools.utils.compute_rank_check(M, rank, verbose=True)[source]¶

Validates and adjusts the rank used in matrix operations based on the number of time series.

Parameters¶

Mint: Number of series
rankint: The desired rank (e.g., for matrix factorization or low-rank approximation).
verbosebool: Print the error or not

Returns¶

rank: int: A valid rank value, adjusted to avoid exceeding the number of available series.

imputegap.tools.utils.compute_seq_length(M)[source]¶

Compute a sequence length based on the input length M using heuristic rules.

Parameters¶

Mint: Number of series in the dataset.

Returns¶

seq_length: int: A derived sequence length appropriate for processing or windowing.

imputegap.tools.utils.config_contamination(ts, pattern, dataset_rate=0.4, series_rate=0.4, block_size=10, offset=0.1, seed=True, limit=1, shift=0.05, std_dev=0.5, explainer=False, probabilities=None, verbose=True)[source]¶

Configure and execute contamination for selected imputation algorithm and pattern.

Parameters¶

ratefloat: Mean parameter for contamination missing percentage rate.
ts_testTimeSeries: A TimeSeries object containing dataset.
patternstr: Type of contamination pattern (e.g., “mcar”, “mp”, “blackout”, “disjoint”, “overlap”, “gaussian”).
block_size_mcarint: Size of blocks removed in MCAR

Returns¶

TimeSeries: TimeSeries object containing contaminated data.

imputegap.tools.utils.config_forecaster(model, params)[source]¶

Configure and execute forecaster model for downstream analytics

Parameters¶

modelstr: name of the forcaster model
paramslist of params: List of paramaters for a forcaster model

Returns¶

Forecaster object (SKTIME/DART): Forecaster object for downstream analytics

imputegap.tools.utils.config_impute_algorithm(incomp_data, algorithm, verbose=True)[source]¶

Configure and execute algorithm for selected imputation imputer and pattern.

Parameters¶

incomp_dataTimeSeries: TimeSeries object containing dataset.
algorithmstr: Name of algorithm
verbosebool, optional: Whether to display the contamination information (default is False).

Returns¶

BaseImputer: Configured imputer instance with optimal parameters.

imputegap.tools.utils.display_title(title='Master Thesis', aut='Quentin Nater', lib='ImputeGAP', university='University Fribourg')[source]¶

Display the title and author information.

Parameters¶

titlestr, optional: The title of the thesis (default is “Master Thesis”).
autstr, optional: The author’s name (default is “Quentin Nater”).
libstr, optional: The library or project name (default is “ImputeGAP”).
universitystr, optional: The university or institution (default is “University Fribourg”).

Returns¶

None

imputegap.tools.utils.dl_integration_transformation(input_matrix, tr_ratio=0.8, inside_tr_cont_ratio=0.2, split_ts=1, split_val=0, nan_val=-99999, prevent_leak=True, offset=0.05, block_selection=True, seed=42, verbose=False)[source]¶

Prepares contaminated data and corresponding masks for deep learning-based imputation training, validation, and testing.

This function simulates missingness in a controlled way, optionally prevents information leakage, and produces masks for training, testing, and validation using different contamination strategies.

Parameters:¶

input_matrixnp.ndarray: The complete input time series data matrix of shape [T, N] (time steps × variables).
tr_ratiofloat, default=0.8: The fraction of data to reserve for training when constructing the test contamination mask.
inside_tr_cont_ratiofloat, default=0.2: The proportion of values to randomly drop inside the training data for internal contamination.
split_tsfloat, default=1: Proportion of the total contaminated data assigned to the test set.
split_valfloat, default=0: Proportion of the total contaminated data assigned to the validation set.
nan_valfloat, default=-99999: Value used to represent missing entries in the masked matrix. nan_val=-1 can be used to set mean values
prevent_leakbool, default=True: Replace the value of NaN with a high number to prevent leakage.
offsetfloat, default=0.05: Minimum temporal offset in the begining of the series
block_selectionbool, default=True: Whether to simulate missing values in contiguous blocks (True) or randomly (False).
seedint, default=42: Seed for NumPy random number generation to ensure reproducibility.
verbosebool, default=False: Whether to print logging/debug information during execution.

Returns:¶

cont_data_matrixnp.ndarray: The input matrix with synthetic missing values introduced.
mask_trainnp.ndarray: Boolean mask of shape [T, N] indicating the training contamination locations (True = observed, False = missing).
mask_testnp.ndarray: Boolean mask of shape [T, N] indicating the test contamination locations.
mask_validnp.ndarray: Boolean mask of shape [T, N] indicating the validation contamination locations.
errorbool: Tag which is triggered if the operation is impossible.

imputegap.tools.utils.generate_random_mask(gt, mask_test, mask_valid, droprate=0.2, offset=None, verbose=False, seed=42)[source]¶

Generate a random training mask over the non-NaN entries of gt, excluding positions already present in the test and validation masks.

Parameters¶

gtnumpy.ndarray: Ground truth data (no NaNs).
mask_testnumpy.ndarray: Binary mask indicating test positions.
mask_validnumpy.ndarray: Binary mask indicating validation positions.
dropratefloat: Proportion of eligible entries to include in the training mask.
offsetfloat: Protect of not the offset of the dataset
verbosebool: Whether to print debug info.
seedint, optional: Random seed for reproducibility.

Returns¶

numpy.ndarray: Binary mask indicating training positions.

imputegap.tools.utils.get_missing_ratio(incomp_data)[source]¶

Check whether the proportion of missing values in the contaminated data is acceptable for training a deep learning model.

Parameters¶

incomp_dataTimeSeries (numpy array): TimeSeries object containing dataset.

Returns¶

bool: True if the missing data ratio is less than or equal to 40%, False otherwise.

imputegap.tools.utils.list_of_algorithms()[source]¶

imputegap.tools.utils.list_of_algorithms_with_families()[source]¶

imputegap.tools.utils.list_of_datasets(txt=False)[source]¶

imputegap.tools.utils.list_of_downstreams()[source]¶

imputegap.tools.utils.list_of_downstreams_darts()[source]¶

imputegap.tools.utils.list_of_downstreams_sktime()[source]¶

imputegap.tools.utils.list_of_extractors()[source]¶

imputegap.tools.utils.list_of_families()[source]¶

imputegap.tools.utils.list_of_metrics()[source]¶

imputegap.tools.utils.list_of_normalizers()[source]¶

imputegap.tools.utils.list_of_optimizers()[source]¶

imputegap.tools.utils.list_of_patterns()[source]¶

imputegap.tools.utils.load_parameters(query: str = 'default', algorithm: str = 'cdrec', dataset: str = 'chlorine', optimizer: str = 'b', path=None, verbose=False)[source]¶

Load default or optimal parameters for algorithms from a TOML file.

Parameters¶

querystr, optional: ‘default’ or ‘optimal’ to load default or optimal parameters (default is “default”).
algorithmstr, optional: Algorithm to load parameters for (default is “cdrec”).
datasetstr, optional: Name of the dataset (default is “chlorine”).
optimizerstr, optional: Optimizer type for optimal parameters (default is “b”).
pathstr, optional: Custom file path for the TOML file (default is None).
verbosebool, optional: Whether to display the contamination information (default is False).

Returns¶

tuple: A tuple containing the loaded parameters for the given algorithm.

imputegap.tools.utils.load_share_lib(name='lib_cdrec', lib=True, verbose=True)[source]¶

Load the shared library based on the operating system.

Parameters¶

namestr, optional: The name of the shared library (default is “lib_cdrec”).
libbool, optional: If True, the function loads the library from the default ‘imputegap’ path; if False, it loads from a local path (default is True).
verbosebool, optional: Whether to display the contamination information (default is True).

Returns¶

ctypes.CDLL: The loaded shared library object.

imputegap.tools.utils.prepare_fixed_testing_set(incomp_m, tr_ratio=0.8, offset=0.05, block_selection=True, verbose=True)[source]¶

Introduces additional missing values (NaNs) into a data matrix to match a specified training ratio.

This function modifies a copy of the input matrix incomp_m by introducing NaNs such that the proportion of observed (non-NaN) values matches the desired tr_ratio. It returns the modified matrix and the corresponding missing data mask.

Parameters¶

incomp_mnp.ndarray: A 2D NumPy array with potential pre-existing NaNs representing missing values.
tr_ratiofloat: Desired ratio of observed (non-NaN) values in the output matrix. Must be in the range (0, 1).
offsetfloat: Protected zone in the begining of the series
block_selectionbool: Select the missing values by blocks or randomly (True, is by block)
verbosebool: Whether to print debug info.

Returns¶

data_matrix_contnp.ndarray: The modified matrix with additional NaNs introduced to match the specified training ratio.
new_masknp.ndarray: A boolean mask of the same shape as data_matrix_cont where True indicates missing (NaN) entries.

Raises¶

AssertionError:: If the final observed and missing ratios deviate from the target by more than 1%.

Notes¶

The function assumes that the input contains some non-NaN entries.

NaNs are added in row-major order from the list of available (non-NaN) positions.

imputegap.tools.utils.prepare_testing_set(incomp_m, original_missing_ratio, block_selection=True, tr_ratio=0.8, verbose=True)[source]¶

imputegap.tools.utils.prevent_leakage(matrix, mask, replacement=0, verbose=True)[source]¶

Replaces missing values in a matrix to prevent data leakage during evaluation.

This function replaces all entries in matrix that are marked as missing in mask with a specified replacement value (default is 0). It then checks to ensure that there are no remaining NaNs in the matrix and that at least one replacement occurred.

Parameters¶

matrixnp.ndarray: A NumPy array potentially containing missing values (NaNs).
masknp.ndarray: A boolean mask of the same shape as matrix, where True indicates positions to be replaced (typically where original values were NaN).
replacementfloat or int, optional: The value to use in place of missing entries. Defaults to 0.
verbosebool: Whether to print debug info.

Returns¶

matrixnp.ndarray: The matrix with missing entries replaced by the specified value.

Raises¶

AssertionError:: If any NaNs remain in the matrix after replacement, or if no replacements were made.

Notes¶

This function is typically used before evaluation to ensure the model does not access ground truth values where data was originally missing.

imputegap.tools.utils.save_optimization(optimal_params, algorithm='cdrec', dataset='', optimizer='b', file_name=None)[source]¶

Save the optimization parameters to a TOML file for later use without recomputing.

Parameters¶

optimal_paramsdict: Dictionary of the optimal parameters.
algorithmstr, optional: The name of the imputation algorithm (default is ‘cdrec’).
datasetstr, optional: The name of the dataset (default is an empty string).
optimizerstr, optional: The name of the optimizer used (default is ‘b’).
file_namestr, optional: The name of the TOML file to save the results (default is None).

Returns¶

None

imputegap.tools.utils.search_path(set_name='test')[source]¶

Find the accurate path for loading test files.

Parameters¶

set_namestr, optional: Name of the dataset (default is “test”).

Returns¶

str: The correct file path for the dataset.

imputegap.tools.utils.split_mask_bwt_test_valid(data_matrix, test_rate=0.8, valid_rate=0.2, nan_val=None, verbose=False, seed=42)[source]¶

Dispatch NaN positions in data_matrix to test and validation masks only.

Parameters¶

data_matrixnumpy.ndarray: Input matrix containing NaNs to be split.
test_ratefloat: Proportion of NaNs to assign to the test set (default is 0.8).
valid_ratefloat: Proportion of NaNs to assign to the validation set (default is 0.2). test_rate + valid_rate must equal 1.0.
verbosebool: Whether to print debug info.
seedint, optional: Random seed for reproducibility.

Returns¶

tuple

test_masknumpy.ndarray: Binary mask indicating positions of NaNs in the test set.
valid_masknumpy.ndarray: Binary mask indicating positions of NaNs in the validation set.
n_nanint: Total number of NaN values found in the input matrix.

imputegap.tools.utils.verification_limitation(percentage, low_limit=0.01, high_limit=1.0)[source]¶

Format and verify that the percentage given by the user is within acceptable bounds.

Parameters¶

percentagefloat: The percentage value to be checked and potentially adjusted.
low_limitfloat, optional: The lower limit of the acceptable percentage range (default is 0.01).
high_limitfloat, optional: The upper limit of the acceptable percentage range (default is 1.0).

Returns¶

float: Adjusted percentage based on the limits.

Raises¶

ValueError: If the percentage is outside the accepted limits.

Notes¶

If the percentage is between 1 and 100, it will be divided by 100 to convert it to a decimal format.
If the percentage is outside the low and high limits, the function will print a warning and return the original value.