imputegap.tools.utils package

The imputegap.tools.utils package provides various utility functions and tools for handling algorithm parameters, evaluation, and other operations in the imputation process.

imputegap.tools.utils module

imputegap.tools.utils.auto_seq_llms(data_x, goal='seq', subset=False, high_limit=200, exception=False, b=None, verbose=True, deep_verbose=False)[source]

Brute-force search for nice (seq_len, batch_size) pairs.

If subset is False:

data_x: array of shape (T, …)

If subset is True:

We internally split T into train / test / val using 0.7 / 0.2 / 0.1 and ensure that batch_size is <= num_windows for each subset.

Returns:

(seq_len, batch_size)

imputegap.tools.utils.auto_seq_sample(matrix, tr_ratio, high_val=98, verbose=True)[source]

Automatically select a suitable sequence length and batch size based on the dataset size and a predefined batch-size table.

The function iteratively searches for an even seq_len, starting from high_val and decreasing by 2, until it is less than or equal to small_set, where:

small_set = int(T * (1 - tr_ratio)) // 2

with T being the number of time steps (rows) in matrix. If the search goes below 2, seq_len is clamped to 2.

Once seq_len is found, the batch size is chosen from a fixed table [2, 4, 8, 16, 32, 64, 96] as the value closest to seq_len.

Parameters

matrixnp.ndarray

Input 2D array of shape (T, F), where T is the number of time steps and F the number of features.

tr_ratiofloat

Training ratio in [0, 1]. Used to compute the size of the “smallest set” (typically validation/test portion) that seq_len should not exceed.

high_valint, optional

Initial (maximum) candidate sequence length from which the search starts and decreases by 2. Default is 98.

verbosebool, optional

If True, prints the selected seq_len, batch_size and the computed small_set. Default is True.

Returns

seq_lenint

Selected sequence length, guaranteed to be at least 2 and less than or equal to small_set.

batch_sizeint

Selected batch size from the fixed table [2, 4, 8, 16, 32, 64, 96] that is closest (in absolute difference) to seq_len.

imputegap.tools.utils.check_contamination_series(ts_m, algo='the algorithm', verbose=True)[source]

Verify whether the input time series matrix meets the contamination constraints required by uni-dimensional algorithms (such as SPIRIT).

Specifically, this function checks if only the first series (column 0) contains missing (NaN) values. If any other series is contaminated, it reports an imputation error (optionally printing a message) and returns True to signal that an issue exists.

Parameters

ts_mnp.ndarray

A 2D NumPy array representing the time series matrix, where each column corresponds to a separate series.

algostr, optional

The name of the algorithm being validated. Used only for logging in the printed error message. Default is “the algorithm”.

verbosebool, optional

If True, prints an error message when contamination is detected outside of series 0. Default is True.

Returns

bool

False if only series 0 is contaminated (valid input). True if contamination exists in any other series (invalid input).

imputegap.tools.utils.check_family(family='DeepLearning', algorithm='')[source]

Check whether a given algorithm belongs to a specified family.

Parameters

familystr, optional

Name of the algorithm family to check against (e.g. "DeepLearning"). Defaults to "DeepLearning".

algorithmstr

Name of the algorithm to check for membership in the given family. Matching is case-insensitive and ignores underscores and hyphens.

Returns

bool

True if an algorithm with the given name exists within the specified family, False otherwise.

imputegap.tools.utils.clean_missing_values(raw_data=None, substitute='zero', mask=None)[source]

Replace all NaN values in a 2D matrix by a column-wise substitute.

Parameters

raw_datanp.ndarray

2D input array of shape (N, M) containing missing values encoded

substitute{“mean”, “median”, “zero”}, optional

Strategy used to replace NaNs per column: - “mean”: replace NaNs with the column-wise mean (ignoring NaNs). - “median”: replace NaNs with the column-wise median (ignoring NaNs). - “zero”: replace NaNs with 0. Default is “mean”.

mask, np.ndarraym optional

Replace the normal NaNs detection

Returns

np.ndarray

2D array of shape (N, M) with NaNs replaced column-wise

imputegap.tools.utils.config_contamination(ts, pattern, dataset_rate=0.4, series_rate=0.4, block_size=10, offset=0.1, seed=True, limit=1, shift=0.05, std_dev=0.5, explainer=False, probabilities=None, logic_by_series=True, verbose=True)[source]

Configure and execute contamination for selected imputation algorithm and pattern.

Parameters

ratefloat

Mean parameter for contamination missing percentage rate.

ts_testTimeSeries

A TimeSeries object containing dataset.

patternstr

Type of contamination pattern (e.g., “mcar”, “mp”, “blackout”, “disjoint”, “overlap”, “gaussian”).

block_size_mcarint

Size of blocks removed in MCAR

Returns

TimeSeries

TimeSeries object containing contaminated data.

imputegap.tools.utils.config_forecaster(model, params)[source]

Configure and execute forecaster model for downstream analytics

Parameters

modelstr

name of the forcaster model

paramslist of params

List of paramaters for a forcaster model

Returns

Forecaster object (SKTIME/DART)

Forecaster object for downstream analytics

imputegap.tools.utils.config_impute_algorithm(incomp_data, algorithm, verbose=True)[source]

Configure and execute algorithm for selected imputation imputer and pattern.

Parameters

incomp_dataTimeSeries

TimeSeries object containing dataset.

algorithmstr

Name of algorithm

verbosebool, optional

Whether to display the contamination information (default is False).

Returns

BaseImputer

Configured imputer instance with optimal parameters.

imputegap.tools.utils.control_boundaries(rank, boundary, algorithm='Algorithm', reduction=1)[source]

Ensure that the rank does not exceed the boundary limit.

Parameters

rankint

The input rank, typically representing the number of components or factors.

boundaryint

The maximum allowed value, usually corresponding to the number of available series.

algorithmstr, optional

The name of the algorithm using this control check (default is “Algorithm”).

reductionint, optional

The amount to reduce the boundary by if the rank exceeds it (default is 1).

Returns

int

The adjusted rank value. If the input rank is valid, it is returned unchanged. If it exceeds the boundary, a reduced value is returned. If no valid reduction is possible, returns 1.

imputegap.tools.utils.dataset_add_dimensionality(matrix, seq_length=24, reshapable=True, adding_nans=True, three_dim=True, window=False, verbose=False, deep_verbose=False)[source]

Prepare a 2D matrix for sequence-based models (sample strategy) by padding and optional reshaping to 3D.

Parameters

matrixnp.ndarray

Input 2D array of shape (N, M), where N is the number of time steps (rows) and M is the number of features (columns).

seq_lengthint, optional

Target sequence length (number of time steps per segment). Used for padding and reshaping. Default is 24.

reshapablebool, optional

If True, the matrix is padded (if needed) so that its number of rows is divisible by seq_length. If False, sequences are extracted in non-overlapping chunks of length seq_length without padding. Default is True.

adding_nans{True, False, None}, optional

Controls the padding values: - None: pad with zeros. - True: pad with NaNs. - False: pad with per-column means (ignoring NaNs). Default is True (pad with NaNs).

three_dimbool, optional

If True and reshapable is True, the padded matrix is reshaped to a 3D array of shape (num_sequences, seq_length, M). If False, the function returns the padded 2D matrix. Ignored when window=True or reshapable=False. Default is True.

windowbool, optional

If True, the function only appends a block of seq_length rows (using the chosen padding strategy) and returns the resulting 2D matrix without reshaping. Default is False.

verbosebool, optional

If True, prints information about padding and the resulting shape(s). Default is False.

deep_verbosebool, optional

If True and three_dim is True, prints the full reshaped 3D matrix for inspection. Default is False.

Returns

np.ndarray

3D array of shape (N_padded // seq_length, seq_length, features).

imputegap.tools.utils.dataset_reverse_dimensionality(matrix, expected_n: int, verbose: bool = True)[source]

Convert (1, N, T, L) -> (N*T, L) or (N, T, L) -> (N*T, L), then trim to expected_n rows.

Steps:
  1. If ndim==4, squeeze axis 0 (requires S==1).

  2. Reshape first two dims together -> (N*T, L).

  3. Drop the last (N*T - expected_n) rows.

Parameters:
  • matrix – np.ndarray of shape (1, N, T, L) or (N, T, L)

  • expected_n – final number of rows after trimming (e.g., 1000)

  • verbose – print shapes and removed-row count

Returns:

np.ndarray of shape (expected_n, L)

imputegap.tools.utils.display_title(title='Master Thesis', aut='Quentin Nater', lib='ImputeGAP', university='University Fribourg')[source]

Display the title and author information.

Parameters

titlestr, optional

The title of the thesis (default is “Master Thesis”).

autstr, optional

The author’s name (default is “Quentin Nater”).

libstr, optional

The library or project name (default is “ImputeGAP”).

universitystr, optional

The university or institution (default is “University Fribourg”).

Returns

None

imputegap.tools.utils.dl_integration_transformation(input_matrix, tr_ratio=0.8, inside_tr_cont_ratio=0.2, split_ts=1, split_val=0, nan_val=-99999, prevent_leak=True, offset=0.05, block_selection=True, seed=42, verbose=False)[source]

Prepares contaminated data and corresponding masks for deep learning-based imputation training, validation, and testing.

This function simulates missingness in a controlled way, optionally prevents information leakage, and produces masks for training, testing, and validation using different contamination strategies.

Parameters:

input_matrixnp.ndarray

The complete input time series data matrix of shape [T, N] (time steps × variables).

tr_ratiofloat, default=0.8

The fraction of data to reserve for training when constructing the test contamination mask.

inside_tr_cont_ratiofloat, default=0.2

The proportion of values to randomly drop inside the training data for internal contamination.

split_tsfloat, default=1

Proportion of the total contaminated data assigned to the test set.

split_valfloat, default=0

Proportion of the total contaminated data assigned to the validation set.

nan_valfloat, default=-99999

Value used to represent missing entries in the masked matrix. nan_val=-1 can be used to set mean values

prevent_leakbool, default=True

Replace the value of NaN with a high number to prevent leakage.

offsetfloat, default=0.05

Minimum temporal offset in the begining of the series

block_selectionbool, default=True

Whether to simulate missing values in contiguous blocks (True) or randomly (False).

seedint, default=42

Seed for NumPy random number generation to ensure reproducibility.

verbosebool, default=False

Whether to print logging/debug information during execution.

Returns:

cont_data_matrixnp.ndarray

The input matrix with synthetic missing values introduced.

mask_trainnp.ndarray

Boolean mask of shape [T, N] indicating the training contamination locations (True = observed, False = missing).

mask_testnp.ndarray

Boolean mask of shape [T, N] indicating the test contamination locations.

mask_validnp.ndarray

Boolean mask of shape [T, N] indicating the validation contamination locations.

errorbool

Tag which is triggered if the operation is impossible.

imputegap.tools.utils.generate_random_mask(gt, mask_test, mask_valid, droprate=0.2, offset=None, series_like=True, verbose=False, seed=42)[source]

Generate a random training mask over the non-NaN entries of gt, excluding positions already present in the test and validation masks.

Parameters

gtnumpy.ndarray

Ground truth data (no NaNs).

mask_testnumpy.ndarray

Binary mask indicating test positions.

mask_validnumpy.ndarray

Binary mask indicating validation positions.

dropratefloat

Proportion of eligible entries to include in the training mask.

series_likebool

The mask must be set on free series

offsetfloat

Protect of not the offset of the dataset

verbosebool

Whether to print debug info.

seedint, optional

Random seed for reproducibility.

Returns

numpy.ndarray

Binary mask indicating training positions.

imputegap.tools.utils.get_missing_ratio(incomp_data)[source]

Check whether the proportion of missing values in the contaminated data is acceptable for training a deep learning model.

Parameters

incomp_dataTimeSeries (numpy array)

TimeSeries object containing dataset.

Returns

bool

True if the missing data ratio is less than or equal to 40%, False otherwise.

imputegap.tools.utils.get_resuts_unit_tests(algo_name, loader, verbose=True)[source]

Returns (dataset, rmse, mae) for the given algo name from loader.toml.

imputegap.tools.utils.handle_nan_input(raw_data, incomp_data)[source]
imputegap.tools.utils.list_of_algorithms()[source]

Return the list of available imputation algorithms.

Parameters

None

Returns

list of str

A sorted list of algorithm names supported by the framework.

imputegap.tools.utils.list_of_algorithms_deep_learning()[source]

Returns all imputation algorithms of the Deep Learning family.

imputegap.tools.utils.list_of_algorithms_llms()[source]

Returns all imputation algorithms of the LLMs family.

imputegap.tools.utils.list_of_algorithms_machine_learning()[source]

Returns all imputation algorithms of the Machine Learning family.

imputegap.tools.utils.list_of_algorithms_matrix_completion()[source]

Returns all imputation algorithms of the Matrix Completion family.

Returns all imputation algorithms of the Pattern Search family.

imputegap.tools.utils.list_of_algorithms_statistics()[source]

Returns all imputation algorithms of the Statistics family.

imputegap.tools.utils.list_of_algorithms_with_families(specify_family=None)[source]

Return the list of available imputation techniques (with families) from ImputeGAP.

Parameters

None

Returns

list of str

A sorted list of imputation techniques (with families) supported by ImputeGAP.

imputegap.tools.utils.list_of_datasets(txt=False)[source]

Return the list of available datasets from ImputeGAP.

Parameters

None

Returns

list of str

A sorted list of datasets names supported by ImputeGAP.

imputegap.tools.utils.list_of_downstreams()[source]

Return the list of available downstream models from ImputeGAP.

Parameters

None

Returns

list of str

A sorted list of downstream models names supported by ImputeGAP.

imputegap.tools.utils.list_of_downstreams_darts()[source]

Return the list of available downstream models (darts) from ImputeGAP.

Parameters

None

Returns

list of str

A sorted list of downstream models names supported by ImputeGAP.

imputegap.tools.utils.list_of_downstreams_sktime()[source]

Return the list of available downstream models (sktime) from ImputeGAP.

Parameters

None

Returns

list of str

A sorted list of downstream models names supported by ImputeGAP.

imputegap.tools.utils.list_of_extractors()[source]

Return the list of available extractors from ImputeGAP.

Parameters

None

Returns

list of str

A sorted list of extractors names supported by ImputeGAP.

imputegap.tools.utils.list_of_families()[source]

Return the list of available families of imputation techniques from ImputeGAP.

Parameters

None

Returns

list of str

A sorted list of families of imputation techniques names supported by ImputeGAP.

imputegap.tools.utils.list_of_metrics()[source]

Return the list of available metrics from ImputeGAP.

Parameters

None

Returns

list of str

A sorted list of families of imputation metrics supported by ImputeGAP.

imputegap.tools.utils.list_of_normalizers()[source]

Return the list of available normalizer (with families) from ImputeGAP.

Parameters

None

Returns

list of str

A sorted list of normalizer supported by ImputeGAP.

imputegap.tools.utils.list_of_optimizers()[source]

Return the list of available optimizers from ImputeGAP.

Parameters

None

Returns

list of str

A sorted list of optimizers names supported by ImputeGAP.

imputegap.tools.utils.list_of_patterns()[source]

Return the list of available imputation patterns.

Parameters

None

Returns

list of str

A sorted list of patterns names supported by the framework.

imputegap.tools.utils.load_parameters(query: str = 'default', algorithm: str = 'cdrec', dataset: str = 'chlorine', optimizer: str = 'b', path=None, verbose=False)[source]

Load default or optimal parameters for algorithms from a TOML file.

Parameters

querystr, optional

‘default’ or ‘optimal’ to load default or optimal parameters (default is “default”).

algorithmstr, optional

Algorithm to load parameters for (default is “cdrec”).

datasetstr, optional

Name of the dataset (default is “chlorine”).

optimizer : str, optional optimizer : str, optional

Optimizer type for optimal parameters (default is “b”).

pathstr, optional

Custom file path for the TOML file (default is None).

verbosebool, optional

Whether to display the contamination information (default is False).

Returns

tuple

A tuple containing the loaded parameters for the given algorithm.

imputegap.tools.utils.load_share_lib(name='lib_cdrec', verbose=True)[source]

Load the shared library based on the operating system.

Parameters

namestr, optional

The name of the shared library (default is “lib_cdrec”).

libbool, optional

If True, the function loads the library from the default ‘imputegap’ path; if False, it loads from a local path (default is True).

verbosebool, optional

Whether to display the contamination information (default is True).

Returns

ctypes.CDLL

The loaded shared library object.

imputegap.tools.utils.prepare_deep_learning_params(incomp_data, seq_len, batch_size, sliding_windows, tr_ratio, verbose)[source]
imputegap.tools.utils.prepare_fixed_testing_set(incomp_m, tr_ratio=0.8, offset=0.05, block_selection=True, verbose=True)[source]

Introduces additional missing values (NaNs) into a data matrix to match a specified training ratio.

This function modifies a copy of the input matrix incomp_m by introducing NaNs such that the proportion of observed (non-NaN) values matches the desired tr_ratio. It returns the modified matrix and the corresponding missing data mask.

Parameters

incomp_mnp.ndarray

A 2D NumPy array with potential pre-existing NaNs representing missing values.

tr_ratiofloat

Desired ratio of observed (non-NaN) values in the output matrix. Must be in the range (0, 1).

offsetfloat

Protected zone in the begining of the series

block_selectionbool

Select the missing values by blocks or randomly (True, is by block)

verbosebool

Whether to print debug info.

Returns

data_matrix_contnp.ndarray

The modified matrix with additional NaNs introduced to match the specified training ratio.

new_masknp.ndarray

A boolean mask of the same shape as data_matrix_cont where True indicates missing (NaN) entries.

Raises

AssertionError:

If the final observed and missing ratios deviate from the target by more than 1%.

Notes

  • The function assumes that the input contains some non-NaN entries.

  • NaNs are added in row-major order from the list of available (non-NaN) positions.

imputegap.tools.utils.prepare_testing_set(incomp_m, original_missing_ratio, block_selection=True, tr_ratio=0.8, verbose=True)[source]
imputegap.tools.utils.prevent_leakage(matrix, mask, replacement=0, verbose=True)[source]

Replaces missing values in a matrix to prevent data leakage during evaluation.

This function replaces all entries in matrix that are marked as missing in mask with a specified replacement value (default is 0). It then checks to ensure that there are no remaining NaNs in the matrix and that at least one replacement occurred.

Parameters

matrixnp.ndarray

A NumPy array potentially containing missing values (NaNs).

masknp.ndarray

A boolean mask of the same shape as matrix, where True indicates positions to be replaced (typically where original values were NaN).

replacementfloat or int, optional

The value to use in place of missing entries. Defaults to 0.

verbosebool

Whether to print debug info.

Returns

matrixnp.ndarray

The matrix with missing entries replaced by the specified value.

Raises

AssertionError:

If any NaNs remain in the matrix after replacement, or if no replacements were made.

Notes

  • This function is typically used before evaluation to ensure the model does not access ground truth values where data was originally missing.

imputegap.tools.utils.reconstruction_window_based(preds, nbr_timestamps, sliding_windows=1, verbose=True, deep_verbose=False)[source]

Reconstruct the full time series after window-based imputation. This function restores the original univariate series or 2D matrix from the 3D windowed (multivariate-style) representation used during the deep learning process. See window_truncation() for the preprocessing transformation applied beforehand.

Parameters

predstorch.Tensor

Predicted windows of shape (N, L, F), where: - N is the number of windows, - L is the window length (sequence length), - F is the number of features per time step.

nbr_timestampsint

Target length T of the reconstructed time series along the time dimension (number of time steps).

sliding_windowsint, optional

Step size between the starting indices of consecutive windows in the original time series. The i-th window is placed starting at index i * sliding_windows. Default is 1.

verbosebool, optional

If True, prints a summary of the reconstruction process and basic completeness statistics. Default is True.

deep_verbosebool, optional

If True, prints detailed information about the index ranges used for each window and the internal count matrix. Useful for debugging. Default is False.

Returns

torch.Tensor

Reconstructed time series of shape (T, D), where overlapping windows have been averaged at each time step.

imputegap.tools.utils.save_optimization(optimal_params, algorithm='cdrec', dataset='', optimizer='b', file_name=None, verbose=True)[source]

Save the optimization parameters to a TOML file for later use without recomputing.

Parameters

optimal_paramsdict

Dictionary of the optimal parameters.

algorithmstr, optional

The name of the imputation algorithm (default is ‘cdrec’).

datasetstr, optional

The name of the dataset (default is an empty string).

optimizerstr, optional

The name of the optimizer used (default is ‘b’).

file_namestr, optional

The name of the TOML file to save the results (default is None).

Returns

None

imputegap.tools.utils.search_path(set_name='test')[source]

Find the accurate path for loading test files.

Parameters

set_namestr, optional

Name of the dataset (default is “test”).

Returns

str

The correct file path for the dataset.

imputegap.tools.utils.sets_splitter_based_on_training(tr, split=0.66667, verbose=False)[source]

Compute test and validation split ratios based on a given training ratio.

Ensures that the sum of training, validation, and test ratios equals 1.0 after rounding to one decimal place. Raises a ValueError if the resulting ratios do not sum to 1.0 within tolerance.

Parameters

trfloat

Training ratio (between 0 and 1).

splitfloat, optional

Percentage of test set. Default is 2/3.

verbosebool, optional

If True, prints the computed ratios for verification. Default is False.

Returns

  • test_ratio : Fraction of data allocated to the test set.

  • val_ratio : Fraction of data allocated to the validation set.

Raises

ValueError

If the computed ratios do not sum to 1.0 (after rounding).

imputegap.tools.utils.split_mask_bwt_test_valid(data_matrix, test_rate=0.8, valid_rate=0.2, nan_val=None, verbose=False, seed=42)[source]

Dispatch NaN positions in data_matrix to test and validation masks only.

Parameters

data_matrixnumpy.ndarray

Input matrix containing NaNs to be split.

test_ratefloat

Proportion of NaNs to assign to the test set (default is 0.8).

valid_ratefloat

Proportion of NaNs to assign to the validation set (default is 0.2). test_rate + valid_rate must equal 1.0.

verbosebool

Whether to print debug info.

seedint, optional

Random seed for reproducibility.

Returns

tuple
test_masknumpy.ndarray

Binary mask indicating positions of NaNs in the test set.

valid_masknumpy.ndarray

Binary mask indicating positions of NaNs in the validation set.

n_nanint

Total number of NaN values found in the input matrix.

imputegap.tools.utils.verification_limitation(percentage, low_limit=0.001, high_limit=1.0)[source]

Format and verify that the percentage given by the user is within acceptable bounds.

Parameters

percentagefloat

The percentage value to be checked and potentially adjusted.

low_limitfloat, optional

The lower limit of the acceptable percentage range (default is 0.01).

high_limitfloat, optional

The upper limit of the acceptable percentage range (default is 1.0).

Returns

float

Adjusted percentage based on the limits.

Raises

ValueError

If the percentage is outside the accepted limits.

Notes

  • If the percentage is between 1 and 100, it will be divided by 100 to convert it to a decimal format.

  • If the percentage is outside the low and high limits, the function will print a warning and return the original value.

imputegap.tools.utils.window_truncation(feature_vectors, seq_len, stride=None, info='', verbose=True, deep_verbose=False)[source]

Segment a sequence of feature vectors into fixed-length windows. In ImputeGAP, this is used in deep learning to reshape a 2D univariate dataset into a 3D windowed representation, enabling multivariate-like processing. See reconstruction_window_based() to restore the imputed matrix to its original shape.

The code was inspired by: https://dl.acm.org/doi/10.1016/j.eswa.2023.119619

Parameters

feature_vectorsnp.ndarray

Input array of feature vectors. Windowing is applied along the first axis (typically the time or sequence dimension).

seq_lenint

Length of each window (number of time steps per segment).

strideint, optional

Step size between the starting indices of consecutive windows. Defaults to seq_len (non-overlapping windows).

infostr, optional

Additional descriptive string to include in the verbose log output. Defaults to an empty string.

verbosebool, optional

If True, prints a summary of the computed windows (shape and configuration). Defaults to True.

deep_verbosebool, optional

If True, prints the raw start indices used to generate the windows. Useful for debugging. Defaults to False.

Returns

np.ndarray

Array of shape (num_windows, seq_len, features) containing the extracted windows, cast to float32.