regression.learn#

cleanlab can be used for learning with noisy data for any dataset and regression model.

For regression tasks, the regression.learn.CleanLearning class wraps any instance of an sklearn model to allow you to train more robust regression models, or use the model to identify corrupted values in the dataset. The wrapped model must adhere to the sklearn estimator API, meaning it must define three functions:

  • model.fit(X, y, sample_weight=None)

  • model.predict(X)

  • model.score(X, y, sample_weight=None)

where X contains the data (i.e. features, covariates, independant variables) and y contains the target value (i.e. label, response/dependant variable). The first index of X and of y should correspond to the different examples in the dataset, such that len(X) = len(y) = N (sample-size).

Your model should be correctly clonable via sklearn.base.clone: cleanlab internally creates multiple instances of the model, and if you e.g. manually wrap a PyTorch model, ensure that every call to the estimator’s __init__() creates an independent instance of the model (for sklearn compatibility, the weights of neural network models should typically be initialized inside of clf.fit()).

Example

>>> from cleanlab.regression.learn import CleanLearning
>>> from sklearn.linear_model import LinearRegression
>>> cl = CleanLearning(clf=LinearRegression()) # Pass in any model.
>>> cl.fit(X, y_with_noise)
>>> # Estimate the predictions as if you had trained without label issues.
>>> predictions = cl.predict(y)

If your model is not sklearn-compatible by default, it might be the case that standard packages can adapt the model. For example, you can adapt PyTorch models using skorch and adapt Keras models using SciKeras.

If an adapter doesn’t already exist, you can manually wrap your model to be sklearn-compatible. This is made easy by inheriting from sklearn.base.BaseEstimator:

from sklearn.base import BaseEstimator

class YourModel(BaseEstimator):
    def __init__(self, ):
        pass
    def fit(self, X, y):
        pass
    def predict(self, X):
        pass
    def score(self, X, y):
        pass

Classes:

CleanLearning([model, cv_n_folds, n_boot, ...])

CleanLearning = Machine Learning with cleaned data (even when training on messy, error-ridden data).

class cleanlab.regression.learn.CleanLearning(model=None, *, cv_n_folds=5, n_boot=5, include_aleatoric_uncertainty=True, verbose=False, seed=None)[source]#

Bases: BaseEstimator

CleanLearning = Machine Learning with cleaned data (even when training on messy, error-ridden data).

Automated and robust learning with noisy labels using any dataset and any regression model. For regression tasks, this class trains a model with error-prone, noisy labels as if the model had been instead trained on a dataset with perfect labels. It achieves this by estimating which labels are noisy (you might solely use CleanLearning for this estimation) and then removing examples estimated to have noisy labels, such that a more robust copy of the same model can be trained on the remaining clean data.

Parameters:
  • model (Optional[BaseEstimator]) –

    Any regression model implementing the sklearn estimator API, defining the following functions:

    • model.fit(X, y)

    • model.predict(X)

    • model.score(X, y)

    Default model used is sklearn.linear_model.LinearRegression.

  • cv_n_folds (int) – This class needs holdout predictions for every data example and if not provided, uses cross-validation to compute them. This argument sets the number of cross-validation folds used to compute out-of-sample predictions for each example in X. Default is 5. Larger values may produce better results, but requires longer to run.

  • n_boot (int) – Number of bootstrap resampling rounds used to estimate the model’s epistemic uncertainty. Default is 5. Larger values are expected to produce better results but require longer runtimes. Set as 0 to skip estimating the epistemic uncertainty and get results faster.

  • include_aleatoric_uncertainty (bool) – Specifies if the aleatoric uncertainty should be estimated during label error detection. True by default, which is expected to produce better results but require longer runtimes.

  • verbose (bool) – Controls how much output is printed. Set to False to suppress print statements. Default False.

  • seed (Optional[bool]) – Set the default state of the random number generator used to split the data. By default, uses np.random current random state.

Methods:

fit(X, y, *[, label_issues, sample_weight, ...])

Train regression model with error-prone, noisy labels as if the model had been instead trained on a dataset with the correct labels.

predict(X, *args, **kwargs)

Predict class labels using your wrapped model.

score(X, y[, sample_weight])

Evaluates your wrapped regression model's score on a test set X with target values y.

find_label_issues(X, y, *[, uncertainty, ...])

Identifies potential label issues (corrupted y-values) in the dataset, and estimates how noisy each label is.

get_label_issues()

Accessor, returns label_issues_df attribute if previously computed.

get_epistemic_uncertainty(X, y[, predictions])

Compute the epistemic uncertainty of the regression model for each example.

get_aleatoric_uncertainty(X, residual)

Compute the aleatoric uncertainty of the data.

save_space()

Clears non-sklearn attributes of this estimator to save space (in-place).

__init_subclass__(**kwargs)

Set the set_{method}_request methods.

get_metadata_routing()

Get metadata routing of this object.

get_params([deep])

Get parameters for this estimator.

set_fit_request(*[, ...])

Request metadata passed to the fit method.

set_params(**params)

Set the parameters of this estimator.

set_score_request(*[, sample_weight])

Request metadata passed to the score method.

fit(X, y, *, label_issues=None, sample_weight=None, find_label_issues_kwargs=None, model_kwargs=None, model_final_kwargs=None)[source]#

Train regression model with error-prone, noisy labels as if the model had been instead trained on a dataset with the correct labels. fit achieves this by first training model via cross-validation on the noisy data, using the resulting predicted probabilities to identify label issues, pruning the data with label issues, and finally training model on the remaining clean data.

Parameters:
  • X (Union[ndarray, DataFrame]) – Data features (i.e. covariates, independent variables), typically an array of shape (N, ...), where N is the number of examples (sample-size). Your model must be able to fit() and predict() data of this format.

  • y (Union[list, ndarray, Series, DataFrame]) – An array of shape (N,) of noisy labels (i.e. target/response/dependant variable), where some values may be erroneous.

  • label_issues (Union[DataFrame, ndarray, None]) –

    Optional already-identified label issues in the dataset (if previously estimated). Specify this to avoid re-estimating the label issues if already done. If pd.DataFrame, must be formatted as the one returned by: self.find_label_issues or self.get_label_issues. The DataFrame must have a column named is_label_issue.

    If np.ndarray, the input must be a boolean mask of length N where examples that have label issues have the value True, and the rest of the examples have the value False.

  • sample_weight (Optional[ndarray]) – Optional array of weights with shape (N,) that are assigned to individual samples. Specifies how to weight the examples in the loss function while training.

  • find_label_issues_kwargs (Optional[dict]) – Optional keyword arguments to pass into self.find_label_issues.

  • model_kwargs (Optional[dict]) – Optional keyword arguments to pass into model’s fit() method.

  • model_final_kwargs (Optional[dict]) – Optional extra keyword arguments to pass into the final model’s fit() on the cleaned data, but not the fit() in each fold of cross-validation on the noisy data. The final fit() will also receive the arguments in clf_kwargs, but these may be overwritten by values in clf_final_kwargs. This can be useful for training differently in the final fit() than during cross-validation.

Return type:

BaseEstimator

Returns:

self (CleanLearning) – Fitted estimator that has all the same methods as any sklearn estimator.

After calling self.fit(), this estimator also stores extra attributes such as:

  • self.label_issues_df: a pd.DataFrame containing label quality scores, boolean flags

    indicating which examples have label issues, and predicted label values for each example. Accessible via self.get_label_issues, of similar format as the one returned by self.find_label_issues. See documentation of self.find_label_issues for column descriptions.

  • self.label_issues_mask: a np.ndarray boolean mask indicating if a particular

    example has been identified to have issues.

predict(X, *args, **kwargs)[source]#

Predict class labels using your wrapped model. Works just like model.predict().

Parameters:

X (np.ndarray or DatasetLike) – Test data in the same format expected by your wrapped regression model.

Return type:

ndarray

Returns:

predictions (np.ndarray) – Predictions for the test examples.

score(X, y, sample_weight=None)[source]#

Evaluates your wrapped regression model’s score on a test set X with target values y. Uses your model’s default scoring function, or r-squared score if your model as no "score" attribute.

Parameters:
  • X (Union[ndarray, DataFrame]) – Test data in the same format expected by your wrapped model.

  • y (Union[list, ndarray, Series, DataFrame]) – Test labels in the same format as labels previously used in fit().

  • sample_weight (Optional[ndarray]) – Optional array of shape (N,) or (N, 1) used to weight each test example when computing the score.

Return type:

float

Returns:

score (float) – Number quantifying the performance of this regression model on the test data.

find_label_issues(X, y, *, uncertainty=None, coarse_search_range=[0.01, 0.05, 0.1, 0.15, 0.2], fine_search_size=3, save_space=False, model_kwargs=None)[source]#

Identifies potential label issues (corrupted y-values) in the dataset, and estimates how noisy each label is.

Note: this method estimates the label issues from scratch. To access previously-estimated label issues from this CleanLearning instance, use the self.get_label_issues method.

This is the method called to find label issues inside CleanLearning.fit() and they share mostly the same parameters.

Parameters:
  • X (Union[ndarray, DataFrame]) – Data features (i.e. covariates, independent variables), typically an array of shape (N, ...), where N is the number of examples (sample-size). Your model, must be able to fit() and predict() data of this format.

  • y (Union[list, ndarray, Series, DataFrame]) – An array of shape (N,) of noisy labels (i.e. target/response/dependant variable), where some values may be erroneous.

  • uncertainty (Union[ndarray, float, None]) – Optional estimated uncertainty for each example. Should be passed in as a float (constant uncertainty throughout all examples), or a numpy array of length N (estimated uncertainty for each example). If not provided, this method will estimate the uncertainty as the sum of the epistemic and aleatoric uncertainty.

  • save_space (bool) – If True, then returned label_issues_df will not be stored as attribute. This means some other methods like self.get_label_issues will no longer work.

  • coarse_search_range (list) – The coarse search range to find the value of k, which estimates the fraction of data which have label issues. More values represent a more thorough search (better expected results but longer runtimes).

  • fine_search_size (int) – Size of fine-grained search grid to find the value of k, which represents our estimate of the fraction of data which have label issues. A higher number represents a more thorough search (better expected results but longer runtimes).

For info about the other parameters, see the docstring of CleanLearning.fit().

Return type:

DataFrame

Returns:

label_issues_df (pd.DataFrame) – DataFrame with info about label issues for each example. Unless save_space argument is specified, same DataFrame is also stored as self.label_issues_df attribute accessible via get_label_issues.

Each row represents an example from our dataset and the DataFrame may contain the following columns:

  • is_label_issue: boolean mask for the entire dataset where True represents a label issue and False represents an example that is accurately labeled with high confidence.

  • label_quality: Numeric score that measures the quality of each label (how likely it is to be correct, with lower scores indicating potentially erroneous labels).

  • given_label: Values originally given for this example (same as y input).

  • predicted_label: Values predicted by the trained model.

get_label_issues()[source]#

Accessor, returns label_issues_df attribute if previously computed. This pd.DataFrame describes the issues identified for each example (each row corresponds to an example). For column definitions, see the documentation of CleanLearning.find_label_issues.

Return type:

Optional[DataFrame]

Returns:

label_issues_df (pd.DataFrame) – DataFrame with (precomputed) info about the label issues for each example.

get_epistemic_uncertainty(X, y, predictions=None)[source]#

Compute the epistemic uncertainty of the regression model for each example. This uncertainty is estimated using the bootstrapped variance of the model predictions.

Parameters:
  • X (ndarray) – Data features (i.e. training inputs for ML), typically an array of shape (N, ...), where N is the number of examples.

  • y (ndarray) – An array of shape (N,) of target values (dependant variables), where some values may be erroneous.

  • predictions (Optional[ndarray]) – Model predicted values of y, will be used as an extra bootstrap iteration to calculate the variance.

Return type:

ndarray

Returns:

epistemic_uncertainty (np.ndarray) – The estimated epistemic uncertainty for each example.

get_aleatoric_uncertainty(X, residual)[source]#

Compute the aleatoric uncertainty of the data. This uncertainty is estimated by predicting the standard deviation of the regression error.

Parameters:
  • X (ndarray) – Data features (i.e. training inputs for ML), typically an array of shape (N, ...), where N is the number of examples.

  • residual (ndarray) – The difference between the given value and the model predicted value of each examples, ie. predictions - y.

Return type:

float

Returns:

aleatoric_uncertainty (float) – The overall estimated aleatoric uncertainty for this dataset.

save_space()[source]#

Clears non-sklearn attributes of this estimator to save space (in-place). This includes the DataFrame attribute that stored label issues which may be large for big datasets. You may want to call this method before deploying this model (i.e. if you just care about producing predictions). After calling this method, certain non-prediction-related attributes/functionality will no longer be available

classmethod __init_subclass__(**kwargs)#

Set the set_{method}_request methods.

This uses PEP-487 [1] to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing (MetadataRequest) – A MetadataRequest encapsulating routing information.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

set_fit_request(*, find_label_issues_kwargs: bool | None | str = '$UNCHANGED$', label_issues: bool | None | str = '$UNCHANGED$', model_final_kwargs: bool | None | str = '$UNCHANGED$', model_kwargs: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') CleanLearning#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • find_label_issues_kwargs (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for find_label_issues_kwargs parameter in fit.

  • label_issues (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for label_issues parameter in fit.

  • model_final_kwargs (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for model_final_kwargs parameter in fit.

  • model_kwargs (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for model_kwargs parameter in fit.

  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self (object) – The updated object.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CleanLearning#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self (object) – The updated object.