regression.learn#
cleanlab can be used for learning with noisy data for any dataset and regression model.
For regression tasks, the regression.learn.CleanLearning
class wraps any instance of an sklearn model to allow you to train more robust regression models,
or use the model to identify corrupted values in the dataset.
The wrapped model must adhere to the sklearn estimator API,
meaning it must define three functions:
model.fit(X, y, sample_weight=None)
model.predict(X)
model.score(X, y, sample_weight=None)
where X
contains the data (i.e. features, covariates, independant variables) and y
contains the target
value (i.e. label, response/dependant variable). The first index of X
and of y
should correspond to the different
examples in the dataset, such that len(X) = len(y) = N
(sample-size).
Your model should be correctly clonable via
sklearn.base.clone:
cleanlab internally creates multiple instances of the model, and if you e.g. manually wrap a
PyTorch model, ensure that every call to the estimator’s __init__()
creates an independent
instance of the model (for sklearn compatibility, the weights of neural network models should typically
be initialized inside of clf.fit()
).
Example
>>> from cleanlab.regression.learn import CleanLearning
>>> from sklearn.linear_model import LinearRegression
>>> cl = CleanLearning(clf=LinearRegression()) # Pass in any model.
>>> cl.fit(X, y_with_noise)
>>> # Estimate the predictions as if you had trained without label issues.
>>> predictions = cl.predict(y)
If your model is not sklearn-compatible by default, it might be the case that standard packages can adapt the model. For example, you can adapt PyTorch models using skorch and adapt Keras models using SciKeras.
If an adapter doesn’t already exist, you can manually wrap your model to be sklearn-compatible. This is made easy by inheriting from sklearn.base.BaseEstimator:
from sklearn.base import BaseEstimator
class YourModel(BaseEstimator):
def __init__(self, ):
pass
def fit(self, X, y):
pass
def predict(self, X):
pass
def score(self, X, y):
pass
Classes:
|
CleanLearning = Machine Learning with cleaned data (even when training on messy, error-ridden data). |
- class cleanlab.regression.learn.CleanLearning(model=None, *, cv_n_folds=5, n_boot=5, include_aleatoric_uncertainty=True, verbose=False, seed=None)[source]#
Bases:
BaseEstimator
CleanLearning = Machine Learning with cleaned data (even when training on messy, error-ridden data).
Automated and robust learning with noisy labels using any dataset and any regression model. For regression tasks, this class trains a
model
with error-prone, noisy labels as if the model had been instead trained on a dataset with perfect labels. It achieves this by estimating which labels are noisy (you might solely use CleanLearning for this estimation) and then removing examples estimated to have noisy labels, such that a more robust copy of the same model can be trained on the remaining clean data.- Parameters:
model (
Optional
[BaseEstimator
]) –Any regression model implementing the sklearn estimator API, defining the following functions:
model.fit(X, y)
model.predict(X)
model.score(X, y)
Default model used is sklearn.linear_model.LinearRegression.
cv_n_folds (
int
) – This class needs holdout predictions for every data example and if not provided, uses cross-validation to compute them. This argument sets the number of cross-validation folds used to compute out-of-sample predictions for each example inX
. Default is 5. Larger values may produce better results, but requires longer to run.n_boot (
int
) – Number of bootstrap resampling rounds used to estimate the model’s epistemic uncertainty. Default is 5. Larger values are expected to produce better results but require longer runtimes. Set as 0 to skip estimating the epistemic uncertainty and get results faster.include_aleatoric_uncertainty (
bool
) – Specifies if the aleatoric uncertainty should be estimated during label error detection.True
by default, which is expected to produce better results but require longer runtimes.verbose (
bool
) – Controls how much output is printed. Set toFalse
to suppress print statements. Default False.seed (
Optional
[bool
]) – Set the default state of the random number generator used to split the data. By default, usesnp.random
current random state.
Methods:
fit
(X, y, *[, label_issues, sample_weight, ...])Train regression
model
with error-prone, noisy labels as if the model had been instead trained on a dataset with the correct labels.predict
(X, *args, **kwargs)Predict class labels using your wrapped model.
score
(X, y[, sample_weight])Evaluates your wrapped regression model's score on a test set X with target values y.
find_label_issues
(X, y, *[, uncertainty, ...])Identifies potential label issues (corrupted y-values) in the dataset, and estimates how noisy each label is.
Accessor, returns label_issues_df attribute if previously computed.
get_epistemic_uncertainty
(X, y[, predictions])Compute the epistemic uncertainty of the regression model for each example.
get_aleatoric_uncertainty
(X, residual)Compute the aleatoric uncertainty of the data.
Clears non-sklearn attributes of this estimator to save space (in-place).
__init_subclass__
(**kwargs)Set the
set_{method}_request
methods.Get metadata routing of this object.
get_params
([deep])Get parameters for this estimator.
set_fit_request
(*[, ...])Request metadata passed to the
fit
method.set_params
(**params)Set the parameters of this estimator.
set_score_request
(*[, sample_weight])Request metadata passed to the
score
method.- fit(X, y, *, label_issues=None, sample_weight=None, find_label_issues_kwargs=None, model_kwargs=None, model_final_kwargs=None)[source]#
Train regression
model
with error-prone, noisy labels as if the model had been instead trained on a dataset with the correct labels.fit
achieves this by first trainingmodel
via cross-validation on the noisy data, using the resulting predicted probabilities to identify label issues, pruning the data with label issues, and finally trainingmodel
on the remaining clean data.- Parameters:
X (
Union
[ndarray
,DataFrame
]) – Data features (i.e. covariates, independent variables), typically an array of shape(N, ...)
, where N is the number of examples (sample-size). Yourmodel
must be able tofit()
andpredict()
data of this format.y (
Union
[list
,ndarray
,Series
,DataFrame
]) – An array of shape(N,)
of noisy labels (i.e. target/response/dependant variable), where some values may be erroneous.label_issues (
Union
[DataFrame
,ndarray
,None
]) –Optional already-identified label issues in the dataset (if previously estimated). Specify this to avoid re-estimating the label issues if already done. If
pd.DataFrame
, must be formatted as the one returned by:self.find_label_issues
orself.get_label_issues
. The DataFrame must have a column namedis_label_issue
.If
np.ndarray
, the input must be a boolean mask of lengthN
where examples that have label issues have the valueTrue
, and the rest of the examples have the valueFalse
.sample_weight (
Optional
[ndarray
]) – Optional array of weights with shape(N,)
that are assigned to individual samples. Specifies how to weight the examples in the loss function while training.find_label_issues_kwargs (
Optional
[dict
]) – Optional keyword arguments to pass intoself.find_label_issues
.model_kwargs (
Optional
[dict
]) – Optional keyword arguments to pass into model’sfit()
method.model_final_kwargs (
Optional
[dict
]) – Optional extra keyword arguments to pass into the final model’sfit()
on the cleaned data, but not thefit()
in each fold of cross-validation on the noisy data. The finalfit()
will also receive the arguments in clf_kwargs, but these may be overwritten by values in clf_final_kwargs. This can be useful for training differently in the finalfit()
than during cross-validation.
- Return type:
BaseEstimator
- Returns:
self (
CleanLearning
) – Fitted estimator that has all the same methods as any sklearn estimator.After calling
self.fit()
, this estimator also stores extra attributes such as:self.label_issues_df
: apd.DataFrame
containing label quality scores, boolean flagsindicating which examples have label issues, and predicted label values for each example. Accessible via
self.get_label_issues
, of similar format as the one returned byself.find_label_issues
. See documentation ofself.find_label_issues
for column descriptions.
self.label_issues_mask
: anp.ndarray
boolean mask indicating if a particularexample has been identified to have issues.
- predict(X, *args, **kwargs)[source]#
Predict class labels using your wrapped model. Works just like
model.predict()
.- Parameters:
X (
np.ndarray
orDatasetLike
) – Test data in the same format expected by your wrapped regression model.- Return type:
ndarray
- Returns:
predictions (
np.ndarray
) – Predictions for the test examples.
- score(X, y, sample_weight=None)[source]#
Evaluates your wrapped regression model’s score on a test set X with target values y. Uses your model’s default scoring function, or r-squared score if your model as no
"score"
attribute.- Parameters:
X (
Union
[ndarray
,DataFrame
]) – Test data in the same format expected by your wrapped model.y (
Union
[list
,ndarray
,Series
,DataFrame
]) – Test labels in the same format as labels previously used infit()
.sample_weight (
Optional
[ndarray
]) – Optional array of shape(N,)
or(N, 1)
used to weight each test example when computing the score.
- Return type:
float
- Returns:
score (
float
) – Number quantifying the performance of this regression model on the test data.
- find_label_issues(X, y, *, uncertainty=None, coarse_search_range=[0.01, 0.05, 0.1, 0.15, 0.2], fine_search_size=3, save_space=False, model_kwargs=None)[source]#
Identifies potential label issues (corrupted y-values) in the dataset, and estimates how noisy each label is.
Note: this method estimates the label issues from scratch. To access previously-estimated label issues from this
CleanLearning
instance, use theself.get_label_issues
method.This is the method called to find label issues inside
CleanLearning.fit()
and they share mostly the same parameters.- Parameters:
X (
Union
[ndarray
,DataFrame
]) – Data features (i.e. covariates, independent variables), typically an array of shape(N, ...)
, where N is the number of examples (sample-size). Yourmodel
, must be able tofit()
andpredict()
data of this format.y (
Union
[list
,ndarray
,Series
,DataFrame
]) – An array of shape(N,)
of noisy labels (i.e. target/response/dependant variable), where some values may be erroneous.uncertainty (
Union
[ndarray
,float
,None
]) – Optional estimated uncertainty for each example. Should be passed in as a float (constant uncertainty throughout all examples), or a numpy array of lengthN
(estimated uncertainty for each example). If not provided, this method will estimate the uncertainty as the sum of the epistemic and aleatoric uncertainty.save_space (
bool
) – If True, then returnedlabel_issues_df
will not be stored as attribute. This means some other methods likeself.get_label_issues
will no longer work.coarse_search_range (
list
) – The coarse search range to find the value ofk
, which estimates the fraction of data which have label issues. More values represent a more thorough search (better expected results but longer runtimes).fine_search_size (
int
) – Size of fine-grained search grid to find the value ofk
, which represents our estimate of the fraction of data which have label issues. A higher number represents a more thorough search (better expected results but longer runtimes).
For info about the other parameters, see the docstring of
CleanLearning.fit()
.- Return type:
DataFrame
- Returns:
label_issues_df (
pd.DataFrame
) – DataFrame with info about label issues for each example. Unless save_space argument is specified, same DataFrame is also stored as self.label_issues_df attribute accessible viaget_label_issues
.Each row represents an example from our dataset and the DataFrame may contain the following columns:
is_label_issue: boolean mask for the entire dataset where
True
represents a label issue andFalse
represents an example that is accurately labeled with high confidence.label_quality: Numeric score that measures the quality of each label (how likely it is to be correct, with lower scores indicating potentially erroneous labels).
given_label: Values originally given for this example (same as y input).
predicted_label: Values predicted by the trained model.
- get_label_issues()[source]#
Accessor, returns label_issues_df attribute if previously computed. This
pd.DataFrame
describes the issues identified for each example (each row corresponds to an example). For column definitions, see the documentation ofCleanLearning.find_label_issues
.- Return type:
Optional
[DataFrame
]- Returns:
label_issues_df (
pd.DataFrame
) – DataFrame with (precomputed) info about the label issues for each example.
- get_epistemic_uncertainty(X, y, predictions=None)[source]#
Compute the epistemic uncertainty of the regression model for each example. This uncertainty is estimated using the bootstrapped variance of the model predictions.
- Parameters:
X (
ndarray
) – Data features (i.e. training inputs for ML), typically an array of shape(N, ...)
, where N is the number of examples.y (
ndarray
) – An array of shape(N,)
of target values (dependant variables), where some values may be erroneous.predictions (
Optional
[ndarray
]) – Model predicted values of y, will be used as an extra bootstrap iteration to calculate the variance.
- Return type:
ndarray
- Returns:
epistemic_uncertainty (
np.ndarray
) – The estimated epistemic uncertainty for each example.
- get_aleatoric_uncertainty(X, residual)[source]#
Compute the aleatoric uncertainty of the data. This uncertainty is estimated by predicting the standard deviation of the regression error.
- Parameters:
X (
ndarray
) – Data features (i.e. training inputs for ML), typically an array of shape(N, ...)
, where N is the number of examples.residual (
ndarray
) – The difference between the given value and the model predicted value of each examples, ie. predictions - y.
- Return type:
float
- Returns:
aleatoric_uncertainty (
float
) – The overall estimated aleatoric uncertainty for this dataset.
- save_space()[source]#
Clears non-sklearn attributes of this estimator to save space (in-place). This includes the DataFrame attribute that stored label issues which may be large for big datasets. You may want to call this method before deploying this model (i.e. if you just care about producing predictions). After calling this method, certain non-prediction-related attributes/functionality will no longer be available
- classmethod __init_subclass__(**kwargs)#
Set the
set_{method}_request
methods.This uses PEP-487 [1] to set the
set_{method}_request
methods. It looks for the information available in the set default values which are set using__metadata_request__*
class attributes, or inferred from method signatures.The
__metadata_request__*
class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the defaultNone
.References
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing (
MetadataRequest
) – AMetadataRequest
encapsulating routing information.
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
deep (
bool
, defaultTrue
) – If True, will return the parameters for this estimator and contained subobjects that are estimators.- Returns:
params (
dict
) – Parameter names mapped to their values.
- set_fit_request(*, find_label_issues_kwargs: bool | None | str = '$UNCHANGED$', label_issues: bool | None | str = '$UNCHANGED$', model_final_kwargs: bool | None | str = '$UNCHANGED$', model_kwargs: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') CleanLearning #
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
find_label_issues_kwargs (
str
,True
,False
, orNone
, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forfind_label_issues_kwargs
parameter infit
.label_issues (
str
,True
,False
, orNone
, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forlabel_issues
parameter infit
.model_final_kwargs (
str
,True
,False
, orNone
, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing formodel_final_kwargs
parameter infit
.model_kwargs (
str
,True
,False
, orNone
, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing formodel_kwargs
parameter infit
.sample_weight (
str
,True
,False
, orNone
, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weight
parameter infit
.
- Returns:
self (
object
) – The updated object.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (
dict
) – Estimator parameters.- Returns:
self (
estimator instance
) – Estimator instance.
- set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') CleanLearning #
Request metadata passed to the
score
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed toscore
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it toscore
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
sample_weight (
str
,True
,False
, orNone
, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weight
parameter inscore
.- Returns:
self (
object
) – The updated object.