regression.learn#
cleanlab can be used for learning with noisy data for any dataset and regression model.
For regression tasks, the regression.learn.CleanLearning
class wraps any instance of an sklearn model to allow you to train more robust regression models,
or use the model to identify corrupted values in the dataset.
The wrapped model must adhere to the sklearn estimator API,
meaning it must define three functions:
model.fit(X, y, sample_weight=None)model.predict(X)model.score(X, y, sample_weight=None)
where X contains the data (i.e. features, covariates, independant variables) and y contains the target
value (i.e. label, response/dependant variable). The first index of X and of y should correspond to the different
examples in the dataset, such that len(X) = len(y) = N (sample-size).
Your model should be correctly clonable via
sklearn.base.clone:
cleanlab internally creates multiple instances of the model, and if you e.g. manually wrap a
PyTorch model, ensure that every call to the estimator’s __init__() creates an independent
instance of the model (for sklearn compatibility, the weights of neural network models should typically
be initialized inside of clf.fit()).
Example
>>> from cleanlab.regression.learn import CleanLearning
>>> from sklearn.linear_model import LinearRegression
>>> cl = CleanLearning(clf=LinearRegression()) # Pass in any model.
>>> cl.fit(X, y_with_noise)
>>> # Estimate the predictions as if you had trained without label issues.
>>> predictions = cl.predict(y)
If your model is not sklearn-compatible by default, it might be the case that standard packages can adapt the model. For example, you can adapt PyTorch models using skorch and adapt Keras models using SciKeras.
If an adapter doesn’t already exist, you can manually wrap your model to be sklearn-compatible. This is made easy by inheriting from sklearn.base.BaseEstimator:
from sklearn.base import BaseEstimator
class YourModel(BaseEstimator):
def __init__(self, ):
pass
def fit(self, X, y):
pass
def predict(self, X):
pass
def score(self, X, y):
pass
Classes:
|
CleanLearning = Machine Learning with cleaned data (even when training on messy, error-ridden data). |
- class cleanlab.regression.learn.CleanLearning(model=None, *, cv_n_folds=5, n_boot=5, include_aleatoric_uncertainty=True, verbose=False, seed=None)[source]#
Bases:
BaseEstimatorCleanLearning = Machine Learning with cleaned data (even when training on messy, error-ridden data).
Automated and robust learning with noisy labels using any dataset and any regression model. For regression tasks, this class trains a
modelwith error-prone, noisy labels as if the model had been instead trained on a dataset with perfect labels. It achieves this by estimating which labels are noisy (you might solely use CleanLearning for this estimation) and then removing examples estimated to have noisy labels, such that a more robust copy of the same model can be trained on the remaining clean data.- Parameters:
model (
Optional[BaseEstimator]) –Any regression model implementing the sklearn estimator API, defining the following functions:
model.fit(X, y)model.predict(X)model.score(X, y)
Default model used is sklearn.linear_model.LinearRegression.
cv_n_folds (
int) – This class needs holdout predictions for every data example and if not provided, uses cross-validation to compute them. This argument sets the number of cross-validation folds used to compute out-of-sample predictions for each example inX. Default is 5. Larger values may produce better results, but requires longer to run.n_boot (
int) – Number of bootstrap resampling rounds used to estimate the model’s epistemic uncertainty. Default is 5. Larger values are expected to produce better results but require longer runtimes. Set as 0 to skip estimating the epistemic uncertainty and get results faster.include_aleatoric_uncertainty (
bool) – Specifies if the aleatoric uncertainty should be estimated during label error detection.Trueby default, which is expected to produce better results but require longer runtimes.verbose (
bool) – Controls how much output is printed. Set toFalseto suppress print statements. DefaultFalse.seed (
Optional[bool]) – Set the default state of the random number generator used to split the data. By default, usesnp.randomcurrent random state.
Methods:
fit(X, y, *[, label_issues, sample_weight, ...])Train regression
modelwith error-prone, noisy labels as if the model had been instead trained on a dataset with the correct labels.predict(X, *args, **kwargs)Predict class labels using your wrapped model.
score(X, y[, sample_weight])Evaluates your wrapped regression model's score on a test set
Xwith target valuesy.find_label_issues(X, y, *[, uncertainty, ...])Identifies potential label issues (corrupted
y-values) in the dataset, and estimates how noisy each label is.Accessor, returns
label_issues_dfattribute if previously computed.get_epistemic_uncertainty(X, y[, predictions])Compute the epistemic uncertainty of the regression model for each example.
get_aleatoric_uncertainty(X, residual)Compute the aleatoric uncertainty of the data.
Clears non-sklearn attributes of this estimator to save space (in-place).
__init_subclass__(**kwargs)Set the
set_{method}_requestmethods.Get metadata routing of this object.
get_params([deep])Get parameters for this estimator.
set_fit_request(*[, ...])Request metadata passed to the
fitmethod.set_params(**params)Set the parameters of this estimator.
set_score_request(*[, sample_weight])Request metadata passed to the
scoremethod.- fit(X, y, *, label_issues=None, sample_weight=None, find_label_issues_kwargs=None, model_kwargs=None, model_final_kwargs=None)[source]#
Train regression
modelwith error-prone, noisy labels as if the model had been instead trained on a dataset with the correct labels.fitachieves this by first trainingmodelvia cross-validation on the noisy data, using the resulting predicted probabilities to identify label issues, pruning the data with label issues, and finally trainingmodelon the remaining clean data.- Parameters:
X (
Union[ndarray,DataFrame]) – Data features (i.e. covariates, independent variables), typically an array of shape(N, ...), where N is the number of examples (sample-size). Yourmodelmust be able tofit()andpredict()data of this format.y (
Union[list,ndarray,Series,DataFrame]) – An array of shape(N,)of noisy labels (i.e. target/response/dependant variable), where some values may be erroneous.label_issues (
Union[DataFrame,ndarray,None]) –Optional already-identified label issues in the dataset (if previously estimated). Specify this to avoid re-estimating the label issues if already done. If
pd.DataFrame, must be formatted as the one returned by:self.find_label_issuesorself.get_label_issues. The DataFrame must have a column namedis_label_issue.If
np.ndarray, the input must be a boolean mask of lengthNwhere examples that have label issues have the valueTrue, and the rest of the examples have the valueFalse.sample_weight (
Optional[ndarray]) – Optional array of weights with shape(N,)that are assigned to individual samples. Specifies how to weight the examples in the loss function while training.find_label_issues_kwargs (
Optional[dict]) – Optional keyword arguments to pass intoself.find_label_issues.model_kwargs (
Optional[dict]) – Optional keyword arguments to pass into model’sfit()method.model_final_kwargs (
Optional[dict]) – Optional extra keyword arguments to pass into the final model’sfit()on the cleaned data, but not thefit()in each fold of cross-validation on the noisy data. The finalfit()will also receive the arguments inclf_kwargs, but these may be overwritten by values inclf_final_kwargs. This can be useful for training differently in the finalfit()than during cross-validation.
- Return type:
BaseEstimator- Returns:
self (
CleanLearning) – Fitted estimator that has all the same methods as any sklearn estimator.After calling
self.fit(), this estimator also stores extra attributes such as:self.label_issues_df: apd.DataFramecontaining label quality scores, boolean flagsindicating which examples have label issues, and predicted label values for each example. Accessible via
self.get_label_issues, of similar format as the one returned byself.find_label_issues. See documentation ofself.find_label_issuesfor column descriptions.
self.label_issues_mask: anp.ndarrayboolean mask indicating if a particularexample has been identified to have issues.
- predict(X, *args, **kwargs)[source]#
Predict class labels using your wrapped model. Works just like
model.predict().- Parameters:
X (
np.ndarrayorDatasetLike) – Test data in the same format expected by your wrapped regression model.- Return type:
ndarray- Returns:
predictions (
np.ndarray) – Predictions for the test examples.
- score(X, y, sample_weight=None)[source]#
Evaluates your wrapped regression model’s score on a test set
Xwith target valuesy. Uses your model’s default scoring function, or r-squared score if your model as no"score"attribute.- Parameters:
X (
Union[ndarray,DataFrame]) – Test data in the same format expected by your wrapped model.y (
Union[list,ndarray,Series,DataFrame]) – Test labels in the same format as labels previously used infit().sample_weight (
Optional[ndarray]) – Optional array of shape(N,)or(N, 1)used to weight each test example when computing the score.
- Return type:
float- Returns:
score (
float) – Number quantifying the performance of this regression model on the test data.
- find_label_issues(X, y, *, uncertainty=None, coarse_search_range=[0.01, 0.05, 0.1, 0.15, 0.2], fine_search_size=3, save_space=False, model_kwargs=None)[source]#
Identifies potential label issues (corrupted
y-values) in the dataset, and estimates how noisy each label is.Note: this method estimates the label issues from scratch. To access previously-estimated label issues from this
CleanLearninginstance, use theself.get_label_issuesmethod.This is the method called to find label issues inside
CleanLearning.fit()and they share mostly the same parameters.- Parameters:
X (
Union[ndarray,DataFrame]) – Data features (i.e. covariates, independent variables), typically an array of shape(N, ...), where N is the number of examples (sample-size). Yourmodel, must be able tofit()andpredict()data of this format.y (
Union[list,ndarray,Series,DataFrame]) – An array of shape(N,)of noisy labels (i.e. target/response/dependant variable), where some values may be erroneous.uncertainty (
Union[ndarray,float,None]) – Optional estimated uncertainty for each example. Should be passed in as a float (constant uncertainty throughout all examples), or a numpy array of lengthN(estimated uncertainty for each example). If not provided, this method will estimate the uncertainty as the sum of the epistemic and aleatoric uncertainty.save_space (
bool) – If True, then returnedlabel_issues_dfwill not be stored as attribute. This means some other methods likeself.get_label_issueswill no longer work.coarse_search_range (
list) – The coarse search range to find the value ofk, which estimates the fraction of data which have label issues. More values represent a more thorough search (better expected results but longer runtimes).fine_search_size (
int) – Size of fine-grained search grid to find the value ofk, which represents our estimate of the fraction of data which have label issues. A higher number represents a more thorough search (better expected results but longer runtimes).
For info about the other parameters, see the docstring of
CleanLearning.fit().- Return type:
DataFrame- Returns:
label_issues_df (
pd.DataFrame) – DataFrame with info about label issues for each example. Unlesssave_spaceargument is specified, same DataFrame is also stored asself.label_issues_dfattribute accessible viaget_label_issues.Each row represents an example from our dataset and the DataFrame may contain the following columns:
is_label_issue: boolean mask for the entire dataset where
Truerepresents a label issue andFalserepresents an example that is accurately labeled with high confidence.label_quality: Numeric score that measures the quality of each label (how likely it is to be correct, with lower scores indicating potentially erroneous labels).
given_label: Values originally given for this example (same as
yinput).predicted_label: Values predicted by the trained model.
- get_label_issues()[source]#
Accessor, returns
label_issues_dfattribute if previously computed. Thispd.DataFramedescribes the issues identified for each example (each row corresponds to an example). For column definitions, see the documentation ofCleanLearning.find_label_issues.- Return type:
Optional[DataFrame]- Returns:
label_issues_df (
pd.DataFrame) – DataFrame with (precomputed) info about the label issues for each example.
- get_epistemic_uncertainty(X, y, predictions=None)[source]#
Compute the epistemic uncertainty of the regression model for each example. This uncertainty is estimated using the bootstrapped variance of the model predictions.
- Parameters:
X (
ndarray) – Data features (i.e. training inputs for ML), typically an array of shape(N, ...), where N is the number of examples.y (
ndarray) – An array of shape(N,)of target values (dependant variables), where some values may be erroneous.predictions (
Optional[ndarray]) – Model predicted values of y, will be used as an extra bootstrap iteration to calculate the variance.
- Return type:
ndarray- Returns:
epistemic_uncertainty (
np.ndarray) – The estimated epistemic uncertainty for each example.
- get_aleatoric_uncertainty(X, residual)[source]#
Compute the aleatoric uncertainty of the data. This uncertainty is estimated by predicting the standard deviation of the regression error.
- Parameters:
X (
ndarray) – Data features (i.e. training inputs for ML), typically an array of shape(N, ...), where N is the number of examples.residual (
ndarray) – The difference between the given value and the model predicted value of each examples, ie.predictions - y.
- Return type:
float- Returns:
aleatoric_uncertainty (
float) – The overall estimated aleatoric uncertainty for this dataset.
- save_space()[source]#
Clears non-sklearn attributes of this estimator to save space (in-place). This includes the DataFrame attribute that stored label issues which may be large for big datasets. You may want to call this method before deploying this model (i.e. if you just care about producing predictions). After calling this method, certain non-prediction-related attributes/functionality will no longer be available
- classmethod __init_subclass__(**kwargs)#
Set the
set_{method}_requestmethods.This uses PEP-487 [1] to set the
set_{method}_requestmethods. It looks for the information available in the set default values which are set using__metadata_request__*class attributes, or inferred from method signatures.The
__metadata_request__*class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the defaultNone.References
- get_metadata_routing()#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing (
MetadataRequest) – AMetadataRequestencapsulating routing information.
- get_params(deep=True)#
Get parameters for this estimator.
- Parameters:
deep (
bool, defaultTrue) – If True, will return the parameters for this estimator and contained subobjects that are estimators.- Returns:
params (
dict) – Parameter names mapped to their values.
- set_fit_request(*, find_label_issues_kwargs: Union[bool, None, str] = '$UNCHANGED$', label_issues: Union[bool, None, str] = '$UNCHANGED$', model_final_kwargs: Union[bool, None, str] = '$UNCHANGED$', model_kwargs: Union[bool, None, str] = '$UNCHANGED$', sample_weight: Union[bool, None, str] = '$UNCHANGED$') CleanLearning#
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline. Otherwise it has no effect.- Parameters:
find_label_issues_kwargs (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forfind_label_issues_kwargsparameter infit.label_issues (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forlabel_issuesparameter infit.model_final_kwargs (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing formodel_final_kwargsparameter infit.model_kwargs (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing formodel_kwargsparameter infit.sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter infit.
- Returns:
self (
object) – The updated object.
- set_params(**params)#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
**params (
dict) – Estimator parameters.- Returns:
self (
estimator instance) – Estimator instance.
- set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') CleanLearning#
Request metadata passed to the
scoremethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.New in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
pipeline.Pipeline. Otherwise it has no effect.- Parameters:
sample_weight (
str,True,False, orNone, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing forsample_weightparameter inscore.- Returns:
self (
object) – The updated object.