outlier#
Methods for finding out-of-distribution examples in a dataset via scores that quantify how atypical each example is compared to the others.
The underlying algorithms are described in this paper.
Classes:
|
Provides scores to detect Out Of Distribution (OOD) examples that are outliers in a dataset. |
- class cleanlab.outlier.OutOfDistribution(params=None)[source]#
Bases:
objectProvides scores to detect Out Of Distribution (OOD) examples that are outliers in a dataset.
Each example’s OOD score lies in [0,1] with smaller values indicating examples that are less typical under the data distribution. OOD scores may be estimated from either: numeric feature embeddings or predicted probabilities from a trained classifier.
To get indices of examples that are the most severe outliers, call
find_top_issuesfunction on the returned OOD scores.- Parameters:
params (
dict, default ={}) –Optional keyword arguments to control how this estimator is fit. Effect of arguments passed in depends on if
OutOfDistributionestimator will rely onfeaturesorpred_probs. These are stored as an instance attributeself.params.- If
featuresis passed in duringfit(),paramscould contain following keys: - knn: sklearn.neighbors.NearestNeighbors, default = None
Instantiated
NearestNeighborsobject that’s been fitted on a dataset in the same feature space. Note that the distance metric andn_neighborsis specified when instantiating this class. You can also pass in a subclass ofsklearn.neighbors.NearestNeighborswhich allows you to use faster approximate neighbor libraries as long as you wrap them behind the same sklearn API. If you specifyknnhere, there is no need to later callfit()before callingscore(). Ifknn = None, then by default:knn = sklearn.neighbors.NearestNeighbors(n_neighbors=k, metric=dist_metric).fit(features)wheredist_metric == "cosine"ifdim(features) > 3ordist_metric == "euclidean"otherwise. See: https://scikit-learn.org/stable/modules/neighbors.html
- kint, default=None
Optional number of neighbors to use when calculating outlier score (average distance to neighbors). If
kis not provided, then by defaultk = knn.n_neighborsork = 10ifknn is None. If an existingknnobject is provided, you can still specify that outlier scores should use a different value ofkthan originally used in theknn, as long as your specified value ofkis smaller than the value originally used inknn.
- tint, default=1
Optional hyperparameter only for advanced users. Controls transformation of distances between examples into similarity scores that lie in [0,1]. The transformation applied to distances
xisexp(-x*t). If you find your scores are all too close to 1, consider increasingt, although the relative scores of examples will still have the same ranking across the dataset.
- If
pred_probsis passed in duringfit(),paramscould contain following keys: - confident_thresholds: np.ndarray, default = None
An array of shape
(K, )where K is the number of classes. Confident threshold for a class j is the expected (average) “self-confidence” for that class. If you specifyconfident_thresholdshere, there is no need to later callfit()before callingscore().
- adjust_pred_probsbool, True
If True, account for class imbalance by adjusting predicted probabilities via subtraction of class confident thresholds and renormalization. If False, you do not have to pass in
labelslater to fit this OOD estimator. See Northcutt et al., 2021.
- method{“entropy”, “least_confidence”}, default=”entropy”
OOD scoring method. Letting length-K vector
P = pred_probs[i]denote the given predicted class-probabilities for the i-th particular example, its OOD score can either be:'entropy':1 - sum_{j} P[j] * log(P[j]) / log(K)'least_confidence':max(P)
- If
Methods:
fit_score(*[, features, pred_probs, labels, ...])Fits this estimator to a given dataset and returns out-of-distribution scores for the same dataset.
fit(*[, features, pred_probs, labels, verbose])Fits this estimator to a given dataset.
score(*[, features, pred_probs])Use fitted estimator and passed in
featuresorpred_probsto calculate out-of-distribution scores for a dataset.- fit_score(*, features=None, pred_probs=None, labels=None, verbose=True)[source]#
Fits this estimator to a given dataset and returns out-of-distribution scores for the same dataset.
Scores lie in [0,1] with smaller values indicating examples that are less typical under the dataset distribution (values near 0 indicate outliers). Exactly one of
featuresorpred_probsneeds to be passed in to calculate scores.If
featuresare passed in aNearestNeighborsobject is fit. Ifpred_probsand ‘labels’ are passed in aconfident_thresholdsnp.ndarrayis fit. For details seefit.- Parameters:
features (
np.ndarray, optional) – Feature array of shape(N, M), where N is the number of examples and M is the number of features used to represent each example. For details,featuresin the same format expected by thefitfunction.pred_probs (
np.ndarray, optional) – An array of shape(N, K)of predicted class probabilities output by a trained classifier. For details,pred_probsin the same format expected by thefitfunction.labels (
array_like, optional) – A discrete array of given class labels for the data of shape(N,). For details,labelsin the same format expected by thefitfunction.verbose (
bool, default= True) – Set toFalseto suppress all print statements.
- Return type:
ndarray- Returns:
scores (
np.ndarray) – Iffeaturesare passed in,ood_features_scoresare returned. Ifpred_probsare passed in,ood_predictions_scoresare returned. For details see return ofscoresfunction.
- fit(*, features=None, pred_probs=None, labels=None, verbose=True)[source]#
Fits this estimator to a given dataset.
One of
featuresorpred_probsmust be specified.If
featuresare passed in, aNearestNeighborsobject is fit. Ifpred_probsand ‘labels’ are passed in, aconfident_thresholdsnp.ndarrayis fit. For details seeOutOfDistributiondocumentation.- Parameters:
features (
np.ndarray, optional) – Feature array of shape(N, M), where N is the number of examples and M is the number of features used to represent each example. All features should be numeric. For less structured data (e.g. images, text, categorical values, …), you should provide vector embeddings to represent each example (e.g. extracted from some pretrained neural network).pred_probs (
np.ndarray, optional) – An array of shape(N, K)of model-predicted probabilities,P(label=k|x). Each row of this matrix corresponds to an examplexand contains the model-predicted probabilities thatxbelongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1.labels (
array_like, optional) – A discrete vector of given labels for the data of shape(N,). Supportedarray_liketypes include:np.ndarrayorlist. Format requirements: for dataset with K classes, labels must be in 0, 1, …, K-1. All the classes (0, 1, …, and K-1) MUST be present inlabels, such that:len(set(labels)) == pred_probs.shape[1]Ifparams["adjust_confident_thresholds"]was previously set toFalse, you do not have to pass inlabels. Note: multi-label classification is not supported by this method, each example must belong to a single class, e.g.labels = np.ndarray([1,0,2,1,1,0...]).verbose (
bool, default= True) – Set toFalseto suppress all print statements.
- score(*, features=None, pred_probs=None)[source]#
Use fitted estimator and passed in
featuresorpred_probsto calculate out-of-distribution scores for a dataset.Score for each example corresponds to the likelihood this example stems from the same distribution as the dataset previously specified in
fit()(i.e. is not an outlier).If
featuresare passed, returns OOD score for each example based on its feature values. Ifpred_probsare passed, returns OOD score for each example based on classifier’s probabilistic predictions. You may have to previously callfit()or callfit_score()instead.- Parameters:
features (
np.ndarray, optional) – Feature array of shape(N, M), where N is the number of examples and M is the number of features used to represent each example. For details, seefeaturesinfitfunction.pred_probs (
np.ndarray, optional) – An array of shape(N, K)of predicted class probabilities output by a trained classifier. For details, seepred_probsinfitfunction.
- Return type:
ndarray- Returns:
scores (
np.ndarray) – Scores lie in [0,1] with smaller values indicating examples that are less typical under the dataset distribution (values near 0 indicate outliers).If
featuresare passed,ood_features_scoresare returned. The score is based on the average distance between the example and its K nearest neighbors in the dataset (in feature space).If
pred_probsare passed,ood_predictions_scoresare returned. The score is based on the uncertainty in the classifier’s predicted probabilities.