outlier#
Methods for finding out-of-distribution examples in a dataset via scores that quantify how atypical each example is compared to the others.
The underlying algorithms are described in this paper.
Classes:
| 
 | Provides scores to detect Out Of Distribution (OOD) examples that are outliers in a dataset. | 
- class cleanlab.outlier.OutOfDistribution(params=None)[source]#
- Bases: - object- Provides scores to detect Out Of Distribution (OOD) examples that are outliers in a dataset. - Each example’s OOD score lies in [0,1] with smaller values indicating examples that are less typical under the data distribution. OOD scores may be estimated from either: numeric feature embeddings or predicted probabilities from a trained classifier. - To get indices of examples that are the most severe outliers, call - find_top_issuesfunction on the returned OOD scores.- Parameters:
- params ( - dict, default =- {}) –- Optional keyword arguments to control how this estimator is fit. Effect of arguments passed in depends on if - OutOfDistributionestimator will rely on- featuresor- pred_probs. These are stored as an instance attribute- self.params.- If featuresis passed in duringfit(),paramscould contain following keys:
- knn: sklearn.neighbors.NearestNeighbors, default = None
- Instantiated - NearestNeighborsobject that’s been fitted on a dataset in the same feature space. Note that the distance metric and- n_neighborsis specified when instantiating this class. You can also pass in a subclass of- sklearn.neighbors.NearestNeighborswhich allows you to use faster approximate neighbor libraries as long as you wrap them behind the same sklearn API. If you specify- knnhere, there is no need to later call- fit()before calling- score(). If- knn = None, then by default:- knn = sklearn.neighbors.NearestNeighbors(n_neighbors=k, metric=dist_metric).fit(features)where- dist_metric == "cosine"if- dim(features) > 3or- dist_metric == "euclidean"otherwise. See: https://scikit-learn.org/stable/modules/neighbors.html
 
- kint, default=None
- Optional number of neighbors to use when calculating outlier score (average distance to neighbors). If - kis not provided, then by default- k = knn.n_neighborsor- k = 10if- knn is None. If an existing- knnobject is provided, you can still specify that outlier scores should use a different value of- kthan originally used in the- knn, as long as your specified value of- kis smaller than the value originally used in- knn.
 
- tint, default=1
- Optional hyperparameter only for advanced users. Controls transformation of distances between examples into similarity scores that lie in [0,1]. The transformation applied to distances - xis- exp(-x*t). If you find your scores are all too close to 1, consider increasing- t, although the relative scores of examples will still have the same ranking across the dataset.
 
 
- If pred_probsis passed in duringfit(),paramscould contain following keys:
- confident_thresholds: np.ndarray, default = None
- An array of shape - (K, )where K is the number of classes. Confident threshold for a class j is the expected (average) “self-confidence” for that class. If you specify- confident_thresholdshere, there is no need to later call- fit()before calling- score().
 
- adjust_pred_probsbool, True
- If True, account for class imbalance by adjusting predicted probabilities via subtraction of class confident thresholds and renormalization. If False, you do not have to pass in - labelslater to fit this OOD estimator. See Northcutt et al., 2021.
 
- method{“entropy”, “least_confidence”}, default=”entropy”
- Method to use when computing outlier scores based on - pred_probs. Letting length-K vector- P = pred_probs[i]denote the given predicted class-probabilities for the i-th example in dataset, its outlier score can either be:- 'entropy':- 1 - sum_{j} P[j] * log(P[j]) / log(K)
- 'least_confidence':- max(P)(equivalent to Maximum Softmax Probability method from the OOD detection literature)
- gen: Generalized ENtropy score from the paper of Liu, Lochman, and Zach (https://openaccess.thecvf.com/content/CVPR2023/papers/Liu_GEN_Pushing_the_Limits_of_Softmax-Based_Out-of-Distribution_Detection_CVPR_2023_paper.pdf)
 
 
 
 
- If 
 - Methods: - fit_score(*[, features, pred_probs, labels, ...])- Fits this estimator to a given dataset and returns out-of-distribution scores for the same dataset. - fit(*[, features, pred_probs, labels, verbose])- Fits this estimator to a given dataset. - score(*[, features, pred_probs])- Use fitted estimator and passed in - featuresor- pred_probsto calculate out-of-distribution scores for a dataset.- fit_score(*, features=None, pred_probs=None, labels=None, verbose=True)[source]#
- Fits this estimator to a given dataset and returns out-of-distribution scores for the same dataset. - Scores lie in [0,1] with smaller values indicating examples that are less typical under the dataset distribution (values near 0 indicate outliers). Exactly one of - featuresor- pred_probsneeds to be passed in to calculate scores.- If - featuresare passed in a- NearestNeighborsobject is fit. If- pred_probsand ‘labels’ are passed in a- confident_thresholds- np.ndarrayis fit. For details see- fit.- Parameters:
- features ( - np.ndarray, optional) – Feature array of shape- (N, M), where N is the number of examples and M is the number of features used to represent each example. For details,- featuresin the same format expected by the- fitfunction.
- pred_probs ( - np.ndarray, optional) – An array of shape- (N, K)of predicted class probabilities output by a trained classifier. For details,- pred_probsin the same format expected by the- fitfunction.
- labels ( - array_like, optional) – A discrete array of given class labels for the data of shape- (N,). For details,- labelsin the same format expected by the- fitfunction.
- verbose ( - bool, default- = True) – Set to- Falseto suppress all print statements.
 
- Return type:
- ndarray
- Returns:
- scores ( - np.ndarray) – If- featuresare passed in,- ood_features_scoresare returned. If- pred_probsare passed in,- ood_predictions_scoresare returned. For details see return of- scoresfunction.
 
 - fit(*, features=None, pred_probs=None, labels=None, verbose=True)[source]#
- Fits this estimator to a given dataset. - One of - featuresor- pred_probsmust be specified.- If - featuresare passed in, a- NearestNeighborsobject is fit. If- pred_probsand ‘labels’ are passed in, a- confident_thresholds- np.ndarrayis fit. For details see- OutOfDistributiondocumentation.- Parameters:
- features ( - np.ndarray, optional) – Feature array of shape- (N, M), where N is the number of examples and M is the number of features used to represent each example. All features should be numeric. For less structured data (e.g. images, text, categorical values, …), you should provide vector embeddings to represent each example (e.g. extracted from some pretrained neural network).
- pred_probs ( - np.ndarray, optional) – An array of shape- (N, K)of model-predicted probabilities,- P(label=k|x). Each row of this matrix corresponds to an example- xand contains the model-predicted probabilities that- xbelongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1.
- labels ( - array_like, optional) – A discrete vector of given labels for the data of shape- (N,). Supported- array_liketypes include:- np.ndarrayor- list. Format requirements: for dataset with K classes, labels must be in 0, 1, …, K-1. All the classes (0, 1, …, and K-1) MUST be present in- labels, such that:- len(set(labels)) == pred_probs.shape[1]If- params["adjust_confident_thresholds"]was previously set to- False, you do not have to pass in- labels. Note: multi-label classification is not supported by this method, each example must belong to a single class, e.g.- labels = np.ndarray([1,0,2,1,1,0...]).
- verbose ( - bool, default- = True) – Set to- Falseto suppress all print statements.
 
 
 - score(*, features=None, pred_probs=None)[source]#
- Use fitted estimator and passed in - featuresor- pred_probsto calculate out-of-distribution scores for a dataset.- Score for each example corresponds to the likelihood this example stems from the same distribution as the dataset previously specified in - fit()(i.e. is not an outlier).- If - featuresare passed, returns OOD score for each example based on its feature values. If- pred_probsare passed, returns OOD score for each example based on classifier’s probabilistic predictions. You may have to previously call- fit()or call- fit_score()instead.- Parameters:
- features ( - np.ndarray, optional) – Feature array of shape- (N, M), where N is the number of examples and M is the number of features used to represent each example. For details, see- featuresin- fitfunction.
- pred_probs ( - np.ndarray, optional) – An array of shape- (N, K)of predicted class probabilities output by a trained classifier. For details, see- pred_probsin- fitfunction.
 
- Return type:
- ndarray
- Returns:
- scores ( - np.ndarray) – Scores lie in [0,1] with smaller values indicating examples that are less typical under the dataset distribution (values near 0 indicate outliers).- If - featuresare passed,- ood_features_scoresare returned. The score is based on the average distance between the example and its K nearest neighbors in the dataset (in feature space).- If - pred_probsare passed,- ood_predictions_scoresare returned. The score is based on the uncertainty in the classifier’s predicted probabilities.