outlier#

Methods for finding out-of-distribution examples in a dataset via scores that quantify how atypical each example is compared to the others.

The underlying algorithms are described in this paper.

Classes:

OutOfDistribution([params])

Provides scores to detect Out Of Distribution (OOD) examples that are outliers in a dataset.

class cleanlab.outlier.OutOfDistribution(params=None)[source]#

Bases: object

Provides scores to detect Out Of Distribution (OOD) examples that are outliers in a dataset.

Each example’s OOD score lies in [0,1] with smaller values indicating examples that are less typical under the data distribution. OOD scores may be estimated from either: numeric feature embeddings or predicted probabilities from a trained classifier.

To get indices of examples that are the most severe outliers, call find_top_issues function on the returned OOD scores.

Parameters:

params (dict, default = {}) –

Optional keyword arguments to control how this estimator is fit. Effect of arguments passed in depends on if OutOfDistribution estimator will rely on features or pred_probs. These are stored as an instance attribute self.params.

If features is passed in during fit(), params could contain following keys:

knn: sklearn.neighbors.NearestNeighbors, default = None
Instantiated NearestNeighbors object that’s been fitted on a dataset in the same feature space. Note that the distance metric and n_neighbors is specified when instantiating this class. You can also pass in a subclass of sklearn.neighbors.NearestNeighbors which allows you to use faster approximate neighbor libraries as long as you wrap them behind the same sklearn API. If you specify knn here, there is no need to later call fit() before calling score(). If knn is None, then by default: The knn object is instantiated as sklearn.neighbors.NearestNeighbors(n_neighbors=k, metric=dist_metric).fit(features). - If dim(features) > 3, the distance metric is set to “cosine”. - If dim(features) <= 3, the distance metric is set to “euclidean”.
The implementation of the euclidean distance metric depends on the number of examples in the features array:

For more than 100 rows, it uses scikit-learn’s “euclidean” metric. This is for efficiency reasons reasons.

For 100 or fewer rows, it uses scipy’s scipy.spatial.distance.euclidean metric. This is for numerical stability reasons.
See: https://scikit-learn.org/stable/modules/neighbors.html See: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html
kint, default=None
Optional number of neighbors to use when calculating outlier score (average distance to neighbors). If k is not provided, then by default k = knn.n_neighbors or k = 10 if knn is None. If an existing knn object is provided, you can still specify that outlier scores should use a different value of k than originally used in the knn, as long as your specified value of k is smaller than the value originally used in knn.
tint, default=1
Optional hyperparameter only for advanced users. Controls transformation of distances between examples into similarity scores that lie in [0,1]. The transformation applied to distances x is exp(-x*t). If you find your scores are all too close to 1, consider increasing t, although the relative scores of examples will still have the same ranking across the dataset.

If pred_probs is passed in during fit(), params could contain following keys:

confident_thresholds: np.ndarray, default = None
An array of shape (K, ) where K is the number of classes. Confident threshold for a class j is the expected (average) “self-confidence” for that class. If you specify confident_thresholds here, there is no need to later call fit() before calling score().
adjust_pred_probsbool, True
If True, account for class imbalance by adjusting predicted probabilities via subtraction of class confident thresholds and renormalization. If False, you do not have to pass in labels later to fit this OOD estimator. See Northcutt et al., 2021.
method{“entropy”, “least_confidence”}, default=”entropy”
Method to use when computing outlier scores based on pred_probs. Letting length-K vector P = pred_probs[i] denote the given predicted class-probabilities for the i-th example in dataset, its outlier score can either be:
- 'entropy': 1 - sum_{j} P[j] * log(P[j]) / log(K)
- 'least_confidence': max(P) (equivalent to Maximum Softmax Probability method from the OOD detection literature)
- gen: Generalized ENtropy score from the paper of Liu, Lochman, and Zach (https://openaccess.thecvf.com/content/CVPR2023/papers/Liu_GEN_Pushing_the_Limits_of_Softmax-Based_Out-of-Distribution_Detection_CVPR_2023_paper.pdf)

Methods:

`fit_score`(*[, features, pred_probs, labels, ...])	Fits this estimator to a given dataset and returns out-of-distribution scores for the same dataset.
`fit`(*[, features, pred_probs, labels, verbose])	Fits this estimator to a given dataset.
`score`(*[, features, pred_probs])	Use fitted estimator and passed in features or pred_probs to calculate out-of-distribution scores for a dataset.

fit_score(*, features=None, pred_probs=None, labels=None, verbose=True)[source]#

Fits this estimator to a given dataset and returns out-of-distribution scores for the same dataset.

Scores lie in [0,1] with smaller values indicating examples that are less typical under the dataset distribution (values near 0 indicate outliers). Exactly one of features or pred_probs needs to be passed in to calculate scores.

If features are passed in a NearestNeighbors object is fit. If pred_probs and ‘labels’ are passed in a confident_thresholds np.ndarray is fit. For details see ~cleanlab.outlier.OutOfDistribution.fit.

Parameters:

features (np.ndarray, optional) – Feature array of shape (N, M), where N is the number of examples and M is the number of features used to represent each example. For details, features in the same format expected by the ~cleanlab.outlier.OutOfDistribution.fit function.
pred_probs (np.ndarray, optional) – An array of shape (N, K) of predicted class probabilities output by a trained classifier. For details, pred_probs in the same format expected by the ~cleanlab.outlier.OutOfDistribution.fit function.
labels (array_like, optional) – A discrete array of given class labels for the data of shape (N,). For details, labels in the same format expected by the ~cleanlab.outlier.OutOfDistribution.fit function.
verbose (bool, default = True) – Set to False to suppress all print statements.

Return type:

ndarray

Returns:

scores (np.ndarray) – If features are passed in, ood_features_scores are returned. If pred_probs are passed in, ood_predictions_scores are returned. For details see return of ~cleanlab.outlier.OutOfDistribution.scores function.

fit(*, features=None, pred_probs=None, labels=None, verbose=True)[source]#

Fits this estimator to a given dataset.

One of features or pred_probs must be specified.

If features are passed in, a NearestNeighbors object is fit. If pred_probs and ‘labels’ are passed in, a confident_thresholds np.ndarray is fit. For details see ~cleanlab.outlier.OutOfDistribution documentation.

Parameters:

features (np.ndarray, optional) – Feature array of shape (N, M), where N is the number of examples and M is the number of features used to represent each example. All features should be numeric. For less structured data (e.g. images, text, categorical values, …), you should provide vector embeddings to represent each example (e.g. extracted from some pretrained neural network).
pred_probs (np.ndarray, optional) – An array of shape (N, K) of model-predicted probabilities, P(label=k|x). Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1.
labels (array_like, optional) – A discrete vector of given labels for the data of shape (N,). Supported array_like types include: np.ndarray or list. Format requirements: for dataset with K classes, labels must be in 0, 1, …, K-1. All the classes (0, 1, …, and K-1) MUST be present in labels, such that: len(set(labels)) == pred_probs.shape[1] If params["adjust_confident_thresholds"] was previously set to False, you do not have to pass in labels. Note: multi-label classification is not supported by this method, each example must belong to a single class, e.g. labels = np.ndarray([1,0,2,1,1,0...]).
verbose (bool, default = True) – Set to False to suppress all print statements.

score(*, features=None, pred_probs=None)[source]#

Use fitted estimator and passed in features or pred_probs to calculate out-of-distribution scores for a dataset.

Score for each example corresponds to the likelihood this example stems from the same distribution as the dataset previously specified in fit() (i.e. is not an outlier).

If features are passed, returns OOD score for each example based on its feature values. If pred_probs are passed, returns OOD score for each example based on classifier’s probabilistic predictions. You may have to previously call fit() or call fit_score() instead.

Parameters:

features (np.ndarray, optional) – Feature array of shape (N, M), where N is the number of examples and M is the number of features used to represent each example. For details, see features in ~cleanlab.outlier.OutOfDistribution.fit function.
pred_probs (np.ndarray, optional) – An array of shape (N, K) of predicted class probabilities output by a trained classifier. For details, see pred_probs in ~cleanlab.outlier.OutOfDistribution.fit function.

Return type:

ndarray

Returns:

scores (np.ndarray) – Scores lie in [0,1] with smaller values indicating examples that are less typical under the dataset distribution (values near 0 indicate outliers).

If features are passed, ood_features_scores are returned. The score is based on the average distance between the example and its K nearest neighbors in the dataset (in feature space).

If pred_probs are passed, ood_predictions_scores are returned. The score is based on the uncertainty in the classifier’s predicted probabilities.