outlier#
Methods for finding out-of-distribution examples in a dataset via scores that quantify how atypical each example is compared to the others.
The underlying algorithms are described in this paper.
Classes:
|
Provides scores to detect Out Of Distribution (OOD) examples that are outliers in a dataset. |
- class cleanlab.outlier.OutOfDistribution(params=None)[source]#
Bases:
object
Provides scores to detect Out Of Distribution (OOD) examples that are outliers in a dataset.
Each example’s OOD score lies in [0,1] with smaller values indicating examples that are less typical under the data distribution. OOD scores may be estimated from either: numeric feature embeddings or predicted probabilities from a trained classifier.
To get indices of examples that are the most severe outliers, call
find_top_issues
function on the returned OOD scores.- Parameters:
params (
dict
, default ={}
) –Optional keyword arguments to control how this estimator is fit. Effect of arguments passed in depends on if OutOfDistribution estimator will rely on features or pred_probs. These are stored as an instance attribute self.params.
- If features is passed in during
fit()
, params could contain following keys: - knn: sklearn.neighbors.NearestNeighbors, default = None
Instantiated
NearestNeighbors
object that’s been fitted on a dataset in the same feature space. Note that the distance metric and n_neighbors is specified when instantiating this class. You can also pass in a subclass ofsklearn.neighbors.NearestNeighbors
which allows you to use faster approximate neighbor libraries as long as you wrap them behind the same sklearn API. If you specifyknn
here, there is no need to later callfit()
before callingscore()
. Ifknn is None
, then by default: The knn object is instantiated assklearn.neighbors.NearestNeighbors(n_neighbors=k, metric=dist_metric).fit(features)
. - Ifdim(features) > 3
, the distance metric is set to “cosine”. - Ifdim(features) <= 3
, the distance metric is set to “euclidean”.- The implementation of the euclidean distance metric depends on the number of examples in the features array:
For more than 100 rows, it uses scikit-learn’s “euclidean” metric. This is for efficiency reasons reasons.
For 100 or fewer rows, it uses scipy’s
scipy.spatial.distance.euclidean
metric. This is for numerical stability reasons.
See: https://scikit-learn.org/stable/modules/neighbors.html See: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html
- kint, default=None
Optional number of neighbors to use when calculating outlier score (average distance to neighbors). If k is not provided, then by default
k = knn.n_neighbors
ork = 10
ifknn is None
. If an existingknn
object is provided, you can still specify that outlier scores should use a different value of k than originally used in theknn
, as long as your specified value of k is smaller than the value originally used inknn
.
- tint, default=1
Optional hyperparameter only for advanced users. Controls transformation of distances between examples into similarity scores that lie in [0,1]. The transformation applied to distances x is
exp(-x*t)
. If you find your scores are all too close to 1, consider increasing t, although the relative scores of examples will still have the same ranking across the dataset.
- If pred_probs is passed in during
fit()
, params could contain following keys: - confident_thresholds: np.ndarray, default = None
An array of shape
(K, )
where K is the number of classes. Confident threshold for a class j is the expected (average) “self-confidence” for that class. If you specify confident_thresholds here, there is no need to later callfit()
before callingscore()
.
- adjust_pred_probsbool, True
If True, account for class imbalance by adjusting predicted probabilities via subtraction of class confident thresholds and renormalization. If False, you do not have to pass in labels later to fit this OOD estimator. See Northcutt et al., 2021.
- method{“entropy”, “least_confidence”}, default=”entropy”
Method to use when computing outlier scores based on pred_probs. Letting length-K vector
P = pred_probs[i]
denote the given predicted class-probabilities for the i-th example in dataset, its outlier score can either be:'entropy'
:1 - sum_{j} P[j] * log(P[j]) / log(K)
'least_confidence'
:max(P)
(equivalent to Maximum Softmax Probability method from the OOD detection literature)gen
: Generalized ENtropy score from the paper of Liu, Lochman, and Zach (https://openaccess.thecvf.com/content/CVPR2023/papers/Liu_GEN_Pushing_the_Limits_of_Softmax-Based_Out-of-Distribution_Detection_CVPR_2023_paper.pdf)
- If features is passed in during
Methods:
fit_score
(*[, features, pred_probs, labels, ...])Fits this estimator to a given dataset and returns out-of-distribution scores for the same dataset.
fit
(*[, features, pred_probs, labels, verbose])Fits this estimator to a given dataset.
score
(*[, features, pred_probs])Use fitted estimator and passed in features or pred_probs to calculate out-of-distribution scores for a dataset.
- fit_score(*, features=None, pred_probs=None, labels=None, verbose=True)[source]#
Fits this estimator to a given dataset and returns out-of-distribution scores for the same dataset.
Scores lie in [0,1] with smaller values indicating examples that are less typical under the dataset distribution (values near 0 indicate outliers). Exactly one of features or pred_probs needs to be passed in to calculate scores.
If features are passed in a
NearestNeighbors
object is fit. If pred_probs and ‘labels’ are passed in a confident_thresholdsnp.ndarray
is fit. For details see ~cleanlab.outlier.OutOfDistribution.fit.- Parameters:
features (
np.ndarray
, optional) – Feature array of shape(N, M)
, where N is the number of examples and M is the number of features used to represent each example. For details, features in the same format expected by the ~cleanlab.outlier.OutOfDistribution.fit function.pred_probs (
np.ndarray
, optional) – An array of shape(N, K)
of predicted class probabilities output by a trained classifier. For details, pred_probs in the same format expected by the ~cleanlab.outlier.OutOfDistribution.fit function.labels (
array_like
, optional) – A discrete array of given class labels for the data of shape(N,)
. For details, labels in the same format expected by the ~cleanlab.outlier.OutOfDistribution.fit function.verbose (
bool
, default= True
) – Set toFalse
to suppress all print statements.
- Return type:
ndarray
- Returns:
scores (
np.ndarray
) – If features are passed in, ood_features_scores are returned. If pred_probs are passed in, ood_predictions_scores are returned. For details see return of ~cleanlab.outlier.OutOfDistribution.scores function.
- fit(*, features=None, pred_probs=None, labels=None, verbose=True)[source]#
Fits this estimator to a given dataset.
One of features or pred_probs must be specified.
If features are passed in, a
NearestNeighbors
object is fit. If pred_probs and ‘labels’ are passed in, a confident_thresholdsnp.ndarray
is fit. For details see ~cleanlab.outlier.OutOfDistribution documentation.- Parameters:
features (
np.ndarray
, optional) – Feature array of shape(N, M)
, where N is the number of examples and M is the number of features used to represent each example. All features should be numeric. For less structured data (e.g. images, text, categorical values, …), you should provide vector embeddings to represent each example (e.g. extracted from some pretrained neural network).pred_probs (
np.ndarray
, optional) – An array of shape(N, K)
of model-predicted probabilities,P(label=k|x)
. Each row of this matrix corresponds to an example x and contains the model-predicted probabilities that x belongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1.labels (
array_like
, optional) – A discrete vector of given labels for the data of shape(N,)
. Supported array_like types include:np.ndarray
orlist
. Format requirements: for dataset with K classes, labels must be in 0, 1, …, K-1. All the classes (0, 1, …, and K-1) MUST be present inlabels
, such that:len(set(labels)) == pred_probs.shape[1]
Ifparams["adjust_confident_thresholds"]
was previously set toFalse
, you do not have to pass in labels. Note: multi-label classification is not supported by this method, each example must belong to a single class, e.g.labels = np.ndarray([1,0,2,1,1,0...])
.verbose (
bool
, default= True
) – Set toFalse
to suppress all print statements.
- score(*, features=None, pred_probs=None)[source]#
Use fitted estimator and passed in features or pred_probs to calculate out-of-distribution scores for a dataset.
Score for each example corresponds to the likelihood this example stems from the same distribution as the dataset previously specified in
fit()
(i.e. is not an outlier).If features are passed, returns OOD score for each example based on its feature values. If pred_probs are passed, returns OOD score for each example based on classifier’s probabilistic predictions. You may have to previously call
fit()
or callfit_score()
instead.- Parameters:
features (
np.ndarray
, optional) – Feature array of shape(N, M)
, where N is the number of examples and M is the number of features used to represent each example. For details, see features in ~cleanlab.outlier.OutOfDistribution.fit function.pred_probs (
np.ndarray
, optional) – An array of shape(N, K)
of predicted class probabilities output by a trained classifier. For details, see pred_probs in ~cleanlab.outlier.OutOfDistribution.fit function.
- Return type:
ndarray
- Returns:
scores (
np.ndarray
) – Scores lie in [0,1] with smaller values indicating examples that are less typical under the dataset distribution (values near 0 indicate outliers).If features are passed, ood_features_scores are returned. The score is based on the average distance between the example and its K nearest neighbors in the dataset (in feature space).
If pred_probs are passed, ood_predictions_scores are returned. The score is based on the uncertainty in the classifier’s predicted probabilities.