outlier#
Methods for finding out-of-distribution examples in a dataset via scores that quantify how atypical each example is compared to the others.
The underlying algorithms are described in this paper.
Classes:
|
Provides scores to detect Out Of Distribution (OOD) examples that are outliers in a dataset. |
- class cleanlab.outlier.OutOfDistribution(params=None)[source]#
Bases:
object
Provides scores to detect Out Of Distribution (OOD) examples that are outliers in a dataset.
Each example’s OOD score lies in [0,1] with smaller values indicating examples that are less typical under the data distribution. OOD scores may be estimated from either: numeric feature embeddings or predicted probabilities from a trained classifier.
To get indices of examples that are the most severe outliers, call
find_top_issues
function on the returned OOD scores.- Parameters:
params (
dict
, default ={}
) –Optional keyword arguments to control how this estimator is fit. Effect of arguments passed in depends on if
OutOfDistribution
estimator will rely onfeatures
orpred_probs
. These are stored as an instance attributeself.params
.- If
features
is passed in duringfit()
,params
could contain following keys: - knn: sklearn.neighbors.NearestNeighbors, default = None
Instantiated
NearestNeighbors
object that’s been fitted on a dataset in the same feature space. Note that the distance metric andn_neighbors
is specified when instantiating this class. You can also pass in a subclass ofsklearn.neighbors.NearestNeighbors
which allows you to use faster approximate neighbor libraries as long as you wrap them behind the same sklearn API. If you specifyknn
here, there is no need to later callfit()
before callingscore()
. Ifknn = None
, then by default:knn = sklearn.neighbors.NearestNeighbors(n_neighbors=k, metric=dist_metric).fit(features)
wheredist_metric == "cosine"
ifdim(features) > 3
ordist_metric == "euclidean"
otherwise. See: https://scikit-learn.org/stable/modules/neighbors.html
- kint, default=None
Optional number of neighbors to use when calculating outlier score (average distance to neighbors). If
k
is not provided, then by defaultk = knn.n_neighbors
ork = 10
ifknn is None
. If an existingknn
object is provided, you can still specify that outlier scores should use a different value ofk
than originally used in theknn
, as long as your specified value ofk
is smaller than the value originally used inknn
.
- tint, default=1
Optional hyperparameter only for advanced users. Controls transformation of distances between examples into similarity scores that lie in [0,1]. The transformation applied to distances
x
isexp(-x*t)
. If you find your scores are all too close to 1, consider increasingt
, although the relative scores of examples will still have the same ranking across the dataset.
- If
pred_probs
is passed in duringfit()
,params
could contain following keys: - confident_thresholds: np.ndarray, default = None
An array of shape
(K, )
where K is the number of classes. Confident threshold for a class j is the expected (average) “self-confidence” for that class. If you specifyconfident_thresholds
here, there is no need to later callfit()
before callingscore()
.
- adjust_pred_probsbool, True
If True, account for class imbalance by adjusting predicted probabilities via subtraction of class confident thresholds and renormalization. If False, you do not have to pass in
labels
later to fit this OOD estimator. See Northcutt et al., 2021.
- method{“entropy”, “least_confidence”}, default=”entropy”
OOD scoring method. Letting length-K vector
P = pred_probs[i]
denote the given predicted class-probabilities for the i-th particular example, its OOD score can either be:'entropy'
:1 - sum_{j} P[j] * log(P[j]) / log(K)
'least_confidence'
:max(P)
- If
Methods:
fit_score
(*[, features, pred_probs, labels, ...])Fits this estimator to a given dataset and returns out-of-distribution scores for the same dataset.
fit
(*[, features, pred_probs, labels, verbose])Fits this estimator to a given dataset.
score
(*[, features, pred_probs])Use fitted estimator and passed in
features
orpred_probs
to calculate out-of-distribution scores for a dataset.- fit_score(*, features=None, pred_probs=None, labels=None, verbose=True)[source]#
Fits this estimator to a given dataset and returns out-of-distribution scores for the same dataset.
Scores lie in [0,1] with smaller values indicating examples that are less typical under the dataset distribution (values near 0 indicate outliers). Exactly one of
features
orpred_probs
needs to be passed in to calculate scores.If
features
are passed in aNearestNeighbors
object is fit. Ifpred_probs
and ‘labels’ are passed in aconfident_thresholds
np.ndarray
is fit. For details seefit
.- Parameters:
features (
np.ndarray
, optional) – Feature array of shape(N, M)
, where N is the number of examples and M is the number of features used to represent each example. For details,features
in the same format expected by thefit
function.pred_probs (
np.ndarray
, optional) – An array of shape(N, K)
of predicted class probabilities output by a trained classifier. For details,pred_probs
in the same format expected by thefit
function.labels (
array_like
, optional) – A discrete array of given class labels for the data of shape(N,)
. For details,labels
in the same format expected by thefit
function.verbose (
bool
, default= True
) – Set toFalse
to suppress all print statements.
- Return type:
ndarray
- Returns:
scores (
np.ndarray
) – Iffeatures
are passed in,ood_features_scores
are returned. Ifpred_probs
are passed in,ood_predictions_scores
are returned. For details see return ofscores
function.
- fit(*, features=None, pred_probs=None, labels=None, verbose=True)[source]#
Fits this estimator to a given dataset.
One of
features
orpred_probs
must be specified.If
features
are passed in, aNearestNeighbors
object is fit. Ifpred_probs
and ‘labels’ are passed in, aconfident_thresholds
np.ndarray
is fit. For details seeOutOfDistribution
documentation.- Parameters:
features (
np.ndarray
, optional) – Feature array of shape(N, M)
, where N is the number of examples and M is the number of features used to represent each example. All features should be numeric. For less structured data (e.g. images, text, categorical values, …), you should provide vector embeddings to represent each example (e.g. extracted from some pretrained neural network).pred_probs (
np.ndarray
, optional) – An array of shape(N, K)
of model-predicted probabilities,P(label=k|x)
. Each row of this matrix corresponds to an examplex
and contains the model-predicted probabilities thatx
belongs to each possible class, for each of the K classes. The columns must be ordered such that these probabilities correspond to class 0, 1, …, K-1.labels (
array_like
, optional) – A discrete vector of given labels for the data of shape(N,)
. Supportedarray_like
types include:np.ndarray
orlist
. Format requirements: for dataset with K classes, labels must be in 0, 1, …, K-1. All the classes (0, 1, …, and K-1) MUST be present inlabels
, such that:len(set(labels)) == pred_probs.shape[1]
Ifparams["adjust_confident_thresholds"]
was previously set toFalse
, you do not have to pass inlabels
. Note: multi-label classification is not supported by this method, each example must belong to a single class, e.g.labels = np.ndarray([1,0,2,1,1,0...])
.verbose (
bool
, default= True
) – Set toFalse
to suppress all print statements.
- score(*, features=None, pred_probs=None)[source]#
Use fitted estimator and passed in
features
orpred_probs
to calculate out-of-distribution scores for a dataset.Score for each example corresponds to the likelihood this example stems from the same distribution as the dataset previously specified in
fit()
(i.e. is not an outlier).If
features
are passed, returns OOD score for each example based on its feature values. Ifpred_probs
are passed, returns OOD score for each example based on classifier’s probabilistic predictions. You may have to previously callfit()
or callfit_score()
instead.- Parameters:
features (
np.ndarray
, optional) – Feature array of shape(N, M)
, where N is the number of examples and M is the number of features used to represent each example. For details, seefeatures
infit
function.pred_probs (
np.ndarray
, optional) – An array of shape(N, K)
of predicted class probabilities output by a trained classifier. For details, seepred_probs
infit
function.
- Return type:
ndarray
- Returns:
scores (
np.ndarray
) – Scores lie in [0,1] with smaller values indicating examples that are less typical under the dataset distribution (values near 0 indicate outliers).If
features
are passed,ood_features_scores
are returned. The score is based on the average distance between the example and its K nearest neighbors in the dataset (in feature space).If
pred_probs
are passed,ood_predictions_scores
are returned. The score is based on the uncertainty in the classifier’s predicted probabilities.