outlier#

Helper functions used internally for outlier detection tasks.

Functions:

transform_distances_to_scores(avg_distances, ...)

Returns an outlier score for each example based on its average distance to its k nearest neighbors.

correct_precision_errors(scores, ...[, C, p])

Ensure that scores where avg_distances are below the tolerance threshold get a score of one.

cleanlab.internal.outlier.transform_distances_to_scores(avg_distances, t, scaling_factor)[source]#

Returns an outlier score for each example based on its average distance to its k nearest neighbors.

The transformation of a distance, dd , to a score, oo , is based on the following formula:

o=exp(dt)o = \exp\left(-dt\right)

where tt scales the distance to a score in the range [0,1].

Parameters:
  • avg_distances (np.ndarray) – An array of distances of shape (N), where N is the number of examples. Each entry represents an example’s average distance to its k nearest neighbors.

  • t (int) – A sensitivity parameter that modulates the strength of the transformation from distances to scores. Higher values of t result in more pronounced differentiation between the scores of examples lying in the range [0,1].

  • scaling_factor (float) – A scaling factor used to normalize the distances before they are converted into scores. A valid scaling factor is any positive number. The choice of scaling factor should be based on the distribution of distances between neighboring examples. A good rule of thumb is to set the scaling factor to the median distance between neighboring examples. A lower scaling factor results in more pronounced differentiation between the scores of examples lying in the range [0,1].

Return type:

ndarray

Returns:

ood_features_scores (np.ndarray) – An array of outlier scores of shape (N,) for N examples.

Examples

>>> import numpy as np
>>> from cleanlab.outlier import transform_distances_to_scores
>>> distances = np.array([[0.0, 0.1, 0.25],
...                       [0.15, 0.2, 0.3]])
>>> avg_distances = np.mean(distances, axis=1)
>>> transform_distances_to_scores(avg_distances, t=1, scaling_factor=1)
array([0.88988177, 0.80519832])
cleanlab.internal.outlier.correct_precision_errors(scores, avg_distances, metric, C=100, p=None)[source]#

Ensure that scores where avg_distances are below the tolerance threshold get a score of one.

Parameters:
  • scores (ndarray) – An array of scores of shape (N), where N is the number of examples. Each entry represents a score between 0 and 1.

  • avg_distances (ndarray) – An array of distances of shape (N), where N is the number of examples. Each entry represents an example’s average distance to its k nearest neighbors.

  • metric (str) – The metric used by the knn algorithm to calculate the distances. It must be ‘cosine’, ‘euclidean’ or ‘minkowski’, otherwise this function does nothing.

  • C (int) – Multiplier used to increase the tolerance of the acceptable precision differences. It is a multiplicative factor of the machine epsilon that is used to calculate the tolerance. For the type of values that are used in the distances, a value of 100 should be a sensible default value for small values of the distances, below the order of 1.

  • p (Optional[int]) – This value is only used when metric is ‘minkowski’. A ValueError will be raised if metric is ‘minkowski’ and ‘p’ was not provided.

Returns:

fixed_scores – An array of scores of shape (N,) for N examples with scores between 0 and 1.