metric#

Data:

`HIGH_DIMENSION_CUTOFF`	If the number of columns (M) in the `features` array is greater than this cutoff value, then by default, K-nearest-neighbors will use the "cosine" metric.
`ROW_COUNT_CUTOFF`	Only affects settings where Euclidean metrics would be used by default.

Functions:

`decide_euclidean_metric`(features)	Decide the appropriate Euclidean metric implementation based on the size of the dataset.
`decide_default_metric`(features)	Decide the KNN metric to be used based on the shape of the feature array.

cleanlab.internal.neighbor.metric.HIGH_DIMENSION_CUTOFF: int = 3#: If the number of columns (M) in the features array is greater than this cutoff value, then by default, K-nearest-neighbors will use the “cosine” metric. The cosine metric is more suitable for high-dimensional data. Otherwise the “euclidean” distance will be used.

cleanlab.internal.neighbor.metric.ROW_COUNT_CUTOFF: int = 100#: Only affects settings where Euclidean metrics would be used by default. If the number of rows (N) in the features array is greater than this cutoff value, then by default, Euclidean distances are computed via the “euclidean” metric (implemented in sklearn for efficiency reasons). Otherwise, Euclidean distances are by default computed via the euclidean metric from scipy (slower but numerically more precise/accurate).

cleanlab.internal.neighbor.metric.decide_euclidean_metric(features)[source]#

Decide the appropriate Euclidean metric implementation based on the size of the dataset.

Parameters:: features (ndarray) – The input features array.
Return type:: Union[str, Callable]
Returns:: metric – A string or a callable representing a specific implementation of computing the euclidean distance.

Note

A choice is made between two implementations of the euclidean metric based on the number of rows in the feature array. If the number of rows (N) in the feature array is greater than another predefined cutoff value (ROW_COUNT_CUTOFF), the "euclidean" metric is used. This is because the euclidean metric performs better on larger datasets. If neither condition is met, the euclidean metric function from scipy is returned.