knn_graph#

Data:

DEFAULT_K

Default number of neighbors to consider in the k-nearest neighbors search, unless the size of the feature array is too small or the user specifies a different value.

Functions:

features_to_knn(features, *[, n_neighbors, ...])

Build and fit a k-nearest neighbors search object from an array of numerical features.

construct_knn_graph_from_index(knn[, ...])

Construct a sparse distance matrix representation of KNN graph out of a fitted NearestNeighbors search object.

create_knn_graph_and_index(features, *[, ...])

Calculate the KNN graph from the features if it is not provided in the kwargs.

correct_knn_graph(features, knn_graph)

Corrects a k-nearest neighbors (KNN) graph by handling exact duplicates in the feature array.

correct_knn_distances_and_indices_with_exact_duplicate_sets_inplace(...)

Corrects the distances and indices arrays of k-nearest neighbors (KNN) graphs by handling sets of exact duplicates explicitly.

correct_knn_distances_and_indices(features, ...)

Corrects the distances and indices of a k-nearest neighbors (KNN) graph based on all exact duplicates detected in the feature array.

cleanlab.internal.neighbor.knn_graph.DEFAULT_K = 10#

Default number of neighbors to consider in the k-nearest neighbors search, unless the size of the feature array is too small or the user specifies a different value.

This should be the largest desired value of k for all desired issue types that require a KNN graph.

E.g. if near duplicates wants k=1 but outliers wants 10, then DEFAULT_K should be 10. This way, all issue types can rely on the same KNN graph.

cleanlab.internal.neighbor.knn_graph.features_to_knn(features, *, n_neighbors=None, metric=None, **sklearn_knn_kwargs)[source]#

Build and fit a k-nearest neighbors search object from an array of numerical features.

Parameters:
  • features (Optional[ndarray]) – The input feature array, with shape (N, M), where N is the number of samples and M is the number of features.

  • n_neighbors (Optional[int]) – The number of nearest neighbors to consider. If None, a default value is determined based on the feature array size.

  • metric (Union[str, Callable, None]) – The distance metric to use for computing distances between points. If None, the metric is determined based on the feature array shape.

  • **sklearn_knn_kwargs – Additional keyword arguments to be passed to the search index constructor.

Return type:

NearestNeighbors

Returns:

knn – A k-nearest neighbors search object fitted to the input feature array.

Examples

>>> import numpy as np
>>> from cleanlab.internal.neighbor import features_to_knn
>>> features = np.random.rand(100, 10)
>>> knn = features_to_knn(features)
>>> knn
NearestNeighbors(metric='cosine', n_neighbors=10)
cleanlab.internal.neighbor.knn_graph.construct_knn_graph_from_index(knn, correction_features=None)[source]#

Construct a sparse distance matrix representation of KNN graph out of a fitted NearestNeighbors search object.

Parameters:
  • knn (NearestNeighbors) – A NearestNeighbors object that has been fitted to a feature array. The KNN graph is constructed based on the distances and indices of each feature row’s nearest neighbors.

  • correction_features (Optional[ndarray]) –

    The input feature array used to fit the NearestNeighbors object. If provided, the function the distances and indices of the neighbors will be corrected based on exact duplicates in the feature array. If not provided, no correction will be applied.

    Warning

    This function is designed to handle a specific case where a KNN index is used to construct a KNN graph by itself, and there is a need to detect and correct for exact duplicates in the feature array. However, relying on this function for such corrections is generally discouraged. There are other functions in the module that handle KNN graph construction with feature corrections in a more flexible and robust manner. Use this function only when there is a special need to correct distances and indices based on the feature array provided.

Return type:

csr_matrix

Returns:

knn_graph – A sparse, weighted adjacency matrix representing the KNN graph of the feature array.

Note

This is not intended to construct a KNN graph of test data. It is only used to construct a KNN graph of the data used to fit the NearestNeighbors object.

Examples

>>> import numpy as np
>>> from cleanlab.internal.neighbor.knn_graph import features_to_knn, construct_knn_graph_from_index
>>> features = np.array([
...    [0.701, 0.701],
...    [0.900, 0.436],
...    [0.000, 1.000],
... ])
>>> knn = features_to_knn(features, n_neighbors=1)
>>> knn_graph = construct_knn_graph_from_index(knn)
>>> knn_graph.toarray()  # For demonstration purposes only. It is generally a bad idea to transform to dense matrix for large graphs.
array([[0.        , 0.33140006, 0.        ],
       [0.33140006, 0.        , 0.        ],
       [0.76210367, 0.        , 0.        ]])
cleanlab.internal.neighbor.knn_graph.create_knn_graph_and_index(features, *, n_neighbors=None, metric=None, correct_exact_duplicates=True, **sklearn_knn_kwargs)[source]#

Calculate the KNN graph from the features if it is not provided in the kwargs.

Parameters:
  • features (Optional[ndarray]) – The input feature array, with shape (N, M), where N is the number of samples and M is the number of features.

  • n_neighbors (Optional[int]) – The number of nearest neighbors to consider. If None, a default value is determined based on the feature array size.

  • metric (Union[str, Callable, None]) – The distance metric to use for computing distances between points. If None, the metric is determined based on the feature array shape.

  • correct_exact_duplicates (bool) – Whether to correct the KNN graph to ensure that exact duplicates have zero mutual distance, and they are correctly included in the KNN graph.

  • **sklearn_knn_kwargs – Additional keyword arguments to be passed to the search index constructor.

Raises:

ValueError : – If features is None, as it’s required to construct a KNN graph from scratch.

Return type:

Tuple[csr_matrix, NearestNeighbors]

Returns:

  • knn_graph – A sparse, weighted adjacency matrix representing the KNN graph of the feature array.

  • knn – A k-nearest neighbors search object fitted to the input feature array. This object can be used to query the nearest neighbors of new data points.

Examples

>>> import numpy as np
>>> from cleanlab.internal.neighbor.knn_graph import create_knn_graph_and_index
>>> features = np.array([
...    [0.701, 0.701],
...    [0.900, 0.436],
...    [0.000, 1.000],
... ])
>>> knn_graph, knn = create_knn_graph_and_index(features, n_neighbors=1)
>>> knn_graph.toarray()  # For demonstration purposes only. It is generally a bad idea to transform to dense matrix for large graphs.
array([[0.        , 0.33140006, 0.        ],
       [0.33140006, 0.        , 0.        ],
       [0.76210367, 0.        , 0.        ]])
>>> knn
NearestNeighbors(metric=<function euclidean at ...>, n_neighbors=1)  # For demonstration purposes only. The actual metric may vary.
cleanlab.internal.neighbor.knn_graph.correct_knn_graph(features, knn_graph)[source]#

Corrects a k-nearest neighbors (KNN) graph by handling exact duplicates in the feature array.

This utility function takes a precomputed KNN graph and the corresponding feature array, identifies sets of exact duplicate feature vectors, and corrects the KNN graph to properly reflect these duplicates. The corrected KNN graph is returned as a sparse CSR matrix.

Parameters:
  • features (np.ndarray) – The input feature array, with shape (N, M), where N is the number of samples and M is the number of features.

  • knn_graph (csr_matrix) – A sparse matrix of shape (N, N) representing the k-nearest neighbors graph. The graph is expected to be in CSR (Compressed Sparse Row) format.

Return type:

csr_matrix

Returns:

csr_matrix – A corrected KNN graph in CSR format with adjusted distances and indices to properly handle exact duplicates in the feature array.

Notes

  • This function assumes that the input knn_graph is already computed and provided in CSR format.

  • The function modifies the KNN graph to ensure that exact duplicates are represented with zero distance and correctly updated neighbor indices.

  • This function is useful for post-processing a KNN graph when exact duplicates were not handled during the initial KNN computation.

cleanlab.internal.neighbor.knn_graph.correct_knn_distances_and_indices_with_exact_duplicate_sets_inplace(distances, indices, exact_duplicate_sets)[source]#

Corrects the distances and indices arrays of k-nearest neighbors (KNN) graphs by handling sets of exact duplicates explicitly. This function modifies the input arrays in-place.

This function ensures that exact duplicates are correctly represented in the KNN graph. It modifies the distances and indices arrays so that each set of exact duplicates points to itself with zero distance, and adjusts the nearest neighbors accordingly.

Parameters:
  • distances (ndarray) – A 2D array of shape (N, k) representing the distances between each point of the N points and their k-nearest neighbors. This array will be modified in-place to reflect the corrections for exact duplicates (whose mutual distances are explicitly set to zero).

  • indices (ndarray) – A 2D array of shape (N, k) representing the indices of the nearest neighbors for each of the N points. This array will be modified in-place to reflect the corrections for exact duplicates.

  • exact_duplicate_sets (List[ndarray]) – A list of 1D arrays, each containing the indices of points that are exact duplicates of each other. These sets will be used to correct the KNN graph by ensuring that duplicates are reflected as nearest neighbors with zero distance.

  • Overview (High-Level) –

  • -------------------

  • k (The function operates in two main scenarios based on the size of the duplicate sets relative to) –

  • 1 (2. Duplicate Set Size < k +) –

    • All nearest neighbors are exact duplicates.

    • The indices array is updated such that the first k+1 entries for each duplicate set point are used to represent the nearest neighbors

      of all points in the duplicate set.

    • The rows of the distances array belonging to the duplicate set are set to zero.

  • 1

    • Some of the nearest neighbors are not exact duplicates.

    • Non-duplicate neighbors are shifted to the back of the list.

    • The indices and distances arrays are updated accordingly to reflect the duplicates at the front with zero distance.

  • Considerations (User) –

  • -------------------

  • Validity (- Input) –

  • Modifications (- In-Place) –

  • Size (- Duplicate Set) –

  • Performance (-) –

  • Capabilities

  • ------------

  • efficiently (- Handles exact duplicate sets) –

  • representation. (ensuring correct KNN graph) –

  • duplicates. (- Adjusts neighbor indices to reflect the presence of) –

  • duplicates.

  • Limitations

  • -----------

  • graph. (- Assumes that the input arrays (distances and indices) come from a precomputed KNN) –

  • neighbors. (- Does not handle near-duplicates or merge non-duplicate) –

  • misidentification. (- Requires careful construction of exact_duplicate_sets to avoid) –

Return type:

None

cleanlab.internal.neighbor.knn_graph.correct_knn_distances_and_indices(features, distances, indices, exact_duplicate_sets=None)[source]#

Corrects the distances and indices of a k-nearest neighbors (KNN) graph based on all exact duplicates detected in the feature array.

Parameters:
  • features (ndarray) – The feature array used to construct the KNN graph.

  • distances (ndarray) – The distances between each point and its k nearest neighbors.

  • indices (ndarray) – The indices of the k nearest neighbors for each point.

  • exact_duplicate_sets (Optional[List[ndarray]]) – A list of numpy arrays, where each array contains the indices of exact duplicates in the feature array. If not provided, it will be computed from the feature array.

Return type:

tuple[ndarray, ndarray]

Returns:

  • corrected_distances – The corrected distances between each point and its k nearest neighbors. Exact duplicates (based on the feature array) are ensured to have zero mutual distance.

  • corrected_indices – The corrected indices of the k nearest neighbors for each point. Exact duplicates are ensured to be included in the k nearest neighbors, unless the number of exact duplicates exceeds k.

Example

>>> import numpy as np
>>> X = np.array(
...     [
...         [0, 0],
...         [0, 0], # Exact duplicate of the previous point
...         [1, 1], # The distances between this point and the others is sqrt(2) (equally distant from both)
...     ]
... )
>>> distances = np.array(  # Distance to the 1-NN of each point
...     [
...         [np.sqrt(2)],  # Should be [0]
...         [1e-16],       # Should be [0]
...         [np.sqrt(2)],
...     ]
... )
>>> indices = np.array(  # Index of the 1-NN of each point
...     [
...         [2],  # Should be [1]
...         [0],
...         [1],  # Might be [0] or [1]
...     ]
... )
>>> corrected_distances, corrected_indices = correct_knn_distances_and_indices(X, distances, indices)
>>> corrected_distances
array([[0.], [0.], [1.41421356]])
>>> corrected_indices
array([[1], [0], [0]])