underperforming_group#

Classes:

UnderperformingGroupIssueManager(datalab[, ...])

Manages issues related to underperforming group examples.

class cleanlab.datalab.internal.issue_manager.underperforming_group.UnderperformingGroupIssueManager(datalab, metric=None, threshold=0.1, k=10, clustering_kwargs={}, min_cluster_samples=5, **_)[source]#

Bases: IssueManager

Manages issues related to underperforming group examples.

Note: The min_cluster_samples argument should not be confused with the min_samples argument of sklearn.cluster.DBSCAN.

Examples

>>> from cleanlab import Datalab
>>> import numpy as np
>>> X = np.random.normal(size=(50, 2))
>>> y = np.random.randint(2, size=50)
>>> pred_probs = X / X.sum(axis=1, keepdims=True)
>>> data = {"X": X, "y": y}
>>> lab = Datalab(data, label_name="y")
>>> issue_types={"underperforming_group": {"clustering_kwargs": {"eps": 0.5}}}
>>> lab.find_issues(pred_probs=pred_probs, features=X, issue_types=issue_types)

Attributes:

description

Short text that summarizes the type of issues handled by this IssueManager.

issue_name

Returns a key that is used to store issue summary results about the assigned Lab.

verbosity_levels

A dictionary of verbosity levels and their corresponding dictionaries of report items to print.

OUTLIER_CLUSTER_LABELS

Specifies labels considered as outliers by the clustering algorithm.

NO_UNDERPERFORMING_CLUSTER_ID

Constant to signify absence of any underperforming cluster.

issue_score_key

Returns a key that is used to store issue score results about the assigned Lab.

info

issues

summary

Methods:

find_issues(pred_probs[, features, cluster_ids])

Finds occurrences of this particular issue in the dataset.

perform_clustering(knn_graph)

Perform clustering of datapoints using a knn graph as distance matrix.

filter_cluster_ids(cluster_ids)

Remove outlier clusters and return IDs of clusters with at least self.min_cluster_samples number of datapoints.

get_underperforming_clusters(cluster_ids, ...)

Get ID and quality score of each underperforming cluster.

collect_info(knn_graph, n_clusters, ...)

Collects data for the info attribute of the Datalab.

make_summary(score)

Construct a summary dataframe.

report(issues, summary, info[, ...])

Compose a report of the issues found by this IssueManager.

description: ClassVar[str]#

Short text that summarizes the type of issues handled by this IssueManager.

issue_name: ClassVar[str] = 'underperforming_group'#

Returns a key that is used to store issue summary results about the assigned Lab.

verbosity_levels: ClassVar[Dict[int, List[str]]]#

A dictionary of verbosity levels and their corresponding dictionaries of report items to print.

Example

>>> verbosity_levels = {
...     0: [],
...     1: ["some_info_key"],
...     2: ["additional_info_key"],
... }
OUTLIER_CLUSTER_LABELS: ClassVar[Tuple[int]] = (-1,)#

Specifies labels considered as outliers by the clustering algorithm.

NO_UNDERPERFORMING_CLUSTER_ID: ClassVar[int] = -2#

Constant to signify absence of any underperforming cluster.

find_issues(pred_probs, features=None, cluster_ids=None, **kwargs)[source]#

Finds occurrences of this particular issue in the dataset.

Computes the issues and summary dataframes. Calls collect_info to compute the info dict.

Return type:

None

perform_clustering(knn_graph)[source]#

Perform clustering of datapoints using a knn graph as distance matrix.

Return type:

ndarray[Any, dtype[int64]]

Args:

knn_graph (csr_matrix): Sparse Distance Matrix.

Returns:

cluster_ids (npt.NDArray[np.int_]): Cluster IDs for each datapoint.

filter_cluster_ids(cluster_ids)[source]#

Remove outlier clusters and return IDs of clusters with at least self.min_cluster_samples number of datapoints.

Return type:

ndarray[Any, dtype[int64]]

Args:

cluster_ids (npt.NDArray[np.int_]): Cluster IDs for each datapoint.

Returns:

unique_cluster_ids (npt.NDArray[np.int_]): List of unique cluster IDs after removing outlier clusters and clusters with less than self.min_cluster_samples number of datapoints.

get_underperforming_clusters(cluster_ids, unique_cluster_ids, labels, pred_probs)[source]#

Get ID and quality score of each underperforming cluster.

Return type:

Tuple[Dict[int, float], int, float]

Args:

cluster_ids (npt.NDArray[np.int_]): Cluster IDs corresponding to each sample unique_cluster_ids (npt.NDArray[np.int_]): Unique cluster IDs excluding noisy clusters labels (npt.NDArray): Label of each sample pred_probs (npt.NDArray): Prediction probability

Returns:

Tuple[Dict[int, float], int, float]: (Cluster IDs and their scores, Worst Cluster ID, Worst Cluster Quality Score)

collect_info(knn_graph, n_clusters, cluster_ids, performed_clustering, worst_cluster_id)[source]#

Collects data for the info attribute of the Datalab. :rtype: Dict[str, Any]

Note

This method is called by find_issues() after find_issues() has set the issues and summary dataframes as instance attributes.

issue_score_key: ClassVar[str] = 'underperforming_group_score'#

Returns a key that is used to store issue score results about the assigned Lab.

classmethod make_summary(score)#

Construct a summary dataframe.

Parameters:

score (float) – The overall score for this issue.

Return type:

DataFrame

Returns:

summary – A summary dataframe.

classmethod report(issues, summary, info, num_examples=5, verbosity=0, include_description=False, info_to_omit=None)#

Compose a report of the issues found by this IssueManager.

Parameters:
  • issues (DataFrame) –

    An issues dataframe.

    Example

    >>> import pandas as pd
    >>> issues = pd.DataFrame(
    ...     {
    ...         "is_X_issue": [True, False, True],
    ...         "X_score": [0.2, 0.9, 0.4],
    ...     },
    ... )
    

  • summary (DataFrame) –

    The summary dataframe.

    Example

    >>> summary = pd.DataFrame(
    ...     {
    ...         "issue_type": ["X"],
    ...         "score": [0.5],
    ...     },
    ... )
    

  • info (Dict[str, Any]) –

    The info dict.

    Example

    >>> info = {
    ...     "A": "val_A",
    ...     "B": ["val_B1", "val_B2"],
    ... }
    

  • num_examples (int) – The number of examples to print.

  • verbosity (int) – The verbosity level of the report.

  • include_description (bool) – Whether to include a description of the issue in the report.

Return type:

str

Returns:

report_str – A string containing the report.

info: Dict[str, Any]#
issues: pd.DataFrame#
summary: pd.DataFrame#