underperforming_group#
Classes:
| 
 | Manages issues related to underperforming group examples. | 
- class cleanlab.datalab.internal.issue_manager.underperforming_group.UnderperformingGroupIssueManager(datalab, metric=None, threshold=0.1, k=10, clustering_kwargs={}, min_cluster_samples=5, **_)[source]#
- Bases: - IssueManager- Manages issues related to underperforming group examples. - Note: The min_cluster_samples argument should not be confused with the min_samples argument of sklearn.cluster.DBSCAN. - Examples - >>> from cleanlab import Datalab >>> import numpy as np >>> X = np.random.normal(size=(50, 2)) >>> y = np.random.randint(2, size=50) >>> pred_probs = X / X.sum(axis=1, keepdims=True) >>> data = {"X": X, "y": y} >>> lab = Datalab(data, label_name="y") >>> issue_types={"underperforming_group": {"clustering_kwargs": {"eps": 0.5}}} >>> lab.find_issues(pred_probs=pred_probs, features=X, issue_types=issue_types) - Attributes: - Short text that summarizes the type of issues handled by this IssueManager. - Returns a key that is used to store issue summary results about the assigned Lab. - A dictionary of verbosity levels and their corresponding dictionaries of report items to print. - Specifies labels considered as outliers by the clustering algorithm. - Constant to signify absence of any underperforming cluster. - Returns a key that is used to store issue score results about the assigned Lab. - Methods: - find_issues(pred_probs[, features, cluster_ids])- Finds occurrences of this particular issue in the dataset. - perform_clustering(knn_graph)- Perform clustering of datapoints using a knn graph as distance matrix. - filter_cluster_ids(cluster_ids)- Remove outlier clusters and return IDs of clusters with at least self.min_cluster_samples number of datapoints. - get_underperforming_clusters(cluster_ids, ...)- Get ID and quality score of each underperforming cluster. - collect_info(knn_graph, n_clusters, ...)- Collects data for the info attribute of the Datalab. - make_summary(score)- Construct a summary dataframe. - report(issues, summary, info[, ...])- Compose a report of the issues found by this IssueManager. - description: ClassVar[str]#
- Short text that summarizes the type of issues handled by this IssueManager. 
 - issue_name: ClassVar[str] = 'underperforming_group'#
- Returns a key that is used to store issue summary results about the assigned Lab. 
 - verbosity_levels: ClassVar[Dict[int, List[str]]]#
- A dictionary of verbosity levels and their corresponding dictionaries of report items to print. - Example - >>> verbosity_levels = { ... 0: [], ... 1: ["some_info_key"], ... 2: ["additional_info_key"], ... } 
 - OUTLIER_CLUSTER_LABELS: ClassVar[Tuple[int]] = (-1,)#
- Specifies labels considered as outliers by the clustering algorithm. 
 - NO_UNDERPERFORMING_CLUSTER_ID: ClassVar[int] = -2#
- Constant to signify absence of any underperforming cluster. 
 - find_issues(pred_probs, features=None, cluster_ids=None, **kwargs)[source]#
- Finds occurrences of this particular issue in the dataset. - Computes the issues and summary dataframes. Calls collect_info to compute the info dict. - Return type:
- None
 
 - perform_clustering(knn_graph)[source]#
- Perform clustering of datapoints using a knn graph as distance matrix. - Return type:
- ndarray[- Any,- dtype[- int64]]
 - Args:
- knn_graph (csr_matrix): Sparse Distance Matrix. 
- Returns:
- cluster_ids (npt.NDArray[np.int_]): Cluster IDs for each datapoint. 
 
 - filter_cluster_ids(cluster_ids)[source]#
- Remove outlier clusters and return IDs of clusters with at least self.min_cluster_samples number of datapoints. - Return type:
- ndarray[- Any,- dtype[- int64]]
 
 - get_underperforming_clusters(cluster_ids, unique_cluster_ids, labels, pred_probs)[source]#
- Get ID and quality score of each underperforming cluster. - Return type:
- Tuple[- Dict[- int,- float],- int,- float]
 - Args:
- cluster_ids (npt.NDArray[np.int_]): Cluster IDs corresponding to each sample unique_cluster_ids (npt.NDArray[np.int_]): Unique cluster IDs excluding noisy clusters labels (npt.NDArray): Label of each sample pred_probs (npt.NDArray): Prediction probability 
- Returns:
- Tuple[Dict[int, float], int, float]: (Cluster IDs and their scores, Worst Cluster ID, Worst Cluster Quality Score) 
 
 - collect_info(knn_graph, n_clusters, cluster_ids, performed_clustering, worst_cluster_id)[source]#
- Collects data for the info attribute of the Datalab. :rtype: - Dict[- str,- Any]- Note - This method is called by - find_issues()after- find_issues()has set the issues and summary dataframes as instance attributes.
 - issue_score_key: ClassVar[str] = 'underperforming_group_score'#
- Returns a key that is used to store issue score results about the assigned Lab. 
 - classmethod make_summary(score)#
- Construct a summary dataframe. - Parameters:
- score ( - float) – The overall score for this issue.
- Return type:
- DataFrame
- Returns:
- summary– A summary dataframe.
 
 - classmethod report(issues, summary, info, num_examples=5, verbosity=0, include_description=False, info_to_omit=None)#
- Compose a report of the issues found by this IssueManager. - Parameters:
- issues ( - DataFrame) –- An issues dataframe. - Example - >>> import pandas as pd >>> issues = pd.DataFrame( ... { ... "is_X_issue": [True, False, True], ... "X_score": [0.2, 0.9, 0.4], ... }, ... ) 
- summary ( - DataFrame) –- The summary dataframe. - Example - >>> summary = pd.DataFrame( ... { ... "issue_type": ["X"], ... "score": [0.5], ... }, ... ) 
- info ( - Dict[- str,- Any]) –- The info dict. - Example - >>> info = { ... "A": "val_A", ... "B": ["val_B1", "val_B2"], ... } 
- num_examples ( - int) – The number of examples to print.
- verbosity ( - int) – The verbosity level of the report.
- include_description ( - bool) – Whether to include a description of the issue in the report.
 
- Return type:
- str
- Returns:
- report_str– A string containing the report.
 
 - info: Dict[str, Any]#
 - issues: pd.DataFrame#
 - summary: pd.DataFrame#