underperforming_group#
Classes:
|
Manages issues related to underperforming group examples. |
- class cleanlab.datalab.internal.issue_manager.underperforming_group.UnderperformingGroupIssueManager(datalab, metric=None, threshold=0.1, k=10, clustering_kwargs={}, min_cluster_samples=5, **_)[source]#
Bases:
IssueManager
Manages issues related to underperforming group examples.
Note: The min_cluster_samples argument should not be confused with the min_samples argument of sklearn.cluster.DBSCAN.
Examples
>>> from cleanlab import Datalab >>> import numpy as np >>> X = np.random.normal(size=(50, 2)) >>> y = np.random.randint(2, size=50) >>> pred_probs = X / X.sum(axis=1, keepdims=True) >>> data = {"X": X, "y": y} >>> lab = Datalab(data, label_name="y") >>> issue_types={"underperforming_group": {"clustering_kwargs": {"eps": 0.5}}} >>> lab.find_issues(pred_probs=pred_probs, features=X, issue_types=issue_types)
Attributes:
Short text that summarizes the type of issues handled by this IssueManager.
Returns a key that is used to store issue summary results about the assigned Lab.
A dictionary of verbosity levels and their corresponding dictionaries of report items to print.
Specifies labels considered as outliers by the clustering algorithm.
Constant to signify absence of any underperforming cluster.
Returns a key that is used to store issue score results about the assigned Lab.
Methods:
find_issues
(pred_probs[, features, cluster_ids])Finds occurrences of this particular issue in the dataset.
perform_clustering
(knn_graph)Perform clustering of datapoints using a knn graph as distance matrix.
filter_cluster_ids
(cluster_ids)Remove outlier clusters and return IDs of clusters with at least self.min_cluster_samples number of datapoints.
get_underperforming_clusters
(cluster_ids, ...)Get ID and quality score of each underperforming cluster.
collect_info
(knn_graph, n_clusters, ...)Collects data for the info attribute of the Datalab.
make_summary
(score)Construct a summary dataframe.
report
(issues, summary, info[, ...])Compose a report of the issues found by this IssueManager.
- description: ClassVar[str]#
Short text that summarizes the type of issues handled by this IssueManager.
- issue_name: ClassVar[str] = 'underperforming_group'#
Returns a key that is used to store issue summary results about the assigned Lab.
- verbosity_levels: ClassVar[Dict[int, List[str]]]#
A dictionary of verbosity levels and their corresponding dictionaries of report items to print.
Example
>>> verbosity_levels = { ... 0: [], ... 1: ["some_info_key"], ... 2: ["additional_info_key"], ... }
- OUTLIER_CLUSTER_LABELS: ClassVar[Tuple[int]] = (-1,)#
Specifies labels considered as outliers by the clustering algorithm.
- NO_UNDERPERFORMING_CLUSTER_ID: ClassVar[int] = -2#
Constant to signify absence of any underperforming cluster.
- find_issues(pred_probs, features=None, cluster_ids=None, **kwargs)[source]#
Finds occurrences of this particular issue in the dataset.
Computes the issues and summary dataframes. Calls collect_info to compute the info dict.
- Return type:
None
- perform_clustering(knn_graph)[source]#
Perform clustering of datapoints using a knn graph as distance matrix.
- Return type:
ndarray
[Any
,dtype
[int64
]]
- Args:
knn_graph (csr_matrix): Sparse Distance Matrix.
- Returns:
cluster_ids (npt.NDArray[np.int_]): Cluster IDs for each datapoint.
- filter_cluster_ids(cluster_ids)[source]#
Remove outlier clusters and return IDs of clusters with at least self.min_cluster_samples number of datapoints.
- Return type:
ndarray
[Any
,dtype
[int64
]]
- get_underperforming_clusters(cluster_ids, unique_cluster_ids, labels, pred_probs)[source]#
Get ID and quality score of each underperforming cluster.
- Return type:
Tuple
[Dict
[int
,float
],int
,float
]
- Args:
cluster_ids (npt.NDArray[np.int_]): Cluster IDs corresponding to each sample unique_cluster_ids (npt.NDArray[np.int_]): Unique cluster IDs excluding noisy clusters labels (npt.NDArray): Label of each sample pred_probs (npt.NDArray): Prediction probability
- Returns:
Tuple[Dict[int, float], int, float]: (Cluster IDs and their scores, Worst Cluster ID, Worst Cluster Quality Score)
- collect_info(knn_graph, n_clusters, cluster_ids, performed_clustering, worst_cluster_id)[source]#
Collects data for the info attribute of the Datalab. :rtype:
Dict
[str
,Any
]Note
This method is called by
find_issues()
afterfind_issues()
has set the issues and summary dataframes as instance attributes.
- issue_score_key: ClassVar[str] = 'underperforming_group_score'#
Returns a key that is used to store issue score results about the assigned Lab.
- classmethod make_summary(score)#
Construct a summary dataframe.
- Parameters:
score (
float
) – The overall score for this issue.- Return type:
DataFrame
- Returns:
summary
– A summary dataframe.
- classmethod report(issues, summary, info, num_examples=5, verbosity=0, include_description=False, info_to_omit=None)#
Compose a report of the issues found by this IssueManager.
- Parameters:
issues (
DataFrame
) –An issues dataframe.
Example
>>> import pandas as pd >>> issues = pd.DataFrame( ... { ... "is_X_issue": [True, False, True], ... "X_score": [0.2, 0.9, 0.4], ... }, ... )
summary (
DataFrame
) –The summary dataframe.
Example
>>> summary = pd.DataFrame( ... { ... "issue_type": ["X"], ... "score": [0.5], ... }, ... )
info (
Dict
[str
,Any
]) –The info dict.
Example
>>> info = { ... "A": "val_A", ... "B": ["val_B1", "val_B2"], ... }
num_examples (
int
) – The number of examples to print.verbosity (
int
) – The verbosity level of the report.include_description (
bool
) – Whether to include a description of the issue in the report.
- Return type:
str
- Returns:
report_str
– A string containing the report.
- info: Dict[str, Any]#
- issues: pd.DataFrame#
- summary: pd.DataFrame#