issue_finder#
Note
This module is not intended to be used directly by users. It is used by the cleanlab.datalab.datalab module.
Specifically, it is used by the Datalab.find_issues method.
Module for the IssueFinder class, which is responsible for configuring,
creating and running issue managers.
It determines which types of issues to look for, instatiates the IssueManagers
via a factory, run the issue managers
(IssueManager.find_issues),
and collects the results to DataIssues.
Note
This module is not intended to be used directly. Instead, use the public-facing
Datalab.find_issues method.
Classes:
| 
 | The IssueFinder class is responsible for managing the process of identifying issues in the dataset by handling the creation and execution of relevant IssueManagers. | 
- class cleanlab.datalab.internal.issue_finder.IssueFinder(datalab, verbosity=1)[source]#
- Bases: - object- The IssueFinder class is responsible for managing the process of identifying issues in the dataset by handling the creation and execution of relevant IssueManagers. It serves as a coordinator or helper class for the Datalab class to encapsulate the specific behavior of the issue finding process. - At a high level, the IssueFinder is responsible for: - Determining which types of issues to look for. 
- Instantiating the appropriate IssueManagers using a factory. 
- Running the IssueManagers’ - find_issuesmethods.
- Collecting the results into a DataIssues instance. 
 - Parameters:
- datalab ( - Datalab) – The Datalab instance associated with this IssueFinder.
- verbosity ( - int) – Controls the verbosity of the output during the issue finding process.
 
 - Note - This class is not intended to be used directly. Instead, use the - Datalab.find_issuesmethod which internally utilizes an IssueFinder instance.- Methods: - find_issues(*[, pred_probs, features, ...])- Checks the dataset for all sorts of common issues in real-world data (in both labels and feature values). - Returns a list of all registered issue types. - Returns a list of the issue types that are run by default when - find_issues()is called without specifying- issue_types.- get_available_issue_types(**kwargs)- Returns a dictionary of issue types that can be used in - Datalab.find_issuesmethod.- find_issues(*, pred_probs=None, features=None, knn_graph=None, issue_types=None)[source]#
- Checks the dataset for all sorts of common issues in real-world data (in both labels and feature values). - You can use Datalab to find issues in your data, utilizing any model you have already trained. This method only interacts with your model via its predictions or embeddings (and other functions thereof). The more of these inputs you provide, the more types of issues Datalab can detect in your dataset/labels. If you provide a subset of these inputs, Datalab will output what insights it can based on the limited information from your model. - Note - This method is not intended to be used directly. Instead, use the - Datalab.find_issuesmethod.- Note - The issues are saved in the - self.datalab.data_issues.issuesattribute, but are not returned.- Parameters:
- pred_probs ( - Optional[- ndarray]) –- Out-of-sample predicted class probabilities made by the model for every example in the dataset. To best detect label issues, provide this input obtained from the most accurate model you can produce. - If provided, this must be a 2D array with shape (num_examples, K) where K is the number of classes in the dataset. 
- features ( - Optional[np.ndarray]) –- Feature embeddings (vector representations) of every example in the dataset. - If provided, this must be a 2D array with shape (num_examples, num_features). 
- knn_graph ( - Optional[- csr_matrix]) –- Sparse matrix representing distances between examples in the dataset in a k nearest neighbor graph. - If provided, this must be a square CSR matrix with shape (num_examples, num_examples) and (k*num_examples) non-zero entries (k is the number of nearest neighbors considered for each example) evenly distributed across the rows. The non-zero entries must be the distances between the corresponding examples. Self-distances must be omitted (i.e. the diagonal must be all zeros and the k nearest neighbors of each example must not include itself). - For any duplicated examples i,j whose distance is 0, there should be an explicit zero stored in the matrix, i.e. - knn_graph[i,j] = 0.- If both - knn_graphand- featuresare provided, the- knn_graphwill take precendence. If- knn_graphis not provided, it is constructed based on the provided- features. If neither- knn_graphnor- featuresare provided, certain issue types like (near) duplicates will not be considered.
- issue_types ( - Optional[- Dict[- str,- Any]]) –- Collection specifying which types of issues to consider in audit and any non-default parameter settings to use. If unspecified, a default set of issue types and recommended parameter settings is considered. - This is a dictionary of dictionaries, where the keys are the issue types of interest and the values are dictionaries of parameter values that control how each type of issue is detected (only for advanced users). More specifically, the values are constructor keyword arguments passed to the corresponding - IssueManager, which is responsible for detecting the particular issue type.- See also 
 
- Return type:
- None
 
 - static list_possible_issue_types()[source]#
- Returns a list of all registered issue types. - Any issue type that is not in this list cannot be used in the - find_issues()method.- See also - REGISTRY: All available issue types and their corresponding issue managers can be found here.- Return type:
- List[- str]
 
 - static list_default_issue_types()[source]#
- Returns a list of the issue types that are run by default when - find_issues()is called without specifying- issue_types.- See also - REGISTRY: All available issue types and their corresponding issue managers can be found here.- Return type:
- List[- str]
 
 - get_available_issue_types(**kwargs)[source]#
- Returns a dictionary of issue types that can be used in - Datalab.find_issuesmethod.