Creating Your Own Issues Manager#
This guide walks through the process of creating creating your own
IssueManager
to detect a custom-defined type of issue alongside the pre-defined issue types in
Datalab.
See also
- register:
- You can either use this function at runtime to register a new issue manager: - from cleanlab.datalab.internal.issue_manager_factory import register register(MyIssueManager) - or add as a decorator to the class definition: - @register class MyIssueManager(IssueManager): ... 
 
Prerequisites#
As a starting point for this guide, we’ll import the necessary things for the next section and create a dummy dataset.
Note
Using Datalab requires additional dependencies beyond the rest of the cleanlab package. To install them, run:
$ pip install "cleanlab[datalab]"
For the developmental version of the package, install from source:
$ pip install "git+https://github.com/cleanlab/cleanlab.git#egg=cleanlab[datalab]"
import numpy as np
import pandas as pd
from cleanlab import IssueManager
# Create a dummy dataset
N = 20
data = pd.DataFrame(
    {
        "text": [f"example {i}" for i in range(N)],
        "label": np.random.randint(0, 2, N),
    },
)
Implementing IssueManagers#
Basic Issue Check#
To create a basic issue manager, inherit from the
IssueManager class,
assign a name to the class as the class-variable, issue_name, and implement the
find_issues method.
The find_issues
method should mark each example in the dataset as an issue or not with a boolean array.
It should also provide a score for each example in the dataset that quantifies the quality of the example
with regards to the issue.
class Basic(IssueManager):
    # Assign a name to the issue
    issue_name = "basic"
    def find_issues(self, **kwargs) -> None:
        # Compute scores for each example
        scores = np.random.rand(len(self.datalab.data))
        # Construct a dataframe where examples are marked for issues
        # and the score for each example is included.
        self.issues = pd.DataFrame(
            {
                f"is_{self.issue_name}_issue" : scores < 0.1,
                self.issue_score_key : scores,
            },
        )
        # Score the dataset as a whole based on this issue type
        self.summary = self.make_summary(score = scores.mean())
Intermediate Issue Check#
To create an intermediate issue:
- Perform the same steps as in the basic issue check section. 
- Populate the - infoattribute with a dictionary of information about the identified issues.
The information can be included in a report generated by Datalab,
if you add any of the keys to the verbosity_levels class-attribute.
Optionally, you can also add a description of the type of issue this issue manager handles to the description class-attribute.
class Intermediate(IssueManager):
    issue_name = "intermediate"
    # Add a dictionary of information to include in the report
    verbosity_levels = {
        0: [],
        1: ["std"],
        2: ["raw_scores"],
    }
    # Add a description of the issue
    description = "Intermediate issues are a bit more involved than basic issues."
    def find_issues(self, *, intermediate_arg: int, **kwargs) -> None:
        N = len(self.datalab.data)
        raw_scores = np.random.rand(N)
        std = raw_scores.std()
        threshold = min(0, raw_scores.mean() - std)
        sin_filter = np.sin(intermediate_arg * np.arange(N) / N)
        kernel = sin_filter ** 2
        scores = kernel * raw_scores
        self.issues = pd.DataFrame(
            {
                f"is_{self.issue_name}_issue" : scores < threshold,
                self.issue_score_key : scores,
            },
        )
        self.summary = self.make_summary(score = scores.mean())
        # Useful information that will be available in the Datalab instance
        self.info = {
            "std": std,
            "raw_scores": raw_scores,
            "kernel": kernel,
        }
Advanced Issue Check#
Note
WIP: This section is a work in progress.
Use with Datalab#
We can create a
Datalab
instance and run issue checks with the custom issue managers we created like so:
from cleanlab.datalab.internal.issue_manager_factory import register
from cleanlab import Datalab
# Register the issue manager
for issue_manager in [Basic, Intermediate]:
    register(issue_manager)
# Instantiate a datalab instance
datalab = Datalab(data, label_name="label")
# Run the issue check
issue_types = {"basic": {}, "intermediate": {"intermediate_arg": 2}}
datalab.find_issues(issue_types=issue_types)
# Print report
datalab.report(verbosity=0)
The report will look something like this:
Here is a summary of the different kinds of issues found in the data:
  issue_type     score  num_issues
       basic  0.477762           2
intermediate  0.286455           0
(Note: A lower score indicates a more severe issue across all examples in the dataset.)
------------------------------------------- basic issues -------------------------------------------
Number of examples with this issue: 2
Overall dataset quality in terms of this issue: 0.4778
Examples representing most severe instances of this issue:
    is_basic_issue  basic_score
13            True     0.003042
8             True     0.058117
11           False     0.121908
15           False     0.169312
17           False     0.229044
--------------------------------------- intermediate issues ----------------------------------------
About this issue:
    Intermediate issues are a bit more involved than basic issues.
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.2865
Examples representing most severe instances of this issue:
    is_intermediate_issue  intermediate_score    kernel
0                   False            0.000000       0.0
1                   False            0.007059  0.009967
3                   False            0.010995  0.087332
2                   False            0.016296   0.03947
11                  False            0.019459  0.794251