Creating Your Own Issues Manager#

This guide walks through the process of creating your own IssueManager to detect a custom-defined type of issue alongside the pre-defined issue types in Datalab.

Prerequisites#

As a starting point for this guide, we’ll import the necessary things for the next section and create a dummy dataset.

Note

Using Datalab requires additional dependencies beyond the rest of the cleanlab package. To install them, run:

$ pip install "cleanlab[datalab]"

For the developmental version of the package, install from source:

$ pip install "git+https://github.com/cleanlab/cleanlab.git#egg=cleanlab[datalab]"

import numpy as np
import pandas as pd
from cleanlab import IssueManager

# Create a dummy dataset
N = 20
data = pd.DataFrame(
    {
        "text": [f"example {i}" for i in range(N)],
        "label": np.random.randint(0, 2, N),
    },
)

Implementing IssueManagers#

Basic Issue Check#

To create a basic issue manager, inherit from the IssueManager class, assign a name to the class as the class-variable, issue_name, and implement the find_issues method.

The find_issues method should mark each example in the dataset as an issue or not with a boolean array. It should also provide a score for each example in the dataset that quantifies the quality of the example with regards to the issue.

class Basic(IssueManager):
    # Assign a name to the issue
    issue_name = "basic"
    def find_issues(self, **kwargs) -> None:
        # Compute scores for each example
        scores = np.random.rand(len(self.datalab.data))

        # Construct a dataframe where examples are marked for issues
        # and the score for each example is included.
        self.issues = pd.DataFrame(
            {
                f"is_{self.issue_name}_issue" : scores < 0.1,
                self.issue_score_key : scores,
            },
        )

        # Score the dataset as a whole based on this issue type
        self.summary = self.make_summary(score = scores.mean())

Intermediate Issue Check#

To create an intermediate issue:

Perform the same steps as in the basic issue check section.
Populate the info attribute with a dictionary of information about the identified issues.

The information can be included in a report generated by Datalab, if you add any of the keys to the verbosity_levels class-attribute. Optionally, you can also add a description of the type of issue this issue manager handles to the description class-attribute.

class Intermediate(IssueManager):
    issue_name = "intermediate"
    # Add a dictionary of information to include in the report
    verbosity_levels = {
        0: [],
        1: ["std"],
        2: ["raw_scores"],
    }
    # Add a description of the issue
    description = "Intermediate issues are a bit more involved than basic issues."
    def find_issues(self, *, intermediate_arg: int, **kwargs) -> None:
        N = len(self.datalab.data)
        raw_scores = np.random.rand(N)
        std = raw_scores.std()
        threshold = min(0, raw_scores.mean() - std)
        sin_filter = np.sin(intermediate_arg * np.arange(N) / N)
        kernel = sin_filter ** 2
        scores = kernel * raw_scores
        self.issues = pd.DataFrame(
            {
                f"is_{self.issue_name}_issue" : scores < threshold,
                self.issue_score_key : scores,
            },
        )
        self.summary = self.make_summary(score = scores.mean())

        # Useful information that will be available in the Datalab instance
        self.info = {
            "std": std,
            "raw_scores": raw_scores,
            "kernel": kernel,
        }

Advanced Issue Check#

There could be different types of issues detected in a dataset. A local issue which affects individual data points in a dataset and can be tracked via Datalab.issues dataframe (to see which data points are exhibiting this type of issue). Alternatively, a global issue which affects the overall dataset but is not easily attributable to individual data points (hard to say one data point exhibits the issue but another does not). Even for global issues, we recommend trying to assign a per data point score (and boolean) if possible, see the Non-IID IssueManager as an example of this. Note that a global issue must have num_issues greater than 0 in its issue_summary, otherwise it won’t show up in Datalab.report() by default.

Use with Datalab#

We can create a Datalab instance and run issue checks with the custom issue managers we created like so:

from cleanlab.datalab.internal.issue_manager_factory import register
from cleanlab import Datalab


# Register the issue manager
for issue_manager in [Basic, Intermediate]:
    register(issue_manager)

# Instantiate a datalab instance
datalab = Datalab(data, label_name="label")

# Run the issue check
issue_types = {"basic": {}, "intermediate": {"intermediate_arg": 2}}
datalab.find_issues(issue_types=issue_types)

# Print report
datalab.report(verbosity=0)

The report will look something like this:

Here is a summary of the different kinds of issues found in the data:

  issue_type     score  num_issues
       basic  0.477762           2
intermediate  0.286455           0

(Note: A lower score indicates a more severe issue across all examples in the dataset.)


------------------------------------------- basic issues -------------------------------------------

Number of examples with this issue: 2
Overall dataset quality in terms of this issue: 0.4778

Examples representing most severe instances of this issue:
    is_basic_issue  basic_score
13            True     0.003042
8             True     0.058117
11           False     0.121908
15           False     0.169312
17           False     0.229044


--------------------------------------- intermediate issues ----------------------------------------

About this issue:
    Intermediate issues are a bit more involved than basic issues.

Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.2865

Examples representing most severe instances of this issue:
    is_intermediate_issue  intermediate_score    kernel
0                   False            0.000000       0.0
1                   False            0.007059  0.009967
3                   False            0.010995  0.087332
2                   False            0.016296   0.03947
11                  False            0.019459  0.794251