Creating Your Own Issues Manager#
This guide walks through the process of creating creating your own
IssueManager
to detect a custom-defined type of issue alongside the pre-defined issue types in
Datalab
.
See also
register
:You can either use this function at runtime to register a new issue manager:
from cleanlab.datalab.internal.issue_manager_factory import register register(MyIssueManager)
or add as a decorator to the class definition:
@register class MyIssueManager(IssueManager): ...
Prerequisites#
As a starting point for this guide, we’ll import the necessary things for the next section and create a dummy dataset.
Note
Using Datalab requires additional dependencies beyond the rest of the cleanlab
package. To install them, run:
$ pip install "cleanlab[datalab]"
For the developmental version of the package, install from source:
$ pip install "git+https://github.com/cleanlab/cleanlab.git#egg=cleanlab[datalab]"
import numpy as np
import pandas as pd
from cleanlab import IssueManager
# Create a dummy dataset
N = 20
data = pd.DataFrame(
{
"text": [f"example {i}" for i in range(N)],
"label": np.random.randint(0, 2, N),
},
)
Implementing IssueManagers#
Basic Issue Check#
To create a basic issue manager, inherit from the
IssueManager
class,
assign a name to the class as the class-variable, issue_name
, and implement the
find_issues
method.
The find_issues
method should mark each example in the dataset as an issue or not with a boolean array.
It should also provide a score for each example in the dataset that quantifies the quality of the example
with regards to the issue.
class Basic(IssueManager):
# Assign a name to the issue
issue_name = "basic"
def find_issues(self, **kwargs) -> None:
# Compute scores for each example
scores = np.random.rand(len(self.datalab.data))
# Construct a dataframe where examples are marked for issues
# and the score for each example is included.
self.issues = pd.DataFrame(
{
f"is_{self.issue_name}_issue" : scores < 0.1,
self.issue_score_key : scores,
},
)
# Score the dataset as a whole based on this issue type
self.summary = self.make_summary(score = scores.mean())
Intermediate Issue Check#
To create an intermediate issue:
Perform the same steps as in the basic issue check section.
Populate the
info
attribute with a dictionary of information about the identified issues.
The information can be included in a report generated by Datalab
,
if you add any of the keys to the verbosity_levels
class-attribute.
Optionally, you can also add a description of the type of issue this issue manager handles to the description
class-attribute.
class Intermediate(IssueManager):
issue_name = "intermediate"
# Add a dictionary of information to include in the report
verbosity_levels = {
0: [],
1: ["std"],
2: ["raw_scores"],
}
# Add a description of the issue
description = "Intermediate issues are a bit more involved than basic issues."
def find_issues(self, *, intermediate_arg: int, **kwargs) -> None:
N = len(self.datalab.data)
raw_scores = np.random.rand(N)
std = raw_scores.std()
threshold = min(0, raw_scores.mean() - std)
sin_filter = np.sin(intermediate_arg * np.arange(N) / N)
kernel = sin_filter ** 2
scores = kernel * raw_scores
self.issues = pd.DataFrame(
{
f"is_{self.issue_name}_issue" : scores < threshold,
self.issue_score_key : scores,
},
)
self.summary = self.make_summary(score = scores.mean())
# Useful information that will be available in the Datalab instance
self.info = {
"std": std,
"raw_scores": raw_scores,
"kernel": kernel,
}
Advanced Issue Check#
Note
WIP: This section is a work in progress.
Use with Datalab#
We can create a
Datalab
instance and run issue checks with the custom issue managers we created like so:
from cleanlab.datalab.internal.issue_manager_factory import register
from cleanlab import Datalab
# Register the issue manager
for issue_manager in [Basic, Intermediate]:
register(issue_manager)
# Instantiate a datalab instance
datalab = Datalab(data, label_name="label")
# Run the issue check
issue_types = {"basic": {}, "intermediate": {"intermediate_arg": 2}}
datalab.find_issues(issue_types=issue_types)
# Print report
datalab.report(verbosity=0)
The report will look something like this:
Here is a summary of the different kinds of issues found in the data:
issue_type score num_issues
basic 0.477762 2
intermediate 0.286455 0
(Note: A lower score indicates a more severe issue across all examples in the dataset.)
------------------------------------------- basic issues -------------------------------------------
Number of examples with this issue: 2
Overall dataset quality in terms of this issue: 0.4778
Examples representing most severe instances of this issue:
is_basic_issue basic_score
13 True 0.003042
8 True 0.058117
11 False 0.121908
15 False 0.169312
17 False 0.229044
--------------------------------------- intermediate issues ----------------------------------------
About this issue:
Intermediate issues are a bit more involved than basic issues.
Number of examples with this issue: 0
Overall dataset quality in terms of this issue: 0.2865
Examples representing most severe instances of this issue:
is_intermediate_issue intermediate_score kernel
0 False 0.000000 0.0
1 False 0.007059 0.009967
3 False 0.010995 0.087332
2 False 0.016296 0.03947
11 False 0.019459 0.794251