data_valuation#

Classes:

DataValuationIssueManager(datalab[, metric, ...])

Detect which examples in a dataset are least valuable via an approximate Data Shapely value.

class cleanlab.datalab.internal.issue_manager.data_valuation.DataValuationIssueManager(datalab, metric=None, threshold=None, k=10, **kwargs)[source]#

Bases: IssueManager

Detect which examples in a dataset are least valuable via an approximate Data Shapely value.

Examples

>>> from cleanlab import Datalab
>>> import numpy as np
>>> from sklearn.neighbors import NearestNeighbors
>>>
>>> # Generate two distinct clusters
>>> X = np.vstack([
...     np.random.normal(-1, 1, (25, 2)),
...     np.random.normal(1, 1, (25, 2)),
... ])
>>> y = np.array([0]*25 + [1]*25)
>>>
>>> # Initialize Datalab with data
>>> lab = Datalab(data={"y": y}, label_name="y")
>>>
>>> # Creating a knn_graph for data valuation
>>> knn = NearestNeighbors(n_neighbors=10).fit(X)
>>> knn_graph = knn.kneighbors_graph(mode='distance')
>>>
>>> # Specifying issue types for data valuation
>>> issue_types = {"data_valuation": {}}
>>> lab.find_issues(knn_graph=knn_graph, issue_types=issue_types)

Attributes:

description

Short text that summarizes the type of issues handled by this IssueManager.

issue_name

Returns a key that is used to store issue summary results about the assigned Lab.

issue_score_key

Returns a key that is used to store issue score results about the assigned Lab.

verbosity_levels

A dictionary of verbosity levels and their corresponding dictionaries of report items to print.

DEFAULT_THRESHOLD

Methods:

find_issues([features])

Calculate the data valuation score with a provided or existing knn graph.

collect_info(issues, knn_graph)

Collects data for the info attribute of the Datalab.

make_summary(score)

Construct a summary dataframe.

report(issues, summary, info[, ...])

Compose a report of the issues found by this IssueManager.

description: ClassVar[str]#

Short text that summarizes the type of issues handled by this IssueManager.

issue_name: ClassVar[str] = 'data_valuation'#

Returns a key that is used to store issue summary results about the assigned Lab.

issue_score_key: ClassVar[str] = 'data_valuation_score'#

Returns a key that is used to store issue score results about the assigned Lab.

verbosity_levels: ClassVar[Dict[int, List[str]]]#

A dictionary of verbosity levels and their corresponding dictionaries of report items to print.

Example

>>> verbosity_levels = {
...     0: [],
...     1: ["some_info_key"],
...     2: ["additional_info_key"],
... }
DEFAULT_THRESHOLD = 0.5#
find_issues(features=None, **kwargs)[source]#

Calculate the data valuation score with a provided or existing knn graph. Based on KNN-Shapley value described in https://arxiv.org/abs/1911.07128 The larger the score, the more valuable the data point is, the more contribution it will make to the model’s training.

Parameters:

knn_graph (csr_matrix) – A sparse matrix representing the knn graph.

Return type:

None

collect_info(issues, knn_graph)[source]#

Collects data for the info attribute of the Datalab. :rtype: dict

Note

This method is called by find_issues() after find_issues() has set the issues and summary dataframes as instance attributes.

classmethod make_summary(score)#

Construct a summary dataframe.

Parameters:

score (float) – The overall score for this issue.

Return type:

DataFrame

Returns:

summary – A summary dataframe.

classmethod report(issues, summary, info, num_examples=5, verbosity=0, include_description=False, info_to_omit=None)#

Compose a report of the issues found by this IssueManager.

Parameters:
  • issues (DataFrame) –

    An issues dataframe.

    Example

    >>> import pandas as pd
    >>> issues = pd.DataFrame(
    ...     {
    ...         "is_X_issue": [True, False, True],
    ...         "X_score": [0.2, 0.9, 0.4],
    ...     },
    ... )
    

  • summary (DataFrame) –

    The summary dataframe.

    Example

    >>> summary = pd.DataFrame(
    ...     {
    ...         "issue_type": ["X"],
    ...         "score": [0.5],
    ...     },
    ... )
    

  • info (Dict[str, Any]) –

    The info dict.

    Example

    >>> info = {
    ...     "A": "val_A",
    ...     "B": ["val_B1", "val_B2"],
    ... }
    

  • num_examples (int) – The number of examples to print.

  • verbosity (int) – The verbosity level of the report.

  • include_description (bool) – Whether to include a description of the issue in the report.

Return type:

str

Returns:

report_str – A string containing the report.

info: Dict[str, Any]#
issues: DataFrame#
summary: DataFrame#