data_valuation#
Classes:
|
Detect which examples in a dataset are least valuable via an approximate Data Shapely value. |
- class cleanlab.datalab.internal.issue_manager.data_valuation.DataValuationIssueManager(datalab, metric=None, threshold=None, k=10, **kwargs)[source]#
Bases:
IssueManager
Detect which examples in a dataset are least valuable via an approximate Data Shapely value.
Examples
>>> from cleanlab import Datalab >>> import numpy as np >>> from sklearn.neighbors import NearestNeighbors >>> >>> # Generate two distinct clusters >>> X = np.vstack([ ... np.random.normal(-1, 1, (25, 2)), ... np.random.normal(1, 1, (25, 2)), ... ]) >>> y = np.array([0]*25 + [1]*25) >>> >>> # Initialize Datalab with data >>> lab = Datalab(data={"y": y}, label_name="y") >>> >>> # Creating a knn_graph for data valuation >>> knn = NearestNeighbors(n_neighbors=10).fit(X) >>> knn_graph = knn.kneighbors_graph(mode='distance') >>> >>> # Specifying issue types for data valuation >>> issue_types = {"data_valuation": {}} >>> lab.find_issues(knn_graph=knn_graph, issue_types=issue_types)
Attributes:
Short text that summarizes the type of issues handled by this IssueManager.
Returns a key that is used to store issue summary results about the assigned Lab.
Returns a key that is used to store issue score results about the assigned Lab.
A dictionary of verbosity levels and their corresponding dictionaries of report items to print.
Methods:
find_issues
([features])Calculate the data valuation score with a provided or existing knn graph.
collect_info
(issues, knn_graph)Collects data for the info attribute of the Datalab.
make_summary
(score)Construct a summary dataframe.
report
(issues, summary, info[, ...])Compose a report of the issues found by this IssueManager.
-
description:
ClassVar
[str
]# Short text that summarizes the type of issues handled by this IssueManager.
-
issue_name:
ClassVar
[str
] = 'data_valuation'# Returns a key that is used to store issue summary results about the assigned Lab.
-
issue_score_key:
ClassVar
[str
] = 'data_valuation_score'# Returns a key that is used to store issue score results about the assigned Lab.
-
verbosity_levels:
ClassVar
[Dict
[int
,List
[str
]]]# A dictionary of verbosity levels and their corresponding dictionaries of report items to print.
Example
>>> verbosity_levels = { ... 0: [], ... 1: ["some_info_key"], ... 2: ["additional_info_key"], ... }
- DEFAULT_THRESHOLD = 0.5#
- find_issues(features=None, **kwargs)[source]#
Calculate the data valuation score with a provided or existing knn graph. Based on KNN-Shapley value described in https://arxiv.org/abs/1911.07128 The larger the score, the more valuable the data point is, the more contribution it will make to the model’s training.
- Parameters:
knn_graph (
csr_matrix
) – A sparse matrix representing the knn graph.- Return type:
None
- collect_info(issues, knn_graph)[source]#
Collects data for the info attribute of the Datalab. :rtype:
dict
Note
This method is called by
find_issues()
afterfind_issues()
has set the issues and summary dataframes as instance attributes.
- classmethod make_summary(score)#
Construct a summary dataframe.
- Parameters:
score (
float
) – The overall score for this issue.- Return type:
DataFrame
- Returns:
summary
– A summary dataframe.
- classmethod report(issues, summary, info, num_examples=5, verbosity=0, include_description=False, info_to_omit=None)#
Compose a report of the issues found by this IssueManager.
- Parameters:
issues (
DataFrame
) –An issues dataframe.
Example
>>> import pandas as pd >>> issues = pd.DataFrame( ... { ... "is_X_issue": [True, False, True], ... "X_score": [0.2, 0.9, 0.4], ... }, ... )
summary (
DataFrame
) –The summary dataframe.
Example
>>> summary = pd.DataFrame( ... { ... "issue_type": ["X"], ... "score": [0.5], ... }, ... )
info (
Dict
[str
,Any
]) –The info dict.
Example
>>> info = { ... "A": "val_A", ... "B": ["val_B1", "val_B2"], ... }
num_examples (
int
) – The number of examples to print.verbosity (
int
) – The verbosity level of the report.include_description (
bool
) – Whether to include a description of the issue in the report.
- Return type:
str
- Returns:
report_str
– A string containing the report.
-
info:
Dict
[str
,Any
]#
-
issues:
DataFrame
#
-
summary:
DataFrame
#
-
description: