duplicate#
Classes:
  | 
Manages issues related to near-duplicate examples.  | 
- class cleanlab.datalab.internal.issue_manager.duplicate.NearDuplicateIssueManager(datalab, metric=None, threshold=0.13, k=10, **_)[source]#
 Bases:
IssueManagerManages issues related to near-duplicate examples.
Attributes:
Short text that summarizes the type of issues handled by this IssueManager.
Returns a key that is used to store issue summary results about the assigned Lab.
A dictionary of verbosity levels and their corresponding dictionaries of report items to print.
Returns a key that is used to store issue score results about the assigned Lab.
Methods:
find_issues([features])Finds occurrences of this particular issue in the dataset.
collect_info(knn_graph, median_nn_distance)Collects data for the info attribute of the Datalab.
make_summary(score)Construct a summary dataframe.
report(issues, summary, info[, ...])Compose a report of the issues found by this IssueManager.
- 
description: 
ClassVar[str]# Short text that summarizes the type of issues handled by this IssueManager.
- 
issue_name: 
ClassVar[str] = 'near_duplicate'# Returns a key that is used to store issue summary results about the assigned Lab.
- 
verbosity_levels: 
ClassVar[Dict[int,List[str]]]# A dictionary of verbosity levels and their corresponding dictionaries of report items to print.
Example
>>> verbosity_levels = { ... 0: [], ... 1: ["some_info_key"], ... 2: ["additional_info_key"], ... }
- 
near_duplicate_sets: 
List[List[int]]# 
- find_issues(features=None, **kwargs)[source]#
 Finds occurrences of this particular issue in the dataset.
Computes the issues and summary dataframes. Calls collect_info to compute the info dict.
- Return type:
 None
- collect_info(knn_graph, median_nn_distance)[source]#
 Collects data for the info attribute of the Datalab. :rtype:
dictNote
This method is called by
find_issues()afterfind_issues()has set the issues and summary dataframes as instance attributes.
- 
issue_score_key: 
ClassVar[str] = 'near_duplicate_score'# Returns a key that is used to store issue score results about the assigned Lab.
- classmethod make_summary(score)#
 Construct a summary dataframe.
- Parameters:
 score (
float) – The overall score for this issue.- Return type:
 DataFrame- Returns:
 summary– A summary dataframe.
- classmethod report(issues, summary, info, num_examples=5, verbosity=0, include_description=False, info_to_omit=None)#
 Compose a report of the issues found by this IssueManager.
- Parameters:
 issues (
DataFrame) –An issues dataframe.
Example
>>> import pandas as pd >>> issues = pd.DataFrame( ... { ... "is_X_issue": [True, False, True], ... "X_score": [0.2, 0.9, 0.4], ... }, ... )
summary (
DataFrame) –The summary dataframe.
Example
>>> summary = pd.DataFrame( ... { ... "issue_type": ["X"], ... "score": [0.5], ... }, ... )
info (
Dict[str,Any]) –The info dict.
Example
>>> info = { ... "A": "val_A", ... "B": ["val_B1", "val_B2"], ... }
num_examples (
int) – The number of examples to print.verbosity (
int) – The verbosity level of the report.include_description (
bool) – Whether to include a description of the issue in the report.
- Return type:
 str- Returns:
 report_str– A string containing the report.
- 
info: 
Dict[str,Any]# 
- 
issues: 
DataFrame# 
- 
summary: 
DataFrame# 
- 
description: