noniid#

Functions:

simplified_kolmogorov_smirnov_test(...)

Computes the Kolmogorov-Smirnov statistic between two groups of data.

Classes:

NonIIDIssueManager(datalab[, metric, k, ...])

Manages issues related to non-iid data distributions.

cleanlab.datalab.internal.issue_manager.noniid.simplified_kolmogorov_smirnov_test(neighbor_histogram, non_neighbor_histogram)[source]#

Computes the Kolmogorov-Smirnov statistic between two groups of data. The statistic is the largest difference between the empirical cumulative distribution functions (ECDFs) of the two groups.

Parameters:
  • neighbor_histogram (ndarray[Any, dtype[float64]]) – Histogram data for the nearest neighbor group.

  • non_neighbor_histogram (ndarray[Any, dtype[float64]]) – Histogram data for the non-neighbor group.

Return type:

float

Returns:

statistic – The KS statistic between the two ECDFs.

Note

  • Both input arrays should have the same length.

  • The input arrays are histograms, which means they contain the count or frequency of values in each group. The data in the histograms should be normalized so that they sum to one.

To calculate the KS statistic, the function first calculates the ECDFs for both input arrays, which are step functions that show the cumulative sum of the data up to each point. The function then calculates the largest absolute difference between the two ECDFs.

class cleanlab.datalab.internal.issue_manager.noniid.NonIIDIssueManager(datalab, metric=None, k=10, num_permutations=25, seed=0, significance_threshold=0.05, **_)[source]#

Bases: IssueManager

Manages issues related to non-iid data distributions.

Parameters:
  • datalab (Datalab) – The Datalab instance that this issue manager searches for issues in.

  • metric (Union[str, Callable, None]) – The distance metric used to compute the KNN graph of the examples in the dataset. If set to None, the metric will be automatically selected based on the dimensionality of the features used to represent the examples in the dataset.

  • k (int) – The number of nearest neighbors to consider when computing the KNN graph of the examples.

  • num_permutations (int) – The number of trials to run when performing permutation testing to determine whether the distribution of index-distances between neighbors in the dataset is IID or not.

Note

This class will only flag a single example as an issue if the dataset is considered non-IID. This type of issue is more relevant to the entire dataset as a whole, rather than to individual examples.

Attributes:

description

Short text that summarizes the type of issues handled by this IssueManager.

issue_name

Returns a key that is used to store issue summary results about the assigned Lab.

verbosity_levels

A dictionary of verbosity levels and their corresponding dictionaries of report items to print.

issue_score_key

Returns a key that is used to store issue score results about the assigned Lab.

info

issues

summary

Methods:

find_issues([features, pred_probs])

Finds occurrences of this particular issue in the dataset.

collect_info([knn_graph, knn])

Collects data for the info attribute of the Datalab.

make_summary(score)

Construct a summary dataframe.

report(issues, summary, info[, ...])

Compose a report of the issues found by this IssueManager.

description: ClassVar[str]#

Short text that summarizes the type of issues handled by this IssueManager.

issue_name: ClassVar[str] = 'non_iid'#

Returns a key that is used to store issue summary results about the assigned Lab.

verbosity_levels: ClassVar[Dict[int, List[str]]]#

A dictionary of verbosity levels and their corresponding dictionaries of report items to print.

Example

>>> verbosity_levels = {
...     0: [],
...     1: ["some_info_key"],
...     2: ["additional_info_key"],
... }
find_issues(features=None, pred_probs=None, **kwargs)[source]#

Finds occurrences of this particular issue in the dataset.

Computes the issues and summary dataframes. Calls collect_info to compute the info dict.

Return type:

None

collect_info(knn_graph=None, knn=None)[source]#

Collects data for the info attribute of the Datalab. :rtype: dict

Note

This method is called by find_issues() after find_issues() has set the issues and summary dataframes as instance attributes.

issue_score_key: ClassVar[str] = 'non_iid_score'#

Returns a key that is used to store issue score results about the assigned Lab.

classmethod make_summary(score)#

Construct a summary dataframe.

Parameters:

score (float) – The overall score for this issue.

Return type:

DataFrame

Returns:

summary – A summary dataframe.

classmethod report(issues, summary, info, num_examples=5, verbosity=0, include_description=False, info_to_omit=None)#

Compose a report of the issues found by this IssueManager.

Parameters:
  • issues (DataFrame) –

    An issues dataframe.

    Example

    >>> import pandas as pd
    >>> issues = pd.DataFrame(
    ...     {
    ...         "is_X_issue": [True, False, True],
    ...         "X_score": [0.2, 0.9, 0.4],
    ...     },
    ... )
    

  • summary (DataFrame) –

    The summary dataframe.

    Example

    >>> summary = pd.DataFrame(
    ...     {
    ...         "issue_type": ["X"],
    ...         "score": [0.5],
    ...     },
    ... )
    

  • info (Dict[str, Any]) –

    The info dict.

    Example

    >>> info = {
    ...     "A": "val_A",
    ...     "B": ["val_B1", "val_B2"],
    ... }
    

  • num_examples (int) – The number of examples to print.

  • verbosity (int) – The verbosity level of the report.

  • include_description (bool) – Whether to include a description of the issue in the report.

Return type:

str

Returns:

report_str – A string containing the report.

info: Dict[str, Any]#
issues: pd.DataFrame#
summary: pd.DataFrame#