Datalab guides#
Guides for using Datalab and understanding the issues it detects.
Note
Using Datalab requires additional dependencies beyond the rest of the cleanlab
package. To install them, run:
$ pip install "cleanlab[datalab]"
For the developmental version of the package, install from source:
$ pip install "git+https://github.com/cleanlab/cleanlab.git#egg=cleanlab[datalab]"
Types of issues#
Guides to use Datalab with greater control, selecting what issues to search for and what nondefault settings to use for detecting them.
- Datalab Issue Types
- Types of issues Datalab can detect
- Estimates for Each Issue Type
- Inputs to Datalab
- Label Issue
- Outlier Issue
- (Near) Duplicate Issue
- Non-IID Issue
- Class Imbalance Issue
- Image-specific Issues
- Spurious Correlations between image-specific properties and labels
- Underperforming Group Issue
- Null Issue
- Data Valuation Issue
- Identifier Column Issue
- Optional Issue Parameters
- Label Issue Parameters
- Outlier Issue Parameters
- Duplicate Issue Parameters
- Non-IID Issue Parameters
- Imbalance Issue Parameters
- Underperforming Group Issue Parameters
- Null Issue Parameters
- Data Valuation Issue Parameters
- Identifier Column Parameters
- Image Issue Parameters
- Spurious Correlations Issue Parameters
- Cleanlab Studio (Easy Mode)
- Types of issues Datalab can detect
Customizing issue types#
Guides (for developers) to create a custom issue type that Datalab audits for together with its built-in issue types.
Cleanlab Studio (Easy Mode)#
Cleanlab Studio is a fully automated platform that can detect the same data issues as this package, as well as many more types of issues, all without you having to do any Machine Learning (or even write any code). Beyond being 100x faster to use and producing more useful results, Cleanlab Studio also provides an intelligent data correction interface for you to quickly fix the issues detected in your dataset (a single data scientist can fix millions of data points thanks to AI suggestions).
Cleanlab Studio offers a powerful AutoML system (with Foundation models) that is useful for more than improving data quality. With a few clicks, you can: find + fix issues in your dataset, identify the best type of ML model and train/tune it, and deploy this model to serve accurate predictions for new data. Also use the same AutoML to auto-label large datasets (a single user can label millions of data points thanks to powerful Foundation models). Try Cleanlab Studio for free!