fasttext#

Text classification with fastText models that are compatible with cleanlab. This module allows you to easily find label issues in your text datasets.

You must have fastText installed: pip install fasttext.

Tips:

Functions:

data_loader([fn, indices, label, batch_size])

Returns a generator, yielding two lists containing [labels], [text].

Classes:

FastTextClassifier(train_data_fn[, ...])

Instantiate a fastText classifier that is compatible with CleanLearning.

cleanlab.models.fasttext.data_loader(fn=None, indices=None, label='__label__', batch_size=1000)[source]#

Returns a generator, yielding two lists containing [labels], [text]. Items are always returned in the order in the file, regardless if indices are provided.

class cleanlab.models.fasttext.FastTextClassifier(train_data_fn, test_data_fn=None, labels=None, tmp_dir='', label='__label__', del_intermediate_data=True, kwargs_train_supervised={}, p_at_k=1, batch_size=1000)[source]#

Bases: BaseEstimator

Instantiate a fastText classifier that is compatible with CleanLearning.

Parameters:
  • train_data_fn (str) – File name of the training data in the format compatible with fastText.

  • test_data_fn (str, optional) – File name of the test data in the format compatible with fastText.

Methods:

fit([X, y, sample_weight])

Trains the fast text classifier.

predict_proba([X, train_data, return_labels])

Produces a probability matrix with examples on rows and classes on columns, where each row sums to 1 and captures the probability of the example belonging to each class.

predict([X, train_data, return_labels])

Predict labels of X

score([X, y, sample_weight, k])

Compute the average precision @ k (single label) of the labels predicted from X and the true labels given by y.

get_params([deep])

Get parameters for this estimator.

set_params(**params)

Set the parameters of this estimator.

fit(X=None, y=None, sample_weight=None)[source]#

Trains the fast text classifier. Typical usage requires NO parameters, just clf.fit() # No params.

Parameters:
  • X (iterable, e.g. list, numpy array (default None)) – The list of indices of the data to use. When in doubt, set as None. None defaults to range(len(data)).

  • y (None) – Leave this as None. It’s a filler to suit sklearns reqs.

  • sample_weight (None) – Leave this as None. It’s a filler to suit sklearns reqs.

predict_proba(X=None, train_data=True, return_labels=False)[source]#

Produces a probability matrix with examples on rows and classes on columns, where each row sums to 1 and captures the probability of the example belonging to each class.

predict(X=None, train_data=True, return_labels=False)[source]#

Predict labels of X

score(X=None, y=None, sample_weight=None, k=None)[source]#

Compute the average precision @ k (single label) of the labels predicted from X and the true labels given by y. score expects a y variable. In this case, y is the noisy labels.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:

deep (bool, default True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params (dict) – Parameter names mapped to their values.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self (estimator instance) – Estimator instance.