fasttext#

Text classification with fastText models that are compatible with cleanlab. This module allows you to easily find label issues in your text datasets.

You must have fastText installed: pip install fasttext.

Tips:

Check out our example using this class: fasttext_amazon_reviews
Our unit tests also provide basic usage examples.

Functions:

data_loader([fn, indices, label, batch_size])

Returns a generator, yielding two lists containing [labels], [text].

Classes:

FastTextClassifier(train_data_fn[, ...])

Instantiate a fastText classifier that is compatible with CleanLearning.

cleanlab.models.fasttext.data_loader(fn=None, indices=None, label='__label__', batch_size=1000)[source]#: Returns a generator, yielding two lists containing [labels], [text]. Items are always returned in the order in the file, regardless if indices are provided.

class cleanlab.models.fasttext.FastTextClassifier(train_data_fn, test_data_fn=None, labels=None, tmp_dir='', label='__label__', del_intermediate_data=True, kwargs_train_supervised={}, p_at_k=1, batch_size=1000)[source]#

Bases: BaseEstimator

Instantiate a fastText classifier that is compatible with CleanLearning.

Parameters:

train_data_fn (str) – File name of the training data in the format compatible with fastText.
test_data_fn (str, optional) – File name of the test data in the format compatible with fastText.

Methods:

`fit`([X, y, sample_weight])	Trains the fast text classifier.
`predict_proba`([X, train_data, return_labels])	Produces a probability matrix with examples on rows and classes on columns, where each row sums to 1 and captures the probability of the example belonging to each class.
`predict`([X, train_data, return_labels])	Predict labels of X
`score`([X, y, sample_weight, k])	Compute the average precision @ k (single label) of the labels predicted from X and the true labels given by y.
`get_params`([deep])	Get parameters for this estimator.
`set_params`(**params)	Set the parameters of this estimator.

fit(X=None, y=None, sample_weight=None)[source]#

Trains the fast text classifier. Typical usage requires NO parameters, just clf.fit() # No params.

Parameters:

X (iterable, e.g. list, numpy array (default None)) – The list of indices of the data to use. When in doubt, set as None. None defaults to range(len(data)).
y (None) – Leave this as None. It’s a filler to suit sklearns reqs.
sample_weight (None) – Leave this as None. It’s a filler to suit sklearns reqs.

predict_proba(X=None, train_data=True, return_labels=False)[source]#: Produces a probability matrix with examples on rows and classes on columns, where each row sums to 1 and captures the probability of the example belonging to each class.

predict(X=None, train_data=True, return_labels=False)[source]#: Predict labels of X

score(X=None, y=None, sample_weight=None, k=None)[source]#: Compute the average precision @ k (single label) of the labels predicted from X and the true labels given by y. score expects a y variable. In this case, y is the noisy labels.

get_params(deep=True)#

Get parameters for this estimator.

Parameters:: deep (bool, default True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params (dict) – Parameter names mapped to their values.

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self (estimator instance) – Estimator instance.