token_classification_utils#

Helper methods used internally in cleanlab.token_classification

Functions:

get_sentence(words)

Get sentence formed by a list of words with minor processing for readability

filter_sentence(sentences[, condition])

Filter sentence based on some condition, and returns filter mask

process_token(token[, replace])

Replaces special characters in the tokens

mapping(entities, maps)

Map a list of entities to its corresponding entities

merge_probs(probs, maps)

Merges model-predictive probabilities with desired mapping

color_sentence(sentence, word)

Searches for a given token in the sentence and returns the sentence where the given token is colored red

cleanlab.internal.token_classification_utils.get_sentence(words)[source]#

Get sentence formed by a list of words with minor processing for readability

Parameters:

words (List[str]) – list of word-level tokens

Return type:

str

Returns:

sentence – sentence formed by list of word-level tokens

Examples

>>> from cleanlab.internal.token_classification_utils import get_sentence
>>> words = ["This", "is", "a", "sentence", "."]
>>> get_sentence(words)
'This is a sentence.'
cleanlab.internal.token_classification_utils.filter_sentence(sentences, condition=None)[source]#

Filter sentence based on some condition, and returns filter mask

Parameters:
  • sentences (List[str]) – list of sentences

  • condition (Optional[Callable[[str], bool]]) – sentence filtering condition

Return type:

Tuple[List[str], List[bool]]

Returns:

  • sentences – list of sentences filtered

  • mask – boolean mask such that mask[i] == True if the i’th sentence is included in the filtered sentence, otherwise mask[i] == False

Examples

>>> from cleanlab.internal.token_classification_utils import filter_sentence
>>> sentences = ["Short sentence.", "This is a longer sentence."]
>>> condition = lambda x: len(x.split()) > 2
>>> long_sentences, _ = filter_sentence(sentences, condition)
>>> long_sentences
['This is a longer sentence.']
>>> document = ["# Headline", "Sentence 1.", "&", "Sentence 2."]
>>> sentences, mask = filter_sentence(document)
>>> sentences, mask
(['Sentence 1.', 'Sentence 2.'], [False, True, False, True])
cleanlab.internal.token_classification_utils.process_token(token, replace=[('#', '')])[source]#

Replaces special characters in the tokens

Parameters:
  • token (str) – token which potentially contains special characters

  • replace (List[Tuple[str, str]]) – list of tuples (s1, s2), where all occurances of s1 are replaced by s2

Return type:

str

Returns:

processed_token – processed token whose special character has been replaced

Note

Only applies to characters in the original input token.

Examples

>>> from cleanlab.internal.token_classification_utils import process_token
>>> token = "#Comment"
>>> process_token("#Comment")
'Comment'

Specify custom replacement rules

>>> replace = [("C", "a"), ("a", "C")]
>>> process_token("Cleanlab", replace)
'aleCnlCb'
cleanlab.internal.token_classification_utils.mapping(entities, maps)[source]#

Map a list of entities to its corresponding entities

Parameters:
  • entities (List[int]) – a list of given entities

  • maps (List[int]) – a list of mapped entities, such that the i’th indexed token should be mapped to maps[i]

Return type:

List[int]

Returns:

mapped_entities – a list of mapped entities

Examples

>>> unique_identities = [0, 1, 2, 3, 4]  # ["O", "B-PER", "I-PER", "B-LOC", "I-LOC"]
>>> maps = [0, 1, 1, 2, 2]  # ["O", "PER", "PER", "LOC", "LOC"]
>>> mapping(unique_identities, maps)
[0, 1, 1, 2, 2]  # ["O", "PER", "PER", "LOC", "LOC"]
>>> mapping([0, 0, 4, 4, 3, 4, 0, 2], maps)
[0, 0, 2, 2, 2, 2, 0, 1]  # ["O", "O", "LOC", "LOC", "LOC", "LOC", "O", "PER"]
cleanlab.internal.token_classification_utils.merge_probs(probs, maps)[source]#

Merges model-predictive probabilities with desired mapping

Parameters:
  • probs (ndarray[Any, dtype[np.floating[T]]]) – A 2D np.array of shape (N, K), where N is the number of tokens, and K is the number of classes for the model

  • maps (List[int]) – a list of mapped index, such that the probability of the token being in the i’th class is mapped to the maps[i] index. If maps[i] == -1, the i’th column of probs is ignored. If np.any(maps == -1), the returned probability is re-normalized.

Return type:

ndarray[Any, dtype[np.floating[T]]]

Returns:

probs_merged – A 2D np.array of shape (N, K'), where K’ is the number of new classes. Probabilities are merged and re-normalized if necessary.

Examples

>>> import numpy as np
>>> from cleanlab.internal.token_classification_utils import merge_probs
>>> probs = np.array([
...     [0.55, 0.0125, 0.0375, 0.1, 0.3],
...     [0.1, 0.8, 0, 0.075, 0.025],
... ])
>>> maps = [0, 1, 1, 2, 2]
>>> merge_probs(probs, maps)
array([[0.55, 0.05, 0.4 ],
       [0.1 , 0.8 , 0.1 ]])
cleanlab.internal.token_classification_utils.color_sentence(sentence, word)[source]#

Searches for a given token in the sentence and returns the sentence where the given token is colored red

Parameters:
  • sentence (str) – a sentence where the word is searched

  • word (str) – keyword to find in sentence. Assumes the word exists in the sentence.

Return type:

str

Returns:

colored_sentencesentence where the every occurrence of the word is colored red, using termcolor.colored

Examples

>>> from cleanlab.internal.token_classification_utils import color_sentence
>>> sentence = "This is a sentence."
>>> word = "sentence"
>>> color_sentence(sentence, word)
'This is a sentence.'

Also works for multiple occurrences of the word

>>> document = "This is a sentence. This is another sentence."
>>> word = "sentence"
>>> color_sentence(document, word)
'This is a sentence. This is another sentence.'