token_classification_utils#

Helper methods used internally in cleanlab.token_classification

Functions:

`get_sentence`(words)	Get sentence formed by a list of words with minor processing for readability
`filter_sentence`(sentences[, condition])	Filter sentence based on some condition, and returns filter mask
`process_token`(token[, replace])	Replaces special characters in the tokens
`mapping`(entities, maps)	Map a list of entities to its corresponding entities
`merge_probs`(probs, maps)	Merges model-predictive probabilities with desired mapping
`color_sentence`(sentence, word)	Searches for a given token in the sentence and returns the sentence where the given token is colored red

cleanlab.internal.token_classification_utils.get_sentence(words)[source]#

Get sentence formed by a list of words with minor processing for readability

Parameters:: words (List[str]) – list of word-level tokens
Return type:: str
Returns:: sentence – sentence formed by list of word-level tokens

Examples

>>> from cleanlab.internal.token_classification_utils import get_sentence
>>> words = ["This", "is", "a", "sentence", "."]
>>> get_sentence(words)
'This is a sentence.'

cleanlab.internal.token_classification_utils.filter_sentence(sentences, condition=None)[source]#

Filter sentence based on some condition, and returns filter mask

Parameters:

sentences (List[str]) – list of sentences
condition (Optional[Callable[[str], bool]]) – sentence filtering condition

Return type:

Tuple[List[str], List[bool]]

Returns:

sentences – list of sentences filtered
mask – boolean mask such that mask[i] == True if the i’th sentence is included in the filtered sentence, otherwise mask[i] == False

Examples

>>> from cleanlab.internal.token_classification_utils import filter_sentence
>>> sentences = ["Short sentence.", "This is a longer sentence."]
>>> condition = lambda x: len(x.split()) > 2
>>> long_sentences, _ = filter_sentence(sentences, condition)
>>> long_sentences
['This is a longer sentence.']
>>> document = ["# Headline", "Sentence 1.", "&", "Sentence 2."]
>>> sentences, mask = filter_sentence(document)
>>> sentences, mask
(['Sentence 1.', 'Sentence 2.'], [False, True, False, True])

cleanlab.internal.token_classification_utils.process_token(token, replace=[('#', '')])[source]#

Replaces special characters in the tokens

Parameters:

token (str) – token which potentially contains special characters
replace (List[Tuple[str, str]]) – list of tuples (s1, s2), where all occurances of s1 are replaced by s2

Return type:

str

Returns:

processed_token – processed token whose special character has been replaced

Note

Only applies to characters in the original input token.

Examples

>>> from cleanlab.internal.token_classification_utils import process_token
>>> token = "#Comment"
>>> process_token("#Comment")
'Comment'

Specify custom replacement rules

>>> replace = [("C", "a"), ("a", "C")]
>>> process_token("Cleanlab", replace)
'aleCnlCb'

cleanlab.internal.token_classification_utils.mapping(entities, maps)[source]#

Map a list of entities to its corresponding entities

Parameters:

entities (List[int]) – a list of given entities
maps (List[int]) – a list of mapped entities, such that the i’th indexed token should be mapped to maps[i]

Return type:

List[int]

Returns:

mapped_entities – a list of mapped entities

Examples

>>> unique_identities = [0, 1, 2, 3, 4]  # ["O", "B-PER", "I-PER", "B-LOC", "I-LOC"]
>>> maps = [0, 1, 1, 2, 2]  # ["O", "PER", "PER", "LOC", "LOC"]
>>> mapping(unique_identities, maps)
[0, 1, 1, 2, 2]  # ["O", "PER", "PER", "LOC", "LOC"]
>>> mapping([0, 0, 4, 4, 3, 4, 0, 2], maps)
[0, 0, 2, 2, 2, 2, 0, 1]  # ["O", "O", "LOC", "LOC", "LOC", "LOC", "O", "PER"]

cleanlab.internal.token_classification_utils.merge_probs(probs, maps)[source]#

Merges model-predictive probabilities with desired mapping

Parameters:

probs (ndarray[Any, dtype[np.floating[T]]]) – A 2D np.array of shape (N, K), where N is the number of tokens, and K is the number of classes for the model
maps (List[int]) – a list of mapped index, such that the probability of the token being in the i’th class is mapped to the maps[i] index. If maps[i] == -1, the i’th column of probs is ignored. If np.any(maps == -1), the returned probability is re-normalized.

Return type:

ndarray[Any, dtype[np.floating[T]]]

Returns:

probs_merged – A 2D np.array of shape (N, K'), where K’ is the number of new classes. Probabilities are merged and re-normalized if necessary.

Examples

>>> import numpy as np
>>> from cleanlab.internal.token_classification_utils import merge_probs
>>> probs = np.array([
...     [0.55, 0.0125, 0.0375, 0.1, 0.3],
...     [0.1, 0.8, 0, 0.075, 0.025],
... ])
>>> maps = [0, 1, 1, 2, 2]
>>> merge_probs(probs, maps)
array([[0.55, 0.05, 0.4 ],
       [0.1 , 0.8 , 0.1 ]])

cleanlab.internal.token_classification_utils.color_sentence(sentence, word)[source]#

Searches for a given token in the sentence and returns the sentence where the given token is colored red

Parameters:

sentence (str) – a sentence where the word is searched
word (str) – keyword to find in sentence. Assumes the word exists in the sentence.

Return type:

str

Returns:

colored_sentence – sentence where the every occurrence of the word is colored red, using termcolor.colored

Examples

>>> from cleanlab.internal.token_classification_utils import color_sentence
>>> sentence = "This is a sentence."
>>> word = "sentence"
>>> color_sentence(sentence, word)
'This is a [31msentence[0m.'

Also works for multiple occurrences of the word

>>> document = "This is a sentence. This is another sentence."
>>> word = "sentence"
>>> color_sentence(document, word)
'This is a [31msentence[0m. This is another [31msentence[0m.'