token_classification_utils#
Helper methods used internally in cleanlab.token_classification
Functions:
|
Get sentence formed by a list of words with minor processing for readability |
|
Filter sentence based on some condition, and returns filter mask |
|
Replaces special characters in the tokens |
|
Map a list of entities to its corresponding entities |
|
Merges model-predictive probabilities with desired mapping |
|
Searches for a given token in the sentence and returns the sentence where the given token is colored red |
- cleanlab.internal.token_classification_utils.get_sentence(words)[source]#
Get sentence formed by a list of words with minor processing for readability
- Parameters:
words (
List
[str
]) – list of word-level tokens- Return type:
str
- Returns:
sentence
– sentence formed by list of word-level tokens
Examples
>>> from cleanlab.internal.token_classification_utils import get_sentence >>> words = ["This", "is", "a", "sentence", "."] >>> get_sentence(words) 'This is a sentence.'
- cleanlab.internal.token_classification_utils.filter_sentence(sentences, condition=None)[source]#
Filter sentence based on some condition, and returns filter mask
- Parameters:
sentences (
List
[str
]) – list of sentencescondition (
Optional
[Callable
[[str
],bool
]]) – sentence filtering condition
- Return type:
Tuple
[List
[str
],List
[bool
]]- Returns:
sentences
– list of sentences filteredmask
– boolean mask such that mask[i] == True if the i’th sentence is included in the filtered sentence, otherwise mask[i] == False
Examples
>>> from cleanlab.internal.token_classification_utils import filter_sentence >>> sentences = ["Short sentence.", "This is a longer sentence."] >>> condition = lambda x: len(x.split()) > 2 >>> long_sentences, _ = filter_sentence(sentences, condition) >>> long_sentences ['This is a longer sentence.'] >>> document = ["# Headline", "Sentence 1.", "&", "Sentence 2."] >>> sentences, mask = filter_sentence(document) >>> sentences, mask (['Sentence 1.', 'Sentence 2.'], [False, True, False, True])
- cleanlab.internal.token_classification_utils.process_token(token, replace=[('#', '')])[source]#
Replaces special characters in the tokens
- Parameters:
token (
str
) – token which potentially contains special charactersreplace (
List
[Tuple
[str
,str
]]) – list of tuples (s1, s2), where all occurances of s1 are replaced by s2
- Return type:
str
- Returns:
processed_token
– processed token whose special character has been replaced
Note
Only applies to characters in the original input token.
Examples
>>> from cleanlab.internal.token_classification_utils import process_token >>> token = "#Comment" >>> process_token("#Comment") 'Comment'
Specify custom replacement rules
>>> replace = [("C", "a"), ("a", "C")] >>> process_token("Cleanlab", replace) 'aleCnlCb'
- cleanlab.internal.token_classification_utils.mapping(entities, maps)[source]#
Map a list of entities to its corresponding entities
- Parameters:
entities (
List
[int
]) – a list of given entitiesmaps (
List
[int
]) – a list of mapped entities, such that the i’th indexed token should be mapped to maps[i]
- Return type:
List
[int
]- Returns:
mapped_entities
– a list of mapped entities
Examples
>>> unique_identities = [0, 1, 2, 3, 4] # ["O", "B-PER", "I-PER", "B-LOC", "I-LOC"] >>> maps = [0, 1, 1, 2, 2] # ["O", "PER", "PER", "LOC", "LOC"] >>> mapping(unique_identities, maps) [0, 1, 1, 2, 2] # ["O", "PER", "PER", "LOC", "LOC"] >>> mapping([0, 0, 4, 4, 3, 4, 0, 2], maps) [0, 0, 2, 2, 2, 2, 0, 1] # ["O", "O", "LOC", "LOC", "LOC", "LOC", "O", "PER"]
- cleanlab.internal.token_classification_utils.merge_probs(probs, maps)[source]#
Merges model-predictive probabilities with desired mapping
- Parameters:
probs (
ndarray
[Any
,dtype
[floating
[TypeVar
(T
, bound=NBitBase
)]]]) – A 2D np.array of shape (N, K), where N is the number of tokens, and K is the number of classes for the modelmaps (
List
[int
]) – a list of mapped index, such that the probability of the token being in the i’th class is mapped to the maps[i] index. If maps[i] == -1, the i’th column of probs is ignored. If np.any(maps == -1), the returned probability is re-normalized.
- Return type:
ndarray
[Any
,dtype
[floating
[TypeVar
(T
, bound=NBitBase
)]]]- Returns:
probs_merged
– A 2D np.array of shape(N, K')
, where K’ is the number of new classes. Probabilities are merged and re-normalized if necessary.
Examples
>>> import numpy as np >>> from cleanlab.internal.token_classification_utils import merge_probs >>> probs = np.array([ ... [0.55, 0.0125, 0.0375, 0.1, 0.3], ... [0.1, 0.8, 0, 0.075, 0.025], ... ]) >>> maps = [0, 1, 1, 2, 2] >>> merge_probs(probs, maps) array([[0.55, 0.05, 0.4 ], [0.1 , 0.8 , 0.1 ]])
- cleanlab.internal.token_classification_utils.color_sentence(sentence, word)[source]#
Searches for a given token in the sentence and returns the sentence where the given token is colored red
- Parameters:
sentence (
str
) – a sentence where the word is searchedword (
str
) – keyword to find in sentence. Assumes the word exists in the sentence.
- Return type:
str
- Returns:
colored_sentence
– sentence where the every occurrence of the word is colored red, usingtermcolor.colored
Examples
>>> from cleanlab.internal.token_classification_utils import color_sentence >>> sentence = "This is a sentence." >>> word = "sentence" >>> color_sentence(sentence, word) 'This is a [31msentence[0m.'
Also works for multiple occurrences of the word
>>> document = "This is a sentence. This is another sentence." >>> word = "sentence" >>> color_sentence(document, word) 'This is a [31msentence[0m. This is another [31msentence[0m.'