Tokenization

class bistring.Token(text, start, end)

Bases: object

A token extracted from a string.

Parameters

text (Union[str, bistr]) – The text of this token.
start (int) – The starting index of this token.
end (int) – The ending index of this token.

text: bistring._bistr.bistr: The actual text of the token.

start: int: The start position of the token.

end: int: The end position of the token.

property original: str

The original value of this token.

Return type: str

property modified: str

The modified value of this token.

Return type: str

classmethod slice(text, start, end)

Create a Token from a slice of a bistr.

Parameters

text (Union[str, bistr]) – The (bi)string to slice.
start (int) – The starting index of the token.
end (int) – The ending index of the token.

Return type

Token

class bistring.Tokenization(text, tokens)

Bases: object

A string and its tokenization.

Parameters

text (Union[str, bistr]) – The text from which the tokens have been extracted.
tokens (Iterable[Token]) – The tokens extracted from the text.

text: bistring._bistr.bistr: The text that was tokenized.

alignment: bistring._alignment.Alignment: The alignment from text indices to token indices.

classmethod infer(text, tokens)

Infer a Tokenization from a sequence of tokens.

>>> tokens = Tokenization.infer('hello, world!', ['hello', 'world'])
>>> tokens[0]
Token(bistr('hello'), start=0, end=5)
>>> tokens[1]
Token(bistr('world'), start=7, end=12)

Due to the possibility of ambiguity, it is much better to use a Tokenizer or some other method of producing Tokens with their positions explicitly set.

Return type: Tokenization
Returns: The inferred tokenization, with token positions found by simple forward search.
Raises: ValueError if the tokens can’t be found in the source string.

__getitem__(index: int) → bistring._token.Token

__getitem__(index: slice) → bistring._token.Tokenization

Indexing a Tokenization returns the nth token:

>>> tokens = Tokenization.infer(
...     "The quick, brown fox",
...     ["The", "quick", "brown", "fox"],
... )
>>> tokens[0]
Token(bistr('The'), start=0, end=3)

Slicing a Tokenization returns a new one with the requested slice of tokens:

>>> tokens = tokens[1:-1]
>>> tokens[0]
Token(bistr('quick'), start=4, end=9)

Return type: Union[Token, Tokenization]

substring(*args)

Map a span of tokens to the corresponding substring. With no arguments, returns the substring from the first to the last token.

Return type: bistr

text_bounds(*args)

Map a span of tokens to the bounds of the corresponding text. With no arguments, returns the bounds from the first to the last token.

Return type: Tuple[int, int]

original_bounds(*args)

Map a span of tokens to the bounds of the corresponding original text. With no arguments, returns the bounds from the first to the last token.

Return type: Tuple[int, int]

bounds_for_text(*args)

Map a span of text to the bounds of the corresponding span of tokens.

Return type: Tuple[int, int]

bounds_for_original(*args)

Map a span of original text to the bounds of the corresponding span of tokens.

Return type: Tuple[int, int]

slice_by_text(*args)

Map a span of text to the corresponding span of tokens.

Return type: Tokenization

slice_by_original(*args)

Map a span of the original text to the corresponding span of tokens.

Return type: Tokenization

snap_text_bounds(*args)

Expand a span of text to align it with token boundaries.

Return type: Tuple[int, int]

snap_original_bounds(*args)

Expand a span of original text to align it with token boundaries.

Return type: Tuple[int, int]