Tokenization
- class bistring.Token(text, start, end)
Bases:
object
A token extracted from a string.
- Parameters
- text: bistring._bistr.bistr
The actual text of the token.
- class bistring.Tokenization(text, tokens)
Bases:
object
A string and its tokenization.
- Parameters
- text: bistring._bistr.bistr
The text that was tokenized.
- alignment: bistring._alignment.Alignment
The alignment from text indices to token indices.
- classmethod infer(text, tokens)
Infer a Tokenization from a sequence of tokens.
>>> tokens = Tokenization.infer('hello, world!', ['hello', 'world']) >>> tokens[0] Token(bistr('hello'), start=0, end=5) >>> tokens[1] Token(bistr('world'), start=7, end=12)
Due to the possibility of ambiguity, it is much better to use a
Tokenizer
or some other method of producingToken
s with their positions explicitly set.- Return type
- Returns
The inferred tokenization, with token positions found by simple forward search.
- Raises
ValueError
if the tokens can’t be found in the source string.
- __getitem__(index: int) bistring._token.Token
- __getitem__(index: slice) bistring._token.Tokenization
Indexing a Tokenization returns the nth token:
>>> tokens = Tokenization.infer( ... "The quick, brown fox", ... ["The", "quick", "brown", "fox"], ... ) >>> tokens[0] Token(bistr('The'), start=0, end=3)
Slicing a Tokenization returns a new one with the requested slice of tokens:
>>> tokens = tokens[1:-1] >>> tokens[0] Token(bistr('quick'), start=4, end=9)
- Return type
- substring(*args)
Map a span of tokens to the corresponding substring. With no arguments, returns the substring from the first to the last token.
- Return type
- text_bounds(*args)
Map a span of tokens to the bounds of the corresponding text. With no arguments, returns the bounds from the first to the last token.
- original_bounds(*args)
Map a span of tokens to the bounds of the corresponding original text. With no arguments, returns the bounds from the first to the last token.
- bounds_for_text(*args)
Map a span of text to the bounds of the corresponding span of tokens.
- bounds_for_original(*args)
Map a span of original text to the bounds of the corresponding span of tokens.
- slice_by_text(*args)
Map a span of text to the corresponding span of tokens.
- Return type
- slice_by_original(*args)
Map a span of the original text to the corresponding span of tokens.
- Return type
- snap_text_bounds(*args)
Expand a span of text to align it with token boundaries.