Tokenization

class bistring.Token(text, start, end)

Bases: object

A token extracted from a string.

Parameters
  • text (Union[str, bistr]) – The text of this token.

  • start (int) – The starting index of this token.

  • end (int) – The ending index of this token.

text: bistr = None

The actual text of the token.

start: int = None

The start position of the token.

end: int = None

The end position of the token.

property original

The original value of this token.

Return type

str

property modified

The modified value of this token.

Return type

str

classmethod slice(text, start, end)

Create a Token from a slice of a bistr.

Parameters
  • text (Union[str, bistr]) – The (bi)string to slice.

  • start (int) – The starting index of the token.

  • end (int) – The ending index of the token.

Return type

Token

class bistring.Tokenization(text, tokens)

Bases: object

A string and its tokenization.

Parameters
  • text (Union[str, bistr]) – The text from which the tokens have been extracted.

  • tokens (Iterable[Token]) – The tokens extracted from the text.

text: bistr = None

The text that was tokenized.

alignment: Alignment = None

The alignment from text indices to token indices.

classmethod infer(text, tokens)

Infer a Tokenization from a sequence of tokens.

>>> tokens = Tokenization.infer('hello, world!', ['hello', 'world'])
>>> tokens[0]
Token(bistr('hello'), start=0, end=5)
>>> tokens[1]
Token(bistr('world'), start=7, end=12)

Due to the possibility of ambiguity, it is much better to use a Tokenizer or some other method of producing Tokens with their positions explicitly set.

Return type

Tokenization

Returns

The inferred tokenization, with token positions found by simple forward search.

Raises

ValueError if the tokens can’t be found in the source string.

__getitem__(index)

Indexing a Tokenization returns the nth token:

>>> tokens = Tokenization.infer(
...     "The quick, brown fox",
...     ["The", "quick", "brown", "fox"],
... )
>>> tokens[0]
Token(bistr('The'), start=0, end=3)

Slicing a Tokenization returns a new one with the requested slice of tokens:

>>> tokens = tokens[1:-1]
>>> tokens[0]
Token(bistr('quick'), start=4, end=9)
Return type

Union[Token, Tokenization]

substring(*args)

Map a span of tokens to the corresponding substring. With no arguments, returns the substring from the first to the last token.

Return type

bistr

text_bounds(*args)

Map a span of tokens to the bounds of the corresponding text. With no arguments, returns the bounds from the first to the last token.

Return type

Tuple[int, int]

original_bounds(*args)

Map a span of tokens to the bounds of the corresponding original text. With no arguments, returns the bounds from the first to the last token.

Return type

Tuple[int, int]

bounds_for_text(*args)

Map a span of text to the bounds of the corresponding span of tokens.

Return type

Tuple[int, int]

bounds_for_original(*args)

Map a span of original text to the bounds of the corresponding span of tokens.

Return type

Tuple[int, int]

slice_by_text(*args)

Map a span of text to the corresponding span of tokens.

Return type

Tokenization

slice_by_original(*args)

Map a span of the original text to the corresponding span of tokens.

Return type

Tokenization

snap_text_bounds(*args)

Expand a span of text to align it with token boundaries.

Return type

Tuple[int, int]

snap_original_bounds(*args)

Expand a span of original text to align it with token boundaries.

Return type

Tuple[int, int]