Tokenizer

class bistring.Tokenizer

Bases: abc.ABC

Abstract base class for tokenizers.

abstract tokenize(text)

Tokenize some text.

Parameters

text (Union[str, bistr]) – The text to tokenize, as either an str or bistr. A plain str should be converted to a bistr before processing.

Return type

Tokenization

Returns

A Tokenization holding the text and its tokens.

class bistring.RegexTokenizer(regex)

Bases: bistring._token.Tokenizer

Breaks text into tokens based on a regex.

>>> tokenizer = RegexTokenizer(r'\w+')
>>> tokens = tokenizer.tokenize('the quick brown fox jumps over the lazy dog')
>>> tokens[0]
Token(bistr('the'), start=0, end=3)
>>> tokens[1]
Token(bistr('quick'), start=4, end=9)
Parameters

regex (Union[str, Pattern[str]]) – A (possibly compiled) regular expression that matches tokens to extract.

class bistring.SplittingTokenizer(regex)

Bases: bistring._token.Tokenizer

Splits text into tokens based on a regex.

>>> tokenizer = SplittingTokenizer(r'\s+')
>>> tokens = tokenizer.tokenize('the quick brown fox jumps over the lazy dog')
>>> tokens[0]
Token(bistr('the'), start=0, end=3)
>>> tokens[1]
Token(bistr('quick'), start=4, end=9)
Parameters

regex (Union[str, Pattern[str]]) – A (possibly compiled) regular expression that matches the regions between tokens.

class bistring.CharacterTokenizer(locale)

Bases: bistring._token._IcuTokenizer

Splits text into user-perceived characters/grapheme clusters.

>>> tokenizer = CharacterTokenizer('th_TH')
>>> tokens = tokenizer.tokenize('กำนัล')
>>> tokens[0]
Token(bistr('กำ'), start=0, end=2)
>>> tokens[1]
Token(bistr('นั'), start=2, end=4)
>>> tokens[2]
Token(bistr('ล'), start=4, end=5)
Parameters

locale (str) – The name of the locale to use for computing user-perceived character boundaries.

class bistring.WordTokenizer(locale)

Bases: bistring._token._IcuTokenizer

Splits text into words based on Unicode rules.

>>> tokenizer = WordTokenizer('en_US')
>>> tokens = tokenizer.tokenize('the quick brown fox jumps over the lazy dog')
>>> tokens[0]
Token(bistr('the'), start=0, end=3)
>>> tokens[1]
Token(bistr('quick'), start=4, end=9)
Parameters

locale (str) – The name of the locale to use for computing word boundaries.

class bistring.SentenceTokenizer(locale)

Bases: bistring._token._IcuTokenizer

Splits text into sentences based on Unicode rules.

>>> tokenizer = SentenceTokenizer('en_US')
>>> tokens = tokenizer.tokenize(
...     'Word, sentence, etc. boundaries are hard. Luckily, Unicode can help.'
... )
>>> tokens[0]
Token(bistr('Word, sentence, etc. boundaries are hard. '), start=0, end=42)
>>> tokens[1]
Token(bistr('Luckily, Unicode can help.'), start=42, end=68)
Parameters

locale (str) – The name of the locale to use for computing sentence boundaries.