Tokenizer
- class bistring.Tokenizer
Bases:
abc.ABC
Abstract base class for tokenizers.
- class bistring.RegexTokenizer(regex)
Bases:
bistring._token.Tokenizer
Breaks text into tokens based on a regex.
>>> tokenizer = RegexTokenizer(r'\w+') >>> tokens = tokenizer.tokenize('the quick brown fox jumps over the lazy dog') >>> tokens[0] Token(bistr('the'), start=0, end=3) >>> tokens[1] Token(bistr('quick'), start=4, end=9)
- class bistring.SplittingTokenizer(regex)
Bases:
bistring._token.Tokenizer
Splits text into tokens based on a regex.
>>> tokenizer = SplittingTokenizer(r'\s+') >>> tokens = tokenizer.tokenize('the quick brown fox jumps over the lazy dog') >>> tokens[0] Token(bistr('the'), start=0, end=3) >>> tokens[1] Token(bistr('quick'), start=4, end=9)
- class bistring.CharacterTokenizer(locale)
Bases:
bistring._token._IcuTokenizer
Splits text into user-perceived characters/grapheme clusters.
>>> tokenizer = CharacterTokenizer('th_TH') >>> tokens = tokenizer.tokenize('กำนัล') >>> tokens[0] Token(bistr('กำ'), start=0, end=2) >>> tokens[1] Token(bistr('นั'), start=2, end=4) >>> tokens[2] Token(bistr('ล'), start=4, end=5)
- Parameters
locale (
str
) – The name of the locale to use for computing user-perceived character boundaries.
- class bistring.WordTokenizer(locale)
Bases:
bistring._token._IcuTokenizer
Splits text into words based on Unicode rules.
>>> tokenizer = WordTokenizer('en_US') >>> tokens = tokenizer.tokenize('the quick brown fox jumps over the lazy dog') >>> tokens[0] Token(bistr('the'), start=0, end=3) >>> tokens[1] Token(bistr('quick'), start=4, end=9)
- Parameters
locale (
str
) – The name of the locale to use for computing word boundaries.
- class bistring.SentenceTokenizer(locale)
Bases:
bistring._token._IcuTokenizer
Splits text into sentences based on Unicode rules.
>>> tokenizer = SentenceTokenizer('en_US') >>> tokens = tokenizer.tokenize( ... 'Word, sentence, etc. boundaries are hard. Luckily, Unicode can help.' ... ) >>> tokens[0] Token(bistr('Word, sentence, etc. boundaries are hard. '), start=0, end=42) >>> tokens[1] Token(bistr('Luckily, Unicode can help.'), start=42, end=68)
- Parameters
locale (
str
) – The name of the locale to use for computing sentence boundaries.