Tokenizer

class Tokenizer()

interface, exported from token

A tokenizer that produces Tokenization()s.

Tokenizer.tokenize(text)

Tokenize a string.

Arguments
  • text (bistring.AnyString) – The text to tokenize, either a string or a BiString().

Returns

token.Tokenization – A Tokenization() holding the text and its tokens.

class RegExpTokenizer(pattern)

exported from token

Implements:
  • token.Tokenizer()

Breaks text into tokens based on a RegExp().

Create a RegExpTokenizer.

Arguments
  • pattern (RegExp) – The regex that will match tokens.

RegExpTokenizer.tokenize(text)
Arguments
  • text (bistring.AnyString) –

Returns

token.Tokenization

class SplittingTokenizer(pattern)

exported from token

Implements:
  • token.Tokenizer()

Splits text into tokens based on a RegExp().

Create a SplittingTokenizer.

Arguments
  • pattern (RegExp) – A regex that matches the regions between tokens.

SplittingTokenizer.tokenize(text)
Arguments
  • text (bistring.AnyString) –

Returns

token.Tokenization