Tokenizer

class Tokenizer()

A tokenizer that produces Tokenization()s.

interface

Tokenizer.tokenize(text)

Tokenize a string.

Arguments
  • text (AnyString()) – The text to tokenize, either a string or a BiString().

Returns

Tokenization – A Tokenization() holding the text and its tokens.

class RegExpTokenizer(pattern)

Breaks text into tokens based on a RegExp().

Implements:

Create a RegExpTokenizer.

Arguments
  • pattern (RegExp()) – The regex that will match tokens.

RegExpTokenizer.tokenize(text)

Tokenize a string.

Arguments
  • text (AnyString()) –

Returns

Tokenization – A Tokenization() holding the text and its tokens.

class SplittingTokenizer(pattern)

Splits text into tokens based on a RegExp().

Implements:

Create a SplittingTokenizer.

Arguments
  • pattern (RegExp()) – A regex that matches the regions between tokens.

SplittingTokenizer.tokenize(text)

Tokenize a string.

Arguments
  • text (AnyString()) –

Returns

Tokenization – A Tokenization() holding the text and its tokens.