Tokenization

class Token(text, start, end)

exported from token

A token extracted from a string.

Create a token.

Arguments
  • text (bistring.AnyString) – The text of this token.

  • start (number) – The start position of the token.

  • end (number) – The end position of the token.

Token.end

type: number

The end position of the token.

Token.modified

type: string

The modified value of the token.

Token.original

type: string

The original value of the token.

Token.slice(text, start, end)

Create a token from a slice of a string.

Arguments
  • text (bistring.AnyString) – The text to slice.

  • start (number) – The start index of the token.

  • end (number) – The end index of the token.

Returns

token.Token

Token.start

type: number

The start position of the token.

Token.text

type: bistring.BiString

The actual text of the token.

class Tokenization(text, tokens)

exported from token

A string and its tokenization.

Create a Tokenization.

Arguments
  • text (bistring.AnyString) – The text from which the tokens have been extracted.

  • tokens (Iterable) – The tokens extracted from the text.

Tokenization.alignment

type: alignment.Alignment

The alignment between the text and the tokens.

Tokenization.boundsForOriginal(start, end)

Map a span of original text to the bounds of the corresponding span of tokens.

Arguments
  • start (number) –

  • end (number) –

Returns

alignment.Bounds

Tokenization.boundsForText(start, end)

Map a span of text to the bounds of the corresponding span of tokens.

Arguments
  • start (number) –

  • end (number) –

Returns

alignment.Bounds

Tokenization.infer(text, tokens)

Infer a Tokenization from a sequence of tokens.

Due to the possibility of ambiguity, it is much better to use a Tokenizer() or some other method of producing Token()s with their positions explicitly set.

Arguments
  • text (bistring.AnyString) – The text that was tokenized.

  • tokens (Iterable) – The extracted tokens.

Returns

token.Tokenization – The inferred tokenization, with token positions found by simple forward search.

Tokenization.length

type: number

The number of tokens.

Tokenization.originalBounds(start, end)

Map a span of tokens to the bounds of the corresponding original text.

Arguments
  • start (undefined|number) –

  • end (undefined|number) –

Returns

alignment.Bounds

Tokenization.slice(start, end)

Compute a slice of this tokenization.

Arguments
  • start (undefined|number) – The position to start from.

  • end (undefined|number) – The position to end at.

Returns

token.Tokenization – The requested slice as a new Tokenization.

Tokenization.sliceByOriginal(start, end)

Map a span of original text to the corresponding span of tokens.

Arguments
  • start (number) –

  • end (number) –

Returns

token.Tokenization

Tokenization.sliceByText(start, end)

Map a span of text to the corresponding span of tokens.

Arguments
  • start (number) –

  • end (number) –

Returns

token.Tokenization

Tokenization.snapOriginalBounds(start, end)

Expand a span of original text to align it with token boundaries.

Arguments
  • start (number) –

  • end (number) –

Returns

alignment.Bounds

Tokenization.snapTextBounds(start, end)

Expand a span of text to align it with token boundaries.

Arguments
  • start (number) –

  • end (number) –

Returns

alignment.Bounds

Tokenization.substring(start, end)

Map a span of tokens to the corresponding substring.

Arguments
  • start (undefined|number) –

  • end (undefined|number) –

Returns

bistring.BiString

Tokenization.text

type: bistring.BiString

The text that was tokenized.

Tokenization.textBounds(start, end)

Map a span of tokens to the bounds of the corresponding text.

Arguments
  • start (undefined|number) –

  • end (undefined|number) –

Returns

alignment.Bounds

Tokenization.tokens

type: keyof:Token[]

The tokens extracted from the text.