Tokenization

class Token(text, start, end)

A token extracted from a string.

Create a token.

Arguments

text (AnyString()) – The text of this token.
start (number()) – The start position of the token.
end (number()) – The end position of the token.

Token.end

type: number

The end position of the token.

Token.modified: type: string

Token.original: type: string

Token.start

type: number

The start position of the token.

Token.text

type: BiString

The actual text of the token.

Token.slice(text, start, end)

Create a token from a slice of a string.

Arguments

text (AnyString()) – The text to slice.
start (number()) – The start index of the token.
end (number()) – The end index of the token.

Returns

Token –

class Tokenization(text, tokens)

A string and its tokenization.

Create a Tokenization.

Arguments

text (AnyString()) – The text from which the tokens have been extracted.
tokens (Iterable) – The tokens extracted from the text.

Tokenization.alignment

type: Alignment

The alignment between the text and the tokens.

Tokenization.length

type: number

The number of tokens.

Tokenization.text

type: BiString

The text that was tokenized.

Tokenization.tokens

type: readonly Token[]

The tokens extracted from the text.

Tokenization.boundsForOriginal(start, end)

Map a span of original text to the bounds of the corresponding span of tokens.

Arguments

start (number()) –
end (number()) –

Returns

Bounds –

Tokenization.boundsForText(start, end)

Map a span of text to the bounds of the corresponding span of tokens.

Arguments

start (number()) –
end (number()) –

Returns

Bounds –

Tokenization.originalBounds(start, end)

Map a span of tokens to the bounds of the corresponding original text.

Arguments

start (number()) –
end (number()) –

Returns

Bounds –

Tokenization.slice(start, end)

Compute a slice of this tokenization.

Arguments

start (number()) – The position to start from.
end (number()) – The position to end at.

Returns

Tokenization – The requested slice as a new Tokenization.

Tokenization.sliceByOriginal(start, end)

Map a span of original text to the corresponding span of tokens.

Arguments

start (number()) –
end (number()) –

Returns

Tokenization –

Tokenization.sliceByText(start, end)

Map a span of text to the corresponding span of tokens.

Arguments

start (number()) –
end (number()) –

Returns

Tokenization –

Tokenization.snapOriginalBounds(start, end)

Expand a span of original text to align it with token boundaries.

Arguments

start (number()) –
end (number()) –

Returns

Bounds –

Tokenization.snapTextBounds(start, end)

Expand a span of text to align it with token boundaries.

Arguments

start (number()) –
end (number()) –

Returns

Bounds –

Tokenization.substring(start, end)

Map a span of tokens to the corresponding substring.

Arguments

start (number()) –
end (number()) –

Returns

BiString –

Tokenization.textBounds(start, end)

Map a span of tokens to the bounds of the corresponding text.

Arguments

start (number()) –
end (number()) –

Returns

Bounds –

Tokenization.infer(text, tokens)

Infer a Tokenization from a sequence of tokens.

Due to the possibility of ambiguity, it is much better to use a Tokenizer() or some other method of producing Token()s with their positions explicitly set.

Arguments

text (AnyString()) – The text that was tokenized.
tokens (Iterable) – The extracted tokens.

Returns

Tokenization – The inferred tokenization, with token positions found by simple forward search.