Tokenization
- class Token(text, start, end)
A token extracted from a string.
Create a token.
- Arguments
text (
AnyString()
) – The text of this token.start (
number()
) – The start position of the token.end (
number()
) – The end position of the token.
- Token.end
type: number
The end position of the token.
- Token.modified
type: string
- Token.original
type: string
- Token.start
type: number
The start position of the token.
- Token.text
type: BiString
The actual text of the token.
- Token.slice(text, start, end)
Create a token from a slice of a string.
- Arguments
text (
AnyString()
) – The text to slice.start (
number()
) – The start index of the token.end (
number()
) – The end index of the token.
- Returns
Token –
- class Tokenization(text, tokens)
A string and its tokenization.
Create a Tokenization.
- Arguments
text (
AnyString()
) – The text from which the tokens have been extracted.tokens (
Iterable
) – The tokens extracted from the text.
- Tokenization.alignment
type: Alignment
The alignment between the text and the tokens.
- Tokenization.length
type: number
The number of tokens.
- Tokenization.text
type: BiString
The text that was tokenized.
- Tokenization.tokens
type: readonly Token[]
The tokens extracted from the text.
- Tokenization.boundsForOriginal(start, end)
Map a span of original text to the bounds of the corresponding span of tokens.
- Arguments
start (
number()
) –end (
number()
) –
- Returns
Bounds –
- Tokenization.boundsForText(start, end)
Map a span of text to the bounds of the corresponding span of tokens.
- Arguments
start (
number()
) –end (
number()
) –
- Returns
Bounds –
- Tokenization.originalBounds(start, end)
Map a span of tokens to the bounds of the corresponding original text.
- Arguments
start (
number()
) –end (
number()
) –
- Returns
Bounds –
- Tokenization.slice(start, end)
Compute a slice of this tokenization.
- Arguments
start (
number()
) – The position to start from.end (
number()
) – The position to end at.
- Returns
Tokenization – The requested slice as a new Tokenization.
- Tokenization.sliceByOriginal(start, end)
Map a span of original text to the corresponding span of tokens.
- Arguments
start (
number()
) –end (
number()
) –
- Returns
Tokenization –
- Tokenization.sliceByText(start, end)
Map a span of text to the corresponding span of tokens.
- Arguments
start (
number()
) –end (
number()
) –
- Returns
Tokenization –
- Tokenization.snapOriginalBounds(start, end)
Expand a span of original text to align it with token boundaries.
- Arguments
start (
number()
) –end (
number()
) –
- Returns
Bounds –
- Tokenization.snapTextBounds(start, end)
Expand a span of text to align it with token boundaries.
- Arguments
start (
number()
) –end (
number()
) –
- Returns
Bounds –
- Tokenization.substring(start, end)
Map a span of tokens to the corresponding substring.
- Arguments
start (
number()
) –end (
number()
) –
- Returns
BiString –
- Tokenization.textBounds(start, end)
Map a span of tokens to the bounds of the corresponding text.
- Arguments
start (
number()
) –end (
number()
) –
- Returns
Bounds –
- Tokenization.infer(text, tokens)
Infer a Tokenization from a sequence of tokens.
Due to the possibility of ambiguity, it is much better to use a
Tokenizer()
or some other method of producingToken()
s with their positions explicitly set.- Arguments
text (
AnyString()
) – The text that was tokenized.tokens (
Iterable
) – The extracted tokens.
- Returns
Tokenization – The inferred tokenization, with token positions found by simple forward search.