Function tokenize

  • Segments the provided text into tokens, then normalizes and classifies each token.

    Parameters

    • text: string

      A string to be tokenized.

    Returns Token[]

    An array of tokens, where each token is a normalized and classified segment of the input text.

    import { tokenize, TokenKind } from 'charabia-js';
    import assert from 'node:assert';

    const tokens = tokenize(
    'The quick ("brown") fox can\'t jump 32.3 feet, right? Brr, it\'s 29.3°F'
    );

    let token = tokens[0];
    assert.equal(token.lemma, 'the');
    assert.equal(token.kind, TokenKind.Word);

    token = tokens[1];
    assert.equal(token.lemma, ' ');
    assert.equal(token.kind, TokenKind.SoftSeparator);

    token = tokens[2];
    assert.equal(token.lemma, 'quick');
    assert.equal(token.kind, TokenKind.Word);