Tokenization and Lemmatization API
Tokenization and Lemmatization: Core Components of Natural Language Processing
Tokenization
Tokenizing a text means splitting it up into smaller entities called tokens. Depending on the tokenizer used, tokens can be different things, such as: a word, a character punctuation marks, or sub-words (the English word "longer" has two subwords: "long" and "er").
Tokenization is a step in almost every Natural Language Processing operation. Given the various existing language structures tokenization differs depending on the language of the text which is being tokenized.
Lemmatization
When lemmatizing, one is looking for the base-form of a word (the form one would typically find in a dictionary. The lemma of the word "are" for example would for example be "be", and the lemma of the word "pears" would be "pear".
Just like tokenization, lemmatization is almost always necessary when doing Natural Language Processing and different for every language.
Reasons for using Tokenization and Lemmatization
Tokenization and lemmatization are typically not used on their own but as a preparatory step to them be able to use the results for other steps of Natural Language Processing workflows. Especially tokenization can be costly so the choice of the tokenizer used is crucial.
Tokenization and Lemmatization Inference API
An application programming interface (API) is a way for two or more computer programs to communicate with each other. Using an API allows you to not only automate your processes but to do so from any language as the API functions as a messenger/translator between computers and programs.
Tustami's Cloud's Tokenization and Lemmatization API
Trustami supplies a tokenization and lemmatization API that gives you the opportunity to perform tokenization and lemmatization operations out of the box, with excellent performance. Tokenization and lemmatization are not very resource intensive, so the response time (latency), is very short. For more details, see our tokenization and lemmatization documentation here.