Technical Glossary

Tokenizer

Definition: Component that splits text into tokens (subword units) that LLMs can process, affecting cost, context limits, and multilingual performance.

— Source: NERVICO, Product Development Consultancy

What is a Tokenizer

A tokenizer is the component that splits text into tokens, the smallest units an LLM can process. Tokens are not necessarily complete words: they can be subwords, individual characters, or even word fragments. Each model uses its own tokenization strategy, which directly affects cost per request, context window limits, and performance across different languages.

How it works

The tokenizer receives raw text and converts it into a sequence of numeric identifiers the model can interpret. Algorithms like BPE (Byte Pair Encoding) or SentencePiece analyze a training corpus to build a vocabulary of frequent tokens. During tokenization, text is decomposed into the longest possible tokens that exist in the vocabulary. Common words are typically represented as a single token, while rare or technical words are split into multiple tokens. A typical vocabulary contains between 32,000 and 200,000 tokens.

Why it matters

The tokenizer determines how much each LLM request costs, since providers bill per processed token. It also defines how much information fits within the model’s context window. For teams working with multilingual content, tokenizer efficiency is critical: text in languages like Spanish, Chinese, or Arabic can consume 1.5 to 3 times more tokens than the same content in English, increasing costs and reducing available context.

Practical example

A development team evaluates the costs of integrating Claude into their support platform. They discover that their Spanish-language tickets consume 40% more tokens than English equivalents due to tokenization. With this information, they optimize their system prompts to be more concise and configure context caching, reducing costs by 30% without sacrificing response quality.

  • LLM - Language models that depend on tokenizers
  • Context Window - Token limit a model can process
  • Embedding - Vector representations generated from tokens

Last updated: February 2026 Category: Artificial Intelligence Related to: LLM, Tokens, Context Window, NLP Keywords: tokenizer, tokens, bpe, sentencepiece, llm tokens, tokenization, subword units

Need help with product development?

We help you accelerate your development with cutting-edge technology and best practices.