Beyond Text Compression: Evaluating Tokenizers Across Scales
AuthorsJonas F. Lotz†‡, António Vilarinho Lopes, Stephan Peitz, Hendra Setiawan, Leonardo Emili
Beyond Text Compression: Evaluating Tokenizers Across Scales
AuthorsJonas F. Lotz†‡, António Vilarinho Lopes, Stephan Peitz, Hendra Setiawan, Leonardo Emili
Tokenizer design significantly impacts language model performance, yet evaluating tokenizer quality remains challenging. While text compression has emerged as a common intrinsic metric, recent work questions its reliability as a quality indicator. We investigate whether evaluating tokenizers on smaller models (350M parameters) reliably predicts their impact at larger scales (2.7B parameters). Through experiments with established tokenizers from widely-adopted language models, we find that tokenizer choice minimally affects English tasks but yields significant, scale-consistent differences in machine translation performance. Based on these findings, we propose additional intrinsic metrics that correlate more strongly with downstream performance than text compression. We combine these metrics into an evaluation framework that enables more reliable intrinsic tokenizer comparisons.
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
February 19, 2025research area Computer Vision
This work was done in collaboration with Swiss Federal Institute of Technology Lausanne (EPFL).
Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid…
Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage
April 15, 2024research area Speech and Natural Language Processingconference EACL
Long prompts present a significant challenge for practical LLM-based systems that need to operate with low latency and limited resources. We investigate prompt compression for zero-shot dialogue systems that learn to use unseen APIs directly in-context from their documentation, which may take up hundreds of prompt tokens per API. We start from a recently introduced approach (Mu et al., 2023) that learns to compress the prompt into a few “gist…