(1D) Ordered Tokens Enable Efficient Test-Time Search
AuthorsZhitong Gao†, Parham Rezaei†, Ali Cy†, Mingqiao Ye†, Nataša Jovanović†, Jesse Allardice, Afshin Dehghan, Amir Zamir†, Roman Bachmann†*, Oğuzhan Fatih Kar†*
(1D) Ordered Tokens Enable Efficient Test-Time Search
AuthorsZhitong Gao†, Parham Rezaei†, Ali Cy†, Mingqiao Ye†, Nataša Jovanović†, Jesse Allardice, Afshin Dehghan, Amir Zamir†, Roman Bachmann†*, Oğuzhan Fatih Kar†*
Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through test-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an AR model) can perform training-free text-to-image generation when guided by an image-text verifier. Beyond this, we systematically study how classical search algorithms (best-of-N, beam search, lookahead search) interact with different token structures, as well as the role of different verifiers and AR priors. Our results highlight the impact of token structure on inference-time scalability and provide practical guidance for test-time scaling in AR models.
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
February 19, 2025research area Computer Vision
This work was done in collaboration with Swiss Federal Institute of Technology Lausanne (EPFL).
Image tokenization has enabled major advances in autoregressive image generation by providing compressed, discrete representations that are more efficient to process than raw pixels. While traditional approaches use 2D grid tokenization, recent methods like TiTok have shown that 1D tokenization can achieve high generation quality by eliminating grid…
Making Smartphone Augmented Reality Apps Accessible
October 18, 2020research area Accessibility, research area Human-Computer Interactionconference ASSETS
Augmented Reality (AR) technology creates new immersive experiences in entertainment, games, education, retail, and social media. AR content is often primarily visual and it is challenging to enable access to it non-visually due to the mix of virtual and real-world content. In this paper, we identify common constituent tasks in AR by analyzing existing mobile AR applications for iOS, and characterize the design space of tasks that require…