Robust Robotic Control from Pixels Using Contrastive Recurrent State-Space Models

AuthorsNitish Srivastava, Walter Talbott, Martin Bertran Lopez, Shuangfei Zhai, Josh Susskind

Modeling the world can benefit robot learning by providing a rich training signal for shaping an agent’s latent state space. However, learning world models in unconstrained environments over high-dimensional observation spaces such as images is challenging. One source of difficulty is the presence of irrelevant but hard-to-model background distractions, and unimportant visual details of task-relevant entities. We address this issue by learning a recurrent latent dynamics model which contrastively predicts the next observation. This simple model leads to surprisingly robust robotic control even with simultaneous camera, background, and color distractions. We outperform alternatives such as bisimulation methods which impose state-similarity measures derived from divergence in future reward or future optimal actions. We obtain state-of-the-art results on the Distracting Control Suite, a challenging benchmark for pixel-based robotic control.

Related readings and updates.

GenCtrl — A Formal Controllability Toolkit for Generative Models

March 6, 2026research area Methods and Algorithmsconference ICLR

As generative models become ubiquitous, there is a critical need for fine-grained control over the generation process. Yet, while controlled generation methods from prompting to fine-tuning proliferate, a fundamental question remains unanswered: are these models truly controllable in the first place? In this work, we provide a theoretical framework to formally answer this question. Framing human-model interaction as a control process, we propose…

ARMADA: Augmented Reality for Robot Manipulation and Robot-Free Data Acquisition

December 17, 2024research area Data Science and Annotation, research area Human-Computer Interaction

Teleoperation for robot imitation learning is bottlenecked by hardware availability. Can high-quality robot data be collected without a physical robot? We present a system for augmenting Apple Vision Pro with real-time virtual robot feedback. By providing users with an intuitive understanding of how their actions translate to robot motions, we enable the collection of natural barehanded human data that is compatible with the limitations of…

Robust Robotic Control from Pixels Using Contrastive Recurrent State-Space Models

Related readings and updates.

GenCtrl — A Formal Controllability Toolkit for Generative Models

ARMADA: Augmented Reality for Robot Manipulation and Robot-Free Data Acquisition

Discover opportunities in Machine Learning.