COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization
AuthorsTian Qin**†, Felix Bai, Ting-Yao Hu, Raviteja Vemulapalli, Hema Swetha Koppula, Zhiyang Xu‡, Bowen Jin§, Mert Cemri¶, Jiarui Lu, Zirui Wang, Meng Cao
COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization
AuthorsTian Qin**†, Felix Bai, Ting-Yao Hu, Raviteja Vemulapalli, Hema Swetha Koppula, Zhiyang Xu‡, Bowen Jin§, Mert Cemri¶, Jiarui Lu, Zirui Wang, Meng Cao
Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constrained preference optimization problem, where agents must satisfy hard constraints while simultaneously optimizing soft user preferences. To support this, we build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem that mirrors commercial booking platforms. Evaluating state-of-the-art models, we uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks, especially for open-source models. By grounding reasoning and planning in a practical, user-facing domain, COMPASS provides a benchmark that directly measures an agent’s ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.
Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents
May 1, 2026research area Methods and Algorithms, research area Tools, Platforms, FrameworksWorkshop at ACL
This paper was accepted at the Fifth Workshop on Natural Language Generation, Evaluation, and Metrics at ACL 2026.
Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post-hoc. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt-tuning or retraining, and fundamentally cannot…
Aligning LLMs by Predicting Preferences from User Writing Samples
June 27, 2025research area Human-Computer Interaction, research area Methods and Algorithmsconference ICML
Accommodating human preferences is essential for creating aligned LLM agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs acting as writing agents to infer a description of user preferences. Agent alignment then comes from conditioning on the inferred preference description. However, existing methods often produce generic preference descriptions that fail to capture the unique and…