SceneScout: Towards AI Agent-driven Access to Street View Imagery for Blind Users
AuthorsGaurav Jain†‡, Leah Findlater, Cole Gleason
SceneScout: Towards AI Agent-driven Access to Street View Imagery for Blind Users
AuthorsGaurav Jain†‡, Leah Findlater, Cole Gleason
People who are blind or have low vision (BLV) may hesitate to travel independently in unfamiliar environments due to uncertainty about the physical landscape. While most tools focus on in-situ navigation, those exploring pre-travel assistance typically provide only landmarks and turn-by-turn instructions, lacking detailed visual context. Street view imagery, which contains rich visual information and has the potential to reveal numerous environmental details, remains inaccessible to BLV people. In this work, we introduce SceneScout, a multimodal large language model (MLLM)-driven AI agent that enables accessible interactions with street view imagery. SceneScout supports two modes: (1) Route Preview, enabling users to familiarize themselves with visual details along a route, and (2) Virtual Exploration, enabling free movement within street view imagery. Our user study (N=10) demonstrates that SceneScout helps BLV users uncover visual information otherwise unavailable through existing means. A technical evaluation shows that most descriptions are accurate (72%) and describe stable visual elements (95%) even in older imagery, though occasional subtle and plausible errors make them difficult to verify without sight. We discuss future opportunities and challenges of using street view imagery to enhance navigation experiences.
Apple Workshop on Human-Centered Machine Learning 2024
July 24, 2025research area Accessibility, research area Fairness, research area Human-Computer Interaction
A human-centered approach to machine learning (HCML) involves designing ML & AI technology that prioritizes the needs and values of the people using it. This leads to AI that complements and enhances human capabilities, rather than replacing them. Research in the area of HCML includes the development of transparent and interpretable machine learning systems to help people feel safer using AI, as well as strategies for predicting and preventing…
Contrastive learning typically matches pairs of related views among a number of unrelated negative views.
Views can be generated (e.g. by augmentations) or be observed. We investigate matching when there are more than two related views which we call poly-view tasks, and derive new representation learning objectives using information maximization and sufficient statistics.
We show that with unlimited computation, one should maximize the number…