A new benchmark designed to assess artificial intelligence’s understanding of spatial awareness, dubbed SAW-Bench, reveals a significant gap between machine perception and human comprehension of everyday surroundings. Researchers presented the benchmark alongside a dataset of 786 real-world videos and over 2,071 annotated question-answer pairs, captured using Ray-Ban Meta smart glasses.
The function, led by Chuhan Li of UC Santa Barbara, Ruilin Han from Yale University, and Joy Hsu from Stanford University, alongside collaborators from University of Maryland, Amazon, and University of California, Merced, addresses a limitation in current multimodal foundation models (MFMs). Existing benchmarks primarily focus on object relationships, neglecting the crucial observer-centric perspective needed for true spatial understanding, according to the research team.
SAW-Bench challenges AI models to reason about space from an embodied perspective, mirroring how humans perceive and interact with the world. Tasks within the benchmark require models to determine relative directions, plan routes, and assess spatial affordances – the possibilities for action within an environment. These tasks necessitate an understanding of the observer’s location, orientation, and trajectory.
Initial evaluations demonstrate a performance disparity of 37.66% between humans and Gemini 3 Flash, currently the best-performing MFM tested on SAW-Bench. Human observers achieved an overall accuracy of 91.55%, peaking at 94.00% in the Self-Localization task. The most challenging task for humans proved to be Reverse Route Plan, with a score of 79.01%.
Other models tested included Qwen3-VL 235B-A22B (41.40%), Qwen3-VL 8B (36.12%), Qwen2.5-VL 32B (36.46%), and LLaVA OneVision 72B (33.70%). These results highlight the challenges in developing AI systems that can match human-level spatial reasoning capabilities in active, real-world environments.
The research team emphasized the importance of using naturally captured footage, despite the challenges related to variability in lighting, occlusion, and camera motion. The use of Ray-Ban Meta smart glasses was intended to mirror human visual experience more closely than traditional camera setups. Chuhan Li is a postdoctoral researcher at Stanford, having previously studied at Yale University and Boston University, according to his LinkedIn profile.
The development of SAW-Bench aims to move beyond simply identifying objects and their relationships, focusing instead on how a model interprets the environment from the perspective of an agent. Researchers suggest that current models often rely on superficial cues rather than building a genuine understanding of camera geometry.