Peeking Ahead of the Field Study: Exploring VLM Personas as Support Tools for Embodied Studies in HCI

Xinyue Gui¹, Ding Xia¹, Mark Colley², Yuan Li¹, Vishal Chauhan¹, Anubhav Anubhav¹, Zhongyi Zhou³, Ehsan Javanmardi¹, Stela Hanbyeol Seo⁴, Chia-Ming Chang⁵, Manabu Tsukada¹, Takeo Igarashi¹

¹The University of Tokyo, ²UCL Interaction Centre, ³Google
⁴Kyoto University, ⁵National Taiwan University of Arts
CHI 2026 · Barcelona, Spain

Paper Code arXiv

Peeking Ahead of the Field Study explores using Vision-Language Model (VLM) personas as low-cost proxy participants to preview field study outcomes. Left: Field study with real participants and VLM personas. Right: Data comparison and researcher interviews showing the effectiveness of VLM personas.

Abstract

Field studies are irreplaceable in human-computer interaction research but are costly and error-prone. To address these challenges, we proposed using Vision-Language Model (VLM) personas as low-cost proxy participants to preview field study outcomes before running them with real people.

We conducted two parallel studies on a street-crossing task in the presence of an autonomous vehicle (AV): (1) a real-world field study with 20 human participants and (2) a video-based simulation using 20 VLM personas modeled after them. The VLM personas navigate through 5 spatial positions across up to 8 time steps, choosing forward, stop, or backward at each step, producing decision trajectories that can be compared directly to human behavior.

Our findings demonstrate that VLM personas can effectively predict pedestrian behavior patterns, offering a promising approach to reduce the cost and risk of field studies while providing valuable insights for research design.

Video

How the Simulator Works

The VLM navigates through 5 positions (0.8 m apart) across up to 8 time steps (1 second each), choosing forward · stop · backward at each step. The simulator plays the video clip matching the current position × time state, producing a full decision trajectory that can be compared directly to human behavior.

Video Scenarios

Six eHMI conditions: 3 eHMI types (light strip, animated eyes, no eHMI) × 2 AV behaviors (stop / pass).

BibTeX


@inproceedings{gui2026personavlm,
  title     = {Peeking Ahead of the Field Study: Exploring VLM Personas as Support Tools for Embodied Studies in HCI},
  author    = {Gui, Xinyue and Xia, Ding and Colley, Mark and Li, Yuan and Chauhan, Vishal and Anubhav, Anubhav and Zhou, Zhongyi and Javanmardi, Ehsan and Seo, Stela Hanbyeol and Chang, Chia-Ming and Tsukada, Manabu and Igarashi, Takeo},
  booktitle = {Proceedings of the CHI Conference on Human Factors in Computing Systems},
  year      = {2026},
  publisher = {ACM},
  address   = {Barcelona, Spain},
  doi       = {10.1145/3772318.3790537}
}