Space3D-Bench

Sixty participants took part in the user study, which was divided into two major sections:

40 evaluation questions
This part consisted of 40 questions drawn randomly from 100-questions scene. Participants were presented with a question, response from the answering system, and the ground truth information or an image of the relevant scene followed by an example answer (being only a suggestion). In each case, participants were asked to decide whether they accept the system's answer as correct or not.
Piecharts corresponding to each question present the distribution of the responses from participants. We additionally state the decision from the automatic assessment system, whose correctness we aimed to evaluate through the user study. The icons next to the questions (✅ and ❌) indicate whether the automatic assessment system matched the decision of the majority of participants.
Additional conditions participants were provided with:

the tolerance for the navigable distance was 0.5m, for straight-line distance 0.2m, and for specific coordinates 0.1m;
in the case of the questions on viewpoints, the provided images presented the viewpoint of a person described in the question.

10 abstracted questions
We additionally wanted to draw conclusions on how to address ambiguities. We abstracted 10 questions and asked them to the participants. We believe the divided opinions on some of the cases give a valuable insight on the ambiguities of natural language answers assessment.

Space3D-Bench: Spatial 3D Question Answering Benchmark

Detailed User Study Results

Acknowledgements