Sixty participants took part in the user study, which was divided into two major sections:
40 evaluation questions
This part consisted of 40 questions drawn randomly from 100-questions scene. Participants were presented with a question, response from the answering system, and the ground truth information or an image of the relevant scene followed by an example answer (being only a suggestion). In each case, participants were asked to decide whether they accept the system's answer as correct or not.
Piecharts corresponding to each question present the distribution of the responses from participants. We additionally state the decision from the automatic assessment system, whose correctness we aimed to evaluate through the user study. The icons next to the questions (✅ and ❌) indicate whether the automatic assessment system matched the decision of the majority of participants.
Additional conditions participants were provided with: