This paper investigates the application of a Multimodal Large Language Model to enhance visitor experiences in cultural heritage settings through Visual Question Answering (VQA) and Contextual Question Answering (CQA). We evaluate the zero-shot capabilities of LLaVA-7b (Large Language and Vision Assistant) on QA using the AQUA dataset. We assess how effectively it can answer questions about artwork, visual content, and contextual information through three experimental approaches. Our findings reveal that LLaVA demonstrates promising performance on visual questions, outperforming previous baselines but facing challenges with questions requiring contextual understanding. The selective knowledge integration approach showed the best overall performance, suggesting an efficient knowledge retrieval systems could enhance performance. Moreover, we show how to exploit such models to provide correct personalized answers using a well-established visitor model.

Ferrato, A., Limongelli, C., Gasparetti, F., Sansonetti, G., Micarelli, A. (2025). Exploring the Potential of Multimodal Large Language Models for Question Answering on Artworks. In UMAP 2025 - Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization (pp.432-436). NEW YORK, NY : Association for Computing Machinery, Inc [10.1145/3708319.3733648].

Exploring the Potential of Multimodal Large Language Models for Question Answering on Artworks

Ferrato, Alessio
;
Limongelli, Carla;Gasparetti, Fabio;Sansonetti, Giuseppe;Micarelli, Alessandro
2025-01-01

Abstract

This paper investigates the application of a Multimodal Large Language Model to enhance visitor experiences in cultural heritage settings through Visual Question Answering (VQA) and Contextual Question Answering (CQA). We evaluate the zero-shot capabilities of LLaVA-7b (Large Language and Vision Assistant) on QA using the AQUA dataset. We assess how effectively it can answer questions about artwork, visual content, and contextual information through three experimental approaches. Our findings reveal that LLaVA demonstrates promising performance on visual questions, outperforming previous baselines but facing challenges with questions requiring contextual understanding. The selective knowledge integration approach showed the best overall performance, suggesting an efficient knowledge retrieval systems could enhance performance. Moreover, we show how to exploit such models to provide correct personalized answers using a well-established visitor model.
2025
Ferrato, A., Limongelli, C., Gasparetti, F., Sansonetti, G., Micarelli, A. (2025). Exploring the Potential of Multimodal Large Language Models for Question Answering on Artworks. In UMAP 2025 - Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization (pp.432-436). NEW YORK, NY : Association for Computing Machinery, Inc [10.1145/3708319.3733648].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/521177
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
social impact