Benvenuti nell'Anagrafe della Ricerca d'Ateneo

As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting, managing, and querying data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation. Building on this study, we present an approach for transparently collecting data provenance based on the use of an LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity and (ii) store an accurate description of all the activities involved in the input pipelines for supporting the explanation of each of them. We then illustrate and test implementation choices aimed at supporting the provenance capture for data preparation pipelines efficiently in a transparent way for data scientists.

Gregori, L., Lazzaro, P.L., Lazzaro, M., Missier, P., Torlone, R. (2025). An LLM-guided platform for multi-granular collection and management of data provenance. JOURNAL OF BIG DATA, 12(1) [10.1186/s40537-025-01209-3].

An LLM-guided platform for multi-granular collection and management of data provenance

Gregori L.;Lazzaro P. L.;Lazzaro M.;Missier P.;Torlone R.

2025-01-01

Abstract

As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting, managing, and querying data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation. Building on this study, we present an approach for transparently collecting data provenance based on the use of an LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity and (ii) store an accurate description of all the activities involved in the input pipelines for supporting the explanation of each of them. We then illustrate and test implementation choices aimed at supporting the provenance capture for data preparation pipelines efficiently in a transparent way for data scientists.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Citazione
	
				Gregori, L., Lazzaro, P.L., Lazzaro, M., Missier, P., Torlone, R. (2025). An LLM-guided platform for multi-granular collection and management of data provenance. JOURNAL OF BIG DATA, 12(1) [10.1186/s40537-025-01209-3].
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/520376

Citazioni

ND

3

2

social impact