As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting and managing data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of preprocessing data manipulation. This reveals that the operations that are used in practice can be implemented by combining a rather limited set of basic operators. We then illustrate and test implementation choices aimed at supporting the provenance capture for those operations efficiently and with minimal effort for data scientists.

Gregori, L., Missier, P., Stidolph, M., Torlone, R., Wood, A. (2024). Design and Development of a Provenance Capture Platform for Data Science. In Proceedings - 2024 IEEE 40th International Conference on Data Engineering Workshops, ICDEW 2024 (pp.285-290). Institute of Electrical and Electronics Engineers Inc. [10.1109/ICDEW61823.2024.00042].

Design and Development of a Provenance Capture Platform for Data Science

Gregori L.;Missier P.;Torlone R.
;
Wood A.
2024-01-01

Abstract

As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting and managing data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of preprocessing data manipulation. This reveals that the operations that are used in practice can be implemented by combining a rather limited set of basic operators. We then illustrate and test implementation choices aimed at supporting the provenance capture for those operations efficiently and with minimal effort for data scientists.
2024
Gregori, L., Missier, P., Stidolph, M., Torlone, R., Wood, A. (2024). Design and Development of a Provenance Capture Platform for Data Science. In Proceedings - 2024 IEEE 40th International Conference on Data Engineering Workshops, ICDEW 2024 (pp.285-290). Institute of Electrical and Electronics Engineers Inc. [10.1109/ICDEW61823.2024.00042].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/478247
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact