In this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, (ii) the definition of provenance patterns for each of them, and (iii) a prototype implementation of an application-level provenance capture library that works alongside Python.
Chapman, A., Missier, P., Simonelli, G., Torlone, R. (2021). Fine-grained provenance for high-quality data science. In CEUR Workshop Proceedings. CEUR-WS.