Paddling in a data lake is strenuous for a data scientist. Being a loosely-structured collection of raw data with little or no meta-information available, the difficulties of extracting insights from a data lake start from the initial phases of data analysis. Indeed, data preparation, which involves many complex operations (such as source and feature selection, exploratory analysis, data profiling, and data curation), is a long and involved activity for navigating the lake before getting precious insights at the finish line. In this framework, we demonstrate kayak, a framework that supports data preparation in a data lake with ad-hoc primitives and allows data scientists to cross the finish line sooner. kayak takes into account the tolerance of the user in waiting for the primitives' results and it uses incremental execution strategies to produce informative previews of these results. The framework is based on a wise management of metadata and on features that limit human intervention, thus scaling smoothly when the data lake evolves.

Maccioni, A., Torlone, R. (2017). Crossing the finish line faster when paddling the Data Lake with kayak. PROCEEDINGS OF THE VLDB ENDOWMENT, 10(12), 1853-1856.

Crossing the finish line faster when paddling the Data Lake with kayak

Maccioni, Antonio;Torlone, Riccardo
2017-01-01

Abstract

Paddling in a data lake is strenuous for a data scientist. Being a loosely-structured collection of raw data with little or no meta-information available, the difficulties of extracting insights from a data lake start from the initial phases of data analysis. Indeed, data preparation, which involves many complex operations (such as source and feature selection, exploratory analysis, data profiling, and data curation), is a long and involved activity for navigating the lake before getting precious insights at the finish line. In this framework, we demonstrate kayak, a framework that supports data preparation in a data lake with ad-hoc primitives and allows data scientists to cross the finish line sooner. kayak takes into account the tolerance of the user in waiting for the primitives' results and it uses incremental execution strategies to produce informative previews of these results. The framework is based on a wise management of metadata and on features that limit human intervention, thus scaling smoothly when the data lake evolves.
Maccioni, A., Torlone, R. (2017). Crossing the finish line faster when paddling the Data Lake with kayak. PROCEEDINGS OF THE VLDB ENDOWMENT, 10(12), 1853-1856.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/329525
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 20
  • ???jsp.display-item.citation.isi??? 14
social impact