Paddling in a data lake is strenuous for a data scientist. Being a loosely-structured collection of raw data with little or no meta-information available, the difficulties of extracting insights from a data lake start from the initial phases of data analysis. Indeed, data preparation, which involves many complex operations (such as source and feature selection, exploratory analysis, data profiling, and data curation), is a long and involved activity for navigating the lake before getting precious insights at the finish line. In this framework, we demonstrate kayak, a framework that supports data preparation in a data lake with ad-hoc primitives and allows data scientists to cross the finish line sooner. kayak takes into account the tolerance of the user in waiting for the primitives' results and it uses incremental execution strategies to produce informative previews of these results. The framework is based on a wise management of metadata and on features that limit human intervention, thus scaling smoothly when the data lake evolves.
Maccioni, A., & Torlone, R. (2017). Crossing the finish line faster when paddling the Data Lake with kayak. PROCEEDINGS OF THE VLDB ENDOWMENT, 10(12), 1853-1856.
|Titolo:||Crossing the finish line faster when paddling the Data Lake with kayak|
|Data di pubblicazione:||2017|
|Citazione:||Maccioni, A., & Torlone, R. (2017). Crossing the finish line faster when paddling the Data Lake with kayak. PROCEEDINGS OF THE VLDB ENDOWMENT, 10(12), 1853-1856.|
|Appare nelle tipologie:||1.1 Articolo in rivista|