We present Noah, an ongoing research project aiming at developing a system for semi-automatically creating end-to-end Web data processing pipelines. The pipelines continuously extract and integrate information from multiple sites by leveraging the redundancy of the data published on the Web. The system is based on a novel hybrid human-machine learning approach in which the same type of questions can be interchangeably posed both to human crowd workers and to automatic responders based on machine learning (ML) models. Since the early stages of pipelines, crowd workers are engaged to guarantee the output data quality, and to collect training data, that are then used to progressively train and evaluate automatic responders. The latter are fully deployed into the data processing pipelines to scale the approach and to contain the crowdsourcing costs later. The combination of guaranteed quality and progressive reductions of costs of the pipelines generated by our system can improve the investments and development processes of many applications that build on the availability of such data processing pipelines.
Cetorelli, V., Crescenzi, V., Merialdo, P., Voyat, R. (2021). Noah: Creating data integration pipelines over continuously extracted web data. In CEUR Workshop Proceedings. CEUR-WS.
Noah: Creating data integration pipelines over continuously extracted web data
Cetorelli V.;Crescenzi V.;Merialdo P.;Voyat R.
2021-01-01
Abstract
We present Noah, an ongoing research project aiming at developing a system for semi-automatically creating end-to-end Web data processing pipelines. The pipelines continuously extract and integrate information from multiple sites by leveraging the redundancy of the data published on the Web. The system is based on a novel hybrid human-machine learning approach in which the same type of questions can be interchangeably posed both to human crowd workers and to automatic responders based on machine learning (ML) models. Since the early stages of pipelines, crowd workers are engaged to guarantee the output data quality, and to collect training data, that are then used to progressively train and evaluate automatic responders. The latter are fully deployed into the data processing pipelines to scale the approach and to contain the crowdsourcing costs later. The combination of guaranteed quality and progressive reductions of costs of the pipelines generated by our system can improve the investments and development processes of many applications that build on the availability of such data processing pipelines.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.