In this paper, we discuss the application of concept of data quality to big data by highlighting how much complex is to define it in a general way. Already data quality is a multidimensional concept, difficult to characterize in precise definitions even in the case of well-structured data. Big data add two further dimensions of complexity: (i) being “very” source specific, and for this we adopt the interesting UNECE classification, and (ii) being highly unstructured and schema-less, often without golden standards to refer to or very difficult to access. After providing a tutorial on data quality in traditional contexts, we analyze big data by providing insights into the UNECE classification, and then, for each type of data source, we choose a specific instance of such a type (notably deep Web data, sensor-generated data, and Twitters/short texts) and discuss how quality dimensions can be defined in these cases. The overall aim of the paper is therefore to identify further research directions in the area of big data quality, by providing at the same time an up-to-date state of the art on data quality.
Firmani, D., Mecella, M., Scannapieco, M., Batini, C. (2016). On the Meaningfulness of “Big Data Quality”. DATA SCIENCE AND ENGINEERING, 1(1), 6-20 [10.1007/s41019-015-0004-7].
On the Meaningfulness of “Big Data Quality”
Firmani Donatella;
2016-01-01
Abstract
In this paper, we discuss the application of concept of data quality to big data by highlighting how much complex is to define it in a general way. Already data quality is a multidimensional concept, difficult to characterize in precise definitions even in the case of well-structured data. Big data add two further dimensions of complexity: (i) being “very” source specific, and for this we adopt the interesting UNECE classification, and (ii) being highly unstructured and schema-less, often without golden standards to refer to or very difficult to access. After providing a tutorial on data quality in traditional contexts, we analyze big data by providing insights into the UNECE classification, and then, for each type of data source, we choose a specific instance of such a type (notably deep Web data, sensor-generated data, and Twitters/short texts) and discuss how quality dimensions can be defined in these cases. The overall aim of the paper is therefore to identify further research directions in the area of big data quality, by providing at the same time an up-to-date state of the art on data quality.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.