Several techniques have been developed to extract and inte- grate data from web sources. However, web data are inherently imprecise and uncertain. This paper addresses the issue of characterizing the un- certainty of data extracted from a number of inaccurate sources. We develop a probabilistic model to compute a probability distribution for the extracted values, and the accuracy of the sources. Our model consid- ers the presence of sources that copy their contents from other sources, and manages the misleading consensus produced by copiers. We extend the models previously proposed in the literature by working on several attributes at a time to better leverage all the available evidence. We also report the results of several experiments on both synthetic and real-life data to show the eectiveness of the proposed approach.
Lorenzo, B., Crescenzi, V., Merialdo, P., Paolo, P. (2010). Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources. In 22nd International Conference, CAiSE 2010. Proceedings. Lecture Notes in Computer Science 6051 Springer 2010. BERLIN : Springer [10.1007/978-3-642-13094-6_8].
Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources
CRESCENZI, VALTER;MERIALDO, PAOLO;
2010-01-01
Abstract
Several techniques have been developed to extract and inte- grate data from web sources. However, web data are inherently imprecise and uncertain. This paper addresses the issue of characterizing the un- certainty of data extracted from a number of inaccurate sources. We develop a probabilistic model to compute a probability distribution for the extracted values, and the accuracy of the sources. Our model consid- ers the presence of sources that copy their contents from other sources, and manages the misleading consensus produced by copiers. We extend the models previously proposed in the literature by working on several attributes at a time to better leverage all the available evidence. We also report the results of several experiments on both synthetic and real-life data to show the eectiveness of the proposed approach.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.