Data extraction from the Web represents an important issue. Several approaches have been developed to bring the wrapper generation process at the web scale. Although they rely on different techniques and formalisms, they all learn a wrapper given a set of sample pages. Unsupervised approaches require just a set of sample pages, supervised ones also need training data. Unfortunately, the accuracy obtained by unsupervised techniques is not sufficient for many applications. On the other hand, obtaining training data is not cheap at the web scale. This paper addresses the issue of minimizing the costs of collecting training data for learning web wrappers. We show that two interleaved problems affect this issue: the choice of the sample pages, and the expressiveness of the wrapper language. We propose a solution that leverages contributions in the field of learning theory, and we discuss the promising results of an experimental evaluation of our approach.
Creo, R., Crescenzi, V., Qiu, D., Merialdo, P. (2012). Minimizing the costs of the training data for learning Web wrappers. In CEUR Workshop Proceedings (pp.35-40).
Minimizing the costs of the training data for learning Web wrappers
CRESCENZI, VALTER;QIU, DISHENG;MERIALDO, PAOLO
2012-01-01
Abstract
Data extraction from the Web represents an important issue. Several approaches have been developed to bring the wrapper generation process at the web scale. Although they rely on different techniques and formalisms, they all learn a wrapper given a set of sample pages. Unsupervised approaches require just a set of sample pages, supervised ones also need training data. Unfortunately, the accuracy obtained by unsupervised techniques is not sufficient for many applications. On the other hand, obtaining training data is not cheap at the web scale. This paper addresses the issue of minimizing the costs of collecting training data for learning web wrappers. We show that two interleaved problems affect this issue: the choice of the sample pages, and the expressiveness of the wrapper language. We propose a solution that leverages contributions in the field of learning theory, and we discuss the promising results of an experimental evaluation of our approach.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.