Benvenuti nell'Anagrafe della Ricerca d'Ateneo

Data extraction from the Web represents an important issue. Several approaches have been developed to bring the wrapper generation process at the web scale. Although they rely on different techniques and formalisms, they all learn a wrapper given a set of sample pages. Unsupervised approaches require just a set of sample pages, supervised ones also need training data. Unfortunately, the accuracy obtained by unsupervised techniques is not sufficient for many applications. On the other hand, obtaining training data is not cheap at the web scale. This paper addresses the issue of minimizing the costs of collecting training data for learning web wrappers. We show that two interleaved problems affect this issue: the choice of the sample pages, and the expressiveness of the wrapper language. We propose a solution that leverages contributions in the field of learning theory, and we discuss the promising results of an experimental evaluation of our approach.

Creo, R., Crescenzi, V., Qiu, D., Merialdo, P. (2012). Minimizing the costs of the training data for learning Web wrappers. In CEUR Workshop Proceedings (pp.35-40).

Minimizing the costs of the training data for learning Web wrappers

Creo, Rolando;CRESCENZI, VALTER;QIU, DISHENG;MERIALDO, PAOLO

2012-01-01

Abstract

Data extraction from the Web represents an important issue. Several approaches have been developed to bring the wrapper generation process at the web scale. Although they rely on different techniques and formalisms, they all learn a wrapper given a set of sample pages. Unsupervised approaches require just a set of sample pages, supervised ones also need training data. Unfortunately, the accuracy obtained by unsupervised techniques is not sufficient for many applications. On the other hand, obtaining training data is not cheap at the web scale. This paper addresses the issue of minimizing the costs of collecting training data for learning web wrappers. We show that two interleaved problems affect this issue: the choice of the sample pages, and the expressiveness of the wrapper language. We propose a solution that leverages contributions in the field of learning theory, and we discuss the promising results of an experimental evaluation of our approach.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2012
			
	Citazione
	
				Creo, R., Crescenzi, V., Qiu, D., Merialdo, P. (2012). Minimizing the costs of the training data for learning Web wrappers. In CEUR Workshop Proceedings (pp.35-40).
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/307635

Citazioni

ND

2

ND

social impact