Benvenuti nell'Anagrafe della Ricerca d'Ateneo

We introduce landmark grammars, a new family of context-free grammars aimed at describing the HTML source code of pages published by large and templated websites and therefore at effectively tackling Web data extraction problems. Indeed, they address the inherent ambiguity of HTML, one of the main challenges of Web data extraction, which, despite over twenty years of research, has been largely neglected by the approaches presented in literature. We then formalize the Smallest Extraction Problem (SEP), an optimization problem for finding the grammar of a family that best describes a set of pages and contextually extract their data. Finally, we present an unsupervised learning algorithm to induce a landmark grammar from a set of pages sharing a common HTML template, and we present an automatic Web data extraction system. The experiments on consolidated benchmarks show that the approach can substantially contribute to improve the state-of-the-art.

Cetorelli, V., Atzeni, P., Crescenzi, V., Milicchio, F. (2021). The smallest extraction problem. PROCEEDINGS OF THE VLDB ENDOWMENT, 14(11), 2445-2458 [10.14778/3476249.3476293].

The smallest extraction problem

Cetorelli V.;Atzeni P.;Crescenzi V.;Milicchio F.

2021-01-01

Abstract

We introduce landmark grammars, a new family of context-free grammars aimed at describing the HTML source code of pages published by large and templated websites and therefore at effectively tackling Web data extraction problems. Indeed, they address the inherent ambiguity of HTML, one of the main challenges of Web data extraction, which, despite over twenty years of research, has been largely neglected by the approaches presented in literature. We then formalize the Smallest Extraction Problem (SEP), an optimization problem for finding the grammar of a family that best describes a set of pages and contextually extract their data. Finally, we present an unsupervised learning algorithm to induce a landmark grammar from a set of pages sharing a common HTML template, and we present an automatic Web data extraction system. The experiments on consolidated benchmarks show that the approach can substantially contribute to improve the state-of-the-art.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2021
			
	Citazione
	
				Cetorelli, V., Atzeni, P., Crescenzi, V., Milicchio, F. (2021). The smallest extraction problem. PROCEEDINGS OF THE VLDB ENDOWMENT, 14(11), 2445-2458 [10.14778/3476249.3476293].
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
VLDB 2021 p2445-crescenzi.pdf accesso aperto Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 997.92 kB Formato Adobe PDF Visualizza/Apri	997.92 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/404519

Citazioni

ND

4

1

social impact