Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a targetWeb site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites.

Crescenzi, V., Merialdo, P., Paolo, M. (2005). Clustering Web pages based on their structure. DATA & KNOWLEDGE ENGINEERING, 54(3), 279-299 [10.1016/j.datak.2004.11.004].

Clustering Web pages based on their structure

CRESCENZI, VALTER;MERIALDO, PAOLO;
2005-01-01

Abstract

Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a targetWeb site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites.
Crescenzi, V., Merialdo, P., Paolo, M. (2005). Clustering Web pages based on their structure. DATA & KNOWLEDGE ENGINEERING, 54(3), 279-299 [10.1016/j.datak.2004.11.004].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/143276
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 60
  • ???jsp.display-item.citation.isi??? 35
social impact