Clustering Web pages based on their structure

Crescenzi, Valter; Merialdo, Paolo; Missier, P.

doi:10.1016/j.datak.2004.11.004

Several techniques have been recently proposed to automatically generate web wrap- pers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe the structure of a web site as a graph: nodes are classes of pages that share a common structure, edges represent links among instances of the page classes. Based on this model, we have developed an algorithm that accepts the url of an entry point to a target web site, visits a limited number of pages, and produces an accurate model of the site structure. We also report on experiments performed on actual web sites.

Crescenzi, V., Merialdo, P., P., M. (2005). Clustering Web pages based on their structure. DATA & KNOWLEDGE ENGINEERING, 54, 279-299 [10.1016/j.datak.2004.11.004].