In data-intensive web sites pages are generated by scripts that embed data from a backend database into HTML templates. There is usually a relationship between the semantics of the data in a page and its corresponding template. For example, in a web site about sports events, it is likely that pages with data about athletes are associated with a template that differs from the template used to generate pages about coaches or referees. This article presents a method to classify web pages according to the associated template. Given a web page, the goal of our method is to accurately find the pages that are about the same topic. Our method leverages on a simple, yet effective model to abstract some structural features of a web page. We present the results of an extensive experimental analysis that show the performance of our methods in terms of both recall and precision regarding a large number of real-world web pages.
LORENZO BLANCO, VALTER CRESCENZI, & PAOLO MERIALDO (2008). Structure and Semantics of Data-IntensiveWeb Pages: An Experimental Study on their Relationships. JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 14(11), 1877-1892 [10.3217/jucs-014-11].
Titolo: | Structure and Semantics of Data-IntensiveWeb Pages: An Experimental Study on their Relationships | |
Autori: | ||
Data di pubblicazione: | 2008 | |
Rivista: | ||
Citazione: | LORENZO BLANCO, VALTER CRESCENZI, & PAOLO MERIALDO (2008). Structure and Semantics of Data-IntensiveWeb Pages: An Experimental Study on their Relationships. JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 14(11), 1877-1892 [10.3217/jucs-014-11]. | |
Abstract: | In data-intensive web sites pages are generated by scripts that embed data from a backend database into HTML templates. There is usually a relationship between the semantics of the data in a page and its corresponding template. For example, in a web site about sports events, it is likely that pages with data about athletes are associated with a template that differs from the template used to generate pages about coaches or referees. This article presents a method to classify web pages according to the associated template. Given a web page, the goal of our method is to accurately find the pages that are about the same topic. Our method leverages on a simple, yet effective model to abstract some structural features of a web page. We present the results of an extensive experimental analysis that show the performance of our methods in terms of both recall and precision regarding a large number of real-world web pages. | |
Handle: | http://hdl.handle.net/11590/118233 | |
Appare nelle tipologie: | 1.1 Articolo in rivista |