Several studies have concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called prefix mark-up languages, that nicely abstract the structures usually found in HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many real-life web pages do not fall in this class of languages. In this article we analyze the roots of the problem and we propose a technique to transform pages in order to bring them into the class of prefix mark-up languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We report on some experiments that we have conducted on real-life web pages to evaluate the approach; the results of this activity demonstrate the effectiveness of the presented techniques.
VALTER CRESCENZI, & PAOLO MERIALDO (2008). WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES. APPLIED ARTIFICIAL INTELLIGENCE, 22(1), 21-52.
Titolo: | WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES |
Autori: | |
Data di pubblicazione: | 2008 |
Rivista: | |
Citazione: | VALTER CRESCENZI, & PAOLO MERIALDO (2008). WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES. APPLIED ARTIFICIAL INTELLIGENCE, 22(1), 21-52. |
Abstract: | Several studies have concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called prefix mark-up languages, that nicely abstract the structures usually found in HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many real-life web pages do not fall in this class of languages. In this article we analyze the roots of the problem and we propose a technique to transform pages in order to bring them into the class of prefix mark-up languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We report on some experiments that we have conducted on real-life web pages to evaluate the approach; the results of this activity demonstrate the effectiveness of the presented techniques. |
Handle: | http://hdl.handle.net/11590/118185 |
Appare nelle tipologie: | 1.1 Articolo in rivista |