Several studies have recently concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called Prefix Mark-Up Languages, that nicely abstract the structures usually found in HTML pages of large web sites; this class has been proved to be identifiable in the limit, and a polynomial unsupervised learning algorithm has been developed. Unfortunately, many real-life web pages do not fall in this class of languages. We argue that this is mainly due to the ambiguity of HTML. In this paper we present an approach to detect and remove HTML ambiguities. Our approach is based on preprocessing techniques that allow us to analyze pages in order to transform them into Prefix Mark-Up Languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We also report on experiments that we have conducted to evaluate the approach.
Crescenzi, V., Merialdo, P. (2008). WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES. APPLIED ARTIFICIAL INTELLIGENCE, 22, 21-52 [10.1080/08839510701853093].
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES
CRESCENZI, VALTER;MERIALDO, PAOLO
2008-01-01
Abstract
Several studies have recently concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called Prefix Mark-Up Languages, that nicely abstract the structures usually found in HTML pages of large web sites; this class has been proved to be identifiable in the limit, and a polynomial unsupervised learning algorithm has been developed. Unfortunately, many real-life web pages do not fall in this class of languages. We argue that this is mainly due to the ambiguity of HTML. In this paper we present an approach to detect and remove HTML ambiguities. Our approach is based on preprocessing techniques that allow us to analyze pages in order to transform them into Prefix Mark-Up Languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We also report on experiments that we have conducted to evaluate the approach.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.