Several studies have recently concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called Prefix Mark-Up Languages, that nicely abstract the structures usually found in HTML pages of large web sites; this class has been proved to be identifiable in the limit, and a polynomial unsupervised learning algorithm has been developed. Unfortunately, many real-life web pages do not fall in this class of languages. We argue that this is mainly due to the ambiguity of HTML. In this paper we present an approach to detect and remove HTML ambiguities. Our approach is based on preprocessing techniques that allow us to analyze pages in order to transform them into Prefix Mark-Up Languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We also report on experiments that we have conducted to evaluate the approach.

Crescenzi, V., Merialdo, P. (2008). WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES. APPLIED ARTIFICIAL INTELLIGENCE, 22, 21-52 [10.1080/08839510701853093].

WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

CRESCENZI, VALTER;MERIALDO, PAOLO
2008-01-01

Abstract

Several studies have recently concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called Prefix Mark-Up Languages, that nicely abstract the structures usually found in HTML pages of large web sites; this class has been proved to be identifiable in the limit, and a polynomial unsupervised learning algorithm has been developed. Unfortunately, many real-life web pages do not fall in this class of languages. We argue that this is mainly due to the ambiguity of HTML. In this paper we present an approach to detect and remove HTML ambiguities. Our approach is based on preprocessing techniques that allow us to analyze pages in order to transform them into Prefix Mark-Up Languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We also report on experiments that we have conducted to evaluate the approach.
2008
Crescenzi, V., Merialdo, P. (2008). WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES. APPLIED ARTIFICIAL INTELLIGENCE, 22, 21-52 [10.1080/08839510701853093].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/148365
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 23
  • ???jsp.display-item.citation.isi??? 15
social impact