Benvenuti nell'Anagrafe della Ricerca d'Ateneo

Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages. With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, red can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.

Guo, J., Crescenzi, V., Furche, T., Grasso, G., Gottlob, G. (2019). Red: Redundancy-driven data extraction from result pages. In The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019 (pp.605-615). 1515 BROADWAY, NEW YORK, NY 10036-9998 USA : Association for Computing Machinery, Inc [10.1145/3308558.3313529].

Red: Redundancy-driven data extraction from result pages

Guo J.;Crescenzi V.;Furche T.;Grasso G.;Gottlob G.

2019-01-01

Abstract

Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages. With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, red can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2019
			
	Codice ISBN
	
				9781450366748
			
	Citazione
	
				Guo, J., Crescenzi, V., Furche, T., Grasso, G., Gottlob, G. (2019). Red: Redundancy-driven data extraction from result pages. In The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019 (pp.605-615). 1515 BROADWAY, NEW YORK, NY 10036-9998 USA : Association for Computing Machinery, Inc [10.1145/3308558.3313529].
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
p605-guo.pdf accesso aperto Tipologia: Documento in Post-print Dimensione 1.73 MB Formato Adobe PDF Visualizza/Apri	1.73 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/372507

Citazioni

ND

5

3

social impact