The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accu- racy can be achieved by supervised approaches, but the costs of training data, i.e., annotations over a set of sam- ple pages, limit their scalability. Crowdsourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We demonstrate alfred, a wrapper inference system super- vised by the workers of a crowdsourcing platform. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. alfred includes several original features: it automatically selects a representative sample set from the input collection of pages; in order to minimize the wrapper inference costs, it dynamically sets the expressiveness of the wrapper for- malism and it adopts an active learning algorithm to select the queries posed to the crowd; it is able to manage inaccu- rate answers that can be provided by the workers engaged by crowdsourcing platforms.
Crescenzi, V., Merialdo, P., Qiu, D. (2013). ALFRED: Crowd assisted data extraction. In WWW 2013 Companion - Proceedings of the 22nd International Conference on World Wide Web (pp.297-300).
ALFRED: Crowd assisted data extraction
CRESCENZI, VALTER;MERIALDO, PAOLO;QIU, DISHENG
2013-01-01
Abstract
The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accu- racy can be achieved by supervised approaches, but the costs of training data, i.e., annotations over a set of sam- ple pages, limit their scalability. Crowdsourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We demonstrate alfred, a wrapper inference system super- vised by the workers of a crowdsourcing platform. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. alfred includes several original features: it automatically selects a representative sample set from the input collection of pages; in order to minimize the wrapper inference costs, it dynamically sets the expressiveness of the wrapper for- malism and it adopts an active learning algorithm to select the queries posed to the crowd; it is able to manage inaccu- rate answers that can be provided by the workers engaged by crowdsourcing platforms.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.