Many approaches have been introduced recently to automatically create or augment Knowledge Graphs (KGs) with facts extracted from Wikipedia, particularly its structured components like the infoboxes. Although these structures are valuable, they represent only a fraction of the actual information expressed in the articles. In this work, we quantify the number of highly accurate facts that can be harvested with high precision from the text of Wikipedia articles using information extraction techniques bootstrapped from the entities and relations already in a KG. Our experimental evaluation, which uses Freebase as reference KG, reveals we can augment several relations in the domain of people by more than 10%, with facts whose accuracy are over 95%. Moreover, the vast majority of these facts are missing from the infoboxes, YAGO and DBpedia.
Cannaviccio, M., Barbosa, D., Merialdo, P. (2016). Accurate fact harvesting from natural language text in Wikipedia with lector. In Proceedings of the 19th International Workshop on Web and Databases, WebDB 2016 (pp.1-6). Association for Computing Machinery, Inc [10.1145/2932194.2932203].
Accurate fact harvesting from natural language text in Wikipedia with lector
CANNAVICCIO, MATTEO;MERIALDO, PAOLO
2016-01-01
Abstract
Many approaches have been introduced recently to automatically create or augment Knowledge Graphs (KGs) with facts extracted from Wikipedia, particularly its structured components like the infoboxes. Although these structures are valuable, they represent only a fraction of the actual information expressed in the articles. In this work, we quantify the number of highly accurate facts that can be harvested with high precision from the text of Wikipedia articles using information extraction techniques bootstrapped from the entities and relations already in a KG. Our experimental evaluation, which uses Freebase as reference KG, reveals we can augment several relations in the domain of people by more than 10%, with facts whose accuracy are over 95%. Moreover, the vast majority of these facts are missing from the infoboxes, YAGO and DBpedia.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.