Entity resolution (ER) seeks to identify which records in a data set refer to the same real-world entity. Given the diversity of ways in which entities can be represented, matched and distinguished, ER is known to be a challenging task for automated strategies, but relatively easier for expert humans. In our work, we abstract the knowledge of experts with the notion of a binary oracle. Our oracle can answer questions of the form "do records u and v refer to the same entity?" under a? exible error model, allowing for some questions to be more di?cult to answer correctly than others. Our contribution is a general error correction tool that can be leveraged by a variety of hybrid-human machine ER algorithms, based on a formal way for selecting indirect "control queries". In our experiments we demonstrate that correction-less ER algorithms equipped with our tool can perform even better than recent ER algorithms speci?cally designed for correcting errors. Our control queries are selected among those that provide strongest connectivity between records of each cluster, based on the concept of graph expanders (which are sparse graphs with formal connectivity properties). We give formal performance guarantees for our toolkit and provide experiments on real and synthetic data.

Galhotra, S., Firmani, D., Saha, B., Srivastava, D. (2018). Robust entity resolution using random graphs. In 44th ACM SIGMOD International Conference on Management of Data (SIGMOD), Winner of the REPRODUCIBILITY AWARD, Class A++ (GII-GRIN rating) (pp.3-18). Association for Computing Machinery [10.1145/3183713.3183755].

Robust entity resolution using random graphs

Firmani Donatella;
2018-01-01

Abstract

Entity resolution (ER) seeks to identify which records in a data set refer to the same real-world entity. Given the diversity of ways in which entities can be represented, matched and distinguished, ER is known to be a challenging task for automated strategies, but relatively easier for expert humans. In our work, we abstract the knowledge of experts with the notion of a binary oracle. Our oracle can answer questions of the form "do records u and v refer to the same entity?" under a? exible error model, allowing for some questions to be more di?cult to answer correctly than others. Our contribution is a general error correction tool that can be leveraged by a variety of hybrid-human machine ER algorithms, based on a formal way for selecting indirect "control queries". In our experiments we demonstrate that correction-less ER algorithms equipped with our tool can perform even better than recent ER algorithms speci?cally designed for correcting errors. Our control queries are selected among those that provide strongest connectivity between records of each cluster, based on the concept of graph expanders (which are sparse graphs with formal connectivity properties). We give formal performance guarantees for our toolkit and provide experiments on real and synthetic data.
2018
9781450317436
Galhotra, S., Firmani, D., Saha, B., Srivastava, D. (2018). Robust entity resolution using random graphs. In 44th ACM SIGMOD International Conference on Management of Data (SIGMOD), Winner of the REPRODUCIBILITY AWARD, Class A++ (GII-GRIN rating) (pp.3-18). Association for Computing Machinery [10.1145/3183713.3183755].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/349019
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 18
  • ???jsp.display-item.citation.isi??? 13
social impact