Entity Resolution (ER) is the task of identifying pairs of records that refer to the same real-world entity (e.g., same products or persons). Pretrained Language Models (PLMs) achieve state-of-the-art ER performance but rely on large labeled datasets, which are costly to acquire. Data augmentation (DA) addresses this issue by generating synthetic training samples, yet methods like MixDA often produce unrealistic samples, limiting their effectiveness. In this paper, we propose a novel DA pipeline that integrates MixDA with symbolic rule-based validation. Our approach generates samples in the embedding space, reconstructs their textual representations via a generative model, and applies domain-specific rules to ensure real-world validity. Experiments on public ER benchmarks demonstrate that our method achieves F1-scores comparable to, and in some cases exceeding, those of PLM-based baselines trained on the full dataset. This work advances data-centric AI by reducing the costs associated with labeling.
Napoleone, M., Console, M., Lenzerini, M., Merialdo, P., Papi, L., Poggi, A., et al. (2026). Domain-Constrained Data Augmentation for Entity Resolution. In Communications in Computer and Information Science (pp.554-566). Springer Science and Business Media Deutschland GmbH [10.1007/978-3-032-19096-3_37].
Domain-Constrained Data Augmentation for Entity Resolution
Merialdo P.;Torlone R.
2026-01-01
Abstract
Entity Resolution (ER) is the task of identifying pairs of records that refer to the same real-world entity (e.g., same products or persons). Pretrained Language Models (PLMs) achieve state-of-the-art ER performance but rely on large labeled datasets, which are costly to acquire. Data augmentation (DA) addresses this issue by generating synthetic training samples, yet methods like MixDA often produce unrealistic samples, limiting their effectiveness. In this paper, we propose a novel DA pipeline that integrates MixDA with symbolic rule-based validation. Our approach generates samples in the embedding space, reconstructs their textual representations via a generative model, and applies domain-specific rules to ensure real-world validity. Experiments on public ER benchmarks demonstrate that our method achieves F1-scores comparable to, and in some cases exceeding, those of PLM-based baselines trained on the full dataset. This work advances data-centric AI by reducing the costs associated with labeling.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


