Domain-Constrained Data Augmentation for Entity Resolution

Napoleone, M.; Console, M.; Lenzerini, M.; Merialdo, P.; Papi, L.; Poggi, A.; Scafoglieri, F.; Torlone, R.

doi:10.1007/978-3-032-19096-3_37

Entity Resolution (ER) is the task of identifying pairs of records that refer to the same real-world entity (e.g., same products or persons). Pretrained Language Models (PLMs) achieve state-of-the-art ER performance but rely on large labeled datasets, which are costly to acquire. Data augmentation (DA) addresses this issue by generating synthetic training samples, yet methods like MixDA often produce unrealistic samples, limiting their effectiveness. In this paper, we propose a novel DA pipeline that integrates MixDA with symbolic rule-based validation. Our approach generates samples in the embedding space, reconstructs their textual representations via a generative model, and applies domain-specific rules to ensure real-world validity. Experiments on public ER benchmarks demonstrate that our method achieves F1-scores comparable to, and in some cases exceeding, those of PLM-based baselines trained on the full dataset. This work advances data-centric AI by reducing the costs associated with labeling.

Napoleone, M., Console, M., Lenzerini, M., Merialdo, P., Papi, L., Poggi, A., et al. (2026). Domain-Constrained Data Augmentation for Entity Resolution. In Communications in Computer and Information Science (pp.554-566). Springer Science and Business Media Deutschland GmbH [10.1007/978-3-032-19096-3_37].

Domain-Constrained Data Augmentation for Entity Resolution

Napoleone M.;Console M.;Lenzerini M.;Merialdo P.;Papi L.;Poggi A.;Scafoglieri F.;Torlone R.

2026-01-01

Abstract

Entity Resolution (ER) is the task of identifying pairs of records that refer to the same real-world entity (e.g., same products or persons). Pretrained Language Models (PLMs) achieve state-of-the-art ER performance but rely on large labeled datasets, which are costly to acquire. Data augmentation (DA) addresses this issue by generating synthetic training samples, yet methods like MixDA often produce unrealistic samples, limiting their effectiveness. In this paper, we propose a novel DA pipeline that integrates MixDA with symbolic rule-based validation. Our approach generates samples in the embedding space, reconstructs their textual representations via a generative model, and applies domain-specific rules to ensure real-world validity. Experiments on public ER benchmarks demonstrate that our method achieves F1-scores comparable to, and in some cases exceeding, those of PLM-based baselines trained on the full dataset. This work advances data-centric AI by reducing the costs associated with labeling.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	Codice ISBN
	
				9783032190956
			
	Citazione
	
				Napoleone, M., Console, M., Lenzerini, M., Merialdo, P., Papi, L., Poggi, A., et al. (2026). Domain-Constrained Data Augmentation for Entity Resolution. In Communications in Computer and Information Science (pp.554-566). Springer Science and Business Media Deutschland GmbH [10.1007/978-3-032-19096-3_37].
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/548638

Citazioni

ND

0

ND

Benvenuti nell'Anagrafe della Ricerca d'Ateneo

Domain-Constrained Data Augmentation for Entity Resolution

Napoleone M.;Console M.;Lenzerini M.;Merialdo P.;Papi L.;Poggi A.;Scafoglieri F.;Torlone R.

2026-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

Benvenuti nell'Anagrafe della Ricerca d'Ateneo

Domain-Constrained Data Augmentation for Entity Resolution

Napoleone M.;Console M.;Lenzerini M.;Merialdo P.;Papi L.;Poggi A.;Scafoglieri F.;Torlone R.

2026-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)