Benvenuti nell'Anagrafe della Ricerca d'Ateneo

In Codice Ratio is a research project to study techniques for analyzing the contents of historical documents conserved in the Vatican Apostolic Archives. In this paper, we present our efforts to develop a system to support the automatic transcription of medieval manuscripts, while maintaining the training data collection effort minimal. We focus on crowdsourcing as a means for scalable, expertless training data collection: using crowdsourced character symbols, we train a custom convolutional neural network able to jointly learn correct character shape identification and character recognition. Our approach generates candidate transcriptions by submitting over-segmented character strokes and their combinations to this classifier, while ranking and choosing the most promising outputs by combining the recognition confidence with character and word level statistical language models. We conducted experiments on an unreleased corpus, the Vatican Registers: training our system on 20 pages annotated by the crowd, we were able to obtain good results (19% CER); comparisons to an off-the-shelf system trained with 20 pages annotated with the same process, and to a professional system trained with more than 300 pages transcribed by skilled paleographers demonstrate the opportunities of the proposed approach.

Nieddu, E., Firmani, D., Merialdo, P., Maiorino, M. (2021). In Codice Ratio: A crowd-enabled solution for low resource machine transcription of the Vatican Registers. INFORMATION PROCESSING & MANAGEMENT, 58(5), 102606 [10.1016/j.ipm.2021.102606].

In Codice Ratio: A crowd-enabled solution for low resource machine transcription of the Vatican Registers

Nieddu E.;Firmani D.;Merialdo P.;Maiorino M.

2021-01-01

Abstract

In Codice Ratio is a research project to study techniques for analyzing the contents of historical documents conserved in the Vatican Apostolic Archives. In this paper, we present our efforts to develop a system to support the automatic transcription of medieval manuscripts, while maintaining the training data collection effort minimal. We focus on crowdsourcing as a means for scalable, expertless training data collection: using crowdsourced character symbols, we train a custom convolutional neural network able to jointly learn correct character shape identification and character recognition. Our approach generates candidate transcriptions by submitting over-segmented character strokes and their combinations to this classifier, while ranking and choosing the most promising outputs by combining the recognition confidence with character and word level statistical language models. We conducted experiments on an unreleased corpus, the Vatican Registers: training our system on 20 pages annotated by the crowd, we were able to obtain good results (19% CER); comparisons to an off-the-shelf system trained with 20 pages annotated with the same process, and to a professional system trained with more than 300 pages transcribed by skilled paleographers demonstrate the opportunities of the proposed approach.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2021
			
	Citazione
	
				Nieddu, E., Firmani, D., Merialdo, P., Maiorino, M. (2021). In Codice Ratio: A crowd-enabled solution for low resource machine transcription of the Vatican Registers. INFORMATION PROCESSING & MANAGEMENT, 58(5), 102606 [10.1016/j.ipm.2021.102606].
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/388014

Citazioni

ND

12

5

social impact