In Codice Ratio is a research project to study techniques for analyzing the contents of historical documents conserved in the Vatican Apostolic Archives. In this paper, we present our efforts to develop a system to support the automatic transcription of medieval manuscripts, while maintaining the training data collection effort minimal. We focus on crowdsourcing as a means for scalable, expertless training data collection: using crowdsourced character symbols, we train a custom convolutional neural network able to jointly learn correct character shape identification and character recognition. Our approach generates candidate transcriptions by submitting over-segmented character strokes and their combinations to this classifier, while ranking and choosing the most promising outputs by combining the recognition confidence with character and word level statistical language models. We conducted experiments on an unreleased corpus, the Vatican Registers: training our system on 20 pages annotated by the crowd, we were able to obtain good results (19% CER); comparisons to an off-the-shelf system trained with 20 pages annotated with the same process, and to a professional system trained with more than 300 pages transcribed by skilled paleographers demonstrate the opportunities of the proposed approach.
Nieddu, E., Firmani, D., Merialdo, P., Maiorino, M. (2021). In Codice Ratio: A crowd-enabled solution for low resource machine transcription of the Vatican Registers. INFORMATION PROCESSING & MANAGEMENT, 58(5), 102606 [10.1016/j.ipm.2021.102606].
In Codice Ratio: A crowd-enabled solution for low resource machine transcription of the Vatican Registers
Nieddu E.;Firmani D.;Merialdo P.;Maiorino M.
2021-01-01
Abstract
In Codice Ratio is a research project to study techniques for analyzing the contents of historical documents conserved in the Vatican Apostolic Archives. In this paper, we present our efforts to develop a system to support the automatic transcription of medieval manuscripts, while maintaining the training data collection effort minimal. We focus on crowdsourcing as a means for scalable, expertless training data collection: using crowdsourced character symbols, we train a custom convolutional neural network able to jointly learn correct character shape identification and character recognition. Our approach generates candidate transcriptions by submitting over-segmented character strokes and their combinations to this classifier, while ranking and choosing the most promising outputs by combining the recognition confidence with character and word level statistical language models. We conducted experiments on an unreleased corpus, the Vatican Registers: training our system on 20 pages annotated by the crowd, we were able to obtain good results (19% CER); comparisons to an off-the-shelf system trained with 20 pages annotated with the same process, and to a professional system trained with more than 300 pages transcribed by skilled paleographers demonstrate the opportunities of the proposed approach.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.