In Codice Ratio is a research project to study tools and techniques for analyzing the contents of historical documents conserved in the Vatican Secret Archives (VSA). In this paper, we present our efforts to develop a system to support the transcription of medieval manuscripts. The goal is to provide paleographers with a tool to reduce their efforts in transcribing large volumes, as those stored in the VSA, producing good transcriptions for significant portions of the manuscripts. We propose an original approach based on character segmentation. Our solution is able to deal with the dirty segmentation that inevitably occurs in handwritten documents. We use a convolutional neural network to recognize characters, and statistical language models to compose word transcriptions. Our approach requires minimal training effort, making the transcription process more scalable, as the production of training sets requires a few pages and can be easily crowdsourced. We have conducted experiments on manuscripts from the Vatican Registers, an unreleased corpus containing the correspondence of the popes. With training data produced by 120 high school students, our system has been able to produce good transcriptions that can be used by paleographers as a solid basis to speedup the transcription process at a large scale.

Firmani, D., Merialdo, P., Maiorino, M., Nieddu, E. (2018). Towards knowledge discovery from the Vatican secret archives. In codice ratio - episode 1: Machine transcription of the manuscripts. In 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Class A++ (GII-GRIN rating) (pp.263-272). Association for Computing Machinery [10.1145/3219819.3219879].

Towards knowledge discovery from the Vatican secret archives. In codice ratio - episode 1: Machine transcription of the manuscripts

Firmani Donatella;Merialdo Paolo;Maiorino Marco;Nieddu Elena
2018-01-01

Abstract

In Codice Ratio is a research project to study tools and techniques for analyzing the contents of historical documents conserved in the Vatican Secret Archives (VSA). In this paper, we present our efforts to develop a system to support the transcription of medieval manuscripts. The goal is to provide paleographers with a tool to reduce their efforts in transcribing large volumes, as those stored in the VSA, producing good transcriptions for significant portions of the manuscripts. We propose an original approach based on character segmentation. Our solution is able to deal with the dirty segmentation that inevitably occurs in handwritten documents. We use a convolutional neural network to recognize characters, and statistical language models to compose word transcriptions. Our approach requires minimal training effort, making the transcription process more scalable, as the production of training sets requires a few pages and can be easily crowdsourced. We have conducted experiments on manuscripts from the Vatican Registers, an unreleased corpus containing the correspondence of the popes. With training data produced by 120 high school students, our system has been able to produce good transcriptions that can be used by paleographers as a solid basis to speedup the transcription process at a large scale.
2018
9781450355520
Firmani, D., Merialdo, P., Maiorino, M., Nieddu, E. (2018). Towards knowledge discovery from the Vatican secret archives. In codice ratio - episode 1: Machine transcription of the manuscripts. In 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Class A++ (GII-GRIN rating) (pp.263-272). Association for Computing Machinery [10.1145/3219819.3219879].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/338949
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 15
  • ???jsp.display-item.citation.isi??? 7
social impact