This paper describes a methodology for discovering and resolving protein names abbreviations from the full-text versions of scientific articles, implemented in the PRAISED framework with the ultimate purpose of building up a publicly available abbreviation repository. Three processing steps lie at the core of the framework: i) an abbreviation identification phase, carried out via domain-independent metrics, whose purpose is to identify all possible abbreviations within a scientific text; ii) an abbreviation resolution phase, which takes into account a number of syntactical and semantic criteria in order to match an abbreviation with its potential explanation; and iii) a dictionary-based protein name identification, which is meant to select only those abbreviations belonging to the protein science domain. A local copy of the UniProt database is used as a source repository for all the known proteins. The PRAISED implementation has been tested against several known annotated corpora, such as the Medstract Gold Standard Corpus, the AB3P Corpus, the BioText Corpus and the Ao and Takagi Corpus, obtaining significantly high levels of recall and extremely fast performance, while also keeping promising levels of precision and overall f-measure, in comparison to the most relevant similar methods. This comparison has been carried out up to Phase 2, since those methods stop at expanding abbreviations, without performing any entity recognition. Instead, the entity recognition performed in the last phase provides PRAISED with an effective strategy for protein discovery, thus moving further from existing context-free techniques. Furthermore, this implementation also addresses the complexity of full-text papers, instead of the simpler abstracts more generally used. As such, the whole PRAISED process (Phase 1, 2 and 3) has been also tested against a manually annotated subset of full-text papers retrieved from the PubMed repository, with significant results as well.

Toti, D., Atzeni, P., Polticelli, F. (2012). Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework. BIO-ALGORITHMS AND MED-SYSTEMS, 8(1), 13-52 [10.2478/bams-2012-0002].

Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework

ATZENI, Paolo;POLTICELLI, Fabio
2012-01-01

Abstract

This paper describes a methodology for discovering and resolving protein names abbreviations from the full-text versions of scientific articles, implemented in the PRAISED framework with the ultimate purpose of building up a publicly available abbreviation repository. Three processing steps lie at the core of the framework: i) an abbreviation identification phase, carried out via domain-independent metrics, whose purpose is to identify all possible abbreviations within a scientific text; ii) an abbreviation resolution phase, which takes into account a number of syntactical and semantic criteria in order to match an abbreviation with its potential explanation; and iii) a dictionary-based protein name identification, which is meant to select only those abbreviations belonging to the protein science domain. A local copy of the UniProt database is used as a source repository for all the known proteins. The PRAISED implementation has been tested against several known annotated corpora, such as the Medstract Gold Standard Corpus, the AB3P Corpus, the BioText Corpus and the Ao and Takagi Corpus, obtaining significantly high levels of recall and extremely fast performance, while also keeping promising levels of precision and overall f-measure, in comparison to the most relevant similar methods. This comparison has been carried out up to Phase 2, since those methods stop at expanding abbreviations, without performing any entity recognition. Instead, the entity recognition performed in the last phase provides PRAISED with an effective strategy for protein discovery, thus moving further from existing context-free techniques. Furthermore, this implementation also addresses the complexity of full-text papers, instead of the simpler abstracts more generally used. As such, the whole PRAISED process (Phase 1, 2 and 3) has been also tested against a manually annotated subset of full-text papers retrieved from the PubMed repository, with significant results as well.
2012
Toti, D., Atzeni, P., Polticelli, F. (2012). Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework. BIO-ALGORITHMS AND MED-SYSTEMS, 8(1), 13-52 [10.2478/bams-2012-0002].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/148995
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 15
  • ???jsp.display-item.citation.isi??? ND
social impact