HIGH THROUGHPUT PROTEIN STRUCTURE PREDICTION IN A GRID ENVIRONMENT

Minervini, Giovanni; La Rocca Giuseppe,; Luisi, Pier Luigi; Polticelli, Fabio

The number of known natural protein sequences, though quite large, is infinitely small as compared to the number of proteins theoretically possible with the twenty natural amino acids. Thus, there exists a huge number of protein sequences which have never been observed in nature, the so called “never born proteins”. The study of the structural and functional properties of "never born proteins" represents a way to improve our knowledge on the fundamental properties that make existing protein sequences so unique. Furthermore it is of great interest to understand if the extant proteins are only the result of contingency or else the result of a selection process based on the peculiar physico-chemical properties of their protein sequence. Protein structure prediction tools combined with the use of large computing resources allow to tackle this problem. In fact, the study of never born proteins requires the generation of a large library of protein sequences not present in nature and the prediction of their three-dimensional structure. This is not trivial when facing 105-107 protein sequences. Indeed, on a single CPU it would require years to predict the structure of such a large library of protein sequences. On the other hand, this is an embarassingly parallel problem in which the same computation (i.e. the prediction of the three-dimensional structure of a protein sequence) must be repeated several times (i.e. on a large number of protein sequences). The use of grid infrastructures makes feasible to approach this problem in an acceptable time frame. In this paper we describe the set up of a simulation environment within the EUChinaGRID infrastructure that allows user friendly exploitation of grid resources for large-scale protein structure prediction.

Minervini, G., LA ROCCA, G., Luisi, P.L., Polticelli, F. (2007). HIGH THROUGHPUT PROTEIN STRUCTURE PREDICTION IN A GRID ENVIRONMENT. BIO-ALGORITHMS AND MED-SYSTEMS, 3, 39-43.