The wfRosetta-MUfold Branch

Posted by in WeFold3

Protein structure refinements by using MUfold to sample Rosetta decoys
Hongbo Li, School of Computer Science and Information Technology, NorthEast Normal University, Changchun, 130117, China, and Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri, 65211, USA, and
Dong Xu, Department of Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, Missouri, 65211, USA.

xudong@missouri.edu

A number of effective tools for protein structure prediction, such as I-TASSER, Modeller and Rosetta have been developed. These tools often generate many models for a given target protein sequence. There is a significant room to improve prediction accuracy by better sampling, ranking and combining these models. In this work, we propose a method to refine protein structures by sampling models generated by the Rosetta server. We assemble similar models from the Rosetta server and use them as templates to generate new models based on MUfold[1]. We have used a QA strategy which combines a consensus method and some single-model scores to select the good candidates from the new models.

Methods

For each target, more than 10,000 models were generated by the Rosetta server[2]. The quality of these models were estimated using ProQ2[3] and top 1000 of them were used as the input of our method. The outputs are 5 good refined models through the following 4 steps:

Step 1. Filtering redundant models from the initial pool of 1000 models: We calculate the pairwise GDT-TS between the input models and remove redundant models with a filtering cutoff of 0.95.

Step 2. Selecting good starting models: We use affinity propagation[4] to cluster the filtered input models. We select the representatives of up to top 30 clusters and sort them by cluster sizes as the starting models.

Step 3. Selecting good candidates from the new models by MUfold-QA: For each starting model sm, we find those models that are most similar to sm and use them (including sm) as templates to build new models by MUfold and Modeller[5]. A scoring function SC in MUfold-QA is defined to measure the quality of new models. According to SC, if a new model nm is better than sm, then we use nm as current sm and repeat the procedure. The iteration ends when no new model is better than current sm and the candidate will be the one with the highest SC score.

Step 4. Ranking good candidates from the new models by a QA strategy: We use the SC score to sort all the candidates and the best 5 models will be the final output. The filtered input decoys are used as a reference set ref. Given a predicted model mi, the scoring function is defined as follow,

SC(mi)=3*norm(cus(mi))+norm(dfire(mi))+norm(opusca(mi))+norm(cheng(mi))

norm is normalization in the range of 0-1, norm(si)=−−, where smax and smin are the maximum and minimum of score s of the models in ref. As for the consensus score, we do not calculate the pairwise similarity between the models, because we generate thousands of models and calculating the pairwise similarity to get a consensus score is very time consuming. Instead, it is calculated using MUfold-CL[6]: For each model, we calculate the distance between every pair of Ca atoms and these distances compose a vector of this model. We first generate the vectors for all the filtered input models, and then we calculate the centroid vector of them. The consensus score of a model mi is defined by the Dscore1(Di, Dc) [5], where Di is the vector of mi and Dc is the centroid vector. The scoring functions of dfire, opusca and cheng are defined in [6, 7, 8], respectively.

References
1. Zhang J., Wang Q., Barz B., He Z., Kosztin I., Shang Y. and Xu D. (2010). MUFOLD: A new solution for protein 3D structure prediction. Proteins. 78, 1137-1152.
2. Kim, D.E., Chivian, D. and Baker, D. (2004). Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res. 32, 526-531.
3. Ray A., Lindahl E. and Wallner B. (2012). Improved model quality assessment using ProQ2. BMC Bioinformatics, 13, 1567-1587.
4. Frey B. J. and Dueck D. (2007). Clustering by Passing Messages Between Data Points. Science. 315, 972–976.
5. Sali A., Blundell T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 234, 779–815.
6. Zhang J. and Xu D. (2013). Fast algorithm for population-based protein structural model analysis. Proteomics. 13, 221–229.
7. Zhou H., Zhou Y. (2002). Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 11, 2714–2726.
8. Wu Y., Lu M., Chen M., Li J., Ma J. (2007). OPUSCa: a knowledge-based potential function requiring only Ca positions. Protein Sci. 16, 1449–1463.
9. Wang Z., Tegge A., Cheng J. (2009). Evaluating the absolute quality of a single protein model using structural features and support vector machines. Proteins: Struct Funct Bioinformatics.75, 638–647.