The wfRstta-PQ-MESHI-MSC Branch

Posted by in WeFold3

Selection of Rosetta decoys using ProQ2 and MESHI-MSC
S. Mirzaei, California State Polytechnic University, Pomona; Industrial and Manufacturing Engineering Department, USA,
T. Sidi and C. Keasar, Department of Computer Science, Ben-Gurion University of the Negev, Israel,
D. E. Kim and D. Baker, Dept. of Biochemistry, University of Washington, USA,
B. Wallner, Division of Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Linköping, Sweden, and
S.N. Crivelli, Lawrence Berkeley National Laboratory, USA

sncrivelli@lbl.gov

WeFold is an open collaboration initiative for protein structure prediction within CASP. It brings together researchers through the science gateway http://wefold.nersc.gov/ and provides computing and storage resources through the National Energy Research Scientific Computing (NERSC) center. WeFold enables the interaction among groups that work on different components of the protein structure prediction pipeline. The combination of these components creates hybrid protein structure prediction pipelines, each submitting its own models. Here we describe the wfRosetta-ProQ-MESHI-MSC pipeline, which applied a two stage selection process to domain decoys generated by Robetta.

Methods

Decoy generation: Decoy sets were generated from the Robetta server [1] (see BAKER-ROSETTASERVER abstract for details). Given the target sequence, Robetta first predicts domain boundaries by identifying PDB templates with optimal sequence similarity and structural coverage to the target through an iterative process. For each iteration, HHSearch [2], Sparks [3], and RaptorX [4] are used to identify templates and generate alignments. The target sequence is threaded onto the template structures to generate partial-threaded models, which are then clustered to identify distinct topologies that are ranked based on the likelihood of the alignments. Regions of the target sequence that are not covered by the partial-threads or are not similar in structure within the top ranked cluster are passed on to the next search iteration. Through this iterative process, non-overlapping clusters are identified that, together, cover the full length of the target sequence and domain boundaries are assigned at the transitions between the clusters. The modeling difficulty of each domain is determined by the degree of structural consensus between the top ranked partial threads from each alignment method. For each predicted domain, Robetta uses the Rosetta comparative modeling protocol, RosettaCM [5], which recombines structural elements from the clustered partial-threads and models missing segments using a combination of fragment insertion and mixed torsion-Cartesian space minimization. Conformational sampling is performed using the Rosetta low-resolution score function [6] with spatial restraints that are generated separately from each cluster [7]. If enough co-evolutionary sequence data exists to accurately predict residue- residue contacts using GREMLIN [8] the clusters are re-ranked using this information, and the spatial restraints are supplemented with the predicted contacts. For difficult domains, models are also generated using the Rosetta fragment assembly methodology6 (RosettaAbinitio), and if GREMLIN contacts are predicted, they are used as restraints for sampling and refinement. All models are refined using a relax protocol [9] that minimizes the Rosetta full-atom energy [10] in torsion and Cartesian space to allow bond angle flexibility. Up to four RosettaCM decoy sets were generated for each domain and all RosettaCM models were provided. Two RosettaAbinitio decoy sets were provided for each difficult domain, the top 5% all-atom Rosetta scoring models and the top 15% contact order [11] scoring models. Large scale sampling was possible through Rosetta@home.

Filtering: The accuracy of all generated domain models were estimated using ProQ2[12] that was recently implemented as a scoring function in Rosetta [13], and the top 1,000 for each domain were selected. In total, 32,474,636 domain models were scored using ProQ2 in CASP12.

Feature extraction: Top 1000 comparative modeling models and top 1000 ab-initio models (the latter, when available) of each predicted domain were downloaded from the weFold server. They were standardized in terms of MESHI features [15,16] by scwrl4 [14] rotamer optimization followed by MESHI energy minimization. 106 structural features were extracted from each decoy and uploaded to the weFold server. These features are detailed in the MESHI_SERVER abstract.

Final selection: A scoring function based on Support Vector Machine (SVM) was used for final selection16. Furthermore, a backward feature elimination was implemented. To this end, we started with the set of MESHI features and studied the result by removing one feature at a time. This elimination process was repeated until no further improvement was observed. For CASP12 the objective was loss minimization. We used SVM and a 10-fold cross validation for testing the result. A grid search was used to fine tune the parameters of SVM around the parameters suggested in [16].

Availability: The MESHI software package (version 9.29, which was used in CASP12) is available at:
https://www.dropbox.com/sh/mb31bjdvvydhuzh/AADVcclTZKtFiSl6I9hBx8Dxa?dl=0

1. Kim,D.E., Chivian,D., & Baker,D. (2004). Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res 32, W526-W531.
2. Söding,J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics 21 (7), 951-960.
3. Yang,Y., Faraggi,E., Zhao,H. & Zhou,Y. (2011). Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one- dimensional structural properties of the query and corresponding native properties of templates. Bioinformatics 27 (15), 2076-2082.
4. Peng,J. & Xu,J. (2011). Raptorx: Exploiting structure information for protein alignment by statistical inference. Proteins 79, 161-171.
5. Song,Y. et al (2013). High-resolution comparative modeling with RosettaCM. Structure 21 (10), 1735- 1742.
6. Leaver-Fay,A. et al. (2010). ROSETTA3.0: An Object-Oriented Software Suite for the Simulation and Design of Macromolecules. Methods in Enzymology 487, 545- 574.
7. Thompson,J. & Baker,D. (2011). Incorporation of evolutionary information into Rosetta comparative modeling. Proteins 79 (8), 2380-2388.
8. Kamisetty,H., Ovchinnikov,S. & Baker,D. (2013). Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era. PNAS 110 (39) 15674-15679.
9. Conway,P. et al (2014). Relaxation of backbone bond geometry improves protein energy landscape modeling. Protein Sci. 23 (1), 47-55.
10. Tyka,M.D. et al. (2011). Alternate states of proteins revealed by detailed energy landscape mapping. JMB 405, 607-18.
11. Plaxco,K.W., Simons,K.T., Baker,D. (1998). Contact order, transition state placement and the refolding rates of single domain proteins. J. Mol. Biol. 277 (4) 985–994.
12. Ray,A, Lindahl E, Wallner,B. (2012). Improved model quality assessment using ProQ2. BMC Bioinformatics 13, 224.
13. Uziela,K, Wallner,B. (2016). ProQ2: estimation of model accuracy implemented in Rosetta. Bioinformatics. 32(9), 1411-3
14. Krivov,G. et al. (2009) Improved prediction of protein side-chain conformations with SCWRL4. Proteins, 77, 778–795.
15. Kalisman,N.,et al. (2005). MESHI: a new library of Java classes for molecular modeling. Bioinformatics 21:3931-3932.
16. Mirzaei,S. et al. (2016) Purely Structural Protein Scoring Functions Using Support Vector Machine and Ensemble Learning. IEEE/ACM Transactions on Computational Biology and Bioinformatics, in press.