The wfRosetta-ProQ-MESHI Branch

Posted by in WeFold3

WeFold – the wfRosetta-ProQ-MESHI pipeline
T. Sidi and C. Keasar, Department of Computer Science, Ben-Gurion University of the Negev, Israel,
D. E. Kim and D. Baker, Institute for Protein Design, University of Washington, USA,
B. Wallner, Division of Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Sweden, and
S.N. Crivelli, Lawrence Berkeley National Laboratory, USA

chen@cs.bgu.ac.il

WeFold is an open collaboration initiative for protein structure prediction within CASP. It brings together researchers through the science gateway http://wefold.nersc.gov/ and provides computing and storage resources through the National Energy Research Scientific Computing center. WeFold enables the interaction among groups that work on different components of the protein structure prediction pipeline. The combination of these components creates hybrid protein structure prediction pipelines, each submitting its own models. This collaboration aims to promote a synergistic effect among the participants and ultimately produce better results than those achieved by the individual methods. In its third round, the collaboration resulted in 12 different pipelines. Here we describe the wfRosetta-ProQ-MESHI pipeline, which applied a two stage selection process to domain decoys generated by Robetta.

Methods

Decoy generation: Decoy sets were generated from the Robetta server [1] (see BAKER-ROSETTASERVER abstract for details). Given the target sequence, Robetta first predicts domain boundaries by identifying PDB templates with optimal sequence similarity and structural coverage to the target through an iterative process. For each iteration, HHSearch [2], Sparks [3], and RaptorX [4] are used to identify templates and generate alignments. The target sequence is threaded onto the template structures to generate partial-threaded models, which are then clustered to identify distinct topologies that are ranked based on the likelihood of the alignments. Regions of the target sequence that are not covered by the partial-threads or are not similar in structure within the top ranked cluster are passed on to the next search iteration. Through this iterative process, non-overlapping clusters are identified that, together, cover the full length of the target sequence and domain boundaries are assigned at the transitions between the clusters. The modeling difficulty of each domain is determined by the degree of structural consensus between the top ranked partial threads from each alignment method. For each predicted domain, Robetta uses the Rosetta comparative modeling protocol, RosettaCM [5], which recombines structural elements from the clustered partial-threads and models missing segments using a combination of fragment insertion and mixed torsion-Cartesian space minimization. Conformational sampling is performed using the Rosetta low-resolution score function [6] with spatial restraints that are generated separately from each cluster [7]. If enough co-evolutionary sequence data exists to accurately predict residue- residue contacts using GREMLIN [8] the clusters are re-ranked using this information, and the spatial restraints are supplemented with the predicted contacts. For difficult domains, models are also generated using the Rosetta fragment assembly methodology [6] (RosettaAbinitio), and if GREMLIN contacts are predicted, they are used as restraints for sampling and refinement. All models are refined using a relax protocol [9] that minimizes the Rosetta full-atom energy10 in torsion and Cartesian space to allow bond angle flexibility. Up to four RosettaCM decoy sets were generated for each domain depending on the number of template clusters and all RosettaCM models were provided. Two RosettaAbinitio decoy sets were provided for each difficult domain, the top 5 percent all-atom Rosetta scoring models and the top 15 percent contact order [11] scoring models. Large scale sampling of around 10,000 to 300,000 models was possible through the use of the distributed computing project, Rosetta@home.

Filtering: The accuracy of all generated domain models was estimated using ProQ2 [12], which was recently implemented as a scoring function in Rosetta [13], and the top 1,000 models for each domain were selected. In total, 32,474,636 domain models were scored using ProQ2 in CASP12.

Final selection and model submission: Top 1000 comparative modeling and top 1000 ab-initio models (the latter, when available) for each predicted domain were downloaded from the WeFold server, and fed to the MESHI_Server protocol. This protocol is described in the MESHI_SERVER abstract. In a nutshell, it first standardizes the decoys by scwrl4 [14] rotamer optimization followed by energy minimization. Then, the protocol extracts 106 structural features from each decoy and feeds them to MESHI_Score, an ensemble of a thousand independent predictors. Each of these predictors is trained to predict decoy qualities using a unique subset of the features. The final score is the weighted median of the thousand individual scores. For each domain, we submitted the five standardized decoys with highest MESHI_Score.

Availability: The MESHI software package (version 9.29, which was used in CASP12) is available at:
https://www.dropbox.com/sh/mb31bjdvvydhuzh/AADVcclTZKtFiSl6I9hBx8Dxa?dl=0

1. Kim,D.E., Chivian,D., & Baker,D. (2004). Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res 32, W526-W531.
2. Söding,J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics 21 (7), 951-960.
3. Yang,Y., Faraggi,E., Zhao,H. & Zhou,Y. (2011). Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one- dimensional structural properties of the query and corresponding native properties of templates. Bioinformatics 27 (15), 2076-2082.
4. Peng,J. & Xu,J. (2011). Raptorx: Exploiting structure information for protein alignment by statistical inference. Proteins 79, 161-171.
5. Song,Y. et al. (2013). High-resolution comparative modeling with RosettaCM. Structure 21 (10), 1735- 1742.
6. Leaver-Fay,A. et al. (2010). ROSETTA3.0: An Object-Oriented Software Suite for the Simulation and Design of Macromolecules. Methods in Enzymology 487, 545- 574.
7. Thompson,J. & Baker,D. (2011). Incorporation of evolutionary information into Rosetta comparative modeling. Proteins 79 (8), 2380-2388.
8. Kamisetty,H., Ovchinnikov,S. & Baker,D. (2013). Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era. PNAS 110 (39) 15674-15679.
9. Conway,P., Tyka,M.D., DiMaio,F., Konerding, D.E. & Baker,D. (2014). Relaxation of backbone bond geometry improves protein energy landscape modeling. Protein Sci. 23 (1), 47-55.
10. Tyka,M.D. et al. (2011). Alternate states of proteins revealed by detailed energy landscape mapping. JMB 405, 607–18.
11. Plaxco,K.W., Simons,K.T., Baker,D. (1998). Contact order, transition state placement and the refolding rates of single domain proteins. J. Mol. Biol. 277 (4) 985–994.
12. Ray,A., Lindahl,E, Wallner,B. (2012). Improved model quality assessment using ProQ2. BMC Bioinformatics 13, 224.
13. Uziela,K, Wallner,B. (2016). ProQ2: estimation of model accuracy implemented in Rosetta. Bioinf. 32 (9), 1411-3
14. Krivov,G.G et al. (2009) Improved prediction of protein side-chain conformations with SCWRL4. Proteins, 77, 778–795.