The wfDB_BW_SVGroup Branch

Posted by in WeFold3

Assessing Protein Structure Models using Protein Structure Networks [PSN-QA]
B.K. Dhanasekaran, Sambit Ghosh, Saraswathi Vishveshwara, Molecular Biophysics Unit, Indian Institute of Science, Bangalore-560012, India, and
the WeFold Community,

WeFold is an open collaboration initiative for protein structure prediction within CASP. It brings together labs and individuals through the science gateway and provides computing and storage resources through the National Energy Research Scientific Computing (NERSC) center. WeFold enables the interaction among groups that work on different components of the protein structure prediction pipeline, thus making it possible to leverage expertise at a scale that has not been done before. The combination of these components creates hybrid protein structure prediction pipelines, each submitting its own models. This collaboration aims to promote a synergistic effect among the participants and ultimately produce better results than those achieved by the individual methods. In its third round, the collaboration resulted in 12 different pipelines. Here we describe the wfDB_BW_SVGroup pipeline, which combines David Baker group from University of Washington, Seattle, WA, United States of America, Björn Wallner group from Linkoping University, Linkoping, Sweden and SVGroup from Indian Institute of Science, Bangalore, India.


Decoy generation:
Decoy sets were generated from the fully automated structure prediction server, Robetta [1] (see BAKER-ROSETTASERVER abstract for details). Given the target sequence, Robetta first predicts domain boundaries by identifying PDB templates with optimal sequence similarity and structural coverage to the target through an iterative process. For each iteration, HHSearch [2], Sparks [3], and RaptorX [4] are used to identify templates and generate alignments. The target sequence is threaded onto the template structures to generate partial-threaded models, which are then clustered to identify distinct topologies that are ranked based on the likelihood of the alignments. Regions of the target sequence that are not covered by the partial-threads or are not similar in structure within the top ranked cluster are passed on to the next search iteration. Through this iterative process, non-overlapping clusters are identified that, together, cover the full length of the target sequence and domain boundaries are assigned at the transitions between the clusters. The modeling difficulty of each domain is determined by the degree of structural consensus between the top ranked partial threads from each alignment method. For each predicted domain, Robetta uses the Rosetta comparative modeling protocol, RosettaCM [5], which recombines structural elements from the clustered partial-threads and models missing segments using a combination of fragment insertion and mixed torsion-Cartesian space minimization. Conformational sampling is performed using the Rosetta low-resolution score function [6] with spatial restraints that are generated separately from each cluster [7]. If enough co-evolutionary sequence data exists to accurately predict residue- residue contacts using GREMLIN [8] the clusters are re-ranked using this information, and the spatial restraints are supplemented with the predicted contacts. For difficult domains, models are also generated using the Rosetta fragment assembly methodology [6] (RosettaAbinitio), and if GREMLIN contacts are predicted, they are used as restraints for sampling and refinement. All models are refined using a relax protocol [9] that minimizes the Rosetta full-atom energy [10] in torsion and Cartesian space to allow bond angle flexibility. Up to four RosettaCM decoy sets were generated for each domain depending on the number of template clusters and all RosettaCM models were provided. Two RosettaAbinitio decoy sets were provided for each difficult domain, the top 5 percent all-atom Rosetta scoring models and the top 15 percent contact order [11] scoring models. Large scale sampling of around 10,000 to 300,000 models was possible through the use of the distributed computing project, Rosetta@home.
The accuracy of all generated domain models were estimated using ProQ2 [12] that was recently implemented as a scoring function in Rosetta [13], and the top 1,000 for each domain were selected. In total, 32,474,636 domain models were scored using ProQ2 in CASP12.
Protein Structure Network Quality Assessment tool (PSN-QA) was used to rank the final set of models. PSN-QA tool is based on the graph theoretical approach to study protein structures where a protein structure is considered as a network. In the network of protein structure, amino acids of the protein structure are considered as node and the edges are constructed between these nodes based on the non-covalent interaction strength [14-15] between the side-chain atoms. Interaction strength is based on the number of interacting atoms between a pair of amino acids (Iij). The network properties such as size of the largest cluster (SLClu), largest K-2 communities (ComSk2) and clustering coefficients are able to capture the general pattern exhibited by native proteins. It is interesting to note that the native protein structures consistently display a steeper transition profile as a function of the interaction strength cut-off (Imin), when compared to the modeled structure (Fig1).

Fig1: Transition profile of network parameters for native and decoy structures.
This transition profile is a characteristic property of native protein structures and can be used to differentiate good models, showing native like properties from non-native like structures. Across different Imins, these parameters along with main-chain hydrogen bond total up to 94 parameters for a single protein structure. An SVM model is then trained based on these 94 parameters, wherein the 5422 native structures from PDB forms the positive set while the 29543 decoy models from various sources form the negative set [16-17]. The tool is capable of classifying the given protein structure model as good and bad further provide with the score for the same. The scores lie between -30 to 20, and imply the following:
<14: Bad Models. 14-16: Transition Zone. Models classified either as good or bad. 16 - 20: Good models. Models show native like properties. A group of very similar structures can be ranked based on its score. PSN-QA was used to obtain the ranks of all group and server models in the regular category from the CASP12 website. It was also used to classify and rank the 1000 models, which were produced by the wfDB_BW_SVGroup pipeline. Results. Table 1 shows the average value and standard deviation of top five models and the remaining 995 model of four example targets released in CASP12. Fig 2 represents the superimposed images of the best and last ranked decoys in target T0879 with its native protein structure. It is clear from the figure that the model (Purple) with the low score does not have a well defined secondary structure. A top ranked model (Cyan) with score in the range of 18-20 indicates well folded secondary structure and suitable side-chain orientations to capture global connectivity. Fig 2: Superimposed images of the first and last model of target T0879 with PDB structure 5JMU. (PDB structure: Green, First Model: Cyan, Last Model: Purple) Target Model Average score of Top 5 Models Standard Deviation of Top 5 Models Average of Remaining 995 models Standard Deviation of Remaining 995 models T0879 18.6488 0.05332511 17.2918 0.4775453 T0891 18.5991 0.33084655 16.4774 0.6303654 T0900 18.2128 0.10767417 16.1957 0.7295866 T0944 18.2893 0.04707691 17.0408 0.4282697 Table1: Scores of four example model WEFOLD provided us a platform to be a part of a prediction-assessment pipeline for the CASP12 regular category. The PSN-QA tool was used to assess the models generated by members of the pipeline. Forty-Six regular targets with 1000 models each were studied and the top five ranked models that had a classification as a native-like protein structure were submitted to CASP12. Availability. PSN-QA is available at https:// 1. Kim,D.E., Chivian,D., & Baker,D. (2004). Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res 32, W526-W531. 2. Söding,J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics 21 (7), 951-960. 3. Yang,Y., Faraggi,E., Zhao,H. & Zhou,Y. (2011). Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one- dimensional structural properties of the query and corresponding native properties of templates. Bioinformatics 27 (15), 2076-2082. 4. Peng,J. & Xu,J. (2011). Raptorx: Exploiting structure information for protein alignment by statistical inference. Proteins 79, 161-171. 5. Song,Y., DiMaio,F., Wang,R., Kim,D., Miles,C., Brunette,T., Thompson,J., & Baker,D. (2013). High-resolution comparative modeling with RosettaCM. Structure 21 (10), 1735- 1742. 6. Leaver-Fay,A., Tyka,M., Lewis,S., Lange,O.F., Thompson,J., Jacak,R., Kaufman,K., Renfrew,P.D., Smith,C., Sheffler,W., Davis,I., Cooper,S., Treuille,A., Mandell,D., Richter,F., Ban,Y.A., Fleishman,S., Corn,J., Kim,D.E., Lyskov,S., Berrondo,M., Mentzer,S., Popović,Z., Havranek,J., Karanicolas,J., Das,R., Meiler,J., Kortemme,T., Gray,J.J., Kuhlman,B., Baker,D. & Bradley,P. (2010). ROSETTA3.0: An Object-Oriented Software Suite for the Simulation and Design of Macromolecules. Methods in Enzymology 487, 545- 574. 7. Thompson,J. & Baker,D. (2011). Incorporation of evolutionary information into Rosetta comparative modeling. Proteins 79 (8), 2380-2388. 8. Kamisetty,H., Ovchinnikov,S. & Baker,D. (2013). Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era. PNAS 110 (39) 15674-15679. 9. Conway,P., Tyka,M.D., DiMaio,F., Konerding, D.E. & Baker,D. (2014). Relaxation of backbone bond geometry improves protein energy landscape modeling. Protein Sci. 23 (1), 47-55. 10. Tyka, M.D., Keedy, D.A., André, I., Dimaio, F., Song, Y., Richardson, D.C., Richardson, J.S., and Baker, D. (2011). Alternate states of proteins revealed by detailed energy landscape mapping. J. Mol. Biol. 405, 607–618. 11. Plaxco, K.W., Simons, K.T., Baker, D. (1998). Contact order, transition state placement and the refolding rates of single domain proteins. J. Mol. Biol. 277 (4) 985–994. 12. Ray A, Lindahl E, Wallner B. (2012). Improved model quality assessment using ProQ2. BMC Bioinformatics 13, 224. 13. Uziela K, Wallner B. (2016). ProQ2: estimation of model accuracy implemented in Rosetta. Bioinformatics. 32 (9), 1411-3 14. Kannan N, Vishveshwara S (1999) Identification of side-chain clusters in protein structures by a graph spectral method. J Mol Biol 292: 441–464. 15. Brinda K, Vishveshwara S: A network representation of protein structures: implications for protein stability. Biophys J. 2005; 89(6): 4159–4170. 16. Chatterjee S, Ghosh S, Vishveshwara S: Network properties of decoys and CASP predicted models: A comparison with native protein structures. Mol Biosyst. 2013; 9(7): 1774–1788. 17. Ghosh S and Vishveshwara S (2014) Ranking the quality of protein structure models using sidechain based network properties [v1; ref status: indexed] F1000Research 2014, 3:17 (doi:10.12688/f1000research.3-17.