Machine Learning Forum

Accurately predicting a protein's 3D structure or native structure from its sequence of amino acids is key to new biological and medical discoveries. Current computational methods can generate hundreds of thousands of models. However, selecting which models most closely match the native structure is an open problem and one of the biggest challenges in the field.

The ML challenge: By observing experimentally-determined proteins, Chen Keasar has listed 42 features that describe a good protein model. Each feature represents a measurable aspect of a protein model. From those features, we invite the machine learning community to generate scoring functions to select the best models out of the hundreds of thousands that our computational methods generate.

Chen has created training and test sets of models (CASP8 & CASP9 server models respectively). Moreover, he calculated all the features for each model and added that information to the datasets. The training and test sets can be organized as a table with two levels of rows (see table below). The first level has the target ID and the second has the model ID. This is because models are created for a specific target, and therefore, rows are arranged into groups. The columns of the table are the features (i.e., numerical values) as well as the GDT-TS value, which is the objective function that measures the accuracy of each model. According to Chen, other important points to have in mind include:
1) Performances should be measured at the target level rather at the model level.
2) Targets may have different number of models. Those with the lowest number of models are not less important.
3) There is no point in defining a per-target scoring function because our goal is to predict new structures, not ones from the training set.

For CASP10 and CASP ROLL, Chen developed a new set of scoring functions using machine learning techniques. These functions did very well in both experiments and we hope that the ML community can help us develop even better scoring functions.

Chen has generously shared these datasets with us and they are available through the WeFold downloads page. Each set has about 70,000 models corresponding to 150 targets. The sets are available in two formats: 1) a tar file containing two MATLAB classes, data files, and a Readme file, and 2) a tar file containing a set of folders (one per target) each containing a group of xml files (one per model).

In addition, we plan to curate the almost 9 million models generated by the WeFold1 community and create a database for the protein scoring problem. The new database will contain the models as well as the features calculated for each model. More information about this project will be posted soon.

Protein Model Feature 1 Feature 2 ... Feature 42 GDT-TS
Target 1
Model 1.1
Model 1.2
...
Model 1.i
Target 2
Model 2.1
Model 2.2
...
Model 2.j
...
Target n
Model n.1
Model n.2
...
Model n.k