Covering complete proteomes with X-ray structures: a current snapshot

The current and the attainable coverage by X-ray structures of proteins and their functions on the scale of the ‘protein universe’ are estimated. A detailed analysis of the coverage across nearly 2000 proteomes from all superkingdoms of life and functional annotations is performed, with particular focus on the human proteome and the family of GPCR proteins.

(3 predictors/algorithm x 2 predictions/assignments per predictor/algorithm x 4 segmentsizes = − PRSegmentComposition_{IUpredL, IUPredS, Complexity}_{0, 1}_{1-5, 6-10, 11-15, >15}count of the number of AAs in the input protein sequence that are in short (1-5 residues)/medium (6-10 residues)/long (11-15 residues)/very long (over 15 residues) segments for each binary prediction/complexity value {0, 1} of each predictor/algorithm {IUpredL, IUPredS, Complexity}. These counts were normalized by the length of the protein. and not with each other. These features were selected empirically using the Training data set. Three of these features are based on amino acid compositions, another three are based on free energy terms and two on hydrophobicity-based indices. The remaining three correspond to the instability index, distance of the first amino acid of medium polarizability from the N-terminus, and fraction of long segments (15AAs or longer) that are characterized by high amino acid complexity.
Supporting Figure S5 presents the box plots of values of the 11 features on the Training dataset along with their biserial correlation (with the binary crystallization output).
Hydrophobicity-based features (features which are based on MANP780101 "Average surrounding hydrophobicity" (Manavalan & Ponnuswamy, 1978) and CASG920101 "Hydrophobicity scale from native protein structures" (Casari & Sippl, 1992) indices) show that non-crystallizable proteins tend to have longer segments characterized by a wider range of hydrophobicity (lower values for the minimum in the MANP780101-based feature and higher values for the maximum in the CASG920101-based feature), whereas crystallizable proteins tend to exclude long segments with either high or low hydrophobicity. The hydrophobicity of the protein chain has been linked with crystallization outcome in many studies (Goh et al., 2004;Overton &Barton, 2006;Chen et al., 2007;Overton et al., 2008;Kurgan et al., 2009;Price et al., 2009;Babnigg & Joachimiak, 2010;Overton et al., 2011), and two of these studies also investigated hydrophobicity in segments of a protein sequence (Babnigg & Joachimiak, 2010;Mizianty & Kurgan, 2011).
The distribution of values of the three free energy-based features (features which are based on WERD780102 "Free energy change of epsilon(i) to epsilon(ex)" (Wertz & Scheraga, 1978), RADA880103 "Transfer free energy from vap to chx" (Radzicka & Wolfended, 1988), and WERD780103 "Free energy change of alpha(Ri) to alpha(Rh)" (Wertz & Scheraga, 1978) indices) show that the non-crystallizable proteins are more likely to include segments with higher and lower free energy change values, whereas crystallizable proteins consist of regions with medium free energy change values; this is similar to the observation related to hydrophobicity. The indices related to the free energy changes were also used to design PPCpred (Mizianty & Kurgan, 2011) and MCSG Zscore (Babnigg & Joachimiak, 2010).
Crystallizable proteins are shown to be enriched in Glu, whereas high content of Ser and Cys is anticorrelated and is characteristic to proteins which are hard to crystallize. This agrees with observations in ref. (Babnigg & Joachimiak, 2010;Mizianty & Kurgan, 2011); the Glu content has been also used in ref. (Price et al., 2009), whereas Ser and Cys contents were used in ref. (Overton et al., 2008) and ref. (Overton et al., 2008;Slabinski et al., 2007), respectively. Instability index, with higher values denoting instable proteins with shorter in vivo half-life, tends to be higher for the non-crystallizable proteins, which agrees with the finding in ref. (Slabinski et al., 2007).
Crystallizable proteins are shown to have a large fraction of long high complexity segments predicted by SEG (Wootton & Federhen, 1993). In fact, over 50% of crystallizable proteins have no low complexity segments at all. Low complexity regions were linked to disorder, with a general rule that inclusion of a larger number and longer low complexity regions implies higher content of disorder (Romero et al., 2001). Information about the predicted disorder was used to determine protein crystallizability in four previous studies (Slabinski et al., 2007;Price et al., 2009;Mizianty & Kurgan, 2011;Mizianty & Kurgan, 2012).
Interestingly, it seems that the non-crystallizable proteins have amino acids with medium polarizability (Cys, Pro, Asn, Val, Glu, Gln, Ile, Leu) closer to the N-terminus than the crystallizable targets. We hypothesize that this could be due to disorder of the protein's N-terminus or the influence of affinity tags or other N-terminal modifications.
Except for the last feature, the characteristics associated with the features utilized by fDETECT are well grounded in the literature and have been shown to be markers of crystallization outcomes. This study formulates a novel combination of these characteristics that can be calculated quickly and which offers competitive levels of predictive performance for the prediction of the crystallization propensity.         The names are based to the nomenclature from the AAIndex1 database. Figure S1 Relation between predicted crystallization propensity and crystal resolution. The box plot shows 25 th , 50 th , and 75 th percentile of scores for the crystal structures with resolution from a given range. Ranges for resolutions were selected to reflect the inverted cubical nature of crystal diffraction resolution. Dashed line represents fitted 3 rd degree polynomial.

Figure S2
The relative difference in crystallization score between the PDB and UniProt. Graph compares the predicted crystallization propensity for all proteins from a given proteome and the proteins from a given proteome which were deposited in the PDB. We selected proteomes with at least 20 chains deposited in the PDB. respectively. Proteins with a given GO annotations were mapped into modelling families. A given modelling family can be structurally covered if it includes at least one protein with a crystallization propensity above a cut-off value provided on the x-axis; the remaining structures in that family can be obtained using homology modelling. The solid lines assume that a given GO annotation is covered when one or more of its annotated modelling families has an attainable structure. The dashed/dotted lines assume that a given annotation is covered when at least 50%/all of its modelling families are structurally covered. The vertical lines show the cut-off values that correspond to 25 th centile, median and 75 th centile of the crystallization propensity scores of the clustered proteins from the PDB dataset.
To assure statistically sound estimates and to accommodate for the incompleteness of the GO annotations, we limited analysis to the annotations with at least 20 modelling families.