Logistic regression models to predict solvent accessible residues using sequence- and homology-based qualitative and quantitative descriptors applied to a domain-complete X-ray structure learning set

This article describes the development, validation and application of simple logistic regression models for the prediction of solvent accessible residues in oligomer and non-oligomer sets using a domain-complete X-ray structure learning set.

A working example of relative solvent accessibility (RSA) prediction for proteins is presented. Novel logistic regression models with various qualitative descriptors that include amino acid type and quantitative descriptors that include 20-and six-term sequence entropy have been built and validated. A domain-complete learning set of over 1300 proteins is used to fit initial models with various sequence homology descriptors as well as query residue qualitative descriptors. Homology descriptors are derived from BLASTp sequence alignments, whereas the RSA values are determined directly from the crystal structure. The logistic regression models are fitted using dichotomous responses indicating buried or accessible solvent, with binary classifications obtained from the RSA values. The fitted models determine binary predictions of residue solvent accessibility with accuracies comparable to other less computationally intensive methods using the standard RSA threshold criteria 20 and 25% as solvent accessible. When an additional non-homology descriptor describing Lobanov-Galzitskaya residue disorder propensity is included, incremental improvements in accuracy are achieved with 25% threshold accuracies of 76.12 and 74.79% for the Manesh-215 and CASP(8+9) test sets, respectively. Moreover, the described software and the accompanying learning and validation sets allow students and researchers to explore the utility of RSA prediction with simple, physically intuitive models in any number of related applications.

Introduction
The characterization of solvent accessible surfaces is a key task, the results of which are useful in protein design, structural biology and proteomics, notably identifying certain protein-protein interactions. There has been significant effort in the large-scale screening of sequences for characterization of function, including key interactions with ligands and other proteins (Watson et al., 2005;Tuncbag & Gursoy, 2009;Zhang et al., 2013). Possible approaches include the characterization of solvent accessible surfaces directly from sequence information, allowing for the identification of relevant clusters on the surface (Xue et al., 2011;Tygai et al., 2012). Other structural features may prove amenable to prediction from sequence, including specifically the identification of critical ISSN 1600-5767 core hydrophobic residues (Berezofsky & Trifonov, 2001;Gromiha & Selvaraj, 2004). Such residues can describe important constraints in modeling protein folding and structure and may be useful when designing modifications for proteins.
Methods using protein sequence information, including first-generation machine learning approaches such as with neural networks, typically have shown a percent accuracy of the order of 70-75% (Rost & Sander, 1994;Richardson & Barlow, 1999;Ahmad et al., 2003). Subsequent methods including large-scale linear regression, support vector regression, support vector machines, k-nearest-neighbor analysis and random forest approaches have reported somewhat better results for certain proteins (Wagner et al., 2005;Wang et al., 2007;Joo et al., 2012;Pugalenthi et al., 2012). The advantage of regression-based approaches is their relative simplicity in both implementation and interpretation. Most recently, a simple, computationally efficient machine learning approach utilizing a general neural network has also been implemented to determine accessible surface area values (Faraggi et al., 2014).
Linear and generalized linear models can include both quantitative and qualitative predictors, here referred to as 'descriptors' (Kutner et al., 2004). Following our previous work (Rose et al., 2011;Nepal, 2013), we include a qualitative descriptor representing query residue type (AA) as well as four homology descriptors: 20-term (E20) and six-term (E6) sequence entropy, fraction of aligned residues that are strongly hydrophobic (FSHP), and fraction of small residue glycine or alanine (FSR). The AA descriptor alone has been shown to be a significant predictor ($70%) of residue solvent accessibility.
Here we implement logistic regression methods to further elucidate the relationship between simple homology and query sequence descriptors with surface accessibility. We consider both estimation and validation tasks, fitting logistic regression models using learning sets and evaluating our fitted models on test data. Logistic models are one method that should be explored in the context of protein topology, easily incorporating the corresponding quantitative and qualitative descriptors. Logistic regression models have proven to be an adaptable choice for binary classification problems in fields such as economics, epidemiology and the social sciences and are now well established. Logistic approaches have the advantage of directly modeling the probability of success, restricting predicted probabilities to [0, 1] and providing for a natural interpretation in terms of the odds of success.
Our goal is to establish the utility of simple logistic models for the long-established problem of the binary identification of solvent accessible residues. Notably, this is a first step in exploring the functionality of such methods in predicting protein-protein interfaces and interactions from sequence. Such exploration may be accelerated by these methods, given the associated computational complexity.
In addition to providing a working example of such prediction for residue solvent accessibility and applying it to problems associated with protein-protein interaction, we introduce students and researchers in structural biology, biophysical chemistry, bioinformatics and allied fields to an application of logistic regression that includes explicitly both qualitative and quantitative predictors.
Our software (http://www.iucr.org/education/resources) and the accompanying learning and validation sets allow users to explore the utility of residue solvent accessibility (RSA) prediction with simple, physically intuitive models in any number of related applications. Moreover, the corresponding R and Python code can be implemented or otherwise modified to directly predict certain structural features (e.g. possible protein-protein interfaces), and new test sequences can be explored by simple implementation of NCBI BLASTp (http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins) and the Protein Data Bank (http://www.rcsb.org/pdb/home/home. do).

Sequence entropy and other homology-based descriptors
As an alternative to the standard 20-term (E20) sequence entropy (Gerstein & Altman, 1995) we calculated the six-term (E6) sequence entropy (Mirny & Shakhnovich, 1999). The entropy terms are summed over six classes of amino acids (e.g. polar), where the sequence entropy at the kth residue position is expressed as Here the probability P jk at amino acid sequence position k is derived from the frequency for an amino acid class j at sequence position k for N aligned residues. Including gaps as the 21st element in our original 20-term sequence entropy calculations was problematic (Liao et al., 2005). Thus, we treated the remaining sequence homology terms, fraction strongly hydrophobic (FSHP) and small residue (FSR), as just fractions of aligned residues. Note the strongly hydrophobic residues (i.e. V, L, I, F, Y, M, W), and the remaining 13 are the non-strongly hydrophobic residues.

Data assembly
Our original learning set of 268 protein chains included query proteins with known three-dimensional X-ray crystallographic structure and was confirmed to be a diverse, representative set (Mishra, 2010). A more extensive learning set of 1363 protein chains (Nepal, 2013) was similarly selected with a resolution < 2.5 Å and an R value < 0.3 using PISCES (Wang & Dunbrack, 2003) from an exhaustive list involving some 6500 protein domains (Bondugula et al., 2011). Redundant and other problematic sequences were also culled for chains with percent identities greater than 25. To determine our sets of aligned protein sequences and their individual residue elements, we chose a straightforward and non-biased standard application of BLASTp (Altschul et al., 1997) to a non-redundant database (GenBank; http://www.ncbi.nlm.nih. gov/genbank/), with BLOSUM62 (Henikoff & Henikoff, 1992) allowing for bit scores greater than or equal to 40% of the best score (see Fig. 1). Test sets are similarly treated. Corresponding homology-based calculations such as E20 using psi-BLAST (Altschul et al., 1997) showed little added value in prediction accuracy. Alternative applications of multiple sequence alignment using MUSCLE (Edgar, 2004) indicated no significant differences in accuracy (E. Chung & B. Lustig, unpublished results).
We continue with simple approaches to parsing homology data, noting that requiring at least ten aligning BLASTp subject sequences given the 40% threshold relative to the highest bit score is a reasonable condition for fully reliable sequence entropies (Liao et al., 2005;Rose et al., 2011). We validate our models using two test sets, where one standard test set, Manesh-215 (Naderi-Manesh et al., 2001), has been thoroughly evaluated with respect to many standard RSA approaches (Nguyen & Rajapske, 2006). Additionally, we evaluated the models on the more recent CASP(8+9), a comparably sized test set evaluated by more current methods (Joo et al., 2012). However, for the Manesh-215 and CASP(8+9) test sets we are required to use a non-optimum homology subset, given that the corresponding sequence and coordination data are listed externally. Fourteen such proteins are noted for the Manesh-215 set, six proteins from the set of 102 X-ray structures of CASP8 and 12 proteins from the set of 88 X-ray structures of CASP9 (see Table 1).

Nearest neighbors and propensity for disorder
One simple and physically intuitive approach to account for the effects of sequence neighbors is to utilize the Ising approach of Lobanov et al. (2012). Here the propensity for each residue being disordered, the Lobanov-Galzitskaya disorder probability (LGDP), is introduced as an additional quantitative descriptor (IsUnstruct; http://bioinfo.protres.ru/IsUnstruct/). Calculations of Lobanov-Galzitskaya propensities include a very limited number (< 2%) of FASTA (http://www.ncbi.nlm.nih.gov/ BLAST/blastcgihelp.shtml) neighbors not included in the PDB coordinates. Undefined FASTA residues, typically less than 1% of our sequences, are assumed to be alanine for the purposes of adjacent neighbor analysis only. Also, specifically the two sequence adjacent residues, adjacent query neighbors (AQN), are defined by FASTA and can be treated as qualitative descriptors. All regression documented here is scalable on a single E5410 Intel Xeon 4P Linux box or comparable 64bit laptop. Lobanov-Galzitskaya Ising-model calculations appear algorithmically straightforward and are not significantly intensive.

Surface accessibility
One gold standard for validation of solvent accessibility predictions is the RSA values determined from the query X-ray structures using NACCESS (Lee & Richards, 1971;Hubbard & Thornton, 1993). Alternative calculations using DSSP-based (Kabsch & Sander, 1983;Ahmad et al., 2004) teaching and education Flowchart of key inputs and outputs. Table 1 Non-optimum homology subsets for test set proteins. Manesh-215  CASP8  CASP9   1axna  3d3oa  3mqza  1bhmb  3d5pa  3n53a  1ceoa  3dewa  3n6za  1cnva  3df8a  3na2a  1esca  3dm3a  3ngwa  1exnb  3doua  3ni8a  1hlba  3njaa  1kpta  3nkga  1udii  3nrga  1vcaa  3nrva  1wbaa  3nwza  2ccya  3nyma  2scpa  2sila solvent accessibility data showed no significant differences in prediction accuracy (J. Spencer, B. Lustig & A. Gottlieb, unpublished results). Here we initially treated amino acids with less than 20% relative exposure to solvent (RSA) as buried (Carugo, 2000). An alternative threshold of 25% was subsequently explored. Model accuracies are estimated on the basis of the standard expression of Richardson & Barlow (1999), where the accuracy is accuracy ¼ number of assignments to correct category total number of assignments : Solvent accessibility for interdomain (interface) regions between chains is directly evaluated from the relevant PDB (RCSB or PISA-PDBe) coordinate files via NACCESS, consistent with Bahadur et al. (2004). The 1363-protein learning set is considered diverse, robust and exhaustive (Nepal, 2013 (8+9), respectively] are noted as such for analysis.

Logistic regression
Here we directly model the probability of accessibility utilizing the naturally dichotomous outcome variable, accessibility status. This approach avoids the common pitfalls of inappropriately applying multiple regression to percentage data, such as obtaining negative probabilities or probabilities that exceed 1. We assume the independence of all residues, although in reality there is some correlation between neighbors. A logistic regression model using residue solvent accessibility as a dichotomous response with Y = 0 for buried residues and Y = 1 for accessible residues was implemented.
Logistic models are a generalization of multiple linear regression in which a binary dependent variable is described as a function of both quantitative and qualitative variables (Kutner et al., 2004;Hosmer et al., 2013). We utilized four quantitative descriptors and 19 dummy variables corresponding to the descriptor AA, such that linear predictor is of the form for j = 1, . . . , 19. Here X j are the indicator variables associated with qualitative descriptor AA. We used treatment coding with Val as the baseline treatment via R 3.1.0, where the X j values are either 0 or 1, with 1 corresponding to the given amino acid type. Let p denote the probability that a particular residue is accessible and 1 À p the probability of a residue being buried. Under the logistic regression framework, the odds of success, p/(1 À p), are modeled as an exponentiated linear combination of the descriptors: The probability that a particular residue is accessible can be expressed by rearranging the terms in equation (4), so that Estimates for the model parameters are obtained using maximum likelihood methods, determining parameters that optimize the likelihood of the observed data (consisting of zeros and ones) being described by the model (Long, 1997). The probability distribution function associated with equation (5) is an S-shaped curve and a more realistic representation of probabilities associated with binary outcomes. The corresponding fitted values are probabilities, which then can be converted to classifications of either 1 or 0 using a 0.5 standard threshold. Shown in Fig. 2  tions superimposed on the least-squares regression fit for the 18-protein transient-binding set as identified by Pettit et al. (2007). The expression in equation (5) is commonly referred to as the odds, where the difference between two log odds can be expressed as As an example, we assume all other non-E6 terms are held constant such that the ratio of the odds of success corresponding to a increase of 1 entropy unit, so that E6* = E6 + 1, is given by Table 2 reports the results of a logistic regression fit using the 1363-based learning set. Included are the parameter estimates, corresponding standard errors, and z and p values based on the Wald test (Ghosh & Sen, 1991). We find that all of the continuous descriptors and nearly all of the dummy variables associated with the categorical descriptor are significant at a 0.05 significance level.

Learning and test sets
As a first example, included for pedagogical purposes, we fit a logistic model with a qualitative descriptor describing all 20 amino acid types (AA) and a quantitative predictor E6 with the results shown in Fig. 2. The learning set corresponds to all 18 available transient-binding proteins from our original 268 learning set (Mishra, 2010;Rose et al., 2011) and current 1363based learning set. These transient-binding proteins were originally characterized as such by Pettit et al. (2007). Included in this figure is the least-squares fit with slope of 10.56 for each amino acid type and varying intercepts (Val and Asp samples are shown in Fig. 2). The logistic fit generates probabilities, which allows for binary classification of all residues as buried versus solvent accessible. We then analysed the accuracies for 13 transient-binding proteins similarly identified from the standard Manesh-215 test set. Our next step is to build various models based on the domain-complete 1363-protein learning set and validate them on the complete standard Manesh-215 test set and the more recent CASP(8+9). We were initially interested in evaluating our models on relevant subsets of the test sets, in particular, transient-binding proteins. However, defining transientbinding proteins is not currently a consensus exercise (Acuner-Ozbabacan et al., 2011). Here, we make a more conventional delineation between oligomers and non-oligomers that is effectively universal in its application. A more complete examination of oligomers and the remaining nonoligomers is warranted.

Specific strategies for optimum homology versus nonoptimum homology proteins
As shown in the flowchart (see Fig. 1) there is a key branch point involving test sets being partitioned into an optimum homology (i.e. ten or more aligned sequences) subset and a remaining non-optimum homology subset (see Table 1 for complete list). The reason is that homology-based descriptors are not reliable predictors for non-optimum homology proteins, where we only validate those models with exclusively non-homology descriptors. The resulting strategy is to use models with exclusively non-homology descriptors for the non-optimum proteins. Therefore, we report a weighted average for all proteins based on the accuracies of two components: (1) predictions for optimum homology proteins from models that include both homology-based descriptors (e.g. E6) and non-homology ones (e.g. LGDP); (2) predictions for non-optimum homology proteins from models including exclusively non-homology descriptors.

General
We have evaluated various models with different subsets of the descriptors. Table 2 reports an initial regression for the saturated E20+E6+FSR+FSHP+AA model (referred to here as the classic model). Note that for this logistic regression E20 and E6 were shown in the corresponding correlation matrix  Table 2 Classic model fit (E20+E6+FSR+FSHP+AA) including parameter estimates, corresponding standard errors, and z and p values based on the Wald test.
Descriptors included are sequence entropies E20 and E6, the amino acid qualitative predictor (AA) with 20 classes, and FSHP and FSR indicating the fraction of optimum homology residues that are strongly hydrophobic (V, I, L, F, M, Y, W) and small (A, G  (Mishra, 2010). So, most recently Box-Tidwell power transformations were applied to test/optimize log-linearity. The resulting relative order and sign of the regression coefficients, compared to those originally determined for Table 2, remain unchanged and no significant increases in resulting classification accuracy are indicated. Table 3 reports the accuracies for a family of models for the optimum homology subsets for the Manesh-215 and CASP(8+9) test sets, and Table 4 reports the accuracies of models also evaluated on oligomer and non-oligomer subsets. The classic model of Table 2 remains the best homology-based model in terms of prediction accuracy. However, we note the high correlation (0.83) between E6 and E20, unlike other quantitative predictors. Also, even the simpler E6+AA model performs nearly as well for both the 20% and the alternative 25% RSA threshold. In both the multiple linear regression (Rose et al., 2011;Nepal, 2013) and logistic models, E6 appears to be the more reliable predictor of residue solvent accessibility as opposed to E20 (see Table 3). This is consistent with the notion that E6 entropy is the more sensitive measure of residue packing (Mirny & Shakhnovich, 2001).
We augmented our set of descriptors with several query sequence-based variables. Inclusion of the LGDP quantitative descriptor and two AQN qualitative descriptors describing the amino acid type for the two sequence-adjacent amino acids generated incremental associated gains in accuracy (Table 4). A comprehensive model consisting of sequence homologybased descriptors (E6, FSR, FSHP) and sequence-based descriptors (AA, LGDP, AQN) was found to produce the highest accuracies for both Manesh-215 and CASP(8+9) optimum homology sub-sets. Note that the total all-proteins accuracies are, for the respective 25 and 20% thresholds, 76.11 and 76.00 for Manesh-215, and 74.79 and 74.55% for CASP(8+9).
Moreover, non-oligomers are often associated with higher accuracy than the oligomers (see Table 4). This is especially true when looking at models with a significant homologybased descriptor component. More importantly, the removal   (41 967 residues). § Non-optimum homology Manesh-215 subset (3113 residues). } Non-optimum homology CASP(8+9) subset (2832 residues). † † Note the other two models with AA and a single quantitative descriptor, E20+AA and FSHP+AA, are not reported as they have less predictive accuracy than E6+AA. ‡ ‡ As shown in Table 2, E20+E6+FSR+FSHP+AA saturated model. Table 4 Selected logistic model accuracies for test sets based on X-ray crystal structures.
LGDP and AQN are included as additional descriptors. We list prediction accuracies for oligomers and non-oligomers together. The standard 1363-based learning set was utilized for model fitting. The difference in accuracy for oligomers minus non-oligomers is scaled, M if difference <À0.5, P if >0.5 and otherwise O. The change in total accuracy for oligomers with likely interfacial residues removed is scaled M if difference <À0.5, P if >0.5 and otherwise O.

Implementation of simple logistic models
In this study, the initial focus has been the examination of sequence-and homology-based descriptors in logistic regression models for protein residue accessibility. Here, a logistic model including only the E6 descriptor sequence entropy for a particular sequence position in conjunction with AA, the amino acid type, offers nearly the same degree of prediction accuracy as the classic model that also includes E20 entropy and the fraction of aligned residues that are strongly hydrophobic (FSHP) as well as being small (FSR). But E6 and E20 are highly correlated, so we preferred the former descriptor because of its generally better prediction capabilities. We have demonstrated that the inclusion of an additional quantitative descriptor, the LGDP, gives associated gains in prediction accuracy of the order of 1%, while the adjacent neighbor descriptors give somewhat less associated gain in prediction accuracy. Certainly, further model building including other simple descriptors is a reasonable next step. One could even consider further refinement of the coordinate data sets themselves based on established structure validation criteria (Reddy et al., 2003;Read et al., 2011). Though computationally intensive, it is a one-time event and does not interfere with the model simplicity.
Even the most computationally intensive machine-learning methods involving optimal threshold RSA criteria show binary prediction limits at or below 80% in accuracy (Joo et al., 2012). Indeed, our logistic regression models with homologybased and qualitative query sequence descriptors compare favorably with existing single-stage methods and other approaches that preclude an extensive machine-learning task (Gianese et al., 2003). Simple models have great utility, including computational simplicity and efficiency. Prioritization of parsimony in model selection is consistent with the Pareto principle, suggesting that any model for which some 80% of outcomes can be described by 20% of the causes is preferable (Pareto, 1971). This prioritization may be required for the greater challenges associated with very large scale calculations involving the applications of these and related models to exhaustive searches of the proteome, including the search for possible protein-protein interactions.
3.3. Intrinsic limitations and future opportunities 3.3.1. General. There appears to be a fundamental limitation for prediction accuracy for surface accessible residues, which probably involves the coupled nature of surface residue accessibility with inter-chain interactions. This constraint is similar to the intrinsic limitation of secondary structure prediction (Rost, 2001), which involves the coupling of secondary with tertiary structure (Kihara, 2005). Specifically, chain residue accessibility prediction is complicated by hydrophobic residues being found not only in core positions but also on surfaces involving interactions with other protein chains (Yan et al., 2008). Prediction of solvent accessibility is further obfuscated by the still largely uncharacterized nature of transient and related protein interactions. It is clear, however, that at least for oligomers, when likely interface residues are removed, small but measurable increases in accuracy are indicated. This unique character for such residues is consistent with previously detected residue conservation signals for oligomeric interface regions (Elcock & McCammon, 2001;Valdar & Thornton, 2001;Guharoy & Chakrabarti, 2005). Interestingly, the 1363-based learning set partitioned into oligomers and non-oligomers showed significantly less accuracy for their respective application to the oligomers and non-oligomers.
Some 5% of PDB residues are identified as missing and possibly disordered and are a non-trivial component of the actual protein chains (Brandt et al., 2008;Lobanov et al., 2010). These generally correspond to residues not identified in an X-ray structure. We note, at least for the aligned portions of FASTA and PDB sequences, a comparable number of such unaccounted residues exist. Of course, no gold standard such as the simple application of NACCESS is available for these types of residues. However, in the future as we continue to build models predicting protein-protein interfaces, the inclusion of such disordered residues is of likely advantage (Mé szá ros et al., 2009). Our successful incorporation of an intrinsic residue disorder propensity as a descriptor suggests the suitability of this sort of approach in related problems, notably the prediction of protein interfaces with respect to other proteins, nucleic acids and small ligands.
Large-scale exploration of sequence space, generated off a set of structural ensembles for three-dimensional protein structures, has shown clustering of sequence entropy values corresponding to a particular fold (Larson et al., 2002). In a similar fashion, Shannon entropies for protein sequence have been shown to correlate with configurational entropies calculated from local physical parameters, including backbone geometry (Koehl & Levitt, 2002). Our earlier calculations indicated a correlation between sequence entropy and inverse packing density (a measure of local flexibility) as well as query hydrophobicity (Liao et al., 2005;Mishra, 2010). Here we can reasonably assume that there are a cohort of very buried residues that are both well packed and not likely to have any portion of their surface area accessible to solvent.
3.3.2. Exploring additional features of protein structure in the classroom with logistic regression. For MAML-1 and other proteins, the propensity of residue disorder as a possible indicator of local environments was explored as part of a graduate course seed project (Nedunuri, 2013). A broader exploration of proteins involving the 18 learning set and 13 test set transient-binding proteins (noted in Fig. 2) was made in the next such course offering, integrating the related LGDP descriptor into prediction and further classification of solvent teaching and education accessible residues (Kadiyala, 2015). Logistic regression outputs such as those indicated by Table 2 can be considered tools for exploration. Note, for instance, the very large and negative regression coefficient of À1.543 for Cys, which is unique among the AA qualitative descriptors. Such a value is consistent with Cys forming a tertiary contact (Dosztá nyi et al., 1997) and the fact that residues most likely to form tertiary contacts are significantly less likely to have side chains accessible to solvent (Kim & Park, 2004).

Conclusion
Homology-and sequence-based descriptors can be used to build classifiers of residue accessibility. We establish their utility as both quantitative and qualitative descriptors in logistic regression models, demonstrating competitive accuracy using as few as two descriptors. Removal of likely oligomeric interfacial residues typically increases prediction accuracies, suggesting specific homology-based and other signals for such interfaces. In predicting solvent accessible residues, E6 sequence entropy is a very useful descriptor, consistent with the notion that to a first approximation highly conserved residues are typically found in the interior of the protein. Moreover, such conserved residues probably involve hydrophobic residues given their tendency to be buried. But also some conserved residues can be partitioned on the surface, accessible to solvent, and this includes some that are hydrophobic. Our methods are well suited for simple integration of homology variables with a variety of potential sequence and other sequence neighbor descriptors, allowing for application of a variety of models, including new ones, to even more challenging computational problems involving the characterization of protein interfaces and protein-protein interactions.