Identification and characterization of two classes of G1 -bulge

In standard -bulges, a residue in one strand of a -sheet forms hydrogen bonds to two successive residues (‘1’ and ‘2’) of a second strand. Two categories, ‘classic’ and ‘G1’ -bulges, are distinguished by their dihedral angles: 1,2R R (classic) or 1,2L R (G1). It had previously been observed that G1 -bulges are most often found as components of two quite distinct composite structures, suggesting that a basis for further differentiation might exist. Here, it is shown that two subtypes of G1 -bulges, G1 and G1 , may be distinguished by their conformation ( R or R) at residue ‘0’ of the second strand. -Bulges that are constituents of the composite structure named the -bulge loop are of the G1 type, whereas those that are constituents of the composite structure named -link here are of the G1 type. A small proportion of G1 -bulges, but not G1 -bulges, occur in other contexts. There are distinctive differences in amino-acid composition and sequence pattern between these two types of G1 -bulge which may have practical application in protein design.


Introduction
The -bulge was first described (Richardson et al., 1978) as a small motif in which, in its commonest and standard form, a residue ('X') in one antiparallel strand of a -sheet forms main chain-main chain hydrogen bonds to two successive residues ('1' and '2') of a second strand instead of making both hydrogen bonds to a single residue. This disrupts the regular -sheet so that a bulge occurs, in some cases ending the participation of one or both of the strands in the sheet. Originally, two main types of -bulge were distinguished: the classic -bulge, with an R conformation at position 1, and the G1 -bulge, with an L conformation at position 1 (Richardson et al., 1978). (The definitions of R , L etc. used here can be found in Section 5.) The name G1 derives from the frequent, but not invariable (Chan et al., 1993), occurrence of glycine at this position. Other variants ('wide', 'bent' and 'special') have been described (Chan et al., 1993) but are much less frequent, at only 10% of all -bulges (Craveur et al., 2013).
There has recently been renewed interest in -bulges because the inclusion of both classic (Marcos et al., 2017) and G1 (Dou et al., 2018) -bulges in protein design has proved to be necessary to achieve certain structural features. It was originally observed that G1 -bulges occurred in the context of two quite different composite structures: the -bulge loop (Milner-White, 1987) and what we call the -link [a structure incorporating a -bulge and a type II -turn (Venkatachalam, 1968) directed away from the -sheet (Richardson et al., 1978)]. The question arises whether features of G1 -bulges exist that favour the formation of one or the other of these composites and, if so, whether this information can be used in the design of synthetic proteins. We show here that by considering the conformation of the amino-acid residue ISSN 2059-7983 N-terminal to the doubleton of the G1 -bulge, such a distinction can be made.

Materials and methods
This work employed two MySQL relational databases that modelled the atoms, residues and hydrogen bonds in different sets of proteins. The smaller one, Protein Motif, which was used in the initial phases of this work (Leader & Milner-White, 2009), contains information on 417 globular proteins from the 500 Protein Data Bank files from the Richardson laboratory (Lovell et al., 2003). (Not all proteins in this and the larger data set were used because some contained duplicated aminoacid positions and other nonstandard features that conflicted with our database schema, causing them to be rejected.) Secondary-structure information and ' and dihedral angles of residues were derived using DSSP (Kabsch & Sander, 1983), whereas for the and ! angles we utilized BBDEP (Dunbrack & Karplus, 1993). Backbone and inter-residue hydrogen bonds were derived using HBPlus (McDonald & Thornton, 1994).
The Protein Motif database was populated with a range of motifs derived from SQL queries specifying residue numbers and identities, dihedral angles and hydrogen bonds. For -bulges the initial specification for the query was two consecutive residues (1 and 2) with a hydrogen bond between the main-chain CO of residue 1 and the main-chain NH of a third residue (X) and a hydrogen bond between the mainchain NH of residue 2 and the main-chain CO of residue X. A further stipulation was that residue 2 should have the R conformation (defined in Section 5). These -bulges were divided into two classes: 1,2-R R (classic) and 1,2-L R (G1).
This database is part of the public web application Motivated Proteins (Leader & Milner-White, 2009) incorporating the molecular viewer Jmol (Herrá ez, 2006) and is also part of the desktop application Structure Motivator (Leader & Milner-White, 2012). Motivated Proteins allows the visualization of individual motifs in the context of the protein, whereas Structure Motivator allows the visualization of dihedral angles at different motif positions.
The second, larger, database, Proteins4K, was constructed specifically for this work. It contains information on 4485 globular proteins from the 'Top 8000' filtered structures from the Richardson laboratory (http://kinemage.biochem.duke.edu/ databases/top8000.php). It was built using the same pipeline as Protein Motif, except that a script, dihedral.pl, kindly provided by Roland Dunbrack, was used instead of BBDEP. We used Proteins4K for command-line queries and populated it with -bulges and the composite motifs encompassing them: -bulge loops and -links. The SQL queries for -bulges made the same hydrogen-bond specifications as above, with the inclusion of dihedral angles at positions 0, 1 and 2 to provide subclasses.
Our approach differs from others employed to study structural motifs such as the PROMOTIF program (Hutchinson & Thornton, 1996). Although computationally less powerful than dedicated programs written in a language such as Fortran, SQL queries of a relational database modelling protein structure were used because of their flexibility. Regardless of the motifs that already populate the database, one can quickly retrieve and visualize information about constructs that suggest themselves in the course of an investigation.

Results and discussion
3.1. Differentiation between the G1 b-bulges in b-bulge loops and b-links The relational database of protein structural information, Protein Motif (Leader & Milner-White, 2009; see Section 2), containing 417 proteins was used for our initial work and for that in Fig. 2. In addition to primary data, it is populated with derived small structural motifs, including the -bulge loop (Milner-White, 1987) and the -link. The latter is a composite of a -bulge and a type II -turn where the 1,2-positions of the -bulge constitute the 3,4-positions of the -turn ( Fig. 1). [The -link was originally described by Richardson et al. (1978), but was not named by them and has been somewhat neglected until recently.] While visualizing the dihedral angles of -bulge loops and -links as Ramachandran plots in the desktop application Structure Motivator (Leader & Milner-White, 2012), it became evident that the G1 -bulges belonging to these two composite motifs differed at what would be position '0', N-terminal to the doubleton. In the -bulge loop this had the R conformation, whereas in the -link it had the R conformation. When modified versions of -bulges, extended to include position '0', were viewed in the Structure Motivator application two separate distributions of dihedral angles were apparent (Fig. 2).
We have therefore altered the definition of the -bulge to include position '0' and have subdivided the G1 -bulges into  Table 1 Occurrence of different types of -bulge and their participation in composite motifs.
Standard -bulges were retrieved from a database of 4485 proteins by queries specifying the hydrogen-bonding pattern in Fig. 1(a) and the dihedral angles given in the Subtype column. Queries were made to determine the number of each subtype present in the two composite motifs indicated. For -bulge loops this involved the additional specification that the singleton residue X was at position À2, À3 or À4 for -bulge loop-5, loop-6 or loop-7, respectively. For -links this involved the additional specification of a hydrogen bond between the peptide-bond O atom at position À1 and the peptide-bond N atom at position 2 in the numbering of Fig. 1(c).
Having established that the extended definition of G1 -bulges allows one to distinguish those present in -bulge loops from those in -links, it was pertinent to ask whether -bulges occurred in other contexts than within these composites. We performed the following analysis using the tenfold larger database Proteins4K. We first queried the database for all -bulges conforming to the pattern 0,1,2-R R (classic) or 0,1,2-L R (G1), stipulating that the pattern of hydrogen bonding to residue X be as in Fig. 1(a). ( represents any of the four pairs of dihedral angles, L , R , L and R .) The number of instances of each of the eight subtypes so defined are given in the 'Total' column of Table 1. It can be seen that almost all classic -bulges are of subtype 0,1,2-R R R and that the vast majority of G1 -bulges are of the subtypes G1 or G1. (The proportions of these three types are included in Fig. 1.) As can be seen in Fig. 2, for any specification such as R , the values of the dihedral angles found at different positions and in different motifs vary. Mean values for the major types of -bulge are given in Supplementary  Table S1. The Proteins4K database was populated with these subclasses of -bulges, which were then queried to determine the proportion in higher-order structures. To identify -bulges in loops such as the -bulge loop-5 (Fig. 1c), loop-6 or higher, the query was for the position of residue X relative to residue 1. To identify -bulges in -links the query was for a hydrogen bond between positions À1 and 2 of the -bulge (Fig. 1c). The final two columns of Table 1 show that 99% of G1 -bulges occur in -bulge loops and 85% of G1 -bulges occur in -links. The 15% of G1 -bulges that are not in -links are considered below. The different types of -bulges and their relationship to composite structures. The singleton is designated 'X' and the doubleton residues as '1' and '2' in the N to C direction. In the diagrams, inter-main-chain hydrogen bonds are represented as broken lines, with the red circles representing O atoms and the blue circles representing N atoms.  (63% Ala/Ile/Leu/Val). Position 0, the conformation of which differentiates the G1 -bulges, also shows differences in amino-acid composition, with G1 -bulges being rich in residues with polar side chains (73% Asn/Asp/Gln/Glu/Ser/ Thr), whereas G1 -bulges have 48% Ala/Lys/Pro/Val. A degree of similarity occurs at position 2, with both types of G1 -bulge having many residues with polar side chains, although G1 -bulges are enriched in aspartate (G1, 21%; G1, 4%).

Amino-acid preferences of G1a and G1b b-bulges
We separated G1 -bulges into those that are components of -bulge loop-5 and -bulge loop-6 structures, and separated G1 -bulges into those that are components of -links and the 15% that are not. Their amino-acid compositions are shown in Fig. 3(b). It is evident that G1 -bulges belonging to -bulge loop-5 motifs have a higher proportion of glycine residues at position 1 than those in -bulge loop-6 motifs, and that at position 0 their polar amino acids are skewed to aspartate and asparagine at the expense of threonine. G1 -bulges within -links are likewise enriched in glycine at position 1 compared with those not in -links. Also noteworthy is that the enrichment in aspartate at position 2 of G1 -bulges is confined to those in -links.
Some of these differences in amino-acid composition can be rationalized in terms of constraints imposed by the composite structures of which G1 -bulges are components. This is illustrated in Fig. 4. The polar side chain at position X of approximately 70% of G1 -bulges (which is rare at this position in G1 -bulges) may be involved in either backbone (Fig. 4a)   within the -bulge loop. In the case of G1 -bulges additional side-chain hydrogen bonding is often found from a polar side chain at position 2 to the backbone NH or CO at position À1 (Figs. 4d, 4e and 4f ). The amino-acid residue frequently involved is aspartate, which is much less abundant in G1 -bulges that are not parts of -links. Aspartate is equally rare at this position in G1 -bulges. Hydrogen bonding by aspartate and asparagine side chains to nearby main-chain atoms has previously been observed in small motifs (Eswar & Ramakrishnan, 1999;Wan & Milner-White, 1999;Duddy et al., 2004). The greater abundance of glycine residues at position 1 in G1 -bulges in the more tightly constrained -bulge loop-5 motifs and -links suggests a role in stabilizing the respective -turns in these latter structures. It should also be mentioned that there is a clear difference in the distribution of dihedral angles found at position X of -bulge loop-5 and -bulge loop-6 motifs within the general R region, as indicated by arrowheads and asterisks, respectively, in Fig. 2.

Sequence patterns and heterogeneity of G1 b-bulges
The difference in the dihedral angles of G1 and G1 -bulges enables one to distinguish them in proteins of known three-dimensional structure. In a similar way, a machine-learning approach allows one to assign the most probable structure of the two on the basis of amino-acid preferences (D. P. Leader, E. J. Milner-White & S. Rogers, unpublished work). However, in engineering proteins with specific subtypes of -bulge a sequence of amino acids must be selected that is likely to produce the desired structure: a choice made from the many combinations of the most frequent amino acids in the four positions 0, 1, 2 and X.
Supplementary Table S2 contains a list of sequence patterns for the G1 -bulges. Although the number of variants is large, it is instructive to examine the five that occur most frequently in each category, as shown in Table 2. For G1 -bulges present in -bulge loop-5 motifs, tripeptides for the 0, 1, 2 sequence of the type DG(S/T/N) are common, as expected from the amino-acid composition, and allow the selection of combinations with residue X that are uncommon in other subtypes. The frequent occurrence of the 0, 1, 2, X combination KGEN is less expected: it is as abundant as all other -GEN combinations in total. Its structure is shown in Fig. 4(c), with hydrogen bonds between the asparagine side chain at position X and the glutamate side-chain O atom and backbone NH group. The lysine residue is oriented away from thebulge hydrogen bonds towards the surface of the protein and, in all instances except one, does not interact with the carboxyl group of the glutamate. For the G1bulges in -bulge loop-6 motifs the most common sequences are consistent with the frequencies of amino acids. The situation for the majority of G1 -bulges, those that form -links, is that the most abundant combinations 0, 1, 2, X are of the type -DGV, as in the amino-acid compositions. The most frequent sequence pattern is PGDV, a reflection of proline being most frequent at position 0. What is not evident from Table 2 is that the amino acid at position À1 is either lysine or arginine in half of the 27 instances. The disposition of these side chains is towards the surface of the protein away from the -bulge hydrogen bonds (Fig. 4f), resembling that of lysine at position 0 in the KGEN motif of G1 -bulges. In this case, however, about half of the basic side chains interact with the carboxylate group of the aspartate. These observations are consistent with previous analysis of the distribution of amino acids in -sheets, which showed that lysine and arginine are often found at the edges of sheets (Fujiwara et al., 2014), where most G1 -bulges are located.
Although we believe that this analysis of sequence patterns will be useful in protein design, it is evident that other  factors determine whether a particular pattern will be appropriate in any instance.

Conclusions
This work answers a longstanding question about G1 -bulges by showing that there are two subtypes, G1 and G1, which can be differentiated on the basis of the conformation at position 0. A reclassification of -bulges on this basis has been implemented in the Protein Motif database and the publicly available web (Leader & Milner-White, 2009) and desktop (Leader & Milner-White, 2012) applications that incorporate it.
An important aspect of this reclassification is that these two types of G1 -bulge are integral components of two different composite structures: G1 -bulges in -bulge loops and G1 -bulges in -links. G1 -bulges and the loops containing them occur in different types of -sheet as an alternative to the simple -turn in -hairpin and -meander structures. In -barrels, these loops may serve to reduce strain (Dou et al., 2018). The -links (Richardson et al., 1978), in which the majority of G1 -bulges reside, have received less attention, but our unpublished work shows that they are important in small -barrels and in -sandwich proteins. The analysis of G1 -bulges in the present work should help to inform the design of engineered proteins in these categories.

Abbreviations
R encompasses the range of dihedral angles À140 < ' < À20 , À90 < < 40 , L the range 20 < ' < 140 , À40 < < 90 , R the range 150 < ' or ' < À25 , 40 < or < À150 and L the range 20 < ' < 140 , À180 < < À80 (here the L region is included within the L region). These abbreviations are used in shorthand descriptions of -bulges to indicate the conformations at residues 0, 1 and 2 on the 'bulged' strand: for example, 0,1,2-R L R indicates a -bulge in which residue 0 has the R conformation, residue 1 has the L conformation and residue 2 has the R conformation. Table 2 Sequence patterns for G1 -bulges.
The five most frequently occurring patterns are shown for each motif. The frequency is per thousand motifs, with the actual number of instances in parentheses. Where no instance of a sequence pattern was found for a particular motif the entry in the (2) AGIT 9 (2) AGVT 9 (2) KDYY 9 (2) † G1 -bulge within a -bulge loop-5 (1143 unique patterns in 2153 motif occurrences). ‡ G1 -bulge within a -bulge loop-6 (854 unique patterns in 1152 motif occurrences). § G1 -bulge within a -link (824 unique patterns in 1283 motif occurrences). } G1 -bulge not within a -link (213 unique patterns in 223 motif occurrences).