Electronic Reprint Biological Crystallography Rapid Model Building of Β-sheets in Electron-density Maps Biological Crystallography Rapid Model Building of B-sheets in Electron-density Maps

permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited. Acta Crystallographica Section D: Biological Crystallography welcomes the submission of papers covering any aspect of structural biology, with a particular emphasis on the structures of biological macromolecules and the methods used to determine them. Reports on new protein structures are particularly encouraged, as are structure–function papers that could include crystallographic binding studies, or structural analysis of mutants or other modified forms of a known protein structure. The key criterion is that such papers should present new insights into biology, chemistry or structure. Papers on crystallographic methods should be oriented towards biological crystallography, and may include new approaches to any aspect of structure determination or analysis. A method for rapidly building-sheets into electron-density maps is presented.-Strands are identified as tubes of high density adjacent to and nearly parallel to other tubes of density. The alignment and direction of each strand are identified from the pattern of high density corresponding to carbonyl and C atoms along the strand averaged over all repeats present in the strand. The-strands obtained are then assembled into a single atomic model of the-sheet regions. The method was tested on a set of 42 experimental electron-density maps at resolutions ranging from 1.5 to 3.8 A ˚. The-sheet regions were nearly completely built in all but two cases, the exceptions being one structure at 2.5 A ˚ resolution in which a third of the residues in-sheets were built and a structure at 3.8 A ˚ in which under 10% were built. The overall average r.m.s.d. of main-chain atoms in the residues built using this method compared with refined models of the structures was 1.5 A ˚ .


Introduction
Many methods for the automatic interpretation of macromolecular electron-density maps and model building have recently been developed. These methods address the critical problem of building an atomic model that is consistent with the known sequence of the macromolecule and the expected geometrical features of the polymer. Automated map-interpretation methods are a natural extension of the powerful tools for interactive model building of models into maps [e.g. O (Jones et al., 1991), MAIN (Turk, 1992), XtalView (McRee, 1999) and Coot (Emsley & Cowtan, 2004)], which include semi-automated procedures for the generation of models after the user specifies some information about the chain location or geometry (Oldfield, 1994;Jones & Kjeldgaard, 1997;McRee, 1999). Recently developed highly automated methods for model building of proteins and nucleic acids include procedures that first identify C -atom positions and then extend these to create a model (Oldfield, 2002(Oldfield, , 2003Ioerger & Sacchettini, 2003;Cowtan, 2006) as well as methods that first find regular secondary structure followed by extension to build loops and other structures (Levitt, 2001;Terwilliger, 2003). Other methods begin with the identification of atomic positions and their interpretation in terms of a polypeptide chain (Perrakis et al., 1999) or begin with some information about the location of the chain and extensive conformational sampling to identify the conformation of the polymer (DePristo et al., 2005). Recently, probabilistic methods based on the recognition of density patterns in electron-density maps have been developed that extend automated model building to lower resolution ranges than were previously accessible (DiMaio et al., 2007;Baker et al., 2007) and methods for building nucleic acids into electron-density maps have been demonstrated (Pavelcik & Schneider, 2008).
There are several uses for automated model building. The most important of these is to build an atomic model that will form a basis for understanding the biology of the molecule that has been crystallized. An important additional use of automated methods for map interpretation is the evaluation of the quality of electron-density maps during the structuredetermination process itself. Although many techniques exist for choosing a high-quality electron-density map (see Terwilliger et al., 2009), by far the strongest indication that a structure has been 'solved' and an accurate electron-density map has been obtained is the ability to interpret the map in terms of an atomic model.
In the context of model building as a tool for map evaluation, the speed of model building is important. The faster the process, the more cases it can be applied to in a short period of time. In particular, the faster the process, the more possibilities for values of parameters in all steps of structure determination can be tested. In our previous template-based methods for main-chain building of protein structures, a key slow step consists of finding which three-amino-acid fragment from a large library best fits the electron density when placed at the tip of a growing chain (Terwilliger, 2003). Although this step can be optimized, for example by grouping of similar fragments and testing only a small subset of the library, it is intrinsically quite slow.
A much faster overall approach to model building is to look specifically for regular secondary structure in an electrondensity map (Jones & Kjeldgaard, 1997;Cowtan, 1998Cowtan, , 2008Terwilliger, 2003). As typically more than half the polypeptide chains in protein molecules have either -helical or -strand structure, a large part of a protein molecule can potentially be built in this way. Furthermore, as both -helices and -strands are in many cases quite regular, it is possible to carry out this analysis without considering a large number of different backbone configurations.
One way to look for a specific feature in a map is to use an FFT-based convolution search (Cowtan, 1998(Cowtan, , 2008. We have previously used such an approach to find -helices and -strands in electron-density maps as the starting point for full model building (Terwilliger, 2003); however, this process is not as fast as it might be because it requires a separate FFT for each orientation of the search model (e.g. a -strand or -helical fragment).
A faster approach might be to identify features that are observable at low resolution or for which the locations can be identified or at least limited and only rotational components need to be sampled (Jones, 2004;Cowtan, 2008). For example, it might be possible to place -helices or -strands directly in the map and follow this by adjustments of their orientations and positions based on any additional information from the map that has not already been used. We have used this type of approach to build -helices into electron-density maps at low resolution (7 Å ), where they appear simply as cylinders of density (Terwilliger, 2010). Higher resolution map information was then used to identify the positioning and direction of the helices. As the initial placement was performed with information that was both low resolution and symmetrical (as the -helices appear as cylinders), it could be carried out rapidly. Although subsequent steps were more time-intensive, they were only applied to the relatively small number of helix placements found, so the entire process was rapid.
Here, we develop a hierarchical method for -sheet model building in which adjacent strands in a sheet are identified as nearly parallel tubes of density and the direction and register of the strands are identified using density correlations based on the periodicity of -strands.

Modeling b-sheets in an electron-density map
Our approach for modeling the -sheets in an electron-density map of a protein focuses on speed by examining the map for characteristic features of these structures. The method consists of three steps.
(i) Identification of the location of sheets based on the presence of nearly parallel tubes of density.
(ii) Identification of -strand alignment and direction using the pattern of high density corresponding to carbonyl and C atoms along the strand averaged over all repeats present in the strand.
(iii) Assembly of -strands into a single model. The result of this process is a model of the -sheet portions of the structure. It can be used as a starting point for further model building and map interpretation in combination with a model of the -helical portions of the structure. The steps carried out are described in detail below.
2.1. Identification of sheet locations as nearly parallel tubes of density Fig. 1(a) illustrates a model of an antiparallel -sheet and corresponding density at a resolution of 2.5 Å . The first step in our procedure for building -sheets is the identification of where strands are located in the electron-density map. At moderate resolution (2.5-4 Å ) the polypeptide backbone resembles a tube of density and for -strands the tubes have only a small amount of curvature. In -sheets these strands are arranged in a nearly parallel or antiparallel fashion, with a small (typically up to about 30 ) inclination between adjacent strands. To simplify the identification of strands in a map and to make it as rapid as possible, a pair of strands in a -sheet is therefore initially considered to consist of two tubes of density that are nearly parallel and that are separated by approximately 4.5 Å at their closest approach. In this analysis, tubes of density are identified, then pairs of nearly parallel tubes are found and finally the tubes are extended into density, allowing curvature of the tubes.
Tubes of density are found in the electron-density map by finding points along ridgelines of high density. Firstly, a set of points in the map that are on ridgelines of high density and are separated typically by 2 Å are identified (green spheres in Fig. 1b). All pairs of points that are connected by high density are then identified. The criteria for two points being connected research papers are that (i) the density sampled along the line connecting the points has a value of at least max Â cut 1 , where max is the higher of the densities at the two end points and cut 1 typically has a value of 0.5, and (ii) that the mean density mean along the line is at least max Â cut 2 , where the typical value of cut 2 is 0.75. These pairs of connected points and the lines connecting them represent the locations and directions of tubes of density that might be -strands.
Next, pairs of nearly parallel tubes of density separated by about 4.5 Å are identified. This is performed by finding two nearby nearly parallel pairs of connected points (representing two tubes of density) with no high-density connections between the pairs (such that the density sampled along the line connecting the points has a value of at most max Â cut 1 as defined above). The cosine of the angle between the two tubes of density is typically required to be at least 0.5. The distance between the tubes of density is typically required to be 4.5 AE 2.0 Å at their closest approach. These tubes representing high density in the map are then extended into the available density, allowing the curvature of the tubes to match the high density in the map, as illustrated for the two tubes of density identified by red spheres in Fig. 1(b). To simplify the analysis, this curvature is only allowed in the direction perpendicular to a line connecting the midlines of the two tubes of density at their closest approach. This direction was chosen because the strands in -sheets typically have a curvature that is roughly perpendicular to the plane of the -sheets.
This procedure as a whole identifies tubes of density in the electron-density map that have the characteristics expected of a strand that is part of a -sheet. Additionally, for each tube of  density the direction to a neighbouring tube of density is also identified, yielding the expected direction of the carbonyl O atoms relative to the backbone of the -strand.
To ensure that the tubes of density being considered have a shape that is approximately that expected for a -strand, each tube of density is scored in two ways. Firstly, the correlation coefficient between the density in the map and an ideal tube of density with a value of 1 along its axis and 0 at a radius of 1.5 Å is estimated. If this is less than the value of cc_strand_min (typically set at 0.5) then the tube of density is discarded as a candidate -strand position. Otherwise, the score for the tube consists of the mean density along the axis of the tube multiplied by the square root of the length of the tube of density. This is similar to the scoring procedure that we have used previously to evaluate the quality of fit of a model to density (Terwilliger, 2003).

Identification of b-strand alignment and direction
Once tubes of density that could represent -strands have been identified as described above, they are each considered individually for their fit to a model of a -strand, allowing the curvature of the strand to match the curvature of the tube of density. Fig. 1(c) shows a close-up view of a model strand and of the curved axis of the tube of density corresponding to it. In this step the goal is to start with the density map and the points marking the tube of density and to end with a strand fitted into the density. One way to do this would be to model a strand in all possible positions near the axis of the tube of density and find the one that fits the best. We chose instead to use a faster but less comprehensive method in which the density near the curve marking the tube of density is examined for periodic patterns corresponding to the pattern of carbonyl O and C atoms along a -sheet. Fig. 1(c) illustrates the features of the electron density that we used in this process. The carbonyl O atoms of the strand in the middle of the figure point alternately up towards the -strand above it and down towards the -strand below it in the figure. The C atoms point alternately into the page and out from it. Note that the C atoms have a specific relationship to the direction of the chain and the positions of the carbonyl atoms: they are located about two-thirds of the distance from one carbonyl to the next going from right to left (N-terminus to C-terminus) along the chain in the middle of Fig. 1(c). This relationship is what we use to identify the positioning and direction of the -strand.
The representation of the density for this strand as a tube is marked by the red spheres in Fig. 1(c). Note that the red spheres very nearly coincide with the main-chain atoms of the strand. As the periodicity of the -strand is known (about 6.7 Å ), it is simple and rapid to average the density near the strand over all corresponding locations along the strand. This produces average density for one repeat of the strand. Then, as the direction towards the neighboring strand is already known, the positioning of the carbonyl atoms can readily be identified as being where the density approximately 1.5 Å from the axis of the strand in the direction of the neighbouring strand is maximal (as illustrated by the two carbonyl O atoms pointing up from the middle strand in Fig. 1c). With the same alignment, another carbonyl O atom points down towards the strand on the other side and this density should be offset by half of the period of the -strand. In our approach, if these two estimates of the locations of the carbonyls agree to within approximately 1/12 of the period then the identification of the location is considered to be a possible match.
The same approach can then be applied to examine the density, again about 1.5 Å from the axis of the strand, this time in the directions perpendicular to the plane of the -sheet. This density corresponds to that of the C atoms and is offset by about 1/3 of a period from that of the carbonyl O atoms (Fig. 1c). The pattern of high density and direction along the -strand going from the N-terminus to the C-terminus can now be readily seen. For the -strand in the middle of Fig. 1(c), starting at the carbonyl O atom pointing up at the right of the figure and moving to the left one atom at a time, it may be seen that the pattern of high density will be (i) up at position 1 (at the carbonyl C atom), (ii) right at atom 3 (the C atom into the plane of the figure), (iii) down at atom 4 (the next carbonyl), (iv) left at atom 6 (the next C atom) and then up again at atom 7 (the next carbonyl O atom pointing up). Note that if the strand were in the opposite direction then the pattern would be different. Consequently, the position and direction of the strand can be identified. In our procedure, if all the positions of highest density in this pattern are aligned within a tolerance of 1/12 of the period of the -strand then the position and direction are considered to be likely to be correct (this would happen in about 1% of cases by chance, as there are two possible directions of the strand, the position of the first carbonyl is defined as the start and the highest density would be within 1/12 of the target position 1/6 of the time for each of the other three atoms).
Given the direction and alignment of a strand as in Fig. 1(c) and the curve corresponding approximately to the main-chain atoms (the red spheres in Fig. 1c), an idealized -strand can be placed. In cases where the direction cannot be identified using the method described above, two candidate -strands are created, one in each direction. Note that if the curvature is substantial then there can be some distortion of the strand.

Assembly of b-strands, elimination of overlaps and joining of adjacent segments
The analysis described above yields a group of modelled -strands that match the electron density. However, these strands may contain overlapping fragments. We use the mainchain assembly routines in the RESOLVE software to assemble these fragments and resolve any overlaps (Terwilliger, 2003). All the -strands are ranked based on their match to the density using the scoring function described above.
-Strands that have two or more sequential C atoms that overlap within about 1 Å are connected into longer chains. The highest scoring chain is selected and all overlapping fragments are deleted. The process is continued until no further fragments with a length of at least four residues are found. Fig. 1(d) shows the result of carrying this out when the strands found from analyses of this map using data to a resolution of 2.5, 3 and 4 Å are merged in this assembly process (the default procedure).

Application to experimental electron-density maps
We tested our approach for modeling -strands using a set of 42 density-modified electron-density maps from the PHENIX structure library previously solved by MAD, SAD, MIR and a combination of SAD and SIR procedures with data extending to high resolutions ranging from 1.5 to 3.8 Å . Maps were calculated with the PHENIX AutoSol wizard (Adams et al., 2002;Terwilliger et al., 2009) using the data that had previously led to refined models for each of the structures considered. Each map was examined for -strands using the procedure described above. Table 1 summarizes the results of these tests. For each structure it shows the number of residues of -strand in the refined structure (as calculated with DSSP; Kabsch & Sander, 1983), the number of residues of -strand found, the number of residues found that were correctly identified as -strand (those for which the C atom was within 3 Å of a C atom of a -strand residue in the refined structure of the protein), the quality of the map (the correlation of the map with a map calculated from the refined model of the structure), the r.m.s. coordinate difference between main-chain atoms in the modeled -strands compared with those in the refined structure and the correlation between the map and a map calculated from the -strand model.
On average, 58% of the residues in -strands as identified by DSSP were built using our approach. Of these, 60% of the residues built were in fact in -strands (the C atom was within 3 Å of a C atom of a -strand residue in the refined structure of the protein). The 40% of the residues built by our proce-  dure that did not match -strands as identified by DSSP were either incorrectly built (e.g. traced into helices) or were built into less regular secondary structure (such as loop regions). Therefore, the method built -sheets reasonably well, but some -strands were missed and some residues were identified as -strand that were in fact another type of structure. Overall, the r.m.s.d. between modelled -strands and refined coordinates was about 1.5 Å . The CPU time required (using 2.9 GHz Intel Xeon processors) to analyze all 42 maps was 66 min or about 0.8 s per residue of -strand placed.
To compare these results with a standard procedure for automated model building, the same 42 maps were analyzed with the PHENIX AutoBuild wizard (Terwilliger et al., 2008) using one cycle of model building and refinement. The AutoBuild wizard built 65% of the residues in -strands as identified with DSSP, with an overall r.m.s.d. (including all main-chain and C residues built, whether strand or not) of 0.95 Å and required 43 h for the entire set of structures.
One structure for which most -strand residues were missed was the GroEL structure (PDB entry 1oel; Braig et al., 1995;Berman et al., 2000;Bernstein et al., 1977). This structure has 644 residues in -strands; however, only 18 of these were found. This structure was at a much lower resolution (3.8 Å ) than all the others in this test and the map was of lower quality than most (correlation with a model map of 0.55), suggesting that the method may not work well at lower resolutions or with maps of poor quality.
In a few cases significantly more -strand residues were built than were identified by DSSP. For example, S-hydrolase (PDB entry 1a7a; Turner et al., 1998) had 247 -strand residues built at a resolution of 2.8 Å , but only 83 of these matched a -strand residue identified by DSSP. Examination of this model showed that much of it was built quite accurately (Fig. 2a); however, there were other places where -strands have been built into density that corresponds to -helices or to side-chain density (Fig. 2b).
The procedure produced very complete structures of the -sheets in many cases. The largest number of -sheet residues built was for the structure 1038B at a resolution of 3 Å (PDB entry 1lql; Choi et al., 2003), for which 472 residues of -sheet were built (and 399 of these matched -sheet residues identified by DSSP) with an r.m.s.d. from the refined model of 1.3 Å . The structure has tenfold NCS but this was not used in the model-building process. A ribbon diagram of this model is shown in Fig. 2(c).
It would be useful to have a way to estimate the quality of a model produced with this method in real cases where the Figure 2 Model building of -sheets in density-modified experimental electron-density maps. (a, b) Sections of the electron-density map and model from S-hydrolase (Turner et al., 1998)  structure is not known. One approach to this is simply to calculate the correlation coefficient (CC) between the electron-density map and the -sheet model, only including points in the map that are near (within 2 Å ) of an atom in the model. Fig. 3 shows that this map correlation does give an indication of the quality of the model (as measured by the r.m.s.d. between model atoms and corresponding atoms in the refined model of the protein).
One adjustable parameter in this procedure that would be expected to affect both the accuracy of the models and the number of residues built is cc_strand_min, the minimum correlation between the density near a potential strand and that of an idealized tube of density. Fig. 4(a) illustrates the mean value of the r.m.s.d. between main-chain atoms of the -sheet models built and the corresponding atoms in refined models as a function of this parameter and Fig. 4(b) illustrates the number of residues built. The accuracy generally improves with increasing stringency, but as expected the number of residues built decreases. Values of cc_strand_min in the range 0.3-0.5 would appear to be reasonable compromises between these competing effects.

Conclusions
The procedure that we have developed for modelling -sheets is quite rapid and reasonably accurate. It identifies most of the -sheets in the tests we have carried out. The method does show some overprediction and can accidentally build -structure into helical or side-chain density in some cases (Fig. 2b), but in general it builds -sheets very well (Fig. 2c).
Several improvements can readily be imagined for this procedure. One would be to take account of the hydrogenbonding pattern in -sheets, which would be expected to improve the register and alignment of the models. Another would be to explicitly look for deviations from regular -structure, such as -bulges or the start of helices, so as to more precisely define the start and end of regular -strands.
The method may be useful in several ways. Firstly, it can be a good indicator of whether a structure has been solved, as a picture such as that in Fig. 2(c) is not likely to be found unless this is the case. Secondly, the procedure can be part of a rapid scoring procedure for evaluation of electron-density maps by analysis of the regular secondary structure evident in the maps. Lastly, the procedure could be used as part of a more complete model-building process in which the secondary structure built with this method is used as a starting point for chain extension and further model building.
The author would like to thank the NIH Protein Structure Initiative for generous support of the Phenix project (1P01 GM063210; P. D. Adams, PI) and the members of the Phenix Accuracy of models as a function of map correlation of the models.

Figure 4
Accuracy of models and residues built as a function of the threshold for strand-map correlation (cc_strand_min). (a) The mean r.m.s.d. between -sheet models and refined structures is shown for the 42 maps in Table 1. (b) The total number of residues built is shown. project for extensive collaboration and discussions. The author is grateful to the many researchers who contributed their data to the PHENIX structure library. The algorithm described here is carried out by the PHENIX routine phenix.find_ helices_strands with the keywords trace_chain=False and strands_only=True.