research papers
Autoindexing with outlier rejection and identification of superimposed lattices
^{a}Physical Biosciences Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, USA
^{*}Correspondence email: nksauter@lbl.gov
Constructing a model labelit.index (http://cci.lbl.gov/labelit ).
to fit the observed Bragg diffraction pattern is straightforward for perfect samples, but indexing can be challenging when artifacts are present, such as poorly shaped spots, split crystals giving multiple closely aligned lattices and outright superposition of patterns from aggregated microcrystals. To optimize the model against marginal data, can be performed using a subset of the observations from which the poorly fitting spots have been discarded. Outliers are identified by assuming a Gaussian error distribution for the bestfitting spots and points diverging from this distribution are culled. The set of remaining observations produces a superior model, while the rejected observations can be used to identify a second crystal if one is present. The prevalence of outliers provides a potentially useful measure of sample quality. The described procedures are implemented for macromolecular crystallography within the autoindexing programKeywords: autoindexing; outlier rejection; superimposed lattices; sample quality.
1. Introduction
`How good is this sample?' and `How well does the model fit the data?' are pertinent questions throughout the process of structure solution, driving critical experimental decisions even at the initial step of eliciting the crystal MOSFLM (Leslie, 1999), HKL (Otwinowski & Minor, 1997), XDS (Kabsch, 2010a,b) and d*TREK (Pflugrath, 1999). Generally, candidate Bragg spots are selected from the diffraction image and their observed laboratory coordinates are converted into by an appropriate geometric construction (Arndt & Wonacott, 1977; reviewed by Dauter, 1999). periodicity is detected by one of several autoindexing procedures (e.g. Steller et al., 1997), leading to a model with nine three unitcell lengths, three unitcell angles and three librations of the with respect to the laboratory axes (Fig. 1a). The predictive power of the model makes accurate data integration possible; in particular, it is used to deduce image coordinates for all of the reflections (Rossmann & van Beek, 1999), even those with a low intensity level that would not otherwise be distinguishable from the background level.
from the raw diffraction image. Algorithms for determining and refining the description are well understood, and are implemented by many dataprocessing packages such asThe unitcell model, or rather seven of its nine z axis or direction of incident Xrays) are readily optimized, since they produce direct changes in the expected Bragg spot positions. By comparing the predicted and observed Bragg positions, a bestfit model can be obtained by leastsquares Highaccuracy refinements of librations about the x and y axes are outside the scope of this paper, as small rotations about these axes do not produce firstorder changes in the expected spot positions. The accepted approach in this regard is to integrate the Bragg intensities over a full or partial data set, giving spotintensity profiles over a sequence of y rotations. These profiles determine highaccuracy x and yrotational orientations through a postrefinement algorithm (Rossmann et al., 1979; Winkler et al., 1979), and the resulting model is then used as the basis for a second, more accurate, round of integration.
(cell lengths, cell angles and libration about theThis paper focuses only on what is achievable with the observed centerofmass spot positions from the bestmeasured spots (Fig. 2a). There is good reason to emphasize this initial characterization of raw images (prior to data reduction), since highthroughput crystal screening is relied upon increasingly to identify the best crystalline samples prior to the collection of full data sets. Crystal screening, which is now a standard option at many synchroton beamlines (Soltis et al., 2008), examines numerous samples sequentially under robotic control. A typical protocol involves the collection of two diffraction patterns spaced 90° apart on the yrotational axis, which is enough data to gain a general understanding of the quality of the sample. Software server frameworks such as WebIce (González et al., 2008) or EDNA (Incardona et al., 2009) execute autoindexing trials for each crystal, generating a summary report that lists characteristics such as the signal strength and limiting resolution. One important quality measure is the r.m.s. deviation (in laboratory space) between the observed and predicted positions of the bestmeasured spots. Even in the absence of postrefinement, the model ought to predict spot positions with subpixel accuracy, so in the best cases the r.m.s. spot displacement is expected to be less than the CCD pixel size, typically about 100 µm.
In less favorable cases, it is challenging to refine the r_{obs} and predicted spot positions r_{calc} for the N bestmeasured spots,
model in a well behaved manner. If the operative method is to minimize the variance between observed spot centroidsthen one must check the implicit assumption that the observations have been paired with the correct Miller index h_{i}. This assumption breaks down for many experimental samples where are difficult or impossible to assign. One example is seen in Fig. 2, where the may be modulated (Wagner & Schönleber, 2009), generating satellite spots and streaks that are spaced along the c* axis. Another problematic phenomenon (§3) is the superposition of multiple lattices, which requires a decision as to which spots to associate with a given model. Clearly, in these cases, the assignment of observations to the wrong Miller index will artificially inflate the r.m.s. deviation value [which is σ_{r} in equation (1)]. Moreover, the refined model will be biased by the outliers, thus degrading any crystal quality measures that depend on an accurate knowledge of the (including the signal strength and limiting resolution). Here we develop a simple statistical test to decide, automatically, which observations should be included in equation (1) and which should be rejected as outliers, thus improving the general computational outcome.
2. Computational methods
Raw diffraction images for a number of published protein structures were downloaded from the Joint Center for Structural Genomics (JCSG; http://www.jcsg.org ) to be used as test cases. Images from hen eggwhite lysozyme containing superimposed lattices from two crystals were obtained by Peter Zwart at Beamline 5.0.1 of the Advanced Light Source at Berkeley. Software development was greatly facilitated by the framework provided by the opensource Computational Crystallography Toolbox (cctbx; GrosseKunstleve et al., 2002).
Raw data are analyzed with a spotfinding program (DISTL; Zhang et al., 2006), with a view to eliminating various types of signal artifacts prior to any further analysis. The rules applied (determined by trial and error) have been described elsewhere (Sauter et al., 2004; Sauter & Zwart, 2009) and are only briefly repeated here. Spots are retained for analysis only if they have smooth profiles that are well separated from their neighbors. Falloff of the spot count as a function of resolution is used to determine conservative outer and innerresolution cutoffs. Additional filters reject spots that are unusually intense, large or small in pixel area, or skewed in shape. Spots are ignored if they are too close to the rotation spindle for accurate positional evaluation.
Further analysis is performed on the spots once the tentative et al. (2004)] converts an observed spot position r to a realvalued Miller index f, the components of which are rounded off to produce the likely integervalued Miller index h. Index h is not always well defined for a rotation photograph; the rotation of the sample about the y axis (typically anywhere from 0.1 to 1.0°) may produce differing h values for a particular r position at the beginning and end of the rotation, in which case the observed spot is removed from consideration. The difference f − h is expected to have small fractional components. Conversely, large component values may indicate an outlier. Indeed, we are able to filter out icering artifacts by detecting peaks in a plot of f − h versus resolution, usefully supplementing the detection of ice rings by additional plots of backgroundcorrected pixel intensities and number of candidate spots versus resolution. Yet despite all this attention to heuristics for classifying outliers, there are still numerous examples of undetected split spots, fragmented ice rings and superimposed lattices that work their way into the target function [equation (1)] for optimizing the model.
model is established by autoindexing. A formula given previously [equation (8) of SauterWe therefore resort to statistical assumptions to help sort the spots. Two populations are posited, a collection of outliers that do not fit the x and y. We make the further simplifying assumptions that the individual coordinate deviations Δx and Δy (Fig. 3a) are uncorrelated and that their variances are equal (). This permits the total deviation between the observed and predicted positions,
model, and a normally distributed collection of spots that fit the model, albeit with some uncertainty. The normally distributed spots are taken to deviate from their predicted positions in the plane of the detector with a Gaussian probability distribution in each coordinate,to be modeled with a Rayleigh probability distribution (which may be thought of as an extension of the Gaussian distribution into two dimensions),
where σ^{2} = = = .
With this framework in place, the goal now is to separate observations that do and do not appear to fit a Rayleigh distribution. The total group of N observations is sorted in order of increasing Δr (Fig. 3c), indexed by the symbol k (k = 0, 1, …, N − 1). A useful tool for interpreting this series is the cumulative distribution function, or probability that an observation will have a Δr lower than a given value,
It is assumed that the spots with the smallest Δr values will form a safe group (with no outliers) from which to model the variance in this equation. We therefore take a reasonable subset (40% of the well measured spots with the lowest Δr; see below) and optimize the value of σ by leastsquares minimization of a function f that characterizes the vertical difference between the observed and calculated cumulative distribution curves in Fig. 3(c),
In this equation, the observed distribution function Φ_{obs} is just a straightline function, (2k + 1)/2N, that spreads the observations along the vertical axis of Fig. 3(c), while the calculated distribution function Φ_{calc} is equation (4) evaluated at the Δr position of the kth observation. Modeling the variance with equation (5) is expected to be superior to using equation (1), since we are fairly confident that equation (5) samples only the variation of spots that truly belong to the and not the outliers that may be present in the total ensemble of N spots. With the cumulative distribution function [equation (4)] now established, it is possible to define an outlier as an observation for which Δr is too far away from the idealized curve of equation (4). A good cutoff is a distance of σ [represented by the green bar in Fig. 3(c)], i.e. spots are classified as outliers when
It is important to understand that this cutoff criterion does not penalize observations that depart from the mean by many standard deviations. For example, if the sample size is extremely large it would be perfectly acceptable for a nonoutlying spot to have Δr = 6σ. Rather than impose an arbitrary σ cutoff, outliers are identified when the incidence of high Δr values exceeds that expected from a Rayleigh distribution. This approach accounts for the sample size in a natural way, as others have done in different contexts (Read, 1999; Zwart et al., 2005).
The statistical model for outlier rejection is applied immediately after autoindexing at the computational stage (Fig. 1b) where the model is still expressed as a triclinic cell. Subsequent to outlier rejection, the model is rerefined [again based on the target function in equation (1)] and the cell is analyzed to discover the possible Bravais types (Sauter et al., 2006) prior to data integration.
3. Results
To assess the , diffraction properties such as limiting resolution varied over a considerable range, as did the experimental conditions such as the width of the rotation angle (not shown). deviation statistics computed from one rotation image in each data set (generally the first image) reveal a broad spectrum of sample qualities, with one sample (3bgu ) exhibiting a second and two others presenting Bragg spots that are split in half (1vr8 ) or streaked in long rows (1vk8 ; Fig. 2). Separately, a test image from a lysozyme sample containing two lattices was analyzed (Fig. 4).
quality from a variety of crystal samples, 22 protein structures were selected from the JCSG data archive, spanning a wide subset of Bravais types. As indicated in Table 1

Each diffraction pattern was autoindexed with LABELIT (Sauter et al., 2004) and the resulting refined based on its agreement with the centerofmass positions of the N bestcandidate Bragg spots [equation (1)]. Our procedure for rejecting outliers is illustrated in Fig. 3 for the 1vk8 diffraction pattern. Fig. 3(a) plots the Δx and Δy deviations of the observed spots from their predicted positions, although the outliers that are several millimeters away from the nearest expected point (Fig. 2b) are beyond the extent of this graph. The ordered sequence of Δr values is shown in Fig. 3(c), along with the bestfit cumulative distribution function [Φ(Δr), blue curve] which is modeled on the 40% of observations with the lowest Δr values (red spots). In all cases examined, choosing the bestconforming subset of Bragg spots in this manner shows that the errors in the spot positions are described reasonably well by a Rayleigh distribution [equations (3) and (4)]. The modeled parameter describing the distribution (σ) is not very sensitive to the exact fraction of spots included; comparable results are obtained upon inclusion of anywhere from 25 to 55% of the spots. A commandline option was added to labelit.index to set this parameter (see §4).
In the case of sample 1vk8 , a full 80% of the chosen spots coincide with the Rayleigh distribution (blue line, Fig. 3c), while the remaining observations (those above the green cutoff line) depart markedly from this ideal. These are interpreted as outliers, and are thus removed from the list of observations. Another round of ensues, this time producing Δx and Δy deviations with a much tighter distribution about the model (Fig. 3b). Also, an ordered plot of Δr values shows that the entire set of remaining observations adheres extremely well to a Rayleigh distribution (Fig. 3d). The improvement in the model is not confined to the 1vk8 case; indeed, the r.m.s. deviation between model and experiment (σ_{r}) decreases after a second round of model sometimes dramatically, in every instance where outliers are rejected (Table 1).
For the crystals containing two lattices (3bgu and lysozyme), the initial target (for refining the predominant lattice) is contaminated with spot observations from the second The r.m.s. spot deviations (887 and 586 µm, respectively) are therefore exceptionally large. However, the statistical rejection of outliers successfully removes these observations, such that the second round of produces much lower spot deviations (66 and 100 µm) that are typical of singlecrystal samples. Furthermore, the identification of the outlying spots provides an opportunity to index the second a separate round of autoindexing based on just the outliers (Fig. 1b) produces good models, although the spot deviations of 586 and 321 µm are fairly high. The significance of this result is that it is not necessary to determine manually which spots to identify with the second prior to autoindexing based on painstaking visual inspection. The two superimposed lysozyme lattices discerned in this manner are depicted in Fig. 4.
As explained in §1 above, the automated crystalscreening experiments that are the intended focus of this paper often rely on acquiring two images spaced 90° apart on the yrotational axis. We simulated such data sets by selecting widely spaced images from the JCSG archive. Statistical rejection of outliers from these twoimage data sets (data not shown) did not differ remarkably from the oneimage trials listed in Table 1, either in the ability to improve the model fit or the ability to conform the remaining spots to a Rayleigh model. Certain minor details did change when considering the two images together, e.g. the additional was not as pronounced in the second 3bgu image and the spot splitting was not as severe in the second 1vr8 image. In other data sets such as 1vm6 , the percentage of spots rejected as outliers increased slightly, apparently because the model proved to be a better fit for one of the two images (data not shown).
4. Discussion
Within the crystalscreening paradigm, tens or hundreds of similar crystal samples may be evaluated for optimal diffraction before one or two are selected for data collection (Page et al., 2005). Automated software tools can facilitate this process by providing measures of sample quality (such as the signaltonoise ratio, limiting resolution, r.m.s. deviation, mosaicity and number of icering artifacts) in real time as the data are acquired. Except for the quantification of ice rings, these standard quality measures focus on the diffracted itself. What have been lacking are reliable measures for the interference caused by nonlattice artifacts, which might degrade the integration of the Bragg signal or the subtraction of background, thus reducing the quality of the structure factors. Two of the statistics presented in Table 1 appear to capture this information. Firstly, the number of outliers (expressed as a percentage of total Bragg spot candidates after ice rings have been removed) measures the prevalence of signals that do not belong to the canonical Secondly, the severity of the outliers (computed as the Fig. 3c area bounded by the observed spots, the blue curve and the green cutoff line, in units of σ) gauges how far the outliers are from the High values of these measures in Table 1 correlate with the visual recognition of stray spots in the 1vk8 , 3bgu , 1vr8 and lysozyme images. These statistics may therefore prove useful for crystal screening in combination with the standard measures mentioned above.
The new methods described here – statistical rejection of outliers followed by rerefinement of the labelit.index (Fig. 1b). Although the program normally operates with image data as the only input, a few commandline options have been added (described in the online manual) for generating plots such as those shown in Figs. 2–4, or for disabling outlier rejection altogether. Presently, the indexing of a second is not part of the default procedure. Rather, it is intended that a high percentage of outliers will alert the user to the possibility of a second whereupon labelit.index may be run a second time with a commandline flag set to follow the `additional indexing path shown in Fig. 1(b).
model – have been added to the default flowchart within the autoindexing programThe discovery of additional lattices performed here and elsewhere (Buts et al., 2004) raises the question of how to handle data reduction. Standard programs such as MOSFLM, HKL, XDS and d*TREK treat only one at a time, so separate datareduction runs will have to be performed to integrate the Bragg signals from each Care must be taken with pairs of reflections from different lattices that either overlap or come close enough such that the background subtraction performed for one is biased by the Bragg signal from another One approach, implemented in the program UNTANGLE (Buts et al., 2004), is to enumerate and reject problematic pairs of reflections before the data sets are merged together. Other programs such as PROW (Bourgeois et al., 1998) and EVAL15 (Schreurs et al., 2010) preserve the information that is present in near or overlapping signals by jointly integrating and deconvoluting the neighboring spots.
The need for software tools to analyze multiple lattices is likely to increase. In contrast with the present practice of transferring individual crystals to the goniometer stage, some new samplepreparation technologies have focused on the in situ collection of data without removing crystals from their growth chamber. These protocols, involving samples grown in capillaries (Yadav et al., 2005) or on microfluidic plates (Ng et al., 2008; Gerdts et al., 2008; Emamzadah et al., 2009), omit the step where crystals are separated from each other, so it is quite possible for the incident Xray beam to impinge on two crystals simultaneously. The early outlierbased detection of additional lattices (§3) could either be used automatically to trigger special datareduction protocols, or be deployed as part of a system of safeguards to avoid collecting such data sets altogether.
At the same time, one must keep in mind the assumptions underlying the present methods. Autoindexing must succeed prior to outlier analysis, so even if there are multiple lattices, one of them must be sufficiently predominant so that its unitcell axes can be discerned by the autoindexing methodology (Steller et al., 1997, in the case of labelit.index). Outlier detection relies on the assumption that there is a safely known subset of observations (we assume 40% here) that truly coincides with the main Finally, if the outliers are used as a basis for determining a second we again assume that there is a predominant signal to be found, so it may be difficult to apply the present method to a superposition of three or more lattices.
The procedures described here are included in the software package LABELIT, available for download by noncommercial users at http://cci.lbl.gov/labelit , and for licensing by commercial users. LABELIT is also included with the PHENIX package (Adams et al., 2002, 2010), available for download at http://www.phenixonline.org .
Acknowledgements
The authors thank Ashley Deacon (Joint Center for Structural Genomics) for creating the archive of full data sets associated with published JCSG structures, making it possible to test new methods such as those described here. Peter Zwart (Berkeley Center for Structural Biology of the Lawrence Berkeley National Laboratory) provided the lysozyme data set. Ralf GrosseKunstleve (Lawrence Berkeley National Laboratory) and David Richardson (Duke University) engaged in helpful discussions. We thank the NIH for financial support of the LABELIT project (grant No. 1R01GM77071). This work was partially supported by DOE contract No. DEAC0205CH11231.
References
Adams, P. D. et al. (2010). Acta Cryst. D66, 213–221. Web of Science CrossRef CAS IUCr Journals
Adams, P. D., GrosseKunstleve, R. W., Hung, L.W., Ioerger, T. R., McCoy, A. J., Moriarty, N. W., Read, R. J., Sacchettini, J. C., Sauter, N. K. & Terwilliger, T. C. (2002). Acta Cryst. D58, 1948–1954. Web of Science CrossRef CAS IUCr Journals
Arndt, U. W. & Wonacott, A. J. (1977). Editors. The Rotation Method in Crystallography, p. 83. Amsterdam: North Holland.
Bourgeois, D., Nurizzo, D., Kahn, R. & Cambillau, C. (1998). J. Appl. Cryst. 31, 22–35. Web of Science CrossRef CAS IUCr Journals
Buts, L., DaoThi, M.H., Wyns, L. & Loris, R. (2004). Acta Cryst. D60, 983–984. Web of Science CrossRef CAS IUCr Journals
Dauter, Z. (1999). Acta Cryst. D55, 1703–1717. Web of Science CrossRef CAS IUCr Journals
Emamzadah, S., Petty, T. J., De Almeida, V., Nishimura, T., Joly, J., Ferrer, J.L. & Halazonetis, T. D. (2009). Acta Cryst. D65, 913–920. Web of Science CrossRef CAS IUCr Journals
Gerdts, C. J., Elliott, M., Lovell, S., Mixon, M. B., Napuli, A. J., Staker, B. L., Nollert, P. & Stewart, L. (2008). Acta Cryst. D64, 1116–1122. Web of Science CrossRef CAS IUCr Journals
González, A., Moorhead, P., McPhillips, S. E., Song, J., Sharp, K., Taylor, J. R., Adams, P. D., Sauter, N. K. & Soltis, S. M. (2008). J. Appl. Cryst. 41, 176–184. Web of Science CrossRef IUCr Journals
GrosseKunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 126–136. Web of Science CrossRef CAS IUCr Journals
Incardona, M.F., Bourenkov, G. P., Levik, K., Pieritz, R. A., Popov, A. N. & Svensson, O. (2009). J. Synchrotron Rad. 16, 872–879. Web of Science CrossRef IUCr Journals
Kabsch, W. (2010a). Acta Cryst. D66, 125–132. Web of Science CrossRef CAS IUCr Journals
Kabsch, W. (2010b). Acta Cryst. D66, 133–144. Web of Science CrossRef CAS IUCr Journals
Leslie, A. G. W. (1999). Acta Cryst. D55, 1696–1702. Web of Science CrossRef CAS IUCr Journals
Ng, J. D., Clark, P. J., Stevens, R. C. & Kuhn, P. (2008). Acta Cryst. D64, 189–197. Web of Science CrossRef CAS IUCr Journals
Otwinowski, Z. & Minor, W. (1997). Methods Enzymol. 276, 307–326. CrossRef CAS Web of Science
Page, R., Deacon, A. M., Lesley, S. A. & Stevens, R. C. (2005). J. Struct. Funct. Genomics, 6, 209–217. CrossRef PubMed CAS
Pflugrath, J. W. (1999). Acta Cryst. D55, 1718–1725. Web of Science CrossRef CAS IUCr Journals
Read, R. J. (1999). Acta Cryst. D55, 1759–1764. Web of Science CrossRef CAS IUCr Journals
Rossmann, M. G. & van Beek, C. G. (1999). Acta Cryst. D55, 1631–1640. Web of Science CrossRef CAS IUCr Journals
Rossmann, M. G., Leslie, A. G. W., AbdelMeguid, S. S. & Tsukihara, T. (1979). J. Appl. Cryst. 12, 570–581. CrossRef CAS IUCr Journals Web of Science
Sauter, N. K., GrosseKunstleve, R. W. & Adams, P. D. (2004). J. Appl. Cryst. 37, 399–409. Web of Science CrossRef CAS IUCr Journals
Sauter, N. K., GrosseKunstleve, R. W. & Adams, P. D. (2006). J. Appl. Cryst. 39, 158–168. Web of Science CrossRef CAS IUCr Journals
Sauter, N. K. & Zwart, P. H. (2009). Acta Cryst. D65, 553–559. Web of Science CrossRef CAS IUCr Journals
Schreurs, A. M. M., Xian, X. & KroonBatenburg, L. M. J. (2010). J. Appl. Cryst. 43, 70–82. Web of Science CrossRef CAS IUCr Journals
Soltis, S. M. et al. (2008). Acta Cryst. D64, 1210–1221. Web of Science CrossRef CAS IUCr Journals
Steller, I., Bolotovsky, R. & Rossmann, M. G. (1997). J. Appl. Cryst. 30, 1036–1040. Web of Science CrossRef CAS IUCr Journals
Wagner, T. & Schönleber, A. (2009). Acta Cryst. B65, 249–268. Web of Science CrossRef CAS IUCr Journals
Winkler, F. K., Schutt, C. E. & Harrison, S. C. (1979). Acta Cryst. A35, 901–911. CrossRef CAS IUCr Journals Web of Science
Yadav, M. K., Gerdts, C. J., Sanishvili, R., Smith, W. W., Roach, L. S., Ismagilov, R. F., Kuhn, P. & Stevens, R. C. (2005). J. Appl. Cryst. 38, 900–905. Web of Science CrossRef CAS IUCr Journals
Zhang, Z., Sauter, N. K., van den Bedem, H., Snell, G. & Deacon, A. M. (2006). J. Appl. Cryst. 39, 112–119. Web of Science CrossRef CAS IUCr Journals
Zwart, P. H., GrosseKunstleve, R. W. & Adams, P. D. (2005). CCP4 Newsl. 43, contribution 7.
This is an openaccess article distributed under the terms of the Creative Commons Attribution (CCBY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.