Autoindexing with outlier rejection and identification of superimposed lattices

After autoindexing, Bragg spot candidates that do not fit on the model lattice can be identified, providing a potentially useful measure of sample quality and giving an avenue for indexing a second lattice, if one is present.


Introduction
'How good is this sample?' and 'How well does the model fit the data?' are pertinent questions throughout the process of structure solution, driving critical experimental decisions even at the initial step of eliciting the crystal lattice from the raw diffraction image. Algorithms for determining and refining the lattice description are well understood, and are implemented by many data-processing packages such as MOSFLM (Leslie, 1999), HKL (Otwinowski & Minor, 1997), XDS (Kabsch, 2010a,b) and d*TREK (Pflugrath, 1999). Generally, candidate Bragg spots are selected from the diffraction image and their observed laboratory coordinates are converted into reciprocal space by an appropriate geometric construction (Arndt & Wonacott, 1977; reviewed by Dauter, 1999). Lattice periodicity is detected by one of several autoindexing procedures (e.g. Steller et al., 1997), leading to a lattice model with nine degrees of freedom: three unit-cell lengths, three unit-cell angles and three librations of the lattice with respect to the laboratory axes (Fig. 1a). The predictive power of the lattice model makes accurate data integration possible; in particular, it is used to deduce image coordinates for all of the reflections (Rossmann & van Beek, 1999), even those with a low intensity level that would not otherwise be distinguishable from the background level.
The unit-cell model, or rather seven of its nine degrees of freedom (cell lengths, cell angles and libration about the z axis or direction of incident X-rays) are readily optimized, since they produce direct changes in the expected Bragg spot positions. By comparing the predicted and observed Bragg positions, a best-fit model can be Protocol for outlier detection. (a) Data-acquisition geometry, showing librations of the crystal about the incident beam (z axis) and goniometer rotation axis (y axis). (b) Computational procedure showing steps that are executed within the program labelit.index. New outlier detection steps developed in this paper are indicated by an asterisk (*) and the alternative pathway for indexing the second lattice (if one is present) is indicated by a dagger symbol ( †). obtained by least-squares refinement. High-accuracy refinements of librations about the x and y axes are outside the scope of this paper, as small rotations about these axes do not produce first-order changes in the expected spot positions. The accepted approach in this regard is to integrate the Bragg intensities over a full or partial data set, giving spot-intensity profiles over a sequence of y rotations. These profiles determine high-accuracy x-and y-rotational orientations through a post-refinement algorithm (Rossmann et al., 1979;Winkler et al., 1979), and the resulting lattice model is then used as the basis for a second, more accurate, round of integration.
This paper focuses only on what is achievable with the observed center-of-mass spot positions from the bestmeasured spots (Fig. 2a). There is good reason to emphasize this initial characterization of raw images (prior to data reduction), since high-throughput crystal screening is relied upon increasingly to identify the best crystalline samples prior to the collection of full data sets. Crystal screening, which is now a standard option at many synchroton beamlines (Soltis et al., 2008), examines numerous samples sequentially under robotic control. A typical protocol involves the collection of two diffraction patterns spaced 90 apart on the y-rotational axis, which is enough data to gain a general understanding of the quality of the sample. Software server frameworks such as Web-Ice (Gonzá lez et al., 2008) or EDNA (Incardona et al., 2009) execute autoindexing trials for each crystal, generating a summary report that lists characteristics such as the signal strength and limiting resolution. One important quality measure is the r.m.s. deviation (in laboratory space) between the observed and predicted positions of the best-measured spots. Even in the absence of post-refinement, the lattice model ought to predict spot positions with sub-pixel accuracy, so in the best cases the r.m.s. spot displacement is expected to be less than the CCD pixel size, typically about 100 mm.
In less favorable cases, it is challenging to refine the lattice model in a well behaved manner. If the operative method is to minimize the variance 2 r between observed spot centroids r obs and predicted spot positions r calc for the N best-measured spots, then one must check the implicit assumption that the observations have been paired with the correct Miller index h i . This assumption breaks down for many experimental samples where Miller indices are difficult or impossible to assign. One example is seen in Fig. 2, where the lattice may be modulated (Wagner & Schö nleber, 2009), generating satellite spots and streaks that are spaced along the c* axis. Another problematic phenomenon (x3) is the superposition of multiple lattices, which requires a decision as to which spots to associate with a given lattice model. Clearly, in these cases, the assignment of observations to the wrong Miller index will artificially inflate the r.m.s. deviation value [which is r in equation (1)]. Moreover, the refined lattice model will be biased by the outliers, thus degrading any crystal quality measures that depend on an accurate knowledge of the lattice (including the signal strength and limiting resolution). Here we develop a simple statistical test to decide, automatically, which observations should be included in equation (1) and which should be rejected as outliers, thus improving the general computational outcome.  , 2002). Raw data are analyzed with a spot-finding program (DISTL; Zhang et al., 2006), with a view to eliminating various types of signal artifacts prior to any further analysis. The rules applied (determined by trial and error) have been described elsewhere (Sauter et al., 2004;Sauter & Zwart, 2009) and are only briefly repeated here. Spots are retained for analysis only if they have smooth profiles that are well separated from their neighbors. Falloff of the spot count as a function of resolution is used to determine conservative outer-and inner-resolution cutoffs. Additional filters reject spots that are unusually  Detail of the 1 rotation image used for analyzing outliers for Protein Data Bank (PDB) entry 1vk8. (a) Red circles are the candidate Bragg spots accepted for the initial lattice refinement, after two rounds of heuristic spot filters. Spot intensity is one criterion for accepting these candidate observations, but other characteristics have disqualifed many of the brightest Bragg spots in this case. Specifically, the heuristic rules select signals for which the centroids are extremely well measured, namely round sharp spot profiles that are baseline-separated from neighboring signals. Thus, the Bragg signals exhibiting satellite spots and streaks oriented along the vertical c*-axis direction are excluded from refinement. (b) Yellow boxes show the predicted Bragg positions on the initially refined lattice model. Observations have been recolored to indicate their status with respect to this predicted lattice. Red dots represent the 40% of spots closest to their predicted positions, used for determining the Rayleigh distribution in Fig. 3(c). Pink spots are those determined to be outliers by the statistical test [equation (6)], very much in agreement with the visual impression. Orange dots are the remaining observations. intense, large or small in pixel area, or skewed in shape. Spots are ignored if they are too close to the rotation spindle for accurate positional evaluation.

Computational methods
Further analysis is performed on the spots once the tentative lattice model is established by autoindexing. A formula given previously [equation (8) of Sauter et al. (2004)] converts an observed spot position r to a real-valued Miller index f, the components of which are rounded off to produce the likely integer-valued Miller index h. Index h is not always well defined for a rotation photograph; the rotation of the sample about the y axis (typically anywhere from 0.1 to 1.0 ) may produce differing h values for a particular r position at the beginning and end of the rotation, in which case the observed spot is removed from consideration. The difference f À h is expected to have small fractional components. Conversely, large component values may indicate an outlier. Indeed, we are able to filter out ice-ring artifacts by detecting peaks in a plot of |f À h| versus resolution, usefully supplementing the detection of ice rings by additional plots of backgroundcorrected pixel intensities and number of candidate spots versus resolution. Yet despite all this attention to heuristics for classifying outliers, there are still numerous examples of undetected split spots, fragmented ice rings and superimposed lattices that work their way into the target function [equation (1)] for optimizing the lattice model.
We therefore resort to statistical assumptions to help sort the spots. Two populations are posited, a collection of outliers that do not fit the lattice model, and a normally distributed collection of lattice spots that fit the model, albeit with some uncertainty. The normally distributed lattice spots are taken to deviate from their predicted positions in the plane of the detector with a Gaussian probability distribution in each coordinate, x and y. We make the further simplifying assumptions that the individual coordinate deviations Áx and Áy (Fig. 3a) are uncorrelated and that their variances are equal ( 2 x ¼ 2 y ). This permits the total deviation between the observed and predicted positions, to be modeled with a Rayleigh probability distribution (which may be thought of as an extension of the Gaussian distribution into two dimensions), where 2 = 2 x = 2 y = 2 r =2. With this framework in place, the goal now is to separate observations that do and do not appear to fit a Rayleigh distribution. The total group of N observations is sorted in order of increasing Ár (Fig. 3c), indexed by the symbol k (k = 0, 1, . . . , N À 1). A useful tool for interpreting this series is the cumulative distribution function, or probability that an observation will have a Ár lower than a given value, It is assumed that the spots with the smallest Ár values will form a safe group (with no outliers) from which to model the variance in this equation. We therefore take a reasonable subset (40% of the well measured spots with the lowest Ár; see below) and optimize the value of by least-squares minimization of a function f that characterizes the vertical difference between the observed and calculated cumulative distribution curves in Fig. 3(c), In this equation, the observed distribution function È obs is just a straight-line function, (2k + 1)/2N, that spreads the observations along the vertical axis of Fig. 3(c), while the calculated distribution function È calc is equation (4) evaluated at the Ár position of the kth observation. Modeling the variance with equation (5) is expected to be superior to using equation (1), since we are fairly confident that equation (5)   outliers that may be present in the total ensemble of N spots.
With the cumulative distribution function [equation (4)] now established, it is possible to define an outlier as an observation for which Ár is too far away from the idealized curve of equation (4). A good cutoff is a distance of [represented by the green bar in Fig. 3(c)], i.e. spots are classified as outliers when It is important to understand that this cutoff criterion does not penalize observations that depart from the mean by many standard deviations. For example, if the sample size is extremely large it would be perfectly acceptable for a non-outlying spot to have Ár = 6. Rather than impose an arbitrary cutoff, outliers are identified when the incidence of high Ár values exceeds that expected from a Rayleigh distribution. This approach accounts for the sample size in a natural way, as others have done in different contexts (Read, 1999;Zwart et al., 2005). The statistical model for outlier rejection is applied immediately after autoindexing at the computational stage (Fig. 1b) where the lattice model is still expressed as a triclinic cell. Subsequent to outlier rejection, the model is re-refined [again based on the target function in equation (1)] and the cell is analyzed to discover the possible Bravais types  prior to data integration.

Results
To assess the lattice quality from a variety of crystal samples, 22 protein structures were selected from the JCSG data archive, spanning a wide subset of Bravais types. As indicated in Table 1, diffraction properties such as limiting resolution varied over a considerable range, as did the experimental conditions such as the width of the rotation angle (not shown). Lattice deviation statistics computed from one rotation image in each data set (generally the first image) reveal a broad spectrum of sample qualities, with one sample (3bgu) exhibiting a second lattice and two others presenting Bragg spots that are split in half (1vr8) or streaked in long rows (1vk8; Fig. 2). Separately, a test image from a lysozyme sample containing two lattices was analyzed (Fig. 4).
Each diffraction pattern was autoindexed with LABELIT (Sauter et al., 2004) and the resulting lattice refined based on its agreement with the center-ofmass positions of the N best-candidate Bragg spots [equation (1)]. Our procedure for rejecting outliers is illustrated in Fig. 3 for the 1vk8 diffraction pattern. Fig. 3(a) plots the Áx and Áy deviations of the observed spots from their predicted positions, although the outliers that are several millimeters away from the nearest expected lattice point (Fig. 2b) Table 1 Improvement of the lattice model after outlier rejection.
Results shown here represent the analysis of a single rotation image. r is the r.m.s. difference between observed and predicted spot positions as defined in equation (1). is a standard deviation fitted to the 40% of observations (before outlier rejection) that are closest to their corresponding predicted positions, as illustrated in Fig. 3

Figure 4
Indexing of two superimposed lysozyme diffraction patterns. (a) The primary lattice model (yellow boxes) after an initial round of refinement, where colored dots represent the spots included in the lattice refinement.
Color coding is the same as used in Figs. 2 and 3. Red spots are the subset safely considered to be part of the primary lattice, while pink spots are the outliers identified by equation (6)  which is modeled on the 40% of observations with the lowest Ár values (red spots). In all cases examined, choosing the bestconforming subset of Bragg spots in this manner shows that the errors in the spot positions are described reasonably well by a Rayleigh distribution [equations (3) and (4)]. The modeled parameter describing the distribution () is not very sensitive to the exact fraction of spots included; comparable results are obtained upon inclusion of anywhere from 25 to 55% of the spots. A command-line option was added to labelit.index to set this parameter (see x4).
In the case of sample 1vk8, a full 80% of the chosen spots coincide with the Rayleigh distribution (blue line, Fig. 3c), while the remaining observations (those above the green cutoff line) depart markedly from this ideal. These are interpreted as outliers, and are thus removed from the list of observations. Another round of lattice refinement ensues, this time producing Áx and Áy deviations with a much tighter distribution about the model (Fig. 3b). Also, an ordered plot of Ár values shows that the entire set of remaining observations adheres extremely well to a Rayleigh distribution (Fig. 3d). The improvement in the model is not confined to the 1vk8 case; indeed, the r.m.s. deviation between model and experiment ( r ) decreases after a second round of model refinement, sometimes dramatically, in every instance where outliers are rejected (Table 1).
For the crystals containing two lattices (3bgu and lysozyme), the initial lattice refinement target (for refining the predominant lattice) is contaminated with spot observations from the second lattice. The r.m.s. spot deviations (887 and 586 mm, respectively) are therefore exceptionally large. However, the statistical rejection of outliers successfully removes these observations, such that the second round of lattice refinement produces much lower spot deviations (66 and 100 mm) that are typical of single-crystal samples. Furthermore, the identification of the outlying spots provides an opportunity to index the second lattice: a separate round of autoindexing based on just the outliers (Fig. 1b) produces good lattice models, although the spot deviations of 586 and 321 mm are fairly high. The significance of this result is that it is not necessary to determine manually which spots to identify with the second lattice prior to autoindexing based on painstaking visual inspection. The two superimposed lysozyme lattices discerned in this manner are depicted in Fig. 4.
As explained in x1 above, the automated crystal-screening experiments that are the intended focus of this paper often rely on acquiring two images spaced 90 apart on the y-rotational axis. We simulated such data sets by selecting widely spaced images from the JCSG archive. Statistical rejection of outliers from these two-image data sets (data not shown) did not differ remarkably from the one-image trials listed in Table 1, either in the ability to improve the model fit or the ability to conform the remaining spots to a Rayleigh model. Certain minor details did change when considering the two images together, e.g. the additional lattice was not as pronounced in the second 3bgu image and the spot splitting was not as severe in the second 1vr8 image. In other data sets such as 1vm6, the percentage of spots rejected as outliers increased slightly, apparently because the lattice model proved to be a better fit for one of the two images (data not shown).

Discussion
Within the crystal-screening paradigm, tens or hundreds of similar crystal samples may be evaluated for optimal diffraction before one or two are selected for data collection (Page et al., 2005). Automated software tools can facilitate this process by providing measures of sample quality (such as the signal-tonoise ratio, limiting resolution, r.m.s. deviation, mosaicity and number of ice-ring artifacts) in real time as the data are acquired. Except for the quantification of ice rings, these standard quality measures focus on the diffracted lattice itself. What have been lacking are reliable measures for the interference caused by non-lattice artifacts, which might degrade the integration of the Bragg signal or the subtraction of background, thus reducing the quality of the structure factors. Two of the statistics presented in Table 1 appear to capture this information. Firstly, the number of outliers (expressed as a percentage of total Bragg spot candidates after ice rings have been removed) measures the prevalence of signals that do not belong to the canonical lattice. Secondly, the severity of the outliers (computed as the Fig. 3c area bounded by the observed spots, the blue curve and the green cutoff line, in units of ) gauges how far the outliers are from the lattice. High values of these measures in Table 1 correlate with the visual recognition of stray spots in the 1vk8, 3bgu, 1vr8 and lysozyme images. These statistics may therefore prove useful for crystal screening in combination with the standard measures mentioned above.
The new methods described here -statistical rejection of outliers followed by re-refinement of the lattice model -have been added to the default flowchart within the autoindexing program labelit.index (Fig. 1b). Although the program normally operates with image data as the only input, a few command-line options have been added (described in the online manual) for generating plots such as those shown in Figs. 2-4, or for disabling outlier rejection altogether. Presently, the indexing of a second lattice is not part of the default procedure. Rather, it is intended that a high percentage of outliers will alert the user to the possibility of a second lattice, whereupon labelit.index may be run a second time with a command-line flag set to follow the 'additional lattice' indexing path shown in Fig. 1(b).
The discovery of additional lattices performed here and elsewhere (Buts et al., 2004) raises the question of how to handle data reduction. Standard programs such as MOSFLM, HKL, XDS and d*TREK treat only one lattice at a time, so separate data-reduction runs will have to be performed to integrate the Bragg signals from each lattice. Care must be taken with pairs of reflections from different lattices that either overlap or come close enough such that the background subtraction performed for one lattice is biased by the Bragg signal from another lattice. One approach, implemented in the program UNTANGLE (Buts et al., 2004), is to enumerate and reject problematic pairs of reflections before the data sets are merged together. Other programs such as PROW (Bourgeois et al., 1998) and EVAL15 (Schreurs et al., 2010) preserve the information that is present in near or overlapping signals by jointly integrating and deconvoluting the neighboring spots.
The need for software tools to analyze multiple lattices is likely to increase. In contrast with the present practice of transferring individual crystals to the goniometer stage, some new sample-preparation technologies have focused on the in situ collection of data without removing crystals from their growth chamber. These protocols, involving samples grown in capillaries (Yadav et al., 2005) or on microfluidic plates (Ng et al., 2008;Gerdts et al., 2008;Emamzadah et al., 2009), omit the step where crystals are separated from each other, so it is quite possible for the incident X-ray beam to impinge on two crystals simultaneously. The early outlier-based detection of additional lattices (x3) could either be used automatically to trigger special data-reduction protocols, or be deployed as part of a system of safeguards to avoid collecting such data sets altogether.
At the same time, one must keep in mind the assumptions underlying the present methods. Autoindexing must succeed prior to outlier analysis, so even if there are multiple lattices, one of them must be sufficiently predominant so that its unitcell axes can be discerned by the autoindexing methodology (Steller et al., 1997, in the case of labelit.index). Outlier detection relies on the assumption that there is a safely known subset of observations (we assume 40% here) that truly coincides with the main lattice. Finally, if the outliers are used as a basis for determining a second lattice, we again assume that there is a predominant signal to be found, so it may be difficult to apply the present method to a superposition of three or more lattices.
The procedures described here are included in the software package LABELIT, available for download by non-commercial users at http://cci.lbl.gov/labelit, and for licensing by commercial users. LABELIT is also included with the PHENIX package (Adams et al., , 2010, available for download at http://www.phenix-online.org.