Fully automatic characterization and data collection from crystals of biological macromolecules

A fully automatic system has been developed that performs X-ray centring and characterization of, and data collection from, large numbers of cryocooled crystals without human intervention.


Introduction
Automation has been a key development in structural biology, from protein production and purification to crystallization, data collection and reduction, structure solution and model building (Aishima et al., 2010;Arzt et al., 2005;Beteva et al., 2006;Bowler et al., 2010;Dauter, 1999;Elsliger et al., 2010;Ferrer et al., 2013;Heinemann et al., 2003;Holton & Alber, 2004;Kabsch, 2010;Monaco et al., 2013;Ohana et al., 2004;Panjikar et al., 2005;Soltis et al., 2008;Perrakis et al., 1999;van den Bedem et al., 2011). In combination with advances in molecular biology and computation, this has allowed not only large numbers of structures to be solved (Berman et al., 2013) but also some incredibly challenging complexes to be revealed in detail (Ben-Shem et al., 2003Selmer et al., 2006;Warne et al., 2008;Zouni et al., 2001). Efforts are now being aimed at the areas that still require manual intervention, in particular targeting the mounting of crystals on sample supports and the centring of crystals in an X-ray beam, two steps which still require considerable human involvement (Cipriani et al., 2012;Deller & Rupp, 2014).
Fully automatic data collection at synchrotron sources has been discussed for many years within the macromolecular crystallography (MX) community, and efforts in automating the various steps have led to great advances over the last ten years. Developments in crystal-characterization software (Incardona et al., 2009;Sauter et al., 2004) and automatic sample changers (Cipriani et al., 2006;Cohen et al., 2002;Jacquamet et al., 2009;Pohl et al., 2004;Ueno et al., 2004), in combination with standardization of sample mounts (Cipriani et al., 2006), has greatly increased productivity. In addition to these developments, mail-in crystallography has been introduced at several synchrotron sites, with the pharmaceutical industry making the most use of these services (Malbet- Monaco et al., 2013;Okazaki et al., 2008;Robinson et al., 2006). Pioneering efforts in the full automation of the data-collection process have been made at the LRL-CAT beamline at the APS, USA and the SSRL, California, USA, which offer an automatic service primarily for the pharmaceutical industry, but with a significant proportion dedicated to academic users (Wasserman et al., 2012;Tsai et al., 2013). However, as sample centring is based on optical image processing, tight restrictions on sample mounting are required in order to provide this service.
The increasing number of challenging projects and the movement of the academic community into the development of small-molecule inhibitor projects (Heikkila et al., 2009) has led to a huge increase in the number of samples for MX, as screening both for crystal quality and for the presence of ligands has become more important. Many of the steps in this process are repetitive and, with modern software and detectors, can often be performed better automatically, especially in cases where crystals are embedded in opaque media and cannot be centred visually. This stimulated the development of a system with sufficient flexibility for fully automatic characterization and data collection from a wide variety of samples, allowing very few restrictions to be put in place. At the heart of this new system is a routine to locate crystals and centre optimal volumes to the beam. Automated routines based on scanning across the face of sample supports and coupling the output to data analysis (Aishima et al., 2010;Bowler et al., 2010;Cherezov et al., 2009;Hilgart et al., 2011) have been implemented at many synchrotron sources. This approach, variously known as mesh, grid or raster scanning, is highly effective in the location of crystals and in defining the best part of a crystal in one direction. However, these routines lack the ability to define the best diffraction volumes needed to be centred to the X-ray beam, and still require user input to define the mesh. Here, we describe routines to define a mesh area, locate the optimal diffraction volume within a support, or the crystals themselves, characterize the volume and subsequently collect an optimized data set according to user requirements. The routines are deployed on the new highly automated ESRF beamline MASSIF-1 and are reliant on well established technology to deliver a stable beam and new developments in sample handling (the RoboDiff sample changer and goniometer) developed at the ESRF (Nurizzo et al., in preparation). In combination with the ESRF MX beamline environment, which includes automatic beam delivery (Gabadinho et al., 2010), data processing and analysis (Brockhauser et al., 2012;Incardona et al., 2009;Monaco et al., 2013), and a database (ISPyB) for sample tracking and data display (Delageniè re et al., 2011), the prospect of analysing many more samples than can be performed manually becomes possible, as the system runs unsupervised and presents data in an intuitive manner. With over 8000 user samples now research papers 1758 Svensson et al. Automatic characterization and data collection processed, the system has been thoroughly tested and presents additional data from the consistent collection of information from multiple crystals from the same project. This allows the comparison of many parameters that are not usually recorded during experiments, such as exact crystal dimensions and diffraction variation within crystals (Bowler & Bowler, 2014  Image processing for the automesh algorithm. A series of images are taken 30 apart; examples are shown at 0 (a) and 90 (b). The background is then subtracted (c) and thresholding is used to detect the edges of the sample mount (d). The edges are then measured as a function of ! and pixels in order to determine the maximum and minimum ! orientations (e) and an area is selected for the mesh scan ( f ).
The system is not designed to replace all user visits to the synchrotron, but rather to perform the repetitive work of screening crystals or collecting data, freeing researchers to spend time on more challenging data-collection problems or to study the underlying biology.

Experimental details and results
The system presented here is based on earlier developments made in the automation of complex experiments (Brockhauser et al., 2012). The process is executed by a Passerelle workflow engine running on a server called the Beamline Expert System (BES). The BES is a customized version of Passerelle EDM (http://isencia.be/passerelle-edm-en) running on a central computing cluster independent of the beamline.
The beamline-control GUI MXCuBE2 (Gabadinho et al., 2010) starts the process by connecting to the BES via a webservice call. The workflow requests operations, such as data collections and motor movements, by connecting back to MXCuBE2 through an XML-RPC server. The process begins once a sample has been mounted and the loop optically centred; the steps and major decision points are shown schematically in Fig. 1. Each of the major steps is described in detail below.

MASSIF-1
The automatic routine described here, and the modules such as X-ray centring, have been designed to run on any of the ESRF MX beamlines. However, it is only run as a dedicated service on the new highly automated beamline MASSIF-1. MASSIF-1 is a ESRF undulator beamline dedicated to the fully automatic characterization of and data collection from crystals of macromolecules. The beamline uses an artificial asymmetric Laue [110] diamond as a monochromator and a compound refractive lens (CRL) as the only focusing element. This simple optical setup produces a highly stable beam of 100 Â 65 mm (H Â V, FWHM) with a typical flux of 2 Â 10 12 photons s À1 at a fixed energy of 12.8 keV, allowing structure solution by singlewavelength anomalous diffraction (SAD). The beamline is equipped with a sample changer that also acts as a diffractometer (RoboDiff), a high-capacity dewar able to store 240 SPINE standard samples and a PILATUS3 2M detector. The beam diameter can be tailored to the crystal size using a series of apertures. The sphere of confusion (SOC) of the RoboDiff is less than 2 mm in diameter over 360 , allowing a minimum aperture of 10 mm to be used. The beamline will be fully described elsewhere (Bowler et al., in preparation).

Automatic determination of mesh parameters
Crystals vary widely in their morphology and can be mounted in a wide variety of supports. Optical centring will place the  Automesh analysis of various sample holders. (a) A mount produced by laser-induced photoablation using the CrystalDirect robot (Cipriani et al., 2012). (b) A bent loop. (c) A micromesh. The final area selected for the mesh scan is shown as a red box (right). In each case the entire mount is covered by the mesh scan and the vertical size is minimized in order to reduce the time taken for the scan. support at the beam position, but the crystal will not necessarily be in the beam over the full rotation range. It is therefore necessary to scan the loop through the X-ray beam in order to accurately locate small crystals and to define the optimum part of larger crystals. Initially, a fixed grid size was used. The fixed grid was sufficient to cover the largest loops available and also take into account that the loop might not be optimally centred. However, the fixed grid was substituted with an automatically determined grid for two main reasons: first, for small loops time is lost in scanning empty space, and second, for some types of sample holders and bent loops a fixed size was often too small.
The automatic mesh algorithm can be used in two scenarios, where either the minimum or the maximum sample-mount size is determined as a function of rotation angle. The minimum grid size is used for sample holders containing a single crystal. The maximum grid size is used when locating more than one crystal in the sample holder in order to optimally detect all crystals. The input for the automatic determination of the mesh-scan area is a series of 12 images acquired from the online video microscope where the goniometer rotation axis is rotated by 30 between each image ( Fig. 2 and Supplementary Movie). For each image, a background subtraction, Gaussian smoothing and the application of a threshold to find the contour (Figs. 2c and 2d) is performed. The contours of all 12 images are then analysed and the images corresponding to the minimum and maximum vertical sample support size are selected (Fig. 2e). The corresponding rotation angle at which the mesh scan should be carried out and the size of the grid are returned as results. A snapshot of the area is also taken to ensure that the loop was covered in the mesh (Fig. 2f ). X-ray centring. (a) Once optical centring is completed, a crystal in a loop (red) will rotate about the goniometer centre of rotation (!). A mesh scan determines the horizontal and vertical translations required to bring the crystal to the centre of rotation at this angle using the goniometer horizontal translation and the centring-table motors (b, c). Rotating the spindle by 90 (d) and performing a vertical scan determines the final movement of the centring-table motors required to place the optimum diffraction volume of the crystal at the centre of rotation of the spindle (e, f ). In cases where the centre of rotation of the spindle is not placed at the beam position, it is determined by two vertical scans separated by 180 (g, h). The difference in position between the scans determines the vertical displacement of the goniometer required to place the centre of rotation of the spindle on the beam position (i, j). This example shows values of ! of 0 and 90 for convenience; arbitrary starting values of ! can be used.
The algorithm has allowed the automatic and optimal handling of many types of sample holder (Fig. 3). For small sample holders a minimum grid size is determined and for exotic sample holders, or bent loops, the grid is optimized to cover the whole sample holder while minimizing the mesh size. When the maximum size is selected, multiple crystals can be located and centred separately.

X-ray centring
The goal of X-ray centring is to locate and define, in three dimensions, the point within the crystal with the best diffraction signal and to place this point on the rotation axis. A twodimensional mesh combined with a line scan, using only the centring-table and goniometer-translation motors, allows a point within the crystal to be centred over 360 while only sampling two orientations 90 apart (Fig. 4), making the routine extremely fast. The X-ray dose is minimized, with each position within the mesh receiving approximately 1 Gy, less than a single image in a complete data collection.
The two-dimensional mesh scan is performed over the area previously defined by analysis of the sample mount. The grid is divided into points, where the distances between points in the horizontal and vertical directions are defined by the beam size (either full beam or defined by the aperture preselected by the user; see x2.5). As the mesh is typically larger horizontally, and a delay is associated between measuring horizontal lines, the mesh is oversampled in the horizontal direction only. For each position, an image is acquired and processed using the program Dozor. The result of Dozor is a scalar value which estimates the diffraction signal of each image by combining an analysis of Bragg peaks with an intensity distribution as a function of resolution. Once all images have been processed by Dozor, the two-dimensional array containing the measured diffraction signal at each point is analysed (Fig. 4c). The position of the best point is determined by first applying a threshold (50% of the maximum) to the signal and then performing a centre-of-mass calculation over any connected regions containing the strongest diffraction signal. The resulting position can be a fraction of the distance between the grid points. For crystals smaller than the vertical beam size it is necessary to perform a small vertical line scan to accurately position the sample as the two-dimensional mesh is not oversampled in the vertical direction. The decision to make this extra line scan is taken in cases where the best diffraction signal after the two-dimensional scan extends only one step in the vertical direction. If no signal is detected at this stage then the sample is unmounted.
After successful determination of the optimal position, the grid coordinates are translated into relative movements of the centring-table and goniometer-translation motors and the sample is moved to this position (Fig. 4b). For a well calibrated goniometer this allows the sample to be centred over 360 by determining the position 90 away from the mesh scan with a single line scan (Figs. 4d, 4e and 4f ). However, in the case of MASSIF-1, the RoboDiff goniometer is both sample changer and goniometer. Therefore, the vertical position of the rota-tion axis must be determined and aligned to the X-ray beam, as it fluctuates between mounting cycles. This is determined using two scans of the goniometer with vertical translation 200 mm above and below the beam position performed 180 away from each other (Figs. 4g-j). The centre of mass is then taken for each scan and the midpoint between these values defines the vertical offset required to ensure that a crystal is fully centred. This calculation uses segmentation to define the object of interest and the threshold is set at 25% of the maximum in order to define the crystal edges rather than the optimum diffraction points. Once determined, the axis is moved to the beam position and a corresponding vertical displacement of the sample is performed in order to maintain the optimal diffraction volume centred to the beam. The sample is then rotated 90 and a vertical line scan is performed on the centring table 400 mm above and below the beam position. After the sample has then been moved to the position with the strongest diffraction signal, it is centred on the rotation axis in the beam over the full rotation range (Fig. 4).
The X-ray centring procedure also provides measurements of crystal dimensions (height, width and depth) as a function of the rotation angle !. This provides a simple model of the crystal volume used as input for strategy calculation (see x2.4).
The automesh algorithm will generally orient a crystal such that its smallest vertical dimension is parallel to the rotation axis, allowing maximum and minimum dimensions to be defined. Occasionally, owing to sample mounting, crystals will not be optimally oriented, leading to small inaccuracies in the model of the crystal volume. However, having a rough estimate of volume relative to the beam size for correct dose estimations in strategy calculation is of enormous benefit and is superior to default values that generally bear no relationship to the actual crystal volume.

Strategy calculation and data collection
At this stage, the workflow can be programmed to start a data collection with pre-defined parameters or to collect reference images in order to calculate an optimized strategy. The pre-defined parameters available are either a 180 or 360 rotation range with a 0.2 oscillation width and an exposure time based on a dose calculated from the measured flux (final dose of $31.3 MGy). After collection of data is completed, the workflow stops and the next sample is mounted.
If the workflow has not been programmed to start a fixed data-collection strategy, four reference images are collected 90 apart with predefined parameters of 100% transmission, 1 oscillation and an exposure time of 0.1 s, corresponding to a total dose of 10 Gy. The reference images are then analysed by the program EDNA (Incardona et al., 2009), which uses a combination of LABELIT (Sauter et al., 2004), MOSLFM (Leslie, 2006), RADDOSE (Paithankar et al., 2009) and BEST (Bourenkov & Popov, 2010) to index the images and calculate an optimized data-collection strategy taking radiation damage into account. If reference images are not optimally collected (for example, using an incorrect exposure time, oscillation width and/or detector resolution), it can be difficult to calcu-research papers late the optimal data-collection strategy. EDNA/BEST can suggest an optimized detector distance if required; therefore, the ability to automatically re-collect reference images with a different detector distance has been added to this process. BEST can predict from B-factor estimation the maximum resolution obtainable from a crystal. If the detector distance is below the value that BEST predicts, it will suggest a new distance at which to collect images. If this is the case, the characterization process is repeated with the new detector distance. In this way, optimal data are always collected. All data are subsequently processed automatically  using pipelines based on XDS (Kabsch, 2010) and, if a significant anomalous signal is detected, are also fed into SAD pipelines for structure determination. The results from all steps in the process are displayed in ISPyB. After collection of data is completed the workflow stops and the next sample can be mounted.

Pipelines and the diffraction plan
The automation of processes necessarily involves standardization, such as the requirement for SPINE standard sample mounts. However, in order for an automatic data-collection service to be successful for a range of projects, considerable flexibility in data-collection parameters must be allowed. This has been implemented by combining different workflows with user input via the diffraction plan in the beamline database ISPyB. Here, information and specific data-collection requirements for each sample can be entered and pipelines for differing data-collection strategies can be selected (Table 1). Where no entry is given, a default value is used (Table 1). In this way, the characterization and data collection can be tailored to each individual sample.
Essential entries in the diffraction plan are the sample acronym (the protein being studied that has been approved by the ESRF safety group) and a unique sample name. Together, these items determine the nomenclature for the directories and diffraction images and, via the acronym, link to information about the crystal, such as space group and unit-cell parameters, if they have been entered. These are then used during the characterization step for strategy calculation and at the autoprocessing stage. The type of experiment can also be defined in the diffraction plan, with the user selecting default data collection (MXPressO), data collection using an EDNA strategy [MXPressE; a strategy optimized for 100% completeness with an hI/(I)i of 2 in the outer resolution shell taking radiation damage into account], SAD-optimized data collection [MXPressE SAD; a strategy optimized for high redundancy (360 ) with the resolution set to where the R merge between Bijvoet pairs is 5%, taking radiation damage into account] or characterization of diffraction properties without data collection (MXScore). The resolution that has already been observed is used as the basis for the detector distance for the mesh and line scans where characterization images are collected and is also used as the resolution for data collections where an EDNA strategy is not requested or indexing fails. A required resolution can also be  Table 1 Diffraction-plan entries.
The diffraction plan is filled for each sample in ISPyB and is used by the system to tailor the experiment to the sample. In the absence of an entry, sensible default values are used.  entered; this value is used as a threshold below which data collection is not performed. Finally, the experiment can be further tailored to the samples by defining a required multiplicity or completeness or by specifying a certain radiationsensitivity or beam size. The ability to add specific information at the level of the sample leads to a highly flexible system without compromising automation.

Error handling
One of the most important aspects of an automated system is the correct handling of errors. Errors arising from robotic mounting of a sample are handled by low-level software which is designed to escape from the error and mount the next sample. Workflows are also paused when there is no beam (by monitoring the flux on a diode near the sample position) or the cryostream temperature rises above 120 K.
Each workflow executed by the BES has an overall error monitor. This ensures that workflows end cleanly in cases of software errors and automatically send an error report to a pre-configured mailing list. In the case of software errors, the current sample is unmounted and the process moves on to the next sample.
More specific error handling exists for particular steps in the process. For the automatic determination of the area of mesh scans, if the algorithm fails (<1% for more than 1000 samples analysed) a default grid of 400 Â 200 mm is returned. Errors can also occur during the selection of the optimum diffraction volume. As the centre of mass is often between grid points, this can sometimes result in a poor position being selected (Fig. 5a). In order to avoid this scenario, the nearest image to the selected point is analysed and if the signal is less than 50% of the strongest position the process is repeated with the threshold increased by 5% until the selected position has the desired score (Fig. 5b). In cases where EDNA characterization fails to produce a strategy, because, for example, the images cannot be indexed or integrated or BEST fails to calculate the intensity distribution correctly, a default strategy is used (180 rotation, 0.2 oscillation, final dose $31.3 MGy). Where a SAD strategy has been requested and fails, the default strategy is for 360 with the same final absorbed dose. In this way, even poorly diffracting samples have an associated data set, which can often yield useful information.

Chronometry
The time taken to process each sample is important as it determines the number of samples that can be processed during scheduled beamtime. This varies depending on the size and shape of the mount and the size of the sample, as these will affect the time taken for the mesh scan and the eventual data-collection time. Fig. 6 and Table 2  Log-normal distributions of durations for major steps in the automatic treatment of the first 1240 samples processed in 2015 on MASSIF-1.  Table 3 The use of crystal dimensions in strategy calculation.
Differences in the calculated strategy to obtain native data with an hI/(I )i of 3.0 in the outer resolution shell based on the measured crystal size or the default crystal size in RADDOSE and BEST. The strategies are taken from the largest and smallest crystals processed with a strategy on MASSIF-1 in 2015.
Crystal larger than beam EDNA strategies and default collection, respectively). The time taken for default data collection has a very tight distribution, reflecting only small changes in the flux owing to differing ring currents. Data collections based on EDNA strategy calculations are far more varied and often take much less time (most frequently 72 s as opposed to 113 s), but there are also many with much longer data-collection times, showing that longer exposure times can be used when crystal size permits (Table 3). This will lead to higher signal-to-noise ratios and demonstrates the usefulness of data-collection strategy calculation.

Discussion
The ability to collect data unattended should make the use of both available beam time and, it is hoped, the time of the users much more efficient. While completely automatic data collection can remove the more mundane tasks in data collection, it also presents new scientific opportunities. The degree of automation allows the experiments to be performed consistently with available details that are not normally used when scientists are present. An important example of this is the precise measurement of crystal dimensions. The X-ray centring procedure described is performed on all samples and not only consistently centres crystals to the optimal diffraction volume but also provides accurate information on crystal size. This information is then used by RADDOSE (Paithankar et al., 2009) and BEST (Bourenkov & Popov, 2010) during EDNA characterization to calculate the absorbed dose used in a data collection. Properly defining crystal dimensions makes a considerable difference in the calculation of data-collection strategies (Table 3), leading to increased resolution and signalto-noise ratios for crystals larger than the X-ray beam and reduced radiation damage for crystals smaller than the X-ray beam. The accurate measurement of crystal dimensions coupled to the real-time measurement of the flux of the X-ray beam leads to the best possible data being collected for a given sample when EDNA characterization is used. The consistent measurement of all parameters related to not only each experiment but also to whole projects will allow information to be fed back into future experiments. For example, the analysis of the crystal dimensions the first 1240 crystals processed on MASSIF-1 in 2015 is informative (Table 4 and Fig. 7). The average values for crystal volume and individual dimensions are as expected, being roughly 100 mm. However, the histogram of measured volumes is striking in that most crystals are much smaller than this average (Fig. 7a), and this is also reflected in the distributions of individual dimensions, with peaks below 50 mm in all three dimensions ( Fig. 7b and Table 4). Over 44% of the crystals processed on MASSIF-1 are below 250 000 mm in volume, equivalent to a cube with sides of 63 mm. The most commonly observed volume is 20 209 mm 3 (edges of $27 mm), which is within two orders of magnitude of the theoretical limit of a 8.3 mm diameter required to obtain 2 Å resolution data from a 'realistic' lysozyme crystal (Holton & Frankel, 2010). In fact, half of the crystals so far processed on MASSIF-1 in 2015 are   Distributions of crystal volumes and dimensions observed for the first 1240 samples processed in 2015 on MASSIF-1. (a) Histogram of crystal volumes; 30% of samples fall below the average volume of $10 6 mm 3 (red line). Theoretical crystal volumes (Holton & Frankel, 2010) required for data at 2.0 Å (blue line) or 2.8 Å (green line) resolution for a protein of 79 kDa on MASSIF-1 are shown for reference. Volume data are shown only to 1.2 Â 10 7 mm 3 for clarity. (b) Log-normal distributions of crystal dimensions. The dimensions are x, the measured crystal length parallel to the spindle axis, y, the length orthogonal to the spindle axis, and z, the length orthogonal to the spindle axis 90 away in !.
smaller in volume than the size theoretically required to obtain 2.8 Å resolution data from a protein crystal with an entity of 79 kDa (PDB average) in the asymmetric unit (304 980 mm 3 ; Holton & Frankel, 2010), showing that the beamline is working at the leading edge of scientific projects and is not just for well diffracting samples; it is also heavily used for screening. This analysis has led to the selection of a default beam diameter of 50 mm, as this most accurately reflects the size of most crystals observed so far on the beamline. The distributions also reflect the tendency of crystals to lie along the spindle axis owing to the way they are mounted, leading to the height of the crystals having the smallest range (Fig. 7b). Analysis of this type should also prove useful for specific projects. Reports can be generated showing crystal size distribution and variation in diffraction quality (Bowler & Bowler, 2014) for each protein. This information can then be fed back into projects, informing, for example, on whether larger or smaller crystals are better and the optimal selection of beam size, depending on variation in diffraction.
Fully automatic characterization and data collection from large numbers of crystals is a new tool for structural biologists. The process is used heavily at opposing ends of the project spectrum. In the initial stages of projects, preliminary hits from high-throughput crystallization or large numbers of poorquality crystals need to be screened, a time-intensive stage that results in little data collection but requires rapid feedback. The opposing use is the collection of hundreds of data sets from well established systems, as is the case in smallmolecule fragment-screening projects. Again, this process is labour-intensive but requires more effort in the interpretation of results and again requires rapid feedback. The process described here facilitates both stages, allowing effort to be redirected from the beamline to the laboratory. In combination with recent developments in the automation of crystal mounting (Cipriani et al., 2012), it is hoped that project lifecycles can be reduced, leading to an increase in the quantity and the quality of structural results available.