Recent improvements to the automatic characterization and data collection algorithms on MASSIF-1

Macromolecular crystallography (MX) is now a mature and widely used technique essential in the understanding of biology and medicine. Increases in computing power combined with robotics have enabled not only large numbers of samples to be screened and characterised but also for better decisions to be taken on data collection itself. This led to the development of MASSIF-1 at the ESRF, the world’s first beamline to run fully automatically while making intelligent decisions taking user requirements into account. Since opening in late 2014 the beamline has now processed over 39,000 samples. Improvements have been made to the speed of the sample handling robotics and error management within the software routines. The workflows initially put in place, while highly innovative at the time, have been expanded to include increased complexity and additional intelligence using the information gathered during characterisation, this includes adapting the beam diameter dynamically to match the diffraction volume within the crystal. Complex multi-position and multi-crystal data collections are now also integrated into the selection of experiments available. This has led to increased data quality and throughput allowing even the most challenging samples to be treated automatically. Synopsis Significant improvements in the sample location, characterisation and data collection algorithms on the autonomous ESRF beamline MASSIF-1 are described. The workflows now include dynamic beam diameter adjustment and multi-position and multi-crystal data collections.


Introduction
Automation is transforming the way scientific data are collected, allowing large amounts of high quality data to be gathered in a consistent manner (Quintana & Plätzer, 2015;Foster, 2005). Advances in robotics and software have been key in these developments and have had a particular impact on structural biology, allowing multiple constructs to be screened and purified (Camper & Viola, 2009;Hart & Waldo, 2013;Vijayachandran et al., 2011); huge numbers of crystallisation experiments to be performed (Elsliger et al., 2010;Ferrer et al., 2013;Heinemann et al., 2003;Joachimiak, 2009;Calero et al., 2014), samples to be mounted at synchrotrons (Cipriani et al., 2006;Cohen et al., 2002;Jacquamet et al., 2009;Papp et al., 2017;Snell et al., 2004), data to be analysed and processed (Bourenkov & Popov, 2010;Holton & Alber, 2004;Incardona et al., 2009;Leslie et al., 2002;Monaco et al., 2013;Winter, 2010) and the entire PDB to be validated (Joosten et al., 2012). The combination of robotic sample mounting and on-line data analysis has been particularly important in macromolecular crystallography (MX) as it allowed time to be saved, large numbers of samples to be screened and enabled the remote operation of beamlines. However, despite these advances, a human presence is still required to sequence actions. Pioneering beamlines that fully automated the process, such as LRL-CAT at the APS (Wasserman et al., 2012) and the SSRL MX beamlines (Tsai et al., 2013), removed the need for human presence, but as they rely on optical loop centring this means that restrictions have to be placed on the size of crystals and tend to be for robust, well diffracting samples, generally those for proprietary research for the pharmaceutical industry.
In 2014 the ESRF beamline MASSIF-1  opened to users as the first beamline to fully automate MX data collection, including sample location and complex decision making algorithms . The combination of sample location and characterisation allows even the smallest and weakly diffracting samples to be treated automatically. This opened full automation to any sample presented in any mount and has provided a new tool to structural biologists, allowing the process of collecting hundreds of data sets or screening hundreds of crystals to be 'outsourced', freeing their time and often collecting better data . At the time of writing, the beamline has processed more than 39,000 samples representing a wide range of projects, from those that require extensive screening to find the best diffracting crystal (Na et al., 2017;Sorigué et al., 2017;Naschberger et al. 2017) to small molecule fragment screening (Cheeseman et al., 2017;Hiruma et al., 2017) and experimental phasing at high and low resolutions (Kharde et al., 2015;Muir et al., 2016). The beamline is able to deal with a wide range of samples by combining parameters provided by the user with information gathered during processing. The workflows initially put in place have performed very well but many enhancements remained possible.
Here, we describe how the algorithms have been improved to increase the amount and quality of data collected on MASSIF-1. Additional subroutines have been added to monitor and correct errors in centring and account for low resolution data collection as well as dynamically adjusting the beam diameter to match homogenous diffraction volumes. In combination with new multiple position and multiple crystal data collection workflows, fully automatic data collection is now possible for the most challenging samples.

Hardware improvements
One of the most time consuming, and important, steps in the X-ray centring process is the initial mesh scan that locates and characterises the crystal. When first implemented on MASSIF-1, a rotation of omega was included in the scan. This implementation required that triggering of the acquisition of images was instigated by the omega axis, meaning that each line of the mesh was treated as a separate data collection. The preparation required between data collections led to additional time being taken for the scan. We have now implemented a scan that includes no rotation of the omega axis and requires only the movements of the high precision Y/Z stage beneath the RoboDiff . This allows the triggering of acquisition to be made by the motor position, allowing the whole mesh to be launched as a single data collection. This method was implemented in 2017 and comparison of the elapsed times for mesh scans performed in the last 2 months of 2016 with the first 2 months of 2017 shows that the time required for these scans has been reduced by an average of 1 minute ( Figure 1).

Dynamic beam sizing
One of the benefits of running a completely automated system is the ability to collect large amounts of data on samples and use these data in improving strategies for data collection. We initially realised that the volumes of all crystals were determined during the X-ray centring routine and this information was subsequently included in the strategy calculation, having the biggest effect on the calculation of the maximum dose given to a crystal during data collection Svensson et al., 2015). Additionally, these measurements provided a distribution of crystal volumes allowing us to use a default beam diameter of 50 μ m, as this was the crystal dimension most frequently observed on the beamline. A specific beam diameter can be selected on a per sample basis in the diffraction plan in ISPyB (Delagenière et al., 2011), however, this option is usually used when users are sure that crystal volumes are significantly smaller than the default beam diameter (Figure 2). Using the information gathered during the mesh scan we can determine an optimised beam diameter.
By accurately matching beam diameter to the crystal, it has been shown that the background can be dramatically reduced (Holton & Frankel, 2010;Moukhametzianov et al., 2008). This is particularly striking when crystals are very small (Evans et al., 2011) but if a crystal is large, the additional diffraction power should not be wasted. This also has to be balanced with the degree of variability within each crystal (Bowler & Bowler, 2014;Bowler et al., 2010;Pozharski, 2012).
We have now introduced a dynamic beam diameter adjustment into all workflows running on MASSIF-1 where no value has been pre-selected by the user. The X-ray centring routine determines the crystal position relative to the beam, the crystal dimensions and also determines the best homogenous volume within the crystal. The centre of mass of this volume is then used as the centring position and it is the dimensions of this volume that are used to select the beam diameter. There are 5 beam diameters available on MASSIF-1, 100 µm, 50 µm, 30 µm, 15 µm and 10 µm  and the smallest vertical volume dimension is used to select the aperture that matches most closely. In this way, the largest volume can be illuminated without increasing background or 'contaminating' the diffraction from variable areas. All steps during X-ray centring are performed with the 50 µm aperture. As the scans are performed with an overlap, the smallest dimension that can be measured is ~20 µm meaning that the 10 µm option is only used when selected by the user.
Once X-ray centring is completed, the new aperture diameter is selected and characterisation images are collected using this diameter. The strategy calculation will include the new flux for the aperture as well as the crystal volume determined during the centring procedure. Since introducing the adaptable beam diameter the system has selected the 30μm beam size most frequently ( Figure 2) followed by the 100 µm and the 15 µm diameter. Half of all data collections are performed with a diameter of 50 µm, reinforcing its choice as the default value.
Can the advantages of dynamic beam adjustment be demonstrated in a consistent manner? It is always problematic to clearly show that one data collection method is better than another. However, here we show that dynamic beam sizing make a significant difference for weakly diffracting samples.
We initially tested the adaptive beam diameter protocol on crystals of the β 1 -adregenic GPCR (Warne et al., 2008). These crystals diffract weakly, exhibit considerable variation in diffraction quality and tend to form as thin plates or needles. A total of 30 crystals were run on MASSIF-1, first using the classic protocol , where a 50 μ m beam diameter is the default, and then running a second protocol on the same crystal including the adaptive beam diameter. In many cases, data sets were collected from the same crystal using both procedures; however, for some cases, data sets were only processed where the beam diameter had been reduced to match the crystal size. Table 1 shows crystal dimensions and data processing statistics from crystals where an automatically processed data set was produced (8 out of 30 crystals). Where crystals were of sufficient quality, data sets were mostly produced from both protocols, but it is for the smaller crystals that a difference is discernible. For crystals with a y dimension below 30 μ m the data sets produced have a higher <I/σ(I)> or resolution limit (Table 1; For42, For48, For59 and For67) even though the crystals have already been exposed. For one of these crystals, For42, the data set collected with the smaller beam is significantly better (Table 2).
What effect has the protocol had on overall data collection? In order to analyse the difference we looked at the average signal to noise ratios, <I/σ(I)>, for all data sets processed automatically (Monaco et al., 2013;Vonrhein et al., 2011;Sparta et al., 2016;Winter, 2010) for the year preceding and following the introduction of the protocol. This amounts to data for approximately 22,000 samples. Figure 3 shows the distributions of overall <I/σ(I)>for the data sets. While the distributions are similar for high <I/σ(I)> they diverge at the lower values with a significant shift lower for the adaptive beam dimeter data sets. The average before the procedure was put in place is 14.4, moving down to 12.2 after, with the modal values 8.55 before and 5.9 after (Figure 3.). We initially found this surprising as we had expected a general increase in <I/σ(I)>. However, given the effect seen on the GPCR crystal, the distribution change is understandable. While the beam diameter is increased or decreased to match the diffraction volume, the <I/σ(I)> values for strongly diffracting crystals will remain the same given the dose to achieve a certain resolution without radiation damage, taking the crystal volume and changed flux into account. However, it is for the weakly diffracting crystals that the adaptive beam dimeter has the most significant effect. Whether the diameter is increased or decreased there is a large shift in the number of data sets that are processed that have rather low <I/σ(I)> after the adaptive beam diameter protocol was implemented. This implies that by introducing this routine into the regular data collection workflow, the beamline is able to increase the number of data sets processed from these samples by reducing the background noise.

Improved error handling
The correct handling of errors is paramount in an automated system. We initially introduced many error handling routines at both a high level, such as the collection of a data set with default settings when indexing fails , and a low level, such as escaping from small robotic errors . After processing more than 39,000 samples we have now been able to observe most errors encountered and have extended the processes to catch them.

Centring errors
We have occasionally observed that after the X-ray centring routine the crystal was still not correctly aligned over the full rotation range. This may be due to movements of the support after the routine is completed or to errors in the routine arising from multiple peaks being selected for centring. Whatever the reason, it can lead to a data set being lost. We therefore introduced a check in the characterisation step that ensures that the 4 images have a diffraction signal. If one image has a signal that is below 10% of the top signal, a recovery routine is launched. This involves three short line scans (50 μ m above and below the current centred position) being launched over the currently centred position. In most cases it corrects the error. In 2017, 13,776 samples were processed on MASSIF-1 and centring recovery was launched 221 times. This represents a centring error rate of 1.6% which should be reduced further by being able to detect and recover incorrectly centred samples.

Low resolution data collection
Unless a resolution is specified by the user, all mesh and characterisation images are collected at 2 Å. If the predicted resolution extends beyond the corners of the detector (1.42 Å) the detector is moved to the new resolution and a further characterisation is launched. This allows the highest possible resolution to be obtained and ensures that characterisation is performed at an optimal detector distance . However, for low resolution, the resolution determined by BEST is used and data collection proceeds according to the determined strategy. It would seem sensible that if very low diffraction is determined the characterisation images should also be re-collected at this resolution. We therefore introduced a routine to re-collect the characterisation images at 4 Å for all samples where the determined resolution is below this value. This allows the distribution of intensity to be better estimated and should lead to better strategy calculations (Popov & Bourenkov, 2003).
By analysing the relationship between predicted and determined resolution from all data sets collected so far we can also try to improve the quality of data collected on MASSIF-1 (Figure 4). The distribution shows that the agreement is excellent, usually slightly underestimating the achievable resolution. This may well be due to the difference between criterion for resolution limit determination being <I/σ(I)> for characterisation and CC1/2 for complete data sets. A clear trend is that for weakly diffracting crystals the strategy tends to underestimate the resolution (Figure 4). This is due to the difficulty in estimating a B factor at very low resolution. In addition to re-collecting the characterisation images the procedure now always sets the detector resolution to 4 Å for all data collections where the predicted resolution is lower than this value. In this way, we hope that higher resolution data will not be missed, if possible, as complete data to 4 Å is more important than sub-optimal data collection at 7 Å.

Multiple crystal and multiple position data collection strategies
The possibility to input a number of positions from which to collect data was introduced into the diffraction plan early in the operation of MASSIF-1 Svensson et al., 2015). This has allowed complete data sets to be collected either from separate crystals contained on a sample support or from multiple positions within a single crystal and has proved to be a popular option with many samples received on MASSIF-1 having a number of positions between 2 and 12 requested ( Figure 5). While extremely useful, this protocol does not cover the scenario where radiation sensitive samples can benefit from a large dose being spread over multiple partial data sets, a procedure known generally as helical data collection  that has been shown to be beneficial in many cases (Polsinelli et al., 2017).
Radiation damage can often make it difficult to collect complete data, or data with sufficient anomalous signal, from a single crystal or a single position within a crystal. A new experiment type is now available that will automatically collect multiple partial data sets from positions within a homogenous volume of a crystal. This can lead to improved data quality, increased resolution and higher anomalous peaks. This is the first fully automated helical data collection protocol that also accounts for the heterogeneity of crystal diffraction quality.
Multiple selected, the automesh algorithm, that determines the area to scan to locate the crystal , uses the widest orientation of the sample support, rather than the smallest, in order to avoid overlapping crystals or positions in ω. After the mesh scan is complete, the map is analysed either for the number of peaks requested (if no positions are specified for MXPressP the default is 5). Determined peaks must be within 10% of the value of the peak for multiple crystals or 70% of the second highest peak for pseudo helical.
Additionally, any peaks that are closer than a beam diameter together or that will overlap in ω are eliminated. The number of allowed detected peaks is then specified by a comment in ISPyB. If multiple crystals have been selected, each point is then centred as usual and a complete data set collected according to user input requirements. If MXPressP is selected, the top peak is centred and 4 characterisation images are collected from the best position. A strategy is then calculated for a complete data set and the data collected, for MXPressP_SAD the strategy is optimised for structure solution by SAD . As usual, in case of indexing failure, a default data collection of 180º is collected (240º for triclinic and 360º for SAD data collection). Once completed, a strategy is then calculated to collect a complete data set from the N positions determined in the mesh scan that are within 70% of the value of position 2. The strategy, as usual, accounts for the volume of the positions, beam diameter etc. Again, in case of a failure in indexing, default data collections are performed at each position using a full dose and the rotation range determined by 180º/N (240º/N for triclinic and 360º/N for SAD). Each partial data set has a 5º overlap with the next to assist with scaling.
We are, for the moment, remaining cautions with helical data collection by collecting a full single position complete data set from the best volume. The reason for this is twofold: 1. we have observed that crystal heterogeneity can often lead to a number of the partial data sets being of varying quality despite the stringent quality threshold we have implemented and 2. We are eager to compile a large amount of data on how and when helical data collection is superior to single position. This is extremely important as, so far, the few studies on helical data collection have not considered crystal heterogeneity (Bowler & Bowler, 2014;. Strategy parameters and data processing statistics for two example systems using the pseudo helical routines for native and SAD data collections are shown in Table 3. Two proteins that tend to form crystals with a needle morphology were selected: β phosphoglucomutase (βPGM) in an open conformation (Baxter et al., 2010) and ferulic acid esterase (FAE) that contains 8 × Se and 5 × Cd 2+ (Prates et al., 2001) with a significant anomalous signal at the MASSIF-1 wavelength of 0.966 Å. Comparing the single position data collection to the merged multiple position data sets shows that in these cases there is not a significant increase in data quality. However, in the SAD case the helical data set has considerably higher <I/σ(I)>, anomalous correlation coefficients and mid-slope of anomalous probability than the single position data set. For the native data sets, the single position is slightly better. This may reflect the heterogeneity within the crystal and highlights the importance of this parameter in whether to select helical versus single position for a certain project. The ability to automatically run clustering algorithms (Giordano et al., 2012;Zander, Cianci, et al., 2016) on these partial data sets may also improve the quality of the final data.
We hope that by being able to analyse the variation in diffraction quality, and compare single position data to multi-position data from the same crystal, a more general strategy for these types of data collection may emerge.

Discussion
The results presented here demonstrate not only the increase in the speed and reliability of automatic data collections but also that more complex strategies can be brought into the arena of autonomous experiments. Automation is often seen as a way to deal with mundane experiments that require little human input. The autonomous system presented here is different in that, in addition to automating mounting and centring, it also uses data gathered during the process to improve data collection strategies. We have already demonstrated that MASSIF-1 collects, on average, better quality data than humans are able to . The additional routines presented here add even more intelligence into the system that should further enhance its ability to extract the best possible data from every sample. This built-in intelligence means that the system is excellent for not only robust and routine data collections but also for challenging systems that diffract weakly. We have demonstrated that adapting the beam dimeter can increase the number of data sets that can be processed from these types of sample. We hope that by providing more data on more samples we can improve feedback into experiment cycles and increase the amount of useful data produced.
All the developments described here have been exported to the human operated ESRF beamlines . As structural biologists now turn to a an ever wider variety of techniques, we hope that fully automatic data collection will become the standard data collection method for MX as the best possible data can be collected from samples, be they large and robust or small and weakly diffracting. In combination with developments in the robotic mounting and soaking of crystals (Zander, Hoffmann, et al., 2016) we envision that the future of macromolecular crystallography is the provision of a fully automated high throughput service able to rapidly produce high quality structural models and screen for potential therapeutic and probe molecules.

Acknowledgments
We thank Tony Warne (MRC-LMB, Cambridge, UK) for the gift of         (Zander, Hoffmann, et al., 2016) support that has 3 crystals mounted. The widest orientation of the mount was selected. B. Mesh scan of the mount shown in A. Three positions were requested and three detected. C. Mesh scan for a β PGM crystal where a native pseudo-helical data collection was requested, 5 positions were detected and a beam diamter of 30 μ m selected and D Mesh scan for a FAE crystal where a SAD pseudo-helical data collection was requested, 5 positions were detected and a beam diamter of 100 μ m selected.