Simulation of small-angle X-ray scattering data of biological macromolecules in solution

An application is presented to simulate images on an area detector for isotropic small-angle X-ray scattering, explicitly accounting for the experimental geometry and yielding radially averaged 1D scattering patterns with statistically appropriate variations.


Introduction
A small-angle X-ray scattering (SAXS) experiment performed on a solution of biological macromolecules typically produces an isotropic 2D scattering pattern originating from the random orientations of the molecule in solution. To access shape and size information of the solute, the image is first radially (or azimuthally) averaged into a 1D curve, which depicts the scattering intensity I(s) against the momentum transfer s. The scattering intensities depend on the wavelength () and the angle between the incoming and scattered photons (2Â) as s = (4 sin Â)/. To extract the structural parameters of the molecule, the background contribution to the scattering from components such as the capillary or the buffer needs to be subtracted from that of the sample.
So far, the most common approach to calculate SAXS data has been at the level of the 1D curve, typically on the basis of atomic structures (Svergun et al., 1995;Schneidman-Duhovny et al., 2010;Knight & Hub, 2015;Grishaev et al., 2010;Liu et al., 2012;Tjioe & Heller, 2007;Virtanen et al., 2011) or from geometric bodies (Konarev et al., 2003). There are significantly fewer options for generating 2D scattering images. Of those available, most are tailored to calculate the scattering from oriented systems and/or for single-particle diffraction experiments, both for coherent and incoherent sources, using explicit ISSN 1600-5767 equations or numerical methods (Chen & Luo, 2017;McAlister & Grady, 2002;Fritz-Popovski, 2013;Saldin et al., 2010;Alves et al., 2017). All these latter methods result in theoretically calculated 2D images or 1D scattering patterns without any random components. For simulations of experimental data, approximations of random noise are often added by ad hoc methods.
As many aspects of the beamline setup, e.g. detector type, sample-detector distance, X-ray wavelength and the incident beam position on the detector, are propagated into the uncertainties of the final 1D scattering pattern, it is necessary to simulate scattering patterns on area detectors. Importantly, the instrumental parameters influence not only the angular range and spacing but also the point-to-point variation of the individual data points in the 1D scattering data. Here we present IMSIM, an application to accurately simulate a 2D SAXS scattering image based on the input of a calculated 1D SAXS curve: with the focus that, when radially averaged again, it accurately reproduces the source data up to a scaling factor and the desired random noise.
An overview of all command-line options and arguments of IMSIM, as well as a full usage example, are provided online: https://www.embl-hamburg.de/biosaxs/manuals/imsim.html.

IMSIM architecture 2.1. Simulation method
The simulation is initialized with a continuous rectangular virtual detector plane of infinite size in polar coordinates. Individual random photon-like arrival events are generated in polar coordinates as well, then mapped and quantized onto a rectangular pixel grid, such that the resulting Cartesian image exhibits the same statistical properties and overall features as observed with experimental data. Here, sampling in polar coordinates is a canonical choice as isotropic scattering implies no preference of direction; hence a random variable of uniform distribution over [À, ) may be employed to determine the direction. In turn, the probability of a certain distance is proportional to the intensities of the calculated input scattering scaled by the Cartesian circle area and normalized to sum to 1.0 [ Fig. 1(a)]. Drawing randomly from a uniform variable over the interval [0, 1] and applying the quantile function of the intensity probability distribution yields the angular position of the virtual photon event [ Fig. 1(b)]. The total number of overall events generated depends on the input parameters on application start-up, e.g. virtual flux, exposure time and concentration. Each random set of polar coordinates is then mapped to Cartesian space and, in the last step, quantized to the configured detector pixel grid with a point-spread function of one pixel. The physical dimensions of the final image depend on the wavelength, sample-detector distance and pixel size of the virtual detector [ Fig. 1(c)].

IMSIM input/output
IMSIM is a command-line application that takes as input a file with calculated one-dimensional scattering intensities on an absolute scale (Svergun et al., 1995), with the first and second columns corresponding to the momentum transfer in inverse å ngströ m and calculated intensities, respectively. To simulate the 2D scattering pattern, setup details such as the detector type, exposure time (seconds), X-ray wavelength (metres), detector distance (metres) and flux (events per second) need to be specified through command-line options. There are currently two types of detectors pre-defined in IMSIM, namely Pilatus and Eiger photon-counting detectors of various sizes (Broennimann et al., 2006;Dinapoli et al., 2011). Here, the detector selection defines the physical dimensions as pixel count and size of the final image. Nonphoton-counting detectors, e.g. CCDs, are not (yet) implemented in IMSIM as they have a point-spread function larger than the one pixel of the Pilatus and Eiger detectors. Introducing a point-spread function would be conceptually simple and could be done on popular request. The sample concentration has to be provided as well, where a concentration of 0.0 corresponds to simulation of background scattering only. Analogous to the sample scattering, a background scattering pattern on an absolute scale may be supplied to be used as background, for example scattering of a filled capillary. Otherwise a default 'flat' background modeled by a doubleexponential function 1 is applied.
For convenience, IMSIM also provides means to obtain a template for a beamstop mask with module gaps suitable for  the selected detector and the angular axis for the defined experimental setup. The axis file is generated using information about the wavelength and sample-detector distance, while the shape of the mask is defined by the detector geometry (e.g. size and positions of the detector modules). Masks may be exported in a FIT2D binary .MSK format (Hammersley, 2016) and it is possible to shift the mask on the 2D scattering pattern by defining offsets in the x and y directions to account for different locations of the beam center on the 2D image [ Fig. 1(c)]. The default output of IMSIM is an image with the simulated 2D scattering pattern in TIFF, EDF or CBF format.

Statistical properties
As presented in Fig. 2, the 1D scattering patterns after radial averaging with RADAVER/IM2DAT (RADAVER in ATSAS 2.8 and earlier, IM2DAT as of ATSAS 3.0; Franke et al., 2017), background subtraction and normalization are similar to the calculated input data up to the simulated noise. In particular, the standardized residuals are randomly distributed and follow a standard normal distribution. This may be statistically validated with (a) the reduced 2 statistic (Pearson, 1900) to test the hypothesis whether the observed differences between calculated input scattering and simulated output data may be explained by chance alone, i.e. whether they are similar up to noise or systematically different; and (b) the Anderson-Darling statistic (Anderson & Darling, 1954;Stephens, 1974) to test whether the empirical distribution of the standardized residuals is similar to a standard normal distribution with mean zero and standard deviation of one. As detailed in Table 1, none of the hypotheses of similarity up to noise and standard normal distribution of residuals may be rejected at a significance threshold = 0.01.
In addition, 1000 independent background frames were simulated and compared pairwise with the reduced 2 test. The obtained histogram of the reduced 2 values of the resulting 499 500 comparisons closely follows the expected 2 distribution (not shown).

Applications
IMSIM is implemented as a cross-platform command-line application that can easily be used interactively, integrated into shell scripts and also employed as a data source, for example, for testing data-analysis pipelines. As IMSIM can quickly generate a large number of realistic scattering patterns from a multitude of setups (Fig. 2), it may be used for all aspects of method development. Owing to the intentional absence of instrument or sample effects, it is straightforward to validate radial-averaging and data-analysis applications by testing on the simulated data provided by IMSIM. Such tests could validate the error propagation, provide insights into the effects of statistical noise on the analysis procedures, or, for example, reveal how over-and/or undersubtraction would influence the results. The methods that so far ignored the information contained in error estimates may be re-evaluated  Examples of simulated data from samples of varying size and shape with constant wavelength at multiple distances of the Pilatus 6M detector from the sample position. All backgrounds were generated with an exposure time of 10 s, the samples with an exposure of 1 s. (a) The simulated data of three different samples after background subtraction and scaling by concentration, offset for clarity. These are native horse spleen ferritin (1ier) at 6.0 m detector distance at a concentration of 0.5 mg ml À1 (green) and 2 mg ml À1 (light blue); bovine serum albumin (4f5s; Bujacz, 2012) at 3.0 m detector distance at a concentration of 0.5 mg ml À1 (yellow) and 2.0 mg ml À1 (violet); and myoglobin (1wla; Maurus et al., 1997) at 1.5 m detector distance at a concentration of 0.5 mg ml À1 (dark blue) and 2.0 mg ml À1 (red). The calculated scattering curves are overlaid in white. Error bars of all data are relative errors in logarithmic scale {SE(I)/ [I log(10)]} and partially truncated for visibility. (b)-(d) The respective standardized residuals of the simulated data minus the calculated input scattering, divided by the error estimate. Standardized residuals are expected to be randomly distributed and to follow a standard normal distribution, i.e. >99% of all standardized residuals should be located in the range [À3; +3]. Table 1 Simulation summary and statistics.
The three samples (exposure time 1 s) are simulated together with a flat background (exposure time 10 s). Each sample is simulated at two concentrations for the given sample-detector distance (different distances are used for all samples). At all positions a 3 mm circular 'beamstop' was masked out at the incident beam position, resulting in a different angular range at each detector position. All generated data sets have n = 2647 data points. Reduced 2 and Anderson-Darling test statistics and associated probabilities of obtaining a result more extreme than the ones observed (p values) are provided for each setup.

Limitations and future developments
The method described generates experimentally idealized, but statistically appropriate, data by emulating a virtual beamline with photons being spontaneously created from the provided sample without any physical components that might interfere, i.e. no ray-tracing is performed. At present, the sample does not allow for a structure factor coming from specific or unspecific aggregation or repulsion effects in solution. The effects of excluded volume at high solute concentrations are not considered owing to a lack of theoretical background to rely on. Future developments of IMSIM may include support for beam profiles, detector point-spread functions and wavelength spread, allowing for the generation of instrumentally smeared (e.g. neutron scattering) data.

Availability
IMSIM is available as part of the ATSAS software package (3.0.0; Franke et al., 2017) and is freely available for academic use (http://www.embl-hamburg.de/biosaxs/download.html).

Conclusions
We present a program to simulate isotropic 2D scattering patterns from one-dimensional small-angle X-ray scattering data. The principle is based on simulating the photon arrival on a virtual detector by modeling it as a probabilistic event, generating appropriately randomized data and their errors. As all of the files needed for the radial averaging can be produced by IMSIM, i.e. mask, axis and uniform background, it is possible to rapidly obtain a subtracted data set based on simulated data that are virtually indiscernible from experimental data. IMSIM may be utilized by experimentalists, methods developers and educational facilities in need of a large number of data sets with close resemblance to experimental data.