research papers
The accurate assessment of smallangle Xray scattering data
^{a}Hauptman–Woodward Medical Research Institute, 700 Ellicott Street, Buffalo, NY 14203, USA,^{b}Department of Structural Biology, SUNY Buffalo, 700 Ellicott Street, Buffalo, NY 14203, USA, and ^{c}Stanford Synchrotron Radiation Lightsource, 2575 Sand Hill Road, MS69, Menlo Park, CA 94025, USA
^{*}Correspondence email: esnell@hwi.buffalo.edu
Smallangle Xray scattering (SAXS) has grown in popularity in recent times with the advent of bright synchrotron Xray sources, powerful computational resources and algorithms enabling the calculation of increasingly complex models. However, the lack of standardized dataquality metrics presents difficulties for the growing user community in accurately assessing the quality of experimental SAXS data. Here, a series of metrics to quantitatively describe SAXS data in an objective manner using statistical evaluations are defined. These metrics are applied to identify the effects of radiation damage, concentration dependence and interparticle interactions on SAXS data from a set of 27 previously described targets for which highresolution structures have been determined via Xray crystallography or nuclear magnetic resonance (NMR) spectroscopy. The studies show that these metrics are sufficient to characterize SAXS data quality on a small sample set with statistical rigor and sensitivity similar to or better than manual analysis. The development of dataquality analysis strategies such as these initial efforts is needed to enable the accurate and unbiased assessment of SAXS data quality.
Keywords: SAXS data quality; SAXStats.
1. Introduction
Xray crystallography and NMR have proven to be highly effective methods to determine the highresolution structures of many biological macromolecules. However, limitations of each technique have restricted their applicability. This is illustrated using data gathered as part of the Protein Structure Initiative (PSI), which was established in 2000 by the National Institute of General Medical Science in order to determine the structures of a broad range of macromolecules pertaining to biological and biomedical problems. As part of this initiative, largescale centers were created to establish and optimize structural pipelines. Success and failure, tracked at each stage of the process, showed that ∼12% of soluble purified proteins from the initiative resulted in structures deposited in the Protein Data Bank (PDB; Chen et al., 2004). Given the nature of the problem this success rate is commendable, but it also illustrates that a large amount of effort was expended to produce the 88% of the samples that to date have provided little to no structural detail.
A complementary technique to Xray crystallography and NMR is smallangle Xray scattering (SAXS; Grant et al., 2011). SAXS is a solution technique that can yield lowresolution structural information about the size, shape and flexibility of a macromolecule (Putnam et al., 2007). It is virtually unlimited by protein size and can characterize both those samples that do provide structural information and the majority that do not. If SAXS could be performed in a highthroughput manner, it could provide limited structural information on the majority of the samples produced by the PSI, as well as those resulting from other biological investigations that are recalcitrant to crystallographic or NMR approaches.
The ability to perform highthroughput SAXS exists (Blanchet et al., 2012; Classen et al., 2013; Perry & Tainer, 2013), along with the potential to generate vast quantities of data that require analysis. Some semiautomated software is available to quickly provide users with parameters such as the (R_{g}), the intensity [I(0)] and the maximum particle dimension (D_{max}) necessary to enable rapid characterization of SAXS data (Franke et al., 2012). However, subjective interpretation of data is still required to fully assess the data quality and to ensure that the conclusions that are drawn are not erroneous. While Xray crystallography and NMR have quantitative standards by which data quality can be assessed, such as minimum dspacing, merging statistics, dispersion or peak line width, SAXS has no equivalents of such values.
A set of publication requirements and standards for smallangle scattering data, both Xray and neutron, has been proposed (Jacques et al., 2012). These focus on ensuring that the scattering data and any subsequent analysis are presented in a manner such that the interpretations presented can be independently evaluated. However, even with these guidelines, evaluating the raw experimental data is still somewhat subjective and dependent on expertise in the technique. The use of smallangle scattering is growing rapidly in the biological community, and rigorous metrics are needed to assess the initial scattering data in a nonsubjective manner. With this in mind, we have built on the proposed requirements and standards to develop metrics that rapidly yield a quantitative assessment of the quality of the initial Xray scattering data. These are useful for those gaining experience in the technique but also for the rapid evaluation of data in real time, allowing feedback during the experiment. We detail these metrics and the results of their application to previously described SAXS data.
2. Experimental
2.1. Macromolecular samples
Our set of samples consisted of 27 proteins supplied by the Northeast Structural Genomics Consortium (NESG) where a crystallographic structure, an NMR structure or combinations of both were available. The samples are described in detail elsewhere (Grant et al., 2011). The proteins are representatives from large protein domain families or biomedical themes, or have been selected as targets whose known structure would be significant to the biomedical community (Wunderlich et al., 2004). Each target is characterized via a series of biochemical experiments including analytical gel filtration, static NMR spectroscopy for determining rotational correlation time and (if possible) highresolution structural data, and, if diffractionquality crystals can be formed, Xray crystallography (Goh et al., 2003; Bertone et al., 2001). Most targets are fulllength polypeptide chains of shorter than 340 amino acids selected from domain sequence clusters (Liu et al., 2004; Liu & Rost, 2004) which are organized in the PEP/CLUP database (Carter et al., 2003). Each protein cluster corresponds to putative structural domains whose threedimensional structure is not known nor can it be accurately modeled through homology. The taxa of the targets range from bacteria and archaea to eukaryotes, with a focus on human proteins. Details of the target set are provided in Table 1.

The initial 27 samples (samples 18 and 19 are identical in the study of Grant et al., 2011) were obtained from the remainder of the proteins used for crystallization screening by the HighThroughput Crystallization Screening laboratory (HTSlab; Luft et al., 2003). After the crystallization screening experiments have been set up by the liquidhandling systems, approximately 60 µl of recovered sample remained. While each of these samples has undergone extensive quality control, each undergoes at least two freeze–thaw cycles: first when the protein is shipped to the HTSlab and a second time when the remaining protein is shipped to the synchrotron as described below. For many proteins, a freeze–thaw cycle can be detrimental (Murphy et al., 2013), causing the protein to aggregate or precipitate. Each protein target was prepared in identical buffer conditions consisting of 100 mM NaCl, 0.02%(w/v) NaN_{3}, 5 mM DTT, 10 mM Tris pH 7.5. This consistency in sample preparation greatly aids efficiency during SAXS data collection.
2.2. Data collection
SAXS data were collected on beamline 42 (Smolsky et al., 2007) of the Stanford Synchrotron Radiation Lightsource (SSRL) utilizing highthroughput datacollection strategies (Martel et al., 2012). The protein solutions used for SAXS data collection underwent two freeze–thaw cycles as described above with a typical sample volume of 60 µl. The sample was diluted with its matching sample buffer to prepare three solutions using sampletobuffer dilution ratios of 2:1, 1:2 and 1:5 for each sample. Scattering data from a buffer blank were measured, followed by each of the three concentrations of each sample and a subsequent buffer blank for comparison. A wash cycle took place between each sample concentration. A wavelength of 1.13 Å was used for eight consecutive 2 s exposures collected at each of the three sample concentrations. Solutions were oscillated in a quartz capillary cell during data collection to minimize exposure of the same volume. This series of short exposures is essential in order to reliably identify a signature for radiation damage.
2.3. Data analysis
Scripts were developed and used to statistically test the SAXS data for indications of radiation damage and interparticle interactions; both of these effects can distort SAXS profiles and lead to conclusions that are not experimentally valid. Unwanted trends in SAXS data resulting from radiation or interparticle interactions are measured and the significance of these trends is examined using the linear regression ttest described below. Since the detection of radiation damage and interparticle interactions relies on statistical significance, the identification of problematic data is achieved in an objective fashion.
2.4. Radiation damage
SAXS data acquisition from proteins has the potential for an inherent experimental artifact: radiation damage. ; Le Maire et al., 1990). These effects manifest themselves as changes in the the (R_{g}), the maximum particle dimension (D_{max}) and the intensity [I(0)]. By monitoring these parameters as a function of exposure time, radiation damage can be tracked and evaluated.
can cause biological macromolecules to form highmolecularweight oligomers owing to the generation of intermolecular crosslinking reactions, the formation of disulfide bonds or other hydrophobic and electrostatic interactions that lead to tertiarystructural or quaternarystructural changes (Davies & Delsignore, 19872.4.1. Using the linear regression tstatistic to evaluate radiation damage
R_{g}, I(0), D_{max} and the similarity of each exposure to the first exposure, χ^{2}, are calculated using available software packages described later and a linear regression analysis is used to obtain a tstatistic (Kenney & Keeping, 1962) as a function of the exposure (or the dose received). The tstatistic describes the likelihood that a slope is significant. From this we can determine whether or not the trends in SAXS parameters as a function of radiation are significant and therefore whether indications of radiation damage are present. The method of ordinary least squares is used to minimize the sum of the square residuals of the linear regression model. For simple linear regression, the tstatistic is given by
where a is the slope of the linear regression, s_{a} is the standard error in the estimate of the slope, t has n − 2 and n is the sample size, i.e. the number of exposures. The tstatistic is converted to a pvalue to determine the statistical significance of radiation damage, independent of the number of using a twotailed tstatistic to pvalue conversion table (Goulden, 1956). Radiation damage is evaluated as statistically present if the pvalue is less than 0.05, a threshold that is commonly chosen to indicate statistical significance. The extent of radiation damage is correlated with the exposure time. Therefore, if the pvalue is less than 0.05, indicative of damage, then the last exposure is rejected and the fit and tstatistic are recalculated for the remaining exposures. This process is repeated until the pvalue of the linear regression for the remaining exposures is greater than 0.05 or the entire data set is rejected. Exposures that are free from radiationdamage effects in all of the SAXS parameters analyzed are averaged together to improve the signal to noise using the program DATAVER (Petoukhov et al., 2012).
In utilizing the linear regression ttest to identify radiation damage, assumptions are inherently made, including the normality of the null hypothesis, the independence of frames and the equal variance of errors. It may be that in specific experimental circumstances these requirements are not met; however, this will generally lead to an inflation of false positives, i.e. that frames are identified as damaged when they are not. This will result in fewer exposures averaged and therefore to a decrease in the signal to noise of the averaged profile. Thus, this is an overly cautious test to ensure that radiationdamaged exposures are not included in the final analysis.
2.4.2. Detecting overall changes in the scattering profile
Typical scattering profiles cover structural information ranging from resolutions of hundreds of Å to as high as 10 Å. Two scattering profiles can be directly compared for overall similarity using the reduced χ^{2} statistic employed in the program DATCMP (Petoukhov et al., 2012) defined as
Here, n is the number of data points i, I_{1}(q_{i}) and I_{2}(q_{i}) are the intensities of the scattering profiles of interest at q_{i} with error σ_{i} and v is the number of For two identical scattering profiles χ^{2} = 0, while two similar profiles will have χ^{2} approximately equal to 1 and two dissimilar profiles will have χ^{2} much greater than 1. While in general the comparison of two scattering profiles is a nontrivial task, this simple discrepancy criterion can be used because we are comparing multiple exposures of the same sample, which will be similar in both scale and noise.
Each scattering profile for subsequent exposures is compared against the first exposure and the χ^{2} is calculated. The first exposure is used for comparison since it received the lowest dose of Exposures showing evidence of radiation damage based on χ^{2} are identified and removed from averaging.
2.4.3. Detecting changes in maximum particle dimension
Radiation exposure can alter the surface and structural properties of proteins as well as solution properties affecting interparticle interactions, all of which may influence the effective maximum particle dimension (D_{max}) of a protein. SAXS data can be used to estimate D_{max} from the pair distribution function [P(r)] using the program GNOM (Semenyuk & Svergun, 1991). A Fourier transform is used to evaluate P(r) from I(q) according to the equation
In practice data cannot be collected from zero to infinity and an indirect Fourier transform method is instead used (Glatter, 1977) with D_{max} selected such that the resulting P(r) decays smoothly to zero without significant oscillations or systematic deviations in the curve. Typically, a predicted D_{max} of 3.5R_{g} is used as a starting point and the value is increased or decreased until a suitable value is identified. In addition to calculating P(r), GNOM also calculates the inverse Fourier transform to analyze how well the resulting I(q) fits the scattering profile. This is an important step to determine the most accurate values of D_{max} and P(r). In the program DATGNOM (Petoukhov et al., 2012) this process has been automated and a series of perceptual criteria such as oscillation, stability and deviation of the fitted I(q) versus the experimental I(q) are used to select the optimal D_{max}. Exposures showing evidence of radiation damage based on their D_{max} are identified and removed from averaging.
2.4.4. Detecting changes in R_{g} and I(0)
Increases in the average size of particles in solution can increase the measured R_{g} and I(0). R_{g} and I(0) are examined as a function of exposure number. R_{g} can be calculated from scattering data using two independent methods: a analysis and via the pair distribution function P(r). The first method calculates R_{g} from the For a monodisperse solution of globular particles, the lnI(q) versus q^{2} is linear in the lowresolution regime where q < 1.3/R_{g} (Guinier & Foumet, 1955). In this region the slope of the line passing through the data is related to R_{g},
The value of R_{g} calculated from the Guinier region varies with user interpretation and expertise. R_{g} is calculated from the slope of the line through the data points in the Guinier region. The Guinier region is dependent upon the R_{g}. The value of R_{g} is typically determined through an iterative cycle of calculating R_{g} and adjusting the Guinier region accordingly, followed by recalculating the R_{g}, for several cycles. The final Guinier region and the calculated value of R_{g} are determined through a somewhat subjective interpretation of what is an acceptable linear region. This becomes particularly difficult when particles are large, resulting in very few data points and a heightened sensitivity to the estimation of R_{g} when varying the Guinier region by as little as one data point. This procedure is automated in the program AutoRg (Petoukhov et al., 2012) which determines the Guinier region by fitting several slightly different regions, calculating the R_{g} for each region and evaluating how the R_{g} changes as a function of additional data points, and then accepts the region that minimizes the variance in R_{g}. Cases that contain few points in the Guinier region as a result of a large R_{g} or data that are particularly noisy are inherently difficult to calculate using this approach. Methods to utilize nonideal data with a high success rate are desirable, but it is critical to ensure that the Guinier region and R_{g} are determined accurately.
To minimize subjectivity in estimating the Guinier region, we employ an independent method to determine the R_{g}. In the previous section we described the determination of the maximum particle dimension and the pair distribution function calculated using DATGNOM. Using this P(r) one can calculate the R_{g} of the particle from
where D_{max} is the maximum particle dimension, r is the interatomic distance and P(r) is the pair distribution function (Putnam et al., 2007). While the determination of P(r) still requires an estimation of D_{max}, there is an advantage to calculating R_{g} using this method. In the Guinier approximation, even slight modifications to the Guinier region by as little as a few data points can result in a significantly different R_{g}. Using (5), the R_{g} is estimated from the pair distribution function, which in turn has been calculated using all available data points in I(q), greatly exceeding the number of data points in a and incorporating information from all regions of While small errors in the estimation of D_{max} may alter P(r) slightly, they have little effect on integration over all r, thus providing a robust calculation of R_{g} (Jacques & Trewhella, 2010). Once we have determined the R_{g} from this method, we now use this R_{g} to determine the Guinier region.
Until now, we have considered the upper limit of the Guinier region to be 1.3/R_{g}. Occasionally, data at very low resolution close to the beam stop can be influenced by external factors such as parasitic scatter and divergence in the beam (Wignall et al., 1990; Li et al., 2012). To alleviate the adverse effects of such factors, it is advantageous to select a minimum cutoff for q such that the desired information about shape and size is not lost or distorted. The minimum q value required to accurately restore the size and shape information present in the form factor (see §2.5.1) is given by the Shannon sampling theorem, which states that the information content in the continuous function I(q) can be represented by its values on a discrete set of points termed Shannon channels (Svergun & Koch, 2003). A measure of the information content is given by Shannon's sampling theorem, such that
where q_{k} = kπ/D_{max}. The number of parameters required to represent I(q) over an interval (q_{min}, q_{max}) is given by the number of Shannon channels
where q_{max} and q_{min} here refer to the highest and lowest resolutions collected in the experiment, respectively. This provides a lower bound on q_{min} such that its value does not exceed the first Shannon channel, i.e. that
By utilizing this boundary as the lower limit of the Guinier region, q_{min,G}, and our previously described upper boundary of q_{max,G} = 1.3/R_{g}, we limit the Guinier region to the interval
After selecting the data points in this Guinier region, we calculate R_{g} using the Guinier method. This second, independent measure of R_{g} is compared against the value estimated from P(r) (Putnam et al., 2007). If the R_{g} from the Guinier approximation differs significantly from the R_{g} calculated using the pair distribution function, then there may be additional interparticle interactions affecting only the lowresolution data.
In addition to the slope of the fit to the Guinier region, we can also obtain the scattering intensity extrapolated to q = 0. This extrapolated intensity, I(0), is directly proportional to the square of the number of electrons in the particle, i.e. the molecular weight (Putnam et al., 2007). The R_{g} and I(0) are subsequently analyzed and the linear regression tstatistic is used to reject data that show changes as a function of exposure indicative of potential radiation damage.
2.5. Interparticle interactions
2.5.1. Concentration dependence
The intensity data collected by SAXS experiments are given by
where F is the form factor of the particle and S is the Ideally, particles in conditions act independently of one another, exhibiting no interparticle effects, resulting in a of unity. Xray exposure time increases signal to noise at the expense of radiation damage. Another variable that can be used to increase the signal to noise is the protein concentration. However, as the concentration of the particle in solution is increased, the average distance between individual particles decreases, and therefore the likelihood of their interaction increases. Variations in electrostatic charge or hydrophobic regions distributed across the surface of the particle can result in either attractive or repulsive forces between neighboring particles with increased concentration, effects that are dependent on the surrounding chemical environment. These interparticle interactions directly affect the in (10), causing it to deviate from unity. This results in a breakdown of the assumption that I(q) collected in a SAXS experiment can be treated as the form factor, which contains the size and shape information desired. It is thus important that no interparticle interactions are present in the course of a SAXS experiment.
We use a method similar to the evaluation of multiple exposures as a function of radiation dose to monitor for the presence of interparticle interactions, evaluating the SAXSderived parameters as a function of concentration. A minimum of three concentrations is required for linear regression analysis. If the tstatistic shows a significant trend, this suggests that interparticle interactions exist and are correlated with sample concentration. Interparticle interactions may be alleviated by lowering the sample concentration or by modifying the solution conditions.
If very slight levels of interparticle interactions are detected, then either the data can be merged from high and low concentrations or zero extrapolation can be used to generate a curve with sufficient signal to noise while alleviating the influence of interparticle interactions (Petoukhov et al., 2012).
2.5.2. Scaling SAXS profiles
An important step in assessing the linear regression for a series of data points is determining the independent variable. In the case of testing for concentration dependence, this step is nontrivial. The independent variable is no longer the exposure number but is the concentration of the particle in solution. Multiple concentrations of a sample are typically prepared either by performing serial dilutions, which aids in creating a linearly dependent set of concentrations, or individually. However, in practice it is often the case that errors in estimated concentration occur. Examples of this include inaccurate initial protein concentration values, the standard deviation associated with dispensing microlitre volumes of liquid and effects such as solvent evaporation over time. All of these can contribute to unpredictable changes in the concentration. To some extent this can be alleviated through the use of inline UV spectroscopy; however, this is not always available.
To determine an appropriate abscissa for each concentration, we evaluate each concentration in a sample series on a relative scale, i.e. one that is not dependent on knowing the absolute solution concentration in mg ml^{−1}. However, this is a nontrivial task. Simply dividing the I(0) of each concentration by the I(0) of the lowest concentration has the flaw that I(0) is also dependent upon the molecular weight of the particle in solution, and if interparticle interactions are occurring as a result of increasing concentration then the relative scaling factors will be skewed. Scaling the scattering profiles using the full q range is likely to be more accurate but also suffers from increased noise and buffersubtraction inaccuracies that become more pronounced at higher resolutions.
We select from the data a region of 50 data points beginning at q = 0.07 Å^{−1} as this region (in the case of the experiments described here) should not experience distortion from small changes in interparticle interactions for most proteins while still being at a low enough resolution to avoid the region most sensitive to low signal to noise. Each data point in the scaling region for each concentration is divided by the corresponding data point in the first concentration, yielding a list of ratios. This list is sorted from least to greatest and the median ratio is selected as the scale factor for that concentration. Lastly, the first concentration is given an abscissa of 1, while each additional concentration is given an abscissa equal to its corresponding scale factor. These abscissae are now used as the regressors in the linear regression analysis to detect changes as a function of concentration.
2.5.3. Detecting changes in R_{g}, D_{max} and I(0)
Interparticle interactions or changes in particle size as a function of concentration may manifest as changes in R_{g} and D_{max}. Firstly, the pair distribution function is used to calculate R_{g} and D_{max} using DATGNOM. These values are subsequently used to determine the Guinier region, from which (4) is used to calculate R_{g} according to the Guinier approximation after using the leastsquares method to best fit the data. Three different Guinier regions are used to calculate R_{g}. All three Guinier regions terminate with a maximum value of q_{max,G} = 1.3/R_{g}, where R_{g} has been calculated from the P(r) distribution. However, each of the three regions utilizes different minimum q values. The first Guinier region uses all available data points less than q = 1.3/R_{g}. However, as mentioned above, data points at the lowest q values closest to the beam stop may suffer from parasitic scatter. Another Guinier region, previously described by (9), uses the maximum particle dimension to determine the minimum q value necessary to reconstruct the continuous I(q) function from a discrete set of points, and is defined as q_{min,G} = π/D_{max}. However, a potential drawback of using this Guinier region is that depending on the particle size and shape this region may contain very few data points, making it more susceptible to inaccuracies in R_{g} calculation owing to noise. To alleviate the potential problems associated with both Guinier regions, we have utilized an additional Guinier region that may ensure the most accurate Guinier estimates in these cases. This region is defined by q_{min,G} = 0.65/R_{g}, i.e. half of q_{max,G}, which for most particle sizes and experimental setups falls between the minimum q value collected and the theoretical minimum q value required by information theory. To summarize, the three Guinier regions evaluated are
Interparticle interactions may also result in changes in I(0), which is proportional to the square of the number of electrons in a particle and hence to the molecular weight. Similarly to R_{g}, I(0) can also be determined from the realspace P(r) function according to
For each concentration (after scaling) the I(0) is calculated using both (12) and the y intercept of the linear regression in the Guinier approximation. In total, four R_{g} values and four I(0) values are calculated [one from the realspace P(r) and one from each of the three Guinier regions]. These eight values, along with D_{max}, are each analyzed as a function of concentration (using the scale factors described in the previous section) and tested for concentration dependence using the linear regression ttest. The resulting likelihood of dependence is expressed as a pvalue and reported.
2.5.4. Detecting changes in particle volume
Occasionally, owing to the shape of a particle, increases in particle size may not significantly or detectably alter R_{g} or D_{max}. Another measure of particle size is the particle volume. This value can be calculated from a SAXS profile and is based on the observation by Porod that globular particles that have a sharp interface between the surface and the solvent display a decay in intensity in the highresolution region proportional to q^{−4} (Porod, 1951). Porod found that the volume of the particle could be calculated according to the equation
where V is the Porod volume and Q is the Porod invariant such that
where k is a constant subtracted to ensure that the asymptotical intensity decays proportional to q^{−4}. Experimentally, one cannot collect data from zero to infinity and instead Q is estimated from the convergence at high q values. This calculation is provided by the program DATPOROD (Petoukhov et al., 2012) and requires the output from DATGNOM, described in the previous section. The Porod volume is directly proportional to the molecular weight of the protein and can be estimated assuming a typical density of 1.37 g cm^{−3} for globular particles according to
where MW is the molecular weight in daltons (Rambo & Tainer, 2011). The molecular weight is tested for concentration dependence, and can be compared with the predicted molecular weight for consistency and to check for the presence of oligomers.
2.5.5. Estimating concentration
Determining the likelihood of concentration dependence does not require knowledge of the absolute concentration; however, calculating the impact of concentration dependence, i.e. the slope of the regression, does. To quantify the degree of concentration effects, we estimate the concentration of each sample directly from the data. To obtain an estimate of the concentration of a sample, calibration can be performed using the of water or of a model protein of known concentration such as lysozyme or glucose isomerase. These methods are accurate to ∼10% (Mylonas & Svergun, 2007; Orthaber et al., 2000). These data, along with I(0) calculated from the P(r) distribution and the scaling procedure described above, allow an estimation of the absolute concentration required to quantify the impact of concentration dependence. The data described here were calibrated using water scattering as the standard.
2.6. Evaluating linearity in the Guinier region
In discussing radiationdamage indicators, we described in detail the method for properly estimating the Guinier region for calculating R_{g}. While this is an effective method, it does so regardless of the linearity of the data in the Guinier region. Linearity in the Guinier region is an important prerequisite to ensure monodispersity (Jacques & Trewhella, 2010). If the data here are nonlinear, this suggests that interparticle interactions or aggregation are present in the sample.
To supplement our current analysis, we evaluate whether or not the data in the Guinier region are linear for each concentration. After determining the three intervals for the Guinier region, the method of least squares is applied to fit each block of three consecutive data points, the minimum required to calculate a linear regression. The slope of the line through these three points is calculated, the block of points is shifted by one data point and the procedure is repeated (Fig. 1). Next, a linear regression is calculated for the set of slopes, i.e. the slope of the slopes (Fig. 1, inset). If the region is linear then the slope of each consecutive block of three points should be constant and independent of the location of the block in the Guinier region. Using the tstatistic, we calculate a pvalue for the likelihood that the trend is significant and therefore that the Guinier region is nonlinear. The slope of the linear regression is used to determine whether the interparticle interactions are attractive or repulsive. If the slope of the regression is positive, this suggests that the interactions are attractive. If the slope of the regression is negative, the interactions are repulsive.
2.7. Robustness
To help to enable the success of the analysis on a wide range of data quality, we have included an optional mechanism for outlier detection. R_{g} estimated from the pair distribution function is the parameter used to detect outliers, since it is one of the most robustly determined parameters (Jacques & Trewhella, 2010). We have utilized the modified zscore methodology (NIST/SEMATECH eHandbook of Statistical Methods, 2012) of outlier detection with a cutoff of 3.5. This method is particularly robust with small sample sizes as it uses the median of the distribution instead of the mean, and is therefore less likely to be skewed by outliers. In the present study we have enabled outlier detection for radiationdamage analysis only, and have disabled it for the analysis of multiple concentrations since all three concentrations are required to assess concentration dependence.
2.8. Scripts
The basic dataanalysis steps described have been coded into a script called SAXStats, which makes use of many of the programs provided in the ATSAS package (Petoukhov et al., 2012). SAXStats has been written in shell language and, although designed around the protocols and data format output by beamline 42 at SSRL, is readily available from the authors for adoption, adaptation or the production of a more generally applicable version. This has been applied to the data described in §2.2 and the results are described below.
3. Results
For all 27 samples studied, SAXStats successfully calculated all parameters and analyzed each sample for radiation damage and concentration dependence. To rapidly visualize the results of the analysis we introduce a new plot, termed the correlation frequency plot, to describe graphically the information presented in tabular form in the Supporting Information. Here, the sample ID is on the vertical axis, while the number of parameters with a given pvalue is shown on the horizontal axis. The likelihood of a correlation being present is determined by the pvalue, which is identified by color as unlikely (green, p > 0.20), possible (yellow, 0.05 < p ≤ 0.20) or probable (red, p ≤ 0.05). For example, in Fig. 2 sample 1 had one parameter identified as unlikely to be affected by radiation damage, two parameters identified as possibly damaged and two parameters that are probably affected by radiation damage. The specific information about which parameters are affected can be found in the Supporting Information. For the concentrationdependence analysis, the degree of correlation is presented alongside the graph as the median impact of concentration dependence in units of percent per mg ml^{−1}. For each parameter, the impact of the concentration dependence is given by the slope of the regression, and the median impact for all parameters collectively is then calculated. For example, in Fig. 3 sample 1 shows a median impact of concentration dependence of 3.7% per mg ml^{−1} of solution, meaning that each parameter changed by approximately 3.7% for every mg ml^{−1} increase in concentration; however, since nearly all of the parameters are green, the impact of each was determined to be insignificant given the standard errors in the parameters.
3.1. Radiationdamage analysis
The likelihood of radiation damage affecting all five parameters analyzed [i.e. R_{g} via Guinier, R_{g} via P(r), I(0), D_{max} and the similarity of each exposure to the first exposure, χ^{2}] for all eight exposures for each of the protein samples is shown in the correlation frequency plot, i.e. a plot showing the total number of parameters for which a correlation has been found between the parameter and exposure number (Fig. 2). For brevity, only the highest concentration is shown. While several parameters analyzed do appear to be affected by radiation damage (Supplementary Table S1), in nearly all cases the effect averaged less than 1% (Supplementary Table S2), demonstrating that the highthroughput experimental protocol is sufficient for collecting data while minimizing radiation damage and that most samples could endure even more exposure before experiencing deleterious effects caused by radiation damage.
3.2. Interparticle interactions
The results from the concentrationseries analysis for all samples are shown in a correlation frequency plot in Fig. 3 and Supplementary Tables S3 and S4. For many samples some parameters are affected by concentration dependence; the impact of this effect is usually less than 5% per mg ml^{−1}. However, in cases such as samples 24 and 27 not only is concentration dependence detected in multiple parameters, but the effect is large relative to other samples, showing an impact of 5.8 and 8.9% per mg ml^{−1}, respectively. However, this dependence did not appear to affect modeling as shown by the agreement with highresolution structural data (Grant et al., 2011).
For sample number 11, several parameters were possibly (yellow) or probably (red) affected by concentration dependence. Additionally, the I(0) values determined from P(r) and each of the three Guinier regions were impacted by a factor of more than 12% per mg ml^{−1} with possible (yellow) or probable (red) likelihood of concentration dependence (Fig. 4, Supplementary Tables S3 and S4) even though other parameters such as R_{g} and D_{max} were not as greatly impacted (∼1%). The large increase in I(0), which was not reflected in the particle dimension, suggests that the average molecular weight of the particles in solution is increasing while having little effect on the D_{max} and R_{g} (§2.5). Similarly, the Porod volume showed an increase of 8% per mg ml^{−1} and a possible likelihood of concentration dependence. Sample 11 was one of two samples (along with sample 4) which in our previous study was shown to exist as a mixture of dimers and tetramers in solution. The concentrationdependent increase in size detected by SAXStats for sample 11 may therefore reflect a small shift in population from dimer to tetramer.
The evaluation of nonlinearity in each of the three Guinier regions revealed that nonlinearity was occasionally detected in Guinier region 1, the widest region encompassing data at the very lowest q values, and was rarely detected in Guinier regions 2 or 3 (Fig. 5). This suggests that the nonlinearity identified in Guinier region 1 is not the result of interparticle interactions but is only an artifact of slight parasitic scatter closest to the beam stop.
4. Discussion
The development of the basic dataquality analysis and the SAXStats script has been carried out on a unique set of SAXS data where we have extremely well characterized samples (a necessary requirement before collecting SAXS data) and structural data from crystallographic, NMR or a combination of both methods (Grant et al., 2011). The parameters and the molecular envelopes produced from these data in a nonautomated manner have been validated by comparison to the known structures. Using SAXStats, we confirmed our experimental strategy to reduce radiationdamage effects and the sensitivity of our methods to detect them, and we detected small concentration dependences that had not initially been noted. These did not affect modeling or conclusions in the initial analysis, as shown by the agreement with highresolution structural data, but serve to highlight the sensitivity of the techniques employed by the SAXStats script. Interestingly, we were also able to identify the case of a mixture from the initial data in sample 11 without resorting to supplementary structural knowledge. The results of the automated SAXStats analysis showed that most values were within 5%, and all were within 10%, of the manually determined values. Similar to the manual analysis, when compared with highresolution structural information from Xray crystallography or NMR, the average difference in R_{g} was less than 1 Å, showing that the SAXS data agreed well with the highresolution data. The manual analysis was timeconsuming and could not be performed with this level of statistical rigor in real time at the beamline. In comparison, the computational time required for SAXStats to be run on these samples, using a single 2.53 GHz Intel processor, was only 31 min, an average time of approximately 1 min per sample. This could be significantly reduced with more efficient coding. However, even as it stands, the reduction in processing time using SAXStats compared with manual analysis makes highthroughput realtime SAXS analysis a reality.
Our highthroughput protocol is designed to use the minimum necessary sample and the minimum beam time. Eight short exposures are used at each of three concentrations. Exploring the results in detail shows that a cutoff of p < 0.05 to identify radiation damage works well for these eight short exposures. Additionally, p < 0.05 is also likely to successfully identify nonlinearity in most cases since the Guinier region is likely to include a sufficient quantity of data points for well planned experiments and instrumentation setups. Using only three concentrations decreases the likelihood of identifying concentrationdependent effects, but the high number of parameters analyzed proved to be sufficient to accomplish this successfully. While using a greater number of concentrations is preferable, we have shown that it is possible to combine both the slope of the regression and the pvalue to correctly assess the likelihood of concentration dependence and its impact on modeling and conclusions. The same analysis applied to more thorough data collection will only improve the accuracy of the results.
We note that the current analysis works most effectively for globular particles. Some parameters, such as D_{max} and the Porod volume, are inherently determined less accurately for highly elongated particles or for those with large disordered regions since the equations used assume globularity. Additionally, the Guinier method must be corrected for elongated or diskshaped particles. However, given that the described method primarily detects changes in particle properties, the analyses of most parameters are still likely to be informative even in these cases. We note that the test set is small and ranges in molecular weight from ∼8.2 kDa (PDB entry 2kw9 ) to ∼48.5 kDa (PDB entry 3hxl ). We are limited in fully testing our techniques by the availability of a robust data set for which both SAXS profiles and structural information exist. As these data become available the robustness of the analysis can be assessed and improvements, if necessary, are encouraged.
Historically, much SAXS data analysis has been performed `by eye' in a highly subjective manner. In particular, the linearity of the Guinier region can be difficult to assess. The application of linear regression statistical analysis to identify radiation damage, concentration dependence and interparticle interactions provides quantitative dataquality metrics. This is particularly useful considering the growing user community for the technique. Widescale adoption of the methods employed in SAXStats (or similar analysis) will result in an increase in the reliability of subsequent information derived from the SAXS data by quantitatively filtering data which may mask unanticipated effects from those accurately extrapolated to a single particle.
Currently, SAXStats is being integrated with the existing datacollection and processing software at SSRL BL42. This provides nearinstantaneous samplequality feedback and thereby allows almost instant identification of sample and experimental problems which can be addressed during the data collection. We employed this procedure at BL42 in a testing mode. It has worked well to alert us to problematic data, demonstrating the potential to increase the success rate of SAXS experiments utilizing valuable Xray resources and subsequently the number of scientific results produced using these resources.
While SAXStats is currently only employed at SSRL BL42, the scripts and methodology that it employs can easily be adapted to virtually any dataanalysis pipeline. SAXStats can be used to flag problematic data that might otherwise go undetected for further manual analysis. Highthroughput pipelines will be particularly benefited by the large increase in efficiency of data analysis. However, the methodology can be applied to individual systems as well. Not only does SAXStats enable the objective evaluation of data quality, it can also provide comparisons between varying solution conditions that may make the effects of solvent conditions on conformational changes, or oligomer organization, easier to identify. This in turn could be used to guide the sample into more desirable solution conditions, for example to increase solubility or monodispersity, or even to guide crystallization efforts. SAXStats does not replace expert analysis, but does flag those cases where a change in experimental design or more experience may be helpful to successfully perform the experiment and analyze the results.
5. Conclusions
The methods employed in the SAXStats protocol along with the results of the present analysis demonstrate not only that SAXS can be performed in high throughput but that the resulting data can also be analyzed in an objective, statistically significant and efficient manner. While improvements in the efficiency and applicability of the method can certainly be made in the future, no degree of automation can remove the necessity for human intervention and analysis when it comes to drawing accurate conclusions from the data. SAXStats is very sensitive to the effects of concentration dependence or radiation damage; however, the influence of these effects on modelling, interpretation and the ultimate determination of biological mechanism require human insight. The simple methodology presented allows a significant amount of human intervention and subjective analysis of SAXS data to be supplemented with statistical, quantitative analyses. It provides a muchneeded starting point to develop objective metrics that enable automation of dataquality assessment and opens up the technique to those more experienced in complementary data analysis.
Acknowledgements
We are extremely grateful to Dr Hiro Tsuruta who was a mentor and friend during this research. This research is supported by DTRA HDTRA10C0057, NIH R01 GM088396 and NSF 1231306. TG developed the mathematics and scripts used. TG, EHS and JL collected and processed the data. TW, LC and TM enabled the highthroughput SAXS data collection. All contributed to the manuscript. Portions of this research were carried out at the Stanford Synchrotron Radiation Lightsource, a Directorate of SLAC National Accelerator Laboratory and an Office of Science User Facility operated for the US Department of Energy Office of Science by Stanford University. The SSRL Structural Molecular Biology Program is supported by the DOE Office of Biological and Environmental Research and by the National Institutes of Health, National Institute of General Medical Sciences (including P41 GM103393) and the National Center for Research Resources (P41 RR001209). The samples used came from work supported in part by the Protein Structure Initiative of the National Institutes of Health, NIGMS grant U54 GM094597. We are grateful to Professor Gaetano Montelione of the Northeast Structural Genomics consortium for access to these samples and for useful comments during this work. The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of any funding agency.
References
Bertone, P., Kluger, Y., Lan, N., Zheng, D., Christendat, D., Yee, A., Edwards, A. M., Arrowsmith, C. H., Montelione, G. T. & Gerstein, M. (2001). Nucleic Acids Res. 29, 2884–2898. Web of Science CrossRef PubMed CAS Google Scholar
Blanchet, C. E., Zozulya, A. V., Kikhney, A. G., Franke, D., Konarev, P. V., Shang, W., Klaering, R., Robrahn, B., Hermes, C., Cipriani, F., Svergun, D. I. & Roessle, M. (2012). J. Appl. Cryst. 45, 489–495. Web of Science CrossRef CAS IUCr Journals Google Scholar
Carter, P., Liu, J. & Rost, B. (2003). Nucleic Acids Res. 31, 410–413. Web of Science CrossRef PubMed CAS Google Scholar
Chen, L., Oughtred, R., Berman, H. M. & Westbrook, J. (2004). Bioinformatics, 20, 2860–2862. Web of Science CrossRef PubMed CAS Google Scholar
Classen, S., Hura, G. L., Holton, J. M., Rambo, R. P., Rodic, I., McGuire, P. J., Dyer, K., Hammel, M., Meigs, G., Frankel, K. A. & Tainer, J. A. (2013). J. Appl. Cryst. 46, 1–13. Web of Science CrossRef CAS IUCr Journals Google Scholar
Davies, K. J. & Delsignore, M. E. (1987). J. Biol. Chem. 262, 9908–9913. CAS PubMed Web of Science Google Scholar
Franke, D., Kikhney, A. G. & Svergun, D. I. (2012). Nucl. Instrum. Methods Phys. Res. A, 689, 52–59. Web of Science CrossRef CAS Google Scholar
Glatter, O. (1977). J. Appl. Cryst. 10, 415–421. CrossRef IUCr Journals Web of Science Google Scholar
Goh, C.S., Lan, N., Echols, N., Douglas, S. M., Milburn, D., Bertone, P., Xiao, R., Ma, L.C., Zheng, D., Wunderlich, Z., Acton, T., Montelione, G. T. & Gerstein, M. (2003). Nucleic Acids Res. 31, 2833–2838. Web of Science CrossRef PubMed CAS Google Scholar
Goulden, C. H. (1956). Methods of Statistical Analysis, 2nd ed. New York: Wiley. Google Scholar
Grant, T. D., Luft, J. R., Wolfley, J. R., Tsuruta, H., Martel, A., Montelione, G. T. & Snell, E. H. (2011). Biopolymers, 95, 517–530. Web of Science CrossRef CAS PubMed Google Scholar
Guinier, A. & Foumet, F. (1955). Small Angle Scattering of Xrays. New York: Wiley Interscience. Google Scholar
Jacques, D. A., Guss, J. M., Svergun, D. I. & Trewhella, J. (2012). Acta Cryst. D68, 620–626. Web of Science CrossRef CAS IUCr Journals Google Scholar
Jacques, D. A. & Trewhella, J. (2010). Protein Sci. 19, 642–657. Web of Science CrossRef CAS PubMed Google Scholar
Kenney, J. F. & Keeping, E. S. (1962). Mathematics of Statistics, 3rd ed. Princeton: Van Nostrand. Google Scholar
Le Maire, M., Thauvette, L., de Foresta, B., Viel, A., Beauregard, G. & Potier, M. (1990). Biochem. J. 267, 431–439. CAS PubMed Web of Science Google Scholar
Li, Z., Li, D., Wu, Z., Wu, Z. & Liu, J. (2012). J. Xray Sci. Technol. 20, 331–338. Web of Science CAS PubMed Google Scholar
Liu, J., Hegyi, H., Acton, T. B., Montelione, G. T. & Rost, B. (2004). Proteins, 56, 188–200. Web of Science CrossRef PubMed CAS Google Scholar
Liu, J. & Rost, B. (2004). Proteins, 55, 678–688. Web of Science CrossRef PubMed CAS Google Scholar
Luft, J. R., Collins, R. J., Fehrman, N. A., Lauricella, A. M., Veatch, C. K. & DeTitta, G. T. (2003). J. Struct. Biol. 142, 170–179. Web of Science CrossRef PubMed CAS Google Scholar
Martel, A., Liu, P., Weiss, T. M., Niebuhr, M. & Tsuruta, H. (2012). J. Synchrotron Rad. 19, 431–434. Web of Science CrossRef CAS IUCr Journals Google Scholar
Murphy, B. M., Swarts, S., Mueller, B. M., van der Geer, P., Manning, M. C. & Fitchmun, M. I. (2013). Nature Methods, 10, 278–279. Web of Science CrossRef CAS PubMed Google Scholar
Mylonas, E. & Svergun, D. I. (2007). J. Appl. Cryst. 40, s245–s249. Web of Science CrossRef CAS IUCr Journals Google Scholar
NIST/SEMATECH eHandbook of Statistical Methods (2012). https://www.itl.nist.gov/div898/handbook/ . Google Scholar
Orthaber, D., Bergmann, A. & Glatter, O. (2000). J. Appl. Cryst. 33, 218–225. Web of Science CrossRef CAS IUCr Journals Google Scholar
Perry, J. J. & Tainer, J. A. (2013). Methods, 59, 363–371. Web of Science CrossRef CAS PubMed Google Scholar
Petoukhov, M. V., Franke, D., Shkumatov, A. V., Tria, G., Kikhney, A. G., Gajda, M., Gorba, C., Mertens, H. D. T., Konarev, P. V. & Svergun, D. I. (2012). J. Appl. Cryst. 45, 342–350. Web of Science CrossRef CAS IUCr Journals Google Scholar
Porod, G. (1951). Colloid Polym. Sci. 124, 83–114. CAS Google Scholar
Putnam, C. D., Hammel, M., Hura, G. L. & Tainer, J. A. (2007). Q. Rev. Biophys. 40, 191–285. Web of Science CrossRef PubMed CAS Google Scholar
Rambo, R. P. & Tainer, J. A. (2011). Biopolymers, 95, 559–571. Web of Science CrossRef CAS PubMed Google Scholar
Semenyuk, A. V. & Svergun, D. I. (1991). J. Appl. Cryst. 24, 537–540. CrossRef Web of Science IUCr Journals Google Scholar
Smolsky, I. L., Liu, P., Niebuhr, M., Ito, K., Weiss, T. M. & Tsuruta, H. (2007). J. Appl. Cryst. 40, s453–s458. Web of Science CrossRef CAS IUCr Journals Google Scholar
Svergun, D. I. & Koch, M. H. J. (2003). Rep. Prog. Phys. 66, 1735–1782. Web of Science CrossRef CAS Google Scholar
Wignall, G. D., Lin, J. S. & Spooner, S. (1990). J. Appl. Cryst. 23, 241–245. CrossRef Web of Science IUCr Journals Google Scholar
Wunderlich, Z., Acton, T. B., Liu, J., Kornhaber, G., Everett, J., Carter, P., Lan, N., Echols, N., Gerstein, M., Rost, B. & Montelione, G. T. (2004). Proteins, 56, 181–187. Web of Science CrossRef PubMed Google Scholar
This is an openaccess article distributed under the terms of the Creative Commons Attribution (CCBY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.