BioXTAS RAW 2: new developments for a free open-source program for small-angle scattering data reduction and analysis

BioXTAS RAW is a free open-source program for reduction, analysis and modelling of small-angle scattering data. This article describes the new and improved features in RAW version 2, including new tools for liquid-chromatography coupled data processing, advanced reporting capabilities and a new API.

RAW's basic approach to finding a good buffer range is to scan a window of defined size along the measured profiles, and test each range (described below) to see if it is a valid buffer range.If no range is found, the window size is narrowed and the scan repeated until either a valid range is found or the minimum size is reached.RAW constrains the set of buffer ranges to test, both to avoid false positives and to improve the speed of the algorithm.Initially, it uses a peak finding algorithm (the find_peaks function in scipy) on a smoothed version of the intensity vs. frame data, which provides the position of all peaks in the dataset.If no peaks are found in the data the algorithm starts the search from the first frame (earliest point in the elution) and proceeds from there.Assuming peaks are found, this defines the starting search range and window size.The initial window size is twice the width of the largest peak at 40% of its maximum intensity above the baseline.The initial search range starts to left (early frame/time) of the first peak, and proceeds towards the start of the dataset (earliest frame/time) in a series of steps whose size depends on the size of the window.This prioritizes buffer measurements closer to the elution peak.For example, if RAW found a single peak at frame 100 with width 10 at 40% maximum intensity, the initial search window would be 20, and the step size would be 4.The ranges tested would be: 75-95, 71-91, 67-87, and so on until either a valid buffer range is found or the start of the dataset is reached.
If a valid buffer range is not found for the initial window size, RAW narrows the window size and redoes the search.This is repeated until a valid range is found or a defined minimum window size is reached.If a valid range has not been found once the minimum window size is reached, the algorithm then searches for a buffer range in the data collected after the last elution peak, using the same range of window sizes and again starting with ranges closer to the peak and then testing those further away.
If a buffer range still is not found, RAW then searches ranges between the peaks.
The test for a valid buffer range has two inputs.The first is the total intensity (or mean intensity or intensity in a given q-range or at a particular q value, depending on user choice) vs. frames (or time) data, sometimes called the scattergram, and the second is the scattering profiles at each measured point in the elution.RAW evaluates a buffer range in three ways.In the first part of the test it calculates the Spearman rank-order correlation coefficient and associated p value for the intensity vs. frame and smoothed intensity vs. frame data in the specified range.The p value, though not technically valid for small ranges, is an indicator of whether there are correlations in the intensity.
Buffer scattering should have the same intensity at all measured points, so correlations are indicative of something eluting in the data (or an issue with the baseline), and RAW marks ranges with possible correlations as not valid.
In the second part of the test RAW checks the similarity of the scattering profiles in the selected range compared to the scattering profile with the median total intensity in the range, using the CORMAP test.The algorithm tests three different q ranges: the full q range, the low q range (first 100 points) and the high q range (last 100 points).A p value calculated for each range determines if all profiles are similar across each tested q range.RAW uses three different q ranges to test for different artifacts in the data.Changes in the low-q may be indicative of capillary fouling or other unwanted damage effects on the data.Changes in the high-q may come from beam or temperature drift (though these can also show up at low-q), while changes in the full profile acts as a catch-all metric.Because CORMAP relies on the number of outlier points vs. the total number of points to generate a p value, doing the test on a smaller number of points makes it more sensitive to a few outliers, such as a small number of points changing at the lowest q values that might indicate capillary fouling.RAW marks ranges where any of the buffer profiles are different from the median profile in any of these three q ranges as not valid.
In the third part of the test, RAW performs a singular value decomposition (SVD) on the entire selected set of profiles and find the number of significant singular values.Buffer ranges should only have one significant singular value, so if there is more than one significant singular value in the tested range than it is not a valid buffer range.In our experience this is typically the least sensitive test, but as RAW has the capability already available it is included for completeness.
In order to optimize the speed of the automated buffer finding, the RAW runs the parts of the test in the order listed above, from fastest to slowest, and if any part fails on the selected range RAW does not run the subsequent parts.
RAW uses the same general approach for the automated sample range determination as it does for the automated buffer range finding, a window is scanned along the data and it tests whether each range is a valid sample range (described below).RAW constrains the sample ranges to test, both to avoid false positives and to improve the speed of the algorithm.The midpoint of the largest peak is selected as the starting point, and an initial search window size is set equal to the width of the peak at 40% of the maximum intensity above baseline.A search range is defined as twice that width.The search window is then shifted alternatively to earlier in the elution and then later in the elution, with a step size for the shift based on the window size, with each alternation getting further from the midpoint.For example, if the midpoint was 100, the window size 10, and the shift step size 2, the windows tested would be: 95-105, 93-103, 97-107, 91-101 and so on until a valid range was found.If no valid range is found for the initial window size, the window size is reduced and the range retested until either a valid range is found or the minimum range size is reached and the algorithm fails to find a valid sample range.A valid sample range will not be found if no peaks are found in the dataset.
The test for a valid sample range has three inputs: the scattering profiles and the Rg and MW calculated for each profile in the selected range.There are five parts to the test.The first part is simply whether all profiles in the selected range have calculated Rg and MW values.If the values could not be calculated for all profiles in the range, then the range is not a valid sample range.The second part calculates the Spearman correlation coefficient and p value for the Rg and MW values in the range.If the sample is uniform across the selected range there should be no correlation, so if the p value from this test indicates a correlation, RAW marks the range as not valid.
The third part tests for similarity between the subtracted scattering profiles in the selected range and the subtracted scattering profile with the maximum total intensity in that range, using the CORMAP test.Because we expect that the intensity of the profiles will change across the peak due to the changing concentration in elution, RAW scales all profiles to the profile with the maximum intensity before the similarity test is done.As with the buffer similarity test, the full q range, the low-q range, and the high-q range are all tested, and if there are any profiles that are not similar to the highest intensity profile in any of the three q ranges then RAW marks the sample range as not valid.
For the fourth part, RAW performs a SVD on the entire selected set of profiles and find the number of significant singular values.As with the buffer range SVD test, subtracted sample ranges should only have one significant singular value, so if there is more than one significant singular value in the selected range than it is not a valid sample range.Again, this tends to be the least sensitive test, but we include it because it was already available in RAW.
The fifth and final part of the test is to check whether including all the profiles in the selected range improves the signal to noise of the final averaged subtracted scattering profile.Here, RAW sorts the profiles in the range by their overall intensity.An average subtracted profile is created by starting with just the most intense profile, and then subsequently averaging that with the next most intense profile, and so on.Every time a new profile is included in the average, RAW calculates the mean of the intensity/uncertainty in the average profile across all q points, yielding the overall signal to noise ratio.If that signal to noise ratio decreases when a profile is included, then that profile should not be included in the final dataset for optimal signal to noise, and so the selected range is not valid.
As with the automated buffer range selection, in order to optimize the speed of the algorithm RAW runs the tests in the order listed above, fastest to slowest, and if any test fails on the range subsequent tests are not run.

S2.1. Automated Dmax determination
The auto Dmax function can run in several ways.If the ATSAS package is not available, it simply returns the Dmax value found by BIFT.However, if the ATSAS package is available then Dmax can be fine-tuned to get a more accurate value.The basic idea is simple.First, RAW runs other automated methods -BIFT, DATGNOM (Petoukhov et al., 2007) and DATCLASS -to determine a good starting point for the search.Based on the results from the SASBDB when the algorithm was written,

Figure S2
Figure S2 Plots of the automatically calculated Rg by a) the RAW automatic Guinier function, and b) the ATSAS AUTORG function on the y-axis vs. the experimenter determined Rg from a SASBDB entry on the x axis.Results are shown for all SASBDB entries with Rg values that were classified as either Protein, DNA, or RNA.Perfect agreement between the automated method and the experimental method would be equal Rg values, shown by the black line in each figure.

Figure S3
Figure S3 Plots of automatically calculated Dmax by a) the RAW auto Dmax function, b) the ATSAS DATGNOM function, c) the ATSAS DATCLASS function, and d) BIFT (as implemented in RAW) on the y-axis vs. the experimenter determined Dmax from a SASBDB entry on the x axis.Results are shown for all SASBDB entries with Dmax values that were classified as either Protein, DNA, or RNA.Perfect agreement between the automated method and the experimental method would be equal Dmax values, shown by the black line in each figure.