Single-particle imaging without symmetry constraints at an X-ray free-electron laser

Data-processing workflow for single-particle imaging experiments at X-ray free-electron lasers is presented. The analysis developed here revealed nanoscale features of the PR772 virus with a resolution better than 10 nm and without any symmetry constraints.


Introduction
Single-particle imaging (SPI) performed using hard X-ray free-electron lasers (XFELs) (Altarelli et al., 2006;Emma et al., 2010;Ishikawa et al., 2012) was proposed more than a decade ago (Neutze et al., 2000;Miao et al., 2001;Gaffney & Chapman, 2007) as a method of determining the structure of individual biological samples from viruses to single molecules in their native environment. This is different from cryoelectron microscopy (Bai et al., 2015), where single biological particles have to be preserved at liquid nitrogen temperatures to determine their structure. Importantly, XFELs may provide another dimension for the study of biological systems, namely, time evolution in pump-probe experiments on extremely small time scales.
XFELs generate pulses with ultra-high brilliance and high spatial coherence on the femtosecond time scale that is a prerequisite for the success of SPI at XFEL sources. The first experiments performed on protein nanocrystals  and single viruses  at the Linac Coherent Light Source (LCLS) were very promising and raised high expectations in the community. Significant progress in the field and several successful structure recoveries of biological samples were reported later (Kimura et al., 2014;Hantke et al., 2014;Ekeberg et al., 2015).
At the same time it was realized that the target of highresolution (potentially atomic resolution) structure determination of biological particles at XFELs is more challenging to achieve than initially anticipated. This lead to formation of the SPI initiative at LCLS and an international team with the goal to further progress single-particle imaging with XFELs .
Here we present the results of the structure determination of the PR772 virus from experimental data collected using soft X-ray pulses at LCLS as part of the SPI initiative (Reddy et al., 2017). Complementary to recent work published on the same experimental data Kurta et al., 2017), we chose a different approach for the analysis using the workflow illustrated in Fig. 1. We implemented a strategy consisting of several steps previously outlined (Gaffney & Chapman, 2007). Preliminary filtering (Reddy et al., 2017) resulted in an initial data set that was further refined by an advanced classification approach (Bobkov et al., 2015) and additional filtering procedures. Next, an orientation determination procedure (Loh & Elser, 2009) was applied to determine the full three-dimensional intensity distribution originating from the virus particle. The final step of reconstruction from this intensity distribution was performed to obtain the three-dimensional structure of the PR772 virus. Importantly, symmetry constraints were not used during the reconstruction. A detailed description of all steps leading to a final reconstruction of the particle structure is presented in this work.

Experiment and initial data processing
The experiment was conducted using the Atomic Molecular Optics (AMO) instrument at the LCLS. The PR772 bacteriophage with a diameter of about 70 nm was chosen. The sample was aerosolized by a Gas Dynamic Virtual Nozzle (GDVN), and the particle stream was focused by an aerodynamic lens stack into the XFEL beam which had a photon energy of 1.6 keV (wavelength 0.775 nm). The far-field diffraction patterns from randomly oriented particles were measured by a pnCCD detector (Strü der et al., 2010). The details of the raw data processing, sample preparation and experimental conditions were recently reported (Reddy et al., 2017).
The data analysis consisted of multiple steps of classification and filtering; useful single-hit diffraction patterns were defined as those where only a single virus was present in the XFEL beam. This single-hit class contains the most valuable data for structure determination by SPI.
The full data set that includes all XFEL pulses consists of the initial number of Set 3M = 2 976 568 images (Reddy et al., 2017). From this data set, Set 44k = 44 039 was selected. This hitselection procedure was based on a -squared metric previously described by Reddy et al. (2017), which selects images with strong scattering (for example, single or multiple particles). The detector images were then downsampled by 4 Â 4 and the detector intensities were converted to photon counts. Typical diffraction patterns from Set 44k are shown in Figs. 2(a)-2(c). Single-hit diffraction patterns providing Set 14k = 14 772 images were further selected using the diffusion-map approach described by Reddy et al. (2017).
Contrary to the above approach, we first sorted the diffraction patterns from Set 44k by their integrated intensities [see the histogram in Fig. 2(d)]. Weak hits with an integrated intensity less than 5 Â 10 4 ph (photons) were excluded from further single-hit analysis since their classification was difficult to determine with sufficient accuracy. We also identified very strong hits with integrated intensity higher than 2 Â 10 6 ph. These consisted predominantly of multiple hits that were  Workflow of the SPI experiment towards single-particle reconstruction: (a) single-hit classification of the initial data, (b) refined filtering of the classified data, (c) orientation determination, (d) particle reconstruction using phase retrieval.  mostly filtered out in the next step. Since the transition between strong and weak hits was smooth, we used different thresholds for the total number of photons counted in a diffraction pattern as an initial filtering step [see vertical lines in Fig. 2(d)]. The respective number of diffraction patterns for each set is shown in Table 1. This selection was then used in the single-hit classification by the principal component analysis (PCA) technique explained below.

Classification by the principal component analysis
The novel and most important filtering step used in this work is based on the classification by the PCA technique (Bobkov et al., 2015). From our experience, direct application of PCA methods to experimental diffraction patterns is not efficient and single-hit classification is hindered substantially as a result of different particle orientations and incident fluctuating X-ray-pulse intensities. Therefore, we compress each diffraction pattern into a feature vector (FV), which describes the diffraction patterns in relation to the real-space structure of a particle (see Supporting Information for the definition of the FV). The FV compression (resulting FV consists of 50 components) is based on X-ray cross-correlation analysis (XCCA) (Wochner et al., 2009;Altarelli et al., 2010;Kurta et al., 2016).
The PCA method is then applied to the set of FVs instead of the full diffraction patterns to separate single hits from other classes. With the PCA we projected the FV values (representing the diffraction patterns) onto a plane of the first and second principal components (PC 1 and PC 2 ) as shown in Fig. 3. Diffraction patterns with similar features cluster in a distinct region on the plane. Diffraction patterns with large differences are spread widely throughout the plane. We manually selected about 50 singlehit diffraction patterns similar to the one shown in Fig. 2(b). These are marked as red dots on the PCA plane in Fig. 3. We also selected about 50 diffraction patterns similar to the multiple-hit diffraction shown in Fig. 2(c) and shown as blue dots on the PCA plane in Fig. 3.
In the PCA plane we discovered a densely packed region which we attributed to single hits. For comparison, we also show the initial data Set 44k as green empty dots and another single-hit selection of Set 14k as yellow dots from the paper by Reddy et al. (2017). From these observations, we suggest that our PCA technique with prior FV compression may be useful in selecting single hits from SPI experimental data.
The PCA densities are visualized as three-dimensional plots to show the dense single-hit areas [ Figs Projection of feature vectors onto the PCA plane. Each dot corresponds to a diffraction pattern. The green empty dots represent diffractions patterns of Set 44k and the yellow dots represent single hits of Set 14k . The manually classified patterns are marked by red (single hits) and blue dots (multiple hits).   Table 1). (d)-( f ) Projection view of the PCA densities with manually classified single-hit patterns shown as red dots. Blue and yellow dots correspond to the same selections as in Fig. 3. The black contour level corresponds to 3.3% of the maximum value of the PCA density (selected hits in Table 1). For lowintensity thresholds, the black contour contains a region that is not clearly represented by the manual single hit selection (a), (d) and (b), (e). The manual hit selection is most precisely matched by data Set 10k PCA within the contour line for an intensity threshold at 2 Â 10 5 ph in (c) and (f). PCA single hits and multiple hits. Here we used the area at 3.3% of the maximum value of the FV density which is visualized as a black contour line in the PCA density plots in Fig. 4. The result of all selections is summarized in Table 1. For the threshold of 2 Â 10 5 ph, the best single-hit selection was expected because the black contour line most tightly encircles the manually selected single hits. A further increase of the integrated intensity threshold was not desired as this could substantially decrease selected amount of data. Within the black contour line Set 10k PCA = 10 082 diffraction patterns were assigned as single-hit candidates.

Filtering of particle-size distribution
The particle size can be estimated, for example, from the power spectral density (PSD), which is the intensity averaged over the azimuthal angle '.
Here q is the magnitude of the momentum transfer vector. We show the PSD of all individual diffraction patterns contained in Set 10k PCA obtained by the PCA technique in Fig. 5(a). From visual inspection we find clear outliers that are not filtered out by the PCA technique. In order to apply further filtering stages we approximated the diffraction patterns by the form factor of a sphere (Pedersen, 1997), where a is a constant and R is the sphere radius.
The q-value distribution of the first characteristic minimum was determined by fitting each PSD with equation (2). The histogram of positions of the first minimum from Set 10k PCA has a broad range of q values [Fig. 5(d)]. This suggests that Set 10k PCA still contains some diffraction patterns which correspond to particles of different sizes. In order to narrow the size distribution, we selected AE1 r.m.s. value around the mean value of the distribution in Fig. 5(d) and obtained Set 8k PCA containing 8459 diffraction patterns (see Table 1). However, the PSDs of Set 8k PCA still show some outliers [ Fig. 5(b)]. Besides the minimum position, we also exploited the quality of each fit to the PSDs. The fit quality was defined as By its definition it compares the fitted data with the measured data. We used AE1 r.m.s. value around the mean value of the fit quality histogram [see Fig. 5(e)] as the last filtering step and obtained the final single-hit selection of Set 7k PCA containing 7 303 diffraction patterns (see Table 1). After this final filtering step, the PSD for our final selection is cleaned from the obvious outliers as seen in Fig. 5(c).
Multiple filtering steps have been taken to obtain a well defined data set that comprises predominantly single hits with a narrow size distribution of particles. A block diagram giving an overview of the selection steps is shown in Fig. 6(a). The PCA technique as the main filtering stage passes 47% after the intensity threshold. The contributions from the other filtering steps are given in Fig. 6(a).
We compared the data selection described here with the initial single-hit selection (Reddy et al., 2017). Fig. 6(b) clearly shows that a large part of Set 6.6k PCA , with 6 677 diffraction patterns, is shared between our data (Set 7k PCA ) and the data Set 14k previously selected by Reddy et al. (2017).
After all of the filtering stages, we reduced the number of diffraction patterns from the initial Set 44k to 17% with the PCA technique and kept only the high-quality data Set 7k PCA for further orientation determination. Before orientation determination, we will first compare in detail two data sets: Set 14k and Set 7k PCA by the X-ray crosscorrelation analysis (XCCA) approach.

Angular X-ray cross-correlation analysis
A two-point angular XCCA (Altarelli et al., 2010;Kurta et al., 2013)   correlation function (CCF) defined here is similar to Kurta et al. (2017), where q 1 and q 2 are the momentum-transfer magnitudes at two points with the respective intensity values I i and I j of the ith and jth diffraction pattern, ' and Á are the angular coordinates and h . . . i ' denotes angular average. The analysis comprises the Fourier components (FCs) of correlation maps calculated by the ensemble-averaged difference spectra defined as (Kurta et al., 2017) e C C n ðq 1 ; q 2 Þ ¼ hC n i;j ðq 1 ; q 2 Þi i¼j À hC n i; j ðq 1 ; q 2 Þi i6 ¼j ; where C n i,j (q 1 , q 2 ) are the FCs of the CCFs of order n over the angle Á and h . . . i i,j denotes the average over diffraction patterns i and j. The Fourier components here are related to the CCFs in equation (4) by Only the FCs of even orders (n = 2; 4; 6; 8; 10; 12) were found to significantly contribute to the difference spectra FCs. In Fig. 7, we show two-dimensional maps of the amplitudes j e C C n ðq 1 ; q 2 Þj for each FC of order n for the data Set 14k , the excluded data Set 8k excluded and the PCA selected data Set 7k PCA (see Fig. 6). The correlation-map results of both Set 14k and Set 8k excluded look very similar. This suggests that the data of Set 14k is dominated by the contribution of data Set 8k excluded excluded by the PCA technique.
Although correlation maps are convenient for visual comparison of data sets, the Fourier quadrant correlation We show the FQC for the pairwise comparison between data Set 14k , Set 8k excluded and Set 7k PCA in Fig. 8. The blue line is the comparison between the PCA selection and the excluded data. We find values much lower than unity at low resolution and conclude on major difference between these two data sets. By comparing the PCA selection and data Set 14k (black line), we find the same tendencies with only slightly improved FQC values. This suggests that the data Set 14k is heavily influenced by the excluded data and we confirm that by showing the FQC of Set 14k and Set 8k excluded , which shows high correlation close to one for all q values (red line).
The XCCA and FQC analysis retrieved a quantitative difference between two data sets. This pointed to a strong influence of the excluded data on the whole Set 14k .

Angular-orientation determination
For the orientation determination we used the well documented expand-maximize-compress (EMC) algorithm (Loh & Elser, 2009) implemented in the software Dragonfly (Ayyer et al., 2016) (see Supporting Information for Dragonfly parameters used for orientation determination). For comparison, we retrieved orientations of the selected diffraction patterns for the two data sets: Set 14k and Set 7k PCA . The three-dimensional intensity distribution for these two data sets is presented in Figs. 9(a) and 9(b). In addition, for the three-dimensional intensity distribution of data Set 7k PCA , the background, in the form of a linear combination of two Gaussian functions, was subtracted [see Fig. 9(c) and the Supporting Information for details].
From the three-dimensional intensity distribution for each data set we conclude that there is strong improvement in the fringe contrast going from the selection Set 14k to Set 7k PCA and further improvement with background subtraction. In Figs. 9(d)-9(f) we show line profiles taken in different directions in reciprocal space. To quantify an improvement in contrast from these line profiles we calculate contrast values as where I max and I min are maximum and minimum values along the line profiles. To characterize each data set with a single number we averaged the contrast values over all neighboring maxima and minima along the line profiles. The diffraction fringe contrast analysis revealed higher contrast of the value ¼ 0:40 for our data selection Set 7k PCA over the previously reported data selection Set 14k with the value of ¼ 0:13. As a consequence of the background subtraction procedure applied to the three-dimensional intensity distribution we obtained a further improved contrast of ¼ 0:84.
Line plots of Figs. 9(d)-9(f) also show shifted peak positions which indicate a non-symmetric particle shape. Importantly, this would not be possible to observe if symmetry constraints were applied at the angular orientation determination step.
7. Electron-density reconstruction 7.1. Virus particle reconstruction It is well known that in the frame of kinematical approximation the scattered intensity represents the squared amplitude of the Fourier transform of the three-dimensional electron density ðrÞ of the particle (Als-Nielsen & McMorrow, 2011), where AðqÞ is the scattered amplitude. To determine the three-dimensional electron-density, iterative phase-retrieval tech-niques (Fienup, 1982;Marchesini, 2007) may be applied to the three-dimensional intensity distribution of a virus particle determined above. By applying phase retrieval, a three-dimensional electron-density distribution averaged over 100 reconstructions was obtained. We used a combination of algorithms including continuous hybrid input-output (cHIO) (Fienup, 2013), solvent flipping (SF) (Marchesini, 2007) and error reduction (ER) (Fienup, 1982) in combination with shrink-wrap (SW) (Marchesini et al., 2003). This gave the most stable reconstructions after 1680 iterations from random initial starts (see Supporting Information for details). The averaging ensures the statistical significance from random starts of the reconstruction algorithm.
Isosurface plots (10% of maximum electron density) of the three-dimensional PR772 virus shape are shown in the first row of Fig. 10. The second row gives the internal structure at 10%, 82% and 89% isosurface level and the third row comprises slices through the center of the particle reconstructions on the same color scale.
The reconstructed electron density from Set 14k contains a small high-density peak in the center of the particle   10(e) and 10(h)] without background subtraction exhibits a broader central density region. Importantly, the reconstruction with background subtraction from data Set 7k PCA [see Figs. 10( f ) and 10(i)] shows the expected density distribution that exhibits the concentric structure with an outer protein capsid and internal lipid membrane surrounding the viral DNA that is characteristic of the Tectiviridae family (Miyazaki et al., 2010).
From the results of reconstruction we also see that the particle shape deviates from initially expected icosahedral symmetry. The asymmetry is most evident from the background subtracted reconstruction from our selection set [see Fig. 10(i)] and shows a low-density region indicated by a black arrow.
The improvement of the PCA-filtered data with respect to Set 14k is seen from the reconstructions. At the same time, background subtraction plays an important role in the final reconstruction result.

Virus size and shape analysis
The particle electron-density reconstruction was particularly helpful for estimating the virus size and shape. We extracted multiple values for the virus size corresponding to different distances between opposite points of the virus capsid. It is useful to distinguish between the distances from facet to facet and from vertex to vertex. From this we can quantify the shape distortion in various directions.
In Fig. 11 the virus dimensions are analyzed in different directions for both cases of data processing, with and without background subtraction. Each size value corresponds to 10% of the maximum electron-density value as shown by the isosurface. The maximum size of the virus may be estimated as the one measured from vertex to vertex. Our results give for this distance values of 66.2 and 68.5 nm for reconstructions without and with background subtraction, respectively (see Table 2). This size corresponds very well to an average size of the virus estimated by other means (Reddy et al., 2017). At the same time for both data sets we observe a pronounced shape distortion of the virus that is about 4.8% and 4.2% of the average virus size for the data sets without and with background subtraction, respectively. This is also consistent with observations made previously (Kurta et al., 2017), where similar distortion of the virus shape was identified by the XCCA approach.

Resolution estimate
The three-dimensional voxel size of the reconstruction determined by the detector size and experiment geometry was 4.2 nm 3 . Finally, the reconstructed particle volume contained about 17 576 resolution elements (voxels). The first estimate of the resolution in our reconstruction was obtained by the phase-retrieval transfer function (PRTF) (Chapman et al., 2006) that provided a value of 9 nm at the 0.5 threshold [see Fig. 12(a)]. The second estimate of the resolution was obtained by the Fourier-shell correlation (FSC) analysis ( van Heel & Schatz, 2005). For the FSC analysis the data were split into two parts, each half was oriented independently with EMC and   Table 2 Particle size determined from facet to facet and vertex to vertex distances shown in Fig. 11  independently reconstructed. The commonly used resolution criterion ( van Heel & Schatz, 2005) of the 1/2-bit threshold line (equal to a signal-to-noise ratio of one in Fourier space of the reconstructed object) intersects the FSC at a resolution value of 7.81 nm. By that analysis we conclude that the resolution of our three-dimensional virus reconstruction presented in Fig. 10 is in the range 7.8-9 nm and was primarily limited by the detector size.

Conclusions
We have presented a workflow from measured single-particle imaging XFEL data to a three-dimensional reconstruction. The workflow consisted of several steps, including single hit diffraction data classification; refined filtering of the classified data; reconstruction of three-dimensional intensity distribution by orientation determination and reconstruction of the particle electron density by phase-retrieval methods (Fig. 1). Our research was performed on data taken at the AMO beamline at LCLS as part of the SPI initiative (Reddy et al., 2017). The analysis was based on initial data selection Set 44k of diffraction patterns which were free of obvious faulty data such as empty images. First, using threshold of the diffraction patterns according to their integrated scattered intensity we separated weak single-particle hits from more useful strong hits. Furthermore, diffraction images above a photon count threshold of 2 Â 10 5 ph were classified by the PCA technique. At that step we identified diffraction images as single hits as a result of their clustering into a dense region on the PCA plane.
At the next step, additional refinement filtering was implemented based on comparative analysis of measured diffraction data with the spherical form-factor model. As a result, the final PCA selection after refinement filtering was a fraction of 17% of the initial data and consisted of the Set 7k PCA of diffraction images.
Furthermore, a newly developed correlation approach based on XCCA was used to compare our data selection with the excluded data of the bigger selection previously reported (Reddy et al., 2017). Our analysis showed that the selected data were indeed distinct from the excluded data.
The three-dimensional intensity distribution in reciprocal space was determined by the EMC algorithm. In addition, we performed background subtraction from the PCA-selected data set, which substantially improved the contrast of the three-dimensional intensity distribution. At this step we observed a nonsymmetric virus shape that would not be possible to identify if symmetry constraints were applied.
At the final step of our workflow we used three-dimensional phase retrieval for the real-space particle reconstruction to reveal the electron density of the measured virus particle. The subsequent size analysis showed a nonsymmetric particle shape with an average size of 68.5 nm, and size variation from an ideal icosahedral shape on the order of 4.2%. The resolution was estimated to be better than 10 nm based on PRTF and FSC analyses. With the presented analyses of the reconstructions, we showed that the current limitation on resolution was primarily imposed by the detector size.
Overall, our results demonstrate a feasible approach for analysis of large SPI data sets collected at XFELs. A major observation from our analysis is that PR772 particles do not exhibit true icosahedral symmetry, which is in agreement with analyses of the data set using other approaches (Kurta et al., 2017;Hosseinizadeh et al., 2017). Our analysis shows that particles exhibit a low-density region beneath one of the facets [see Fig. 10(i)]. The biological significance of this remains to be shown, but it is worth noting that the internal membrane of the closely related PRD1 virus undergoes remodeling in response to environmental conditions (like osmolarity) that result in changes to the membrane-capsid interactions (Peralta et al., 2013). The changes are thought to occur during interactions with the host cell receptor during infection, resulting in destabilization of the icosahedral intermembrane vesicle.
Going forward with XFEL SPI studies, we would like to note that a further increase of the scattering angle in future SPI experiments may be achieved with higher XFEL fluence [with certain limitations imposed by radiation damage (Lorenz et al., 2012;Gorobtsov et al., 2015)]. Importantly, a sufficient number of single hits should be collected in these experiments to produce a useful signal at larger diffraction angles. A substantial enhancement of the hit rate is expected at higher luminosity XFEL facilities such as European XFEL and LCLS II, as well as due to improved sample-injection techniques. We also expect that our PCA technique based on single-hit classification workflow is an important step forward for future data analysis prior to orientation determination and phase retrieval.

Funding information
Portions of this research were carried out at the LCLS at the SLAC National Accelerator Laboratory; use of the LCLS is supported by the US Department of Energy, Office of Science,Office of Basic Energy Sciences (under Contract No. DE-AC02-76SF00515). MR, SB, VI and IAV acknowledge support from the Helmholtz Association's Initiative and Networking Fund and the Russian Science Foundation . MR acknowledges financial support from the Joachim Herz Foundation Hamburg. MR, KA, YYK, AJM and AB acknowledge support of Program-Oriented Funds of the Helmholtz Association to DESY. JAS and FRNCM acknowledge support from the Swedish Research Council and the Swedish Foundation for Strategic Research. BGH acknowledges support from the US National Science Foundation (NSF) Science and Technology Center BioXFEL Award 1231306.