Introducing robustness to maximum-likelihood refinement of electron-microsopy data

An expectation-maximization algorithm for maximum-likelihood refinement of electron-microscopy data is presented that is based on finite mixtures of multivariate t-distributions. Compared with the conventionally employed Gaussian mixture model, the t-distribution provides robustness against outliers in the data.


Introduction
Whereas maximum-likelihood approaches have become a gold standard in many areas of macromolecular X-ray crystallography (e.g. Bricogne, 1997;de La Fortelle & Bricogne, 1997;Read, 2001;Blanc et al., 2004), in single-particle threedimensional electron microscopy (3D-EM) such statistical approaches have only recently been started to be explored. An important characteristic of the maximum-likelihood approach is the natural way in which one may model the experimental noise in the data. Because the noise levels in 3D-EM data are typically extremely high, one would expect that 3D-EM refinement problems could greatly benefit from a proper error model. However, for many years image processing in 3D-EM has been addressed using methods that do not take the noisy character of the experimental data into account in a statistical way (Frank, 2006). Provencher and Vogel performed early work on a statistical model for the noise in 3D-EM data  and it was only in 1998 that Sigworth introduced a maximum-likelihood algorithm for the alignment of a set of two-dimensional images (Sigworth, 1998). Thereafter, Doerschuk and coworkers used the same principles for the three-dimensional reconstruction of icosahedral viruses (Doerschuk & Johnson, 2000;Yin et al., 2003;Lee et al., 2007), Pascual-Montano and coworkers introduced a maximum-likelihood algorithm for self-organizing maps (Pascual-Montano et al., 2001) and Zeng and coworkers applied this approach to two-dimensional alignment of crystal images (Zeng et al., 2007).
We were the first to address the problem of simultaneous two-dimensional image alignment and classification using maximum-likelihood principles (Scheres, Valle, Nuñ ez et al., 2005) and we then extended this methodology to the general case of three-dimensional reconstruction from structurally heterogeneous data (Scheres, Gao et al., 2007). The latter is of special relevance for 3D-EM single-particle analysis, in which many samples constitute large and flexible macromolecular complexes. These complexes typically adopt multiple conformations that are often directly related to their function in living organisms. In principle, provided that one can sort the projections from distinct structures using a computer, multiple three-dimensional reconstructions of the particles in their distinct functional states may be obtained from a single 3D-EM experiment. However, this sorting is strongly intertwined with the orientational assignment of the projections and at present still represents one of the major challenges in single-particle image processing (Leschziner & Nogales, 2007).
We model structurally heterogeneous data as a finite mixture and treat the unavailable information about the orientation and the structural class of each experimental projection as missing data. We then tackle the mixture problem using expectation maximization, which can be shown to converge to the maximum-likelihood estimation of the mixture parameters under relatively mild conditions (Dempster et al., 1977;McLachlan & Peel, 2000). The resulting algorithm is a multi-reference refinement procedure which is similar to conventional refinement approaches in the field (Radermacher, 1994;Penczek et al., 1994). However, the most important difference of the maximum-likelihood approach is that the underlying statistical data model allows one to marginalize over the missing variables. That is, whereas conventional approaches assign a single orientation and class membership to each projection, the maximum-likelihood approach calculates probability-weighted assignments for all possibilities. This provides an intrinsic stabilization of the possibly unstable reconstruction problem. Together with the typical use of relatively small images (also to reduce computational costs) and an early stopping criterion in the underlying algrebraic reconstruction algorithm with smooth basis functions, or blobs (Marabini et al., 1998;Scheres, Gao et al., 2007), this yields a stable algorithm in practice which has been shown to be highly effective on multiple occasions (see, for example, Nickell et al., 2007;Cuellar et al., 2008;Juliá n et al., 2008;Rehmann et al., 2008).
Despite the importance of the underlying data model in statistical approaches, little work has been performed to explore alternative models for maximum-likelihood approaches in 3D-EM. All the approaches mentioned above share the assumption of additive white Gaussian noise in real space. A large part of the noise may result from shot noise owing to the small number of imaging electrons (10-20 per squared angstrom). The latter would require a multiplicative noise model with a Poisson distribution. However, in practice the additive Gaussian model is a good approximation when each pixel represents many squared angstroms (Sigworth, 2004). The pixel areas for the classifications described in this paper, for example, range from 12 to 30 squared angstroms. Moreover, several additional sources of noise exist such as structural noise arising from the surrounding ice and detector noise; the combination of these multiple independent sources of noise has been shown to follow a Gaussian distribution (Sorzano, de la Fraga et al., 2004). The additive character of the Gaussian noise model results in a computationally attractive algorithm, but the assumption of whiteness is known to be a poor one for electron-microscopy projections. Therefore, we recently introduced an alternative data model in reciprocal space that allows the modelling of nonwhite, or coloured, Gaussian noise (Scheres, Nunez-Ramirez et al., 2007). The Gaussian distribution still remains a common factor, while in other pattern-recognition fields a notable interest has developed in the use of alternative distributions. For many applied problems the tails of the Gaussian are shorter than required and mixtures of Gaussians may lack robustness in the presence of atypical observations. In particular, the use of multivariate t-distributions has repeatedly been proposed as a more robust alternative. The t-distribution has wider tails and its degree of freedom essentially plays the role of rejecting atypical observations. As tends to infinity, the t-distribution approaches the Gaussian, so that may be viewed as a robustness tuning parameter. Several contributions defining frameworks of expectation-maximization algorithms for mixtures of t-distributions have appeared and mixtures of t-distributions have been successfully applied to a range of different types of data (Lange et al., 1989;McLachlan & Peel, 2000;Wang et al., 2004).
In this contribution, we explore the suitability of modelling structurally heterogeneous 3D-EM data as a mixture of multivariate t-distributions. We derive the corresponding expectation-maximization algorithm in x2. In x3 we illustrate its intrinsic properties of providing robustness against outliers and compare the performance of the new algorithm with the conventional algorithm for Gaussian mixtures in twodimensional and three-dimensional classification. We conclude this paper with a discussion on the potential usefulness of the proposed algorithm in x4.

The optimization problem
We model two-dimensional images X 1 , X 2 , . . . , X N as follows: where (i) X i 2 R J are the recorded data. Typical data sets comprise N = 20 000 to N = 200 000 images with J = 50 Â 50 up to J = 100 Â 100 pixels.
(ii) G i 2 R J is independent zero-mean (additive) noise.
(iv) R È i V K i 2 R J are the two-dimensional projection data (uncontaminated by noise) of the unknown object V i in an unknown random orientation in space and position in the plane. We parametrize the unknown orientation and position in the plane by a discretized distribution of p = 1, . . . , P projection directions (described by P combinations of two Euler angles) and q = 1, . . . , Q in-plane transformations consisting of Q rot rotations and Q trans translations and the corresponding discrete transformations are denoted as R pq .
The reconstruction problem at hand is to estimate V 1 , V 2 , . . . , V K from the observed data X 1 , X 2 , . . . , X N . We view this estimation problem as a missing data problem, where the missing data associated with the observed data elements X i are the position È i and the random index i . Thus, the complete data set is We solve the reconstruction problem by way of maximumlikelihood estimation, where we aim to find those parameters Â* that maximize the logarithm of the joint probability of observing the entire set of images X 1 , X 2 , . . . , X N : PðX i jk; p; q; ÂÞPðk; p; qjÂÞ: As described previously (see, for example, the Supplementary Note in Scheres, Gao et al., 2007), we assume that particle picking has left a two-dimensional Gaussian distribution of residual translations q x and q y centred at the origin and with standard deviation . Furthermore, we assume an even distribution of the Q rot sampled in-plane rotations and a discretized distribution of estimated proportions kp of the data belonging to the pth projection of the kth underlying three-dimensional structure (with kp ! 0 and P K k¼1 P P p¼1 kp = 1). Thereby, P(k, p, q|Â) is calculated as follows: In contrast to previous contributions where a Gaussian distribution was employed, we calculate P(X i |k, p, q, Â) as a multivariate t-distribution, with a diagonal covariance matrix with all diagonal elements equal to 2 : and ||Á|| denoting Euclidian distance. Following McLachlan & Peel (2000), we notice that the multivariate t-distribution may be viewed as a weighted average Gaussian distribution with the weight given by the Gamma distribution: Here, q(u i ) is the p.d.f. of a Gamma distribution with equal scale and degrees of freedom, G (/2, /2), and n J ( and again with a diagonal covariance matrix, which has all diagonal elements equal to 2 /u i , Therefore, it is convenient to introduce another set of 'missing' variables u 1 , . . . , u N , which are defined such that independently for i = 1, . . . , N and all u i are independently distributed according to Thus, the complete data set becomes and the function to be optimized becomes In analogy with (3) and with the previously introduced algorithm for Gaussian distributions (Scheres, Gao et al., 2007), the reconstruction problem at hand is to find the parameter set Â* that maximizes (12). However, Â now includes an additional parameter and the missing data vector has been augmented to not only include positions È i and random indices i but also variables u i . In this way, atypical observations in the data (i.e. observations with relatively large residuals) may be accomodated by relatively wide Gaussian distributions (i.e. with small values of u i ) and the additional parameter is used to describe the assumed distribution of all u i according to (10).

The optimization algorithm
This optimization problem may be solved by expectation maximization (Dempster et al., 1977). This algorithm is used for finding maximum-likelihood estimates of parameters in probabilistic models that depend on unobserved or hidden variables. Expectation maximization is an iterative method that alternates between expectation (E) and maximization (M) steps. In the E-step one computes the expectation of the likelihood by including the hidden variables as if they were observed. In the M-step that follows, the maximum-likelihood estimate of the model parameters is computed by maximizing the expected likelihood found in the previous E-step. The parameters found in the M-step are then used to begin another E-step and the process is repeated. As stated above, the missing variables in this case are u i , È i and i and the parameters to be estimated are contained in Â.
In the E-step, again following McLachlan & Peel (2000), we calculate the expectation value of the log-likelihood function using the current estimates of the model parameters (Â old ):

new algorithms workshop
old ikpq Â ½log PðX i jk; p; q; u i ; ÂÞ þ log Pðu i jk; p; q; ÂÞ þ log Pðk; p; qjÂÞ: Here, old ikpq is the conditional probability distribution of k, p and q given X i , and for the conditional expectation of u i , given X i , k, p and q, we obtain In the subsequent M-step of the algorithm, we maximize the lower bound (13) with respect to all model parameters in Â.
Since there exists no closed form for the update of new , we will consider to be known (i.e. user-determined). The updates for and the proportions pk new may be calculated independently from the updates of the other model parameters as follows: For the updates of V and , we note that they are a weighted version of the corresponding updates in the case of Gaussian distributions, with standard deviations 2 i =u i , . . . , 2 N =u N and with the weights being the additional missing variables u 1 , . . . , u N . Therefore, as for the Gaussian case, updating V may be performed by separately solving K least-squares problems, for which we use a modified algebraic reconstruction algorithm (wlsART; see Scheres, Gao et al., 2007). In this case, the least-squares problems are and the updated is obtained as

Implementation
We implemented a total of four variants of the abovedescribed algorithm in the open-source package XMIPP (Sorzano, Marabini et al., 2004;Scheres et al., 2008). The proposed algorithm for three-dimensional classification can be adapted with only minor changes to a two-dimensional classification algorithm. In this case, instead of optimizing (13) with respect to three-dimensional structures V 1 , . . . , V K , one optimizes this function with respect to two-dimensional images A 1 , . . . , A K . The algorithm remains basically the same, except for the fact that in this case R pq represents an in-plane transformation (parametrized by a single rotation and two inplane coordinates) and the least-squares problem in (18) is replaced by the following updated formula: In addition, both the two-dimensional and the three-dimensional variants may also be expressed in reciprocal space. In this case, X 1 , . . . , X N , A 1 , . . . , A K and V 1 , . . . , V K represent the Fourier transforms of the observed data and the twodimensional or three-dimensional models, respectively, G i is independent zero-mean additive noise in reciprocal space and R pq represents the reciprocal-space equivalent of either a projection operation or an in-plane transformation in real space. In the former model one describes the noise by independent distributions on the real-space pixels, while in the latter the noise is modelled as being spatially stationary, which allows one to describe nonwhite or coloured noise. For a more extensive elaboration on these characteristics and their implementation, the reader is referred to Scheres, Nunez-Ramirez et al. (2007). Finally, we mention that the summations over k, p and q are extremely computing-intensive operations. Therefore, we have implemented three deviations from the strict expectationmaximization algorithm that result in a considerable speed-up of the calculations without hampering the classification performance in practice. The first two deviations were also implemented as such in the algorithms using Gaussian distributions, whereas the third deviation is specific for the t-distribution case: (i) instead of integrating over the entire search space of k and q, we employ a reduced-space approach , (ii) the update of is performed using V old instead of V new and (iii) following the proposal of McLachlan & Peel (2000), we replace the division by N in (19)

Robustness to outliers
We used a simplified two-dimensional test case to illustrate the potential of the t-distribution in providing robustness to outliers. The test data consisted of 1000 experimental cryo-EM projections of a 70S Escherichia coli ribosome particle in a single orientation. In 50 of the 1000 images, we positioned circles of constant density with radii varying uniformly between 10 and 15 pixels and with centres varying between À15 and 15 pixels from the image origin. The intensity of these circles was set to a constant value of 5, 10, 15 or 20 times the standard deviation of the original experimental images. We then performed two-dimensional real-space maximumlikelihood refinements with a single reference image for these data sets, comparing the performance of the Gaussian and new algorithms workshop Acta Cryst. (2009). D65, 672-678 t-mixtures (Fig. 1). The resulting averages clearly showed the effect of the improved robustness to outliers provided by the t-mixture. For the data sets with the strongest outliers in particular, the averages obtained with the Gaussian mixture showed clear artefacts that were not visible in the averages obtained using the t-mixture model. Analysis of the converged estimates for the standard deviation in the noise indicated that the algorithm for the Gaussian case tries to accommodate the outliers by increasing the widths of the Gaussians. This is much less the case for the t-distribution case, where low values for u old ik' downweight the contribution of the outliers in the calculation of the averages and the standard deviation of the noise. The stronger the outliers, the larger this downweighting effect and the larger the differences between the two algorithms.

Performance in two-dimensional classification
To explore the potential of the new algorithm in twodimensional image classification, we performed maximumlikelihood multi-reference refinements on a cryo-EM data set of MCM top views (Gó mez-Llorente et al., 2005) and on a negative-stain data set of G40P top views (Nunez-Ramirez et al., 2006). For each data set we performed four runs, using mixtures of Gaussians or of t-distributions with six degrees of freedom, and performing two-dimensional refinements in real or in reciprocal space. All four runs were started from identical seeds, which were obtained as average images over three random subsets of the data sets. Fig. 2 shows the resulting images of these runs, which show only minor differences between the two types of mixtures either in real or in reciprocal space. In all cases, the refined images look very similar to those obtained with a Gaussian mixture. Not only do the densities for the averaged particles in the centre of the images look very similar, the two mixture types even result in common characteristics in the noise background. The optimization path and the optimal orientation and classification parameters of the individual images upon convergence also showed only small differences (not shown).

Performance in three-dimensional classification
For three-dimensional classification, we compared the performance of both types of mixtures using a data set of 20 000 ribosome particles. This data set was previously shown to be structurally heterogeneous as only part of the ribosomes are complexed with elongation factor G (EF-G; see Scheres, Nunez-Ramirez et al., 2007). Refinements with four references typically converge to a single class corresponding to ribosomes in complex with EF-G and three classes of ribosomes without EF-G. We again performed four runs using real-space or reciprocal-space refinement and using a Gaussian or a t-distribution mixture with six degrees of freedom. The intensity of segmented EF-G density in the class corresponding to ribosomes in complex with EF-G may serve as an indicator of classification quality, since remaining heterogeneity will generally yield lower density levels for EF-G. Fig. 3 shows segmented EF-G densities for the four runs performed. Starting from identical seeds, the t-mixture model in real space gave somewhat higher EF-G densities than the Gaussian mixture and the corresponding classes overlapped by 87%. Refinement in reciprocal space yielded stronger EF-G densities than in real space for both types of mixtures. The differences between the Gaussian and the t-mixture were smaller in this case, as no obvious difference in the intensity of EF-G Class averages as obtained in two-dimensional classifications with three references for the G40P and MCM data sets using a Gaussian or t-mixture model in real or in reciprocal space. (a) Converged reference images for the runs with a mixture of Gaussians (top row) and t-distributions with six degrees of freedom (bottom row) for test sets containing outliers of increasing intensity. A value of zero for the outlier intensity is used to indicate the original test set without outliers. (b) Converged estimates for the standard deviation in the noise () for the runs with a Gaussian (black) or a t-mixture (grey). (b) Average and standard-deviation values for the converged estimates for u old ik' at maximal old ik' of the 50 outliers (black) and the remaining 950 images (grey) upon convergence for the runs with a t-mixture. density was observed and the EF-G-containing classes overlapped by 94%.

Discussion
The selection of individual particles from electron micrographs, called particle picking, is typically a difficult task. For cryo-EM data on relatively small particles (200-500 kDa) in particular, automated procedures may have relatively high error rates and the collection of good data is often strongly dependent on the specialized skills of the electron microscopist (Zhu et al., 2004). Therefore, it is relatively common for cryo-EM data sets to contain significant amounts of outliers. Atypical observations that were mistakenly assumed to be a particle of interest may deteriorate the quality of the three-dimensional reconstruction. In the best case scenario they only affect the resolution obtained. In the worst case scenario artefacts introduced by outliers may affect the interpretation of the structure itself. Conventionally, outliers have been dealt with by removing those particles with the lowest cross-correlation coefficients with the reference from the refinement process (Frank, 2006). Although effective in practice, such discrete decisions are hard to accommodate in the statistical framework of maximum-likelihood refinement.
The algorithm proposed in this contribution provides an alternative statistical solution to outlier removal. The problem of structurally heterogeneous projection data is modelled as a finite mixture of multivariate t-distributions with a given degree of freedom. In the resulting expectation-maximization algorithm, images with atypically large residuals contribute relatively little to the model estimates through lower values of u old ik' ; see (15), (18) and (19). Note that the residuals used to calculate the weights u old ik' are closely related to the crosscorrelation coefficient, but instead of taking discrete decisions the statistical approach applies a continuous downweighting of outliers as their residuals increase. We illustrated this effect for a small experimental test set with artificially generated outliers. In the Gaussian model all particles contribute equally to the model estimates. Consequently, especially in the presence of strong outliers, the average images obtained showed outlierrelated artefacts and the variance of the noise was overestimated. In contrast, the t-distribution model resulted in clean average images and reliable noise estimates through an effective downweighting of the outliers.
In practice, however, there are limits to the downweighting of outliers in the proposed algorithm. From (15) and the example in Fig. 1, we can see that even for a few degrees of freedom (e.g. six) significant downweighting is only achieved for images with squared residuals that exceed the standard deviation of the noise in the images several times. This may restrict the usefulness of the proposed algorithm in identifying aberrant particles. In practice, particles with such large residuals may be easily recognizable at earlier stages of image processing, whereas one would ideally want to downweight any particle that does not correspond to a projection of one of the K reference structures. Analyses of the two-dimensional classifications presented in x3.2 indeed did not reveal an obvious relation between low values of u old ik' and what one would consider atypical images in terms of the underlying signal (results not shown).
The number of degrees of freedom is a free parameter of the proposed algorithm. Although in theory the optimal value of any free parameter should be tested, we performed all calculations presented in this work with a fixed value of six degrees of freedom. Because 3D-EM images typically contain many pixels, the right-hand side of both the numerator and the denominator in (15) will dominate the calculation of u old ik' when using few degrees of freedom. Together with the observation made above that even for few degrees of freedom the effect of outlier downweighting may be relatively small, this suggests that in practice it may be sufficient to run this algorithm only with few degrees of freedom. This is confirmed by our calculations. When using three, nine or 30 degrees of freedom in the runs shown in Fig. 2, almost identical results were obtained (not shown) compared with using six degrees of freedom.
Improved image classification would be the ultimate aspiration of introducing a novel algorithm for maximumlikelihood refinement of 3D-EM data. Despite the fact that both the MCM and the G40P data set contained significant amounts of neighbouring particles and other artefacts that were not accounted for in the model, we did not observe any significant improvement in using the t-mixture model over the conventional Gaussian model in our two-dimensional classifications. One could attribute this to the observation that strong downweighting may only be achieved for outliers with very large residuals and that such strong artefacts were not present in these data. However, although the differences in the Segmented EF-G density from the maps obtained for the EF-Gcontaining class in three-dimensional maximum-likelihood refinements using a real-space (a, b) or reciprocal-space (c, d) target function and a Gaussian (b, d) or a t-mixture model with six degrees of freedom (a, c). Superimposed on the (transparent) densities obtained with the Gaussian mixture model are the positive (green) and negative (red) difference maps, i.e. the density obtained with the t-mixture minus the density obtained with the Gaussian mixture. All maps, including the difference maps, are rendered at the same isosurface value. u old ik' weights may appear to be relatively small in practice, more subtle effects may still play an important role in the complicated convergence process. This may perhaps explain why three-dimensional classification of a structurally heterogeneous ribosome data set with a real-space t-mixture model may have given better classification results than the Gaussian mixture, as hinted at by a stronger signal for the complexed EF-G density.
We have presented too few tests to allow the drawing of general conclusions on the relative suitability of the t-mixture model and the conventional Gaussian model. A continuing application of the proposed algorithms on multiple test cases may provide further insights, but this falls beyond the scope of this contribution. Most probably, the optimal choice of algorithm will depend on the data set at hand. Therefore, we have made all algorithms described in this work accessible to the community by implementing them in our open-source package XMIPP (Sorzano, Marabini et al., 2004;Scheres et al., 2008). Apart from modifications to the maximum-likelihood classification approach as presented here, we also foresee the exploration of alternative algorithms, such as maximum a posteriori (MAP) estimation, which may offer signifant benefits in additional stabilization of the reconstruction problem through the incorporation of prior information.