Clustering and Averaging of Images in Single-Particle Analysis

Genome Informatics 11: 151–160 (2000) 151 Clustering and Averaging of Images in Single-Particle Analysis Kiyoshi Asai1 Yutaka Ueno1 [email protected]...
Author: Grace French
0 downloads 2 Views 349KB Size
Genome Informatics 11: 151–160 (2000)

151

Clustering and Averaging of Images in Single-Particle Analysis Kiyoshi Asai1

Yutaka Ueno1

[email protected]

[email protected]

Chikara

Sato1

[email protected] 1 2

Katsutoshi Takahashi2 [email protected]

Bioinformatics Group, Electrotechnical Laboratories, 1-1-4 Umezono, Tsukuba, Japan Japan Advanced Institute of Science and Technology

Abstract Single particle analysis is a straightforward method for studying the structures of macromolecules that cannot be crystallized. It builds three-dimensional structures of particles by estimating the projection angles of their randomly oriented electron-microscopic images. The existing methods divide the images into clusters, build class averages for the clusters, and estimate the projection angle of each cluster. However, the clustering and the averaged images are highly sensitive to the choice of reference images and mask patterns for each cluster. Thus, the analyses are neither robust nor automatic, and their results depend heavily on the intuition and experience of researchers who set references. We have been developing a software system for single-particle analysis with new clustering and averaging algorithms for building the three-dimensional structures of target molecules. In this paper, we focus on the algorithms for the robust image-processing of the electron microscopic images in our system.

Keywords: single-particle analysis, protein structure, clustering

1

Introduction

Recently the number of the proteins whose three-dimensional structures have been determined by Xray crystallography has been increasing. However, their in relation to the huge variety of proteins is, this number is still limited due to the difficulty of crystallization and to the molecular-weight limits in NMR analysis. In particular, membrane proteins, which are functionally intriguing, are very difficult to crystallize, and only limited structures have been determined. Since transmission electron microscopy allows for the direct observation of the atomic density, single-particle analysis has allowed for the structural analysis of ribosome and membrane proteins [3]. Single-particle analysis is considered to be an advantageous method for observing structural changes and molecular complexes of working proteins in structural biology studies of the next generation. Negative staining is a technique for fixing a specimen in vacuum optics with heavy-atom chemicals on a fine carbon plate. Recently, the use of an ice-embedded specimen without deformation in staining has been found to be an ideal method for cryo-electron microscopy, enabling us to provide more illumination to particles in order to improve the resolution of their images. However, the particles embedded in ice gives less contrast than with a negative-staining specimen. In addition, random orientation of the particles make three-dimensional reconstruction more difficult. Figure 1 summarizes the necessary image processing for single-particle analysis. Images with the same orientation are aligned and classified to obtain their class averages. With these clearer class

152

Asai et al.

images, the three-dimensional density is reconstructed by the back-projection method. Because of the abiding rotational variance and subtle inhomogeneous particles, careless clustering of images easily fails at low signal-to-noise ratios [10]. Therefore, robust clustering has been a matter of great interest to biologists [6, 9], and its improvements and sophisticated image-processing protocols are in demand, especially for processing thousands of particles or more. We have studied methods for clustering images from single-particle analysis using negative-stained sodium channels [8]. Our target is an automated software system for single-particle analysis with three-dimensional reconstruction from low-contrast images. In this paper, we describe our clustering algorithm for these single particle images. Currently, corresponding analyses [2] have been widely employed for clustering images. A hierarchical clustering has also been proposed [4]. However, the reported methods [6, 9] require a substantial amount of human decision-making in complicated protocols together with knowledge of the characteristics of the target particles. They therefore prevent us from implementing the clustering in an automated software system. Our robust clustering strategy is based on a bottom-up clustering with an iterative alignment algorithm [6] that attempts to extract common motifs among images. Iterative alignment plays an important role for the refinement of reconstructed three-dimensional models. By aligning observed images into a series of simulated projections of the model [6], it can produce a consistent set of projection images for the particle. In this paper, we also discuss the characteristics of projection images that constrain objective image clusters.

2

Definition of the Problem

Single-particle analysis involves the processing of information on projected electron microscopic images of particles with unknown orientation. The purpose of the analysis is to reconstruct the 3D structure of the particle. The problem is a joint-estimation problem involving both the original 3D structure of the particle and the projection angle of each image, all of which are unknown before the analysis. The images are distributed on the larger electron microscopic picture plane. Therefore, our first step is to pick up the single-particle images from the picture plane. In the current study, we carry out this process by manually using a picking-up editor that we have developed. An automatic imagecollection process is currently under development. Because the raw images of the particle are highly noisy, the second step is to divide the images into clusters and build the class averages, so-called characteristic views. The purpose of the clustering is to collect raw images with similar projection angles. Figure 3 briefly explains the situation. The clustering should be performed based on the two angles that define the direction of the projection. The third angle, the rotation angle in the projection plane, are neutralized by the rotation of the alignment procedure. Those clustering and average images are highly sensitive to the selection of references and mask patterns. In existing methods, the analyses are neither robust nor automatic, and the results strongly depend on the intuition and experience of the researcher. We have been developing a software system for single-particle analysis with new algorithms that can actually build the 3D structure of the target molecule. In this paper, we focus on the robust image-processing, clustering, and averaging of the electron microscopic images in our system.

3

Clustering and Alignment of Images

In single-particle analysis of proteins, the raw images are clustered by some algorithm, and the class averages are used for the first-step 3D model of the reconstruction. The signal-to-noise ratio of the clusters’ class averages are improved from the raw images and are much more easily handled in the 3D

Clustering and Averaging of Images in Single Particle Analysis

153

Figure 1: Overview of 3D reconstruction in single particle analysis. The class averages are constructed by clustering the picked up images. The class averages correspond to the projections to certain directions and the 3D structure is reconstructed by estimating the projection angles. Upper right figure shows the common line analysis of two class averages, which is a basic algorithm related to the estimation of projection angles.

154

Asai et al.

Figure 2: Picking up editor for the electron microscopic images.

Figure 3: Sketch of a cluster and projection. The particle is projected to various direction. This case the cluster have a range ω of 3D angles for projection. The three images, which have different rotation angles θ1 , θ1 , θ1 in the projection plain are aligned into one image.

Clustering and Averaging of Images in Single Particle Analysis

155

reconstruction procedure. It is very difficult to perform direct 3D reconstruction from noisy raw images because the 3D reconstruction involves estimating the projection angles, which have three-dimensional freedom that includes the rotation in the projection plane. The purpose of the clustering is to collect raw images having similar projection angles. Figure 3 briefly explains the situation. The clustering should be performed based on the projection direction, which has two-dimensional angular freedom. The third angle, the rotation angle in the projection plane, is neutralized by the rotation of the alignment procedure.

3.1

Overview of Robust Clustering

We provide here an overview of our clustering system, which automatically builds robust clusters from manually picked-up particle images. The picked-up images are roughly centered, but the clustering finds the best affine transformation (parallel shift and rotation) during the alignment of images. Before the clustering, the images are smoothed by a Gaussian filter and normalized to have uniform variance in their image values. See section 4.3 and 3.2 for an explanation of this procedure. In the clustering, the clusters are represented by class averages. At each stage, the class averages are calculated by the iterative aligner from random alignments, and use of the iterative aligner is the key protocol used to build robust and reference-free images. The first step in the clustering process is bottom-up clustering. Because bottom-up clustering includes a pairwise calculation of similarities in the raw images, traditional methods have avoided this approach. We, however, have developed parallelized software on workstation clusters, which makes possible those computationally heavy analyses. 1. Begin with individual images to represent all clusters. 2. Calculate the pairwise similarity. Calculate the similarity of all pairs of images, which is the covariance of the two images from the best affine alignment. 3. Calculate the class average for the most similar pair. The average image is built based on the best affine alignment. If either of the images is the class average of a cluster, use the iterative aligner on the original images. If the size of the cluster exceeds the threshold, move the cluster into storage. 4. Calculate the similarity of the new class average and the other images. 5. If more than two images are remaining, return to 3. The second part of the clustering involves a refinement of the clusters using a K-means-like algorithm. The traditional K-means algorithm is inappropriate because the space of the 2D images, which is scaled by similarity scores, is non-Euclidean. The class averages of our refinement do not exactly match the centroids in the K-means algorithm. The raw images are re-classified by using the class averages from the bottom-up clustering as references. The new class averages are calculated from the re-classified images using the iterative aligner. Traditional methods build the new class averages using the old class averages as references, but the convergence of the average images are, by our experiments, much worse than our new method. We repeat this refinement process until the clusters converge, but the traditional method often requires us to stop the iteration before the deformation of the class averages.

3.2

Scores of Alignment

In order to scale the similarity of the images, cross-correlation function have been used to score the alignment by traditional methods.

156

Asai et al.

Let x = (x0 , x1 , ..., yN −1 ) and y = (y0 , y1 , ..., yN −1 ) are the two images. Each image has one suffix as the result of expanding the image values into vectors. The covariance of the two images is defined as follows: Cov(x, y ) =

N −1 

(xi − x ¯)(yi − y¯).

(1)

i=0

After normalizing the covariance by the standard deviation of the images, the cross-correlation function is defined as: Crr(x, y) =

Cov(x, y) σx σy

where σx2 =

−1 −1 1 N 1 N (xi − x ¯)2 σy2 = (yi − y¯)2 . N i=0 N i=0

If the variances of the images are normalized to 1, the covariance and the cross correlation are the same. The average image of two normalized images has a smaller standard deviation because the cross correlation Crr(x, y) in the following equation is not greater than 1. 2 = σx+y

−1 1 N {(xi + yi ) − (¯ x + y¯)}2 4 i=0

1 1 1 = σx2 + σx2 + Cov(x, y ) 4 4 2 1 = σx σy (σx + σy + 2Crr(x, y). 4 If we use the cross correlation for the similarity score, the class averages should be normalized every time. We can avoid this problem by using covariance as the similarity score because the raw images are normalized before the analysis and covariance is a linear function as follows: 1 1 1  = Cov(x, w)  + Cov(y , w)  Cov( {x + y }, w) 2 2 2

3.3

(2)

Alignment with Reference

When the reference pattern is given, the alignment of the source images is calculated by independently aligning each image to maximize the score between the reference and the image. The alignment is achieved by rotation or affine transformation (parallel shift and rotation) in the image planes. The scores are usually calculated using the mask patterns. When the format of the original images has square types, we should use some mask patterns that are smaller than the maximum inner circle of the square in order to guarantee that the rotation always gives the corresponding data. The mask patterns should cover the images of the whole molecule, and carefully designed masks sometimes allow us to avoid the negative effects of noise on the outside of the image. The alignment is straightforward, but the result heavily depends on the choice of a reference. The same kind of distortion can also happen on the mask patterns, and by using distorted references and masks we can virtually ’write’ any patterns by alignment. However, as long as the reference is properly selected, the influence of the mask is limited. We can repeat the alignment procedure by updating the reference pattern with the average image from the alignment. However, this method does not guarantee convergence to the local optimum. Figure 4 shows the influence of the reference and the repeat alignment.

Clustering and Averaging of Images in Single Particle Analysis

157

Figure 4: An example of the effect of references. Left: the average images which have been aligned using the reference of ’X’,’F’,’S’. Right: the ’correct’ average image using iterative aligner. In each row, the adjacent image on the left is used as the reference of the alignment, and the averages are calculated from left to right. Starting from bad references (’S’ and ’X’), the averages converge to wrong images. ’F’ happened to be a good reference and the converged average is simlar to the ’correct’ average image. They are all reference based alignment. Even if starting from such bad alignment, the iterative aligner finds the same alignment with right figure.

3.4

Iterative Aligner

Because of the importance of the reference pattern and our usual lack of knowledge of this pattern before/during the clustering, it is necessary to adopt a reference-free algorithm to align the images. If the images belong to a specific cluster and have a certain similarity, the iterative aligner from random alignments can create a reasonably good alignment. In order to make the alignment robust, we use an unbiased circular mask for this iterative alignment. The iterative aligner repeats the following process: 1. Remove one image from the alignment. 2. Calculate the average image using the remaining images. 3. Align the removed images to the average image. Because the scores in the last step don’t become worse, the iterative aligner always finds the local optimum. This algorithm is exactly same as the iterative multiple sequence alignment of the amid acid sequences. Furthermore, this alignment problem itself has almost the same framework as the multiple sequence alignment problem. N(= number of sequences/images)-dimensional dynamic programming is out of the question; in contrast, pairwise alignment, iterative alignment, A∗ algorithm are all applicable, but the iterative aligner is in practice the most powerful tool.

3.5

Application to Electron Micrographs

The membrane ion channel ( 200kDai˜for vertebrate) is an essential element for the activity in neurons. Negative-stained images from membrane ion channel were used in our analysis. Films were digitized into 16-bit values with 50µm per pixel ( LeafScan 45, Scitechs Inc.), which corresponds to 6.25˚ A. The defocus values were estimated to approximately 0.6µm by Fourier transformation of the images. Our clustering images are shown as class averages in Fig. 1. The results are consistent with those that

158

Asai et al.

were deduced by reference-based alignment with a corresponding analysis. Our method successfully eliminated use of references in clustering and experimental mask parameters that had been hard to obtain. The details of the obtained structure will be discussed with the structure’s three-dimensional reconstruction elsewhere. In addition to the negative-staining specimens, we have started image processing of low-contrast images of this protein obtained by cryo-microscopy.

4 4.1

Discussion Comparisons with Existing Methods

There are two major software packages available for general single particle analysis: SPIDER [1] facilitates computational tasks for tilted-pair observations, which have been the predominant method used for three-dimensional reconstructions. The package can also carry out iterative alignment, but an appropriate use of the algorithm depends on users scripts. Furthermore, the required understanding in this legacy package and rigorous scripting environment is not trivial. IMAGIC [5] features the Euler angle estimation for randomly oriented particles. Besides the deficiency of the iterative aligner, this proprietary package prevents us from customizing code to avoid the pit falls of reference-based alignment. We therefore decided to construct our software system by implementing the described method to take advantage of modern computational techniques. Our system also features a userfriendly particle picking-up editor and implementation of the simultaneous minimization algorithm [7] to resolve Euler angles of characteristic views.

4.2

Problems in Clustering

The problem in the single-particle analysis is to estimate both the 3D structure of the molecule and the projection angles of the images at the same time, without any initial knowledge of either of these things. Image-clustering should involve collecting the images with similar projection angles, but there is no way to guess the projection angles from each image. Instead, the clustering is performed based on image similarities, with the assumption being that images will be similar if their projection angles are similar. However, there is of course no guarantee that similar images will have similar projection angles. It appears to be impossible to avoid this problem, and bottom-up clustering assumes that highly similar images must have very similar projection angles. Therefore, it is necessary to remove nonsense clusters from the running set at every good opportunity. During the bottom-up clustering, the class averages derived by the iterative aligner from random alignments are compared with the direct alignment of the class averages of the parent clusters, and new clusters are removed if they are not close enough. During the K-mean-like refinement of clustering, the cluster that fails to produce enough raw images by re-classification are removed. The iterative alignment of each new cluster is also compared with the previous class average and is removed if they are not close enough. Finally, even if the clustering produces beautiful average images, some of these images may not represent the proper projection angles. Our last opportunity to reject these pseudo-clusters is to consider their consistency with other clusters in the 3D reconstruction process.

4.3

Principle of Smoothing

The pixel values in the projected images are the integrated density of the 3D structure in the projecting direction. Therefore, we can find the corresponding points of two projected images if the projecting direction of the two images are exactly the same. The images in a cluster have a range of projection angles; the projection directions, however, are slightly different from each other. The integration of the density of the 3D structure is carried out along different lines depending on their projection direction.

Clustering and Averaging of Images in Single Particle Analysis

159

In such cases, it is impossible to find the exact correspondence of the points in the images. Therefore, it is necessary to smooth the raw images before clustering. Let us consider a simple situation in which the target molecule has a uniform density and the same radius R for each projection angle. If the two images have different projection directions whose mutual angles are ω, the corresponding points in the 3D structure are projected at different location on the plane. The distance between the two projected points is r sin(ω), where the distance from the center of the 3D structure to these points is r. If a cluster consists of images whose projection directions are uniformly distributed within the range of an angle ω ∗ , corresponding points are uniformly projected within a circle of radius r sin(ω ∗ ). Therefore, by integrating the uniform density with r, the smoothing function should be an averaging filter that has decreasing weights with respect to the distance. Although the discussion about smoothing is intuitively easy, such analyses of smoothing have not been well studied in single-particle analysis. The parameters of the smoothing function can be calculated from the range of the angles that are covered by a cluster and the resolution of the images. In other words, we must change the smoothing level depending on how many clusters we are going to build. Because the number of clusters is related to the 3D reconstruction performance, it should be carefully selected. The number of available raw images should also be considered. If the number of clusters varies during the clustering process, it may be necessary to change the respective smoothing parameters. We have not implemented dynamic changes of smoothing during bottom-up clustering. The raw images include several types of noise and signal transformations, as is their nature. It is necessary to pre-process the raw images in that sense also, but we don’t discuss the detail of this type of processing in the present paper.

4.4

Computation

Because the calculation of covariance or correlations between two images are time consuming, the searching space for finding the best affine transformation for the alignment should be minimized. The resolution of the transformation is related to the resolution of the images and the principle of smoothing. For the rotation, we combine a rough search that changes the angles by 5 degrees and a fine search of 1 degree. By combining parallel shifts and rotations to form affine transformations, we avoid exhaustive searches and adopt repeated combinations of parallel shifts and rotations. It is necessary to calculate the pairwise similarity scores in bottom-up clustering. If the data size M of the raw images is large, it is practically impossible to calculate similarity scores for these M (M − 1)/2 pairs. In such cases, the raw images are roughly classified into several groups and the bottom-up clustering is attempted in each group. In recent experiments, when attempting to analyze 8000 raw images with 96 × 96 resolution, the bottom-up clustering took one week to complete using 32 CPUs from alpha workstation clusters and 20 CPUs from SGI machines.

5

Conclusion

We have developed a new clustering system for single-particle analysis. This system requires no manually designed reference and automatically builds robust class averages. The iterative alignment algorithm and bottom-up clustering with large computational power are the keys to achieving this robustness. We have also studied the implications of smoothing of raw images used in the clustering. The level of smoothing is directly related to the range of the projection angle of each cluster. The clustering system has been developed as a part of the a 3D reconstruction system for single-particle analysis. The total system will be discussed elsewhere.

160

Asai et al.

References [1] Frank, J., Shimkin, B., and Dowse, H., SPIDER – A Modular Software System For Electron Image Processing, Ultramicroscopy, 6:343–358,1981. [2] Frank, J. and Voublik, M., Multivariate Statistical Analysis of Ribosome Electron Micrographs, J. Mol. Biol., 161:117–137, 1982. [3] Frank, J., Three-dimensional Electron Microscopy of Macromolecular Assemblies, Academic Press London, 1996. [4] van Heel, M., Classification of very large electron microscopical image data sets Optik, 82:114–126, 1989. [5] van Heel, M. and Keegstra, W., IMAGIC+ a fast flexible and friendly image analysis software system, Ultramicroscopy, 7:113–130, 1981. [6] Penczek, P., Radermacher, M., and Frank, J., Three-dimensional reconstruction of single particles embedded in ice, Ultramicroscopy, 40:33–53, 1992. [7] Penczek, P., Zhu, J., and Frank, J., A common-lines based method for determining orientations for N < 3 particle projections simultaneously, Ultramicroscopy, 63:205–218, 1996. [8] Sato, C., Sato, M., Iwasaki, A., Doi, T., and Engel, A., The Sodium Channel Has Four Domains surrounding A Central Pore, J. Struct. Biol., 121:314–325, 1998. [9] Serysheva, E.V., Orlova, M., Chiu, W., Sherman, M.B., Hamilton, S., and van Heel, M., Electron cryomicroscopy and angular reconstitution used to visualize the skeletal muscle calcium release channel, Nature Struct. Biol., 2(1):18–24, 1995. [10] Sigworth, F.J., A Maximum-Likelihood Approach to Single-Particle Image Refinement, J. Struc. Biol., 122:328–339, 1998. [11] Ueno, Y., Takahashi, K., Asai, K., and Sato, C., BESPA: Software Tools for Three-dimensional Structure Reconstruction from Single Particle Images of Proteins, Genome Informatics, 10:241– 242, 1999.

Suggest Documents