Representation of viruses in the remediated PDB archive

A new data model for PDB entries of viruses and other biological assemblies with regular noncrystallographic symmetry is described.


Introduction
Recent improvements in structural biology methods have given rise to an increasing body of structural data for biological assemblies composed of tens to thousands of individual protein and/or nucleic acid polymer chains. Structures of such quaternary complexes or assemblies present many challenges for archival representation and validation, graphical display and analysis (Dutta & Berman, 2005).
Large biological assemblies are often composed of multiple copies of one or more polymer entities, with the arrangement of repeating units following a regular point or helical symmetry (Goodsell & Olson, 2000). The largest class of biological assemblies with regular symmetry currently represented in the Protein Data Bank (PDB) archive (Berman et al., 2000) are the icosahedral viruses, with approximately 250 structures determined either by X-ray crystallography or cryoelectron microscopy (CryoEM; reviewed by Harrison, 2001;Chiu & Rixon, 2002;Lee & Johnson, 2003). A smaller group of virus entries have helical symmetry: approximately 30 structures determined mainly by fiber X-ray diffraction methods (Marvin, 1998;Stubbs, 1999).
Assemblies may have multiple embedded symmetries or adjacent symmetries. For instance, the icosahedral Paramecium bursaria chorella virus type 1 (PBCV-1) algal virus shell has thousands of copies of a membrane-embedded coat protein arranged with pseudocrystalline symmetry (Nandhagopal et al., 2002). The T4 tailed bacteriophage has fivefold, sixfold and helical symmetries aligned along a common axis (Leiman et al., 2003).
The PDB entries of icosahedral and helical viruses and a handful of other large biological assemblies with regular noncrystallographic symmetry were previously archived in an inconsistent manner and were prone to errors. To address these problems, we have developed a flexible scheme to represent assemblies with regular symmetry. The scheme involves four key elements: (i) a set of atomic coordinates representing the repeating unit, (ii) parameters defining the regular symmetry, (iii) an operations list containing regular symmetry operations plus any frame transformations (transformations between different coordinate frames) and (iv) a compact set of assembly-generation instructions, with the possibility of defining multiple assemblies. Using this scheme, instructions may be given to build a full icosahedral virus in the deposited frame, a pentamer subassembly of the virus in the standard icosahedral point frame and the asymmetric unit of the virus crystal in the standard space-group frame.
This representation was developed to provide uniformity among virus structures within the PDB as part of a larger remediation project to remove legacy errors and improve the uniformity of the entire archive (Henrick et al., 2008). The representation has been fully implemented in the PDB exchange dictionary and has been incorporated in the reme-diated entries of over 280 structures, mainly viruses but also several nonvirus assemblies (Table 1). The new scheme will permit routine annotation of future entries with regular and complex symmetries and will also make it possible to more easily build and view such assemblies within graphical display programs.

Background: remediation of virus entries
A review of 250 icosahedral virus structure entries and 30 helical virus entries deposited into the PDB between 1984 and 2006 revealed three major issues to be addressed in remediation: missing or erroneous sets of transformation operations, inconsistency in coordinate-frame representations and overly complex building instructions. For each issue, corrected information was gathered and validated in a systematic way.
For approximately 40% of virus entries, the set of matrix transformations needed to build up the full biological assembly either was absent or contained errors. Problem entries were identified by inspection of images generated via an automated script using the Multiscale Model module of Chimera (http://www.cgl.ucsf.edu/chimera/; Goddard et al., 2005;Pettersen et al., 2004). Corrected transformations were obtained from the Virus Particle Explorer database (VIPERdb; http://viperdb.scripps.edu; Reddy et al., 2001;Natarajan et al., 2005;Shepherd et al., 2006) or the Protein Quaternary Structure server (PQS; http://pqs.ebi.ac.uk; Henrick & Thornton, 1998). For helical viruses, parameters to construct representative matrix transformations were collected from PQS.
The atomic coordinates of virus entries have been archived in a variety of different coordinate reference frames. CryoEM structures and early crystal structures of icosahedral viruses are typically presented in one of two standard icosahedral reference frames. However, the recent trend for crystal Table 1 Remediated entries.
Symmetry type Entry IDs, sorted by experiment type †

Figure 1
Deposition frame of remediated icosahedral virus crystal structure entries. The number of entries is plotted by year of release and coordinate frame type. Entries with coordinates provided in the standard frame of the crystal lattice are represented by light yellow bars. Entries presented in an icosahedral frame and requiring one or more non-identity transformations to place virus particles into the crystal lattice are represented by dark blue bars.
structures is to deposit in the frame of the crystal lattice ( Fig. 1). For each icosahedral virus, the transformation [P] that moves the deposited coordinates into the VIPER standard icosahedral frame was determined using the PDB2VIPER program (Shepherd et al., 2006) with minor modifications. 60 transformations [T m ], m = 1-60, were calculated for each assembly from a standard ordered set of icosahedral opera- For 210 icosahedral virus crystal structures, transformations to the crystal lattice frame were collected from author text remarks or primary citations, extracted from SCALE records, or set to identity, as appropriate. One transformation was defined for each independent particle in a crystal asymmetric unit. Noncrystallographic symmetry (NCS) operations defining crystal asymmetric units were determined automatically using software developed in-house. Crystal packing was inspected using the Crystal Contacts module of Chimera.
Of 88 crystal structure entries with deposited structure factors, 70 yielded R factors below 0.40 (56 below 0.30) using SFCHECK (Vaguine et al., 1999). Before remediation, only a handful of these entries yielded reasonable validation statistics.
For the majority of virus-structure entries with atomic coordinates representing one regular symmetry (point or helical) asymmetric unit, application of regular symmetry operations is all that is required to build a full or representative assembly. However, several entries contain explicit atom coordinates for larger assemblies, e.g. an icosahedral pentamer, or a full crystal asymmetric unit with one quarter or one half of a full virus capsid. In some of these cases coordinates were presumably duplicated for convenient viewing of a particular interface, but in others regular symmetry is only approximate and explicit coordinates are required to represent the unique part of a lower symmetry structure. For the PBCV-1 virus (PDB code 1m4x; Nandhagopal et al., 2002), atomic coordinates are only provided for a small fraction (1/28th) of one icosahedral asymmetric unit containing three chains: a total of 3 Â 28 Â 60 chains and 16 284 240 atoms are required to build the complete capsid. In all of these special situations, symmetry-parameter representation and instructions for building complete assemblies from selections of matrix operations, selections of coordinates and/or hierarchical application of transformation operations were defined on a case-by-case basis.

Representation of complexes with regular symmetry
In order to archive the corrected information gathered in the virus remediation process, the PDB exchange dictionary was extended (http://mmcif.pdb.org). New terms enable explicit definition of regular noncrystallographic point and helical symmetries and provide for definition of transformation operations and implementation of a compact notation for assembly generation. The new dictionary categories are used in conjunction with existing data items for crystal symmetry and logical groups of atomic coordinates. The resulting representation permits the description of biological assemblies with any regular symmetry and determined by any experimental method. An example of the representation in mmCIF format is provided as supplementary material. 1

Regular symmetry definitions
Regular symmetries include point, helical and crystal symmetries. Given parameters appropriate to the symmetry type and a standard reference frame with a defined relationship between symmetry axes and Cartesian coordinate axes, a complete set of symmetry operations can be defined for any point group and representative symmetry operations can be defined for any helical or crystal symmetry. The PDB follows standard definitions for crystal symmetry (Hahn, 2002). Parameter and standard frame definitions used for point and helical symmetries are described below and follow the  Threefold on (1, 1, 1) Asymmetric unit centerof-mass position On +x Nearest +x and +z Between +x, +z and (1, 1, 1) Nearest +x and (1, 1, 1) T = 3, nearest (0, 1, ') and +z; else nearest (0, 1, ') and threefold on ('/3, 0, 2' + 1/3) † Tetrahedral and octahedral standard frames and hierarchy of symmetry operations follow International Tables for Crystallography definitions for cubic space groups P23 (No. 195) and P432 (No. 207), respectively (Hahn, 2002). ‡ The icosahedral standard frame is identical to that utilized by VIPERdb (Reddy et al., 2001), but the hierarchy of symmetry operations follows tetrahedral symmetry after the application of fivefold symmetry. ' = [(5) 1/2 + 1]/2. conventions for cryoEM structural studies proposed by Heymann et al. (2005).
3.1.1. Point symmetry. The five point symmetries that can be adopted by biological assemblies are circular, dihedral, tetrahedral, octahedral and icosahedral, corresponding to Schö enflies symbols C, D, T, O and I, respectively. For structures with circular or dihedral symmetry, a circular symmetry parameter is required to define the number of repeats around the major symmetry axis. Examples include a viral toxin with C38 symmetry (Fig. 2c), a clathrin cage with D6 symmetry (Fig. 2d) and a four-layer ring with D17 symmetry (Fig. 2e).
Standard frames and hierarchical order of symmetry operations for the point symmetries are defined in Table 2. In every case the symmetry center is at the origin and symmetry elements are aligned to major orthogonal coordinate axes. The icosahedral standard frame is identical to the VIPERdb frame, with twofolds aligned to the x, y, z axes and fivefolds closest to the z axis lying in the yz plane (Fig. 3). Icosahedral pointsymmetry operations are initiated by the application of fivefold symmetry around the vector (0, 1, '), followed by application of tetrahedral symmetry operations. Where possible, the hierarchical order of symmetry operations follows the related space group: P23 for tetrahedral symmetry, P432 for octagonal symmetry.
The VIPER database restricts the position of the primary icosahedral asymmetric unit center of mass within the icosahedral standard frame (Natarajan et al., 2005;Shepherd et al., 2006). The advantage of restricted placement is that the transformation from an arbitrary deposited frame into the standard frame {[P] in (1)} has one unique solution. We utilize the same boundaries, as illustrated in Fig. 3: for triangleshaped icosahedral asymmetric units (e.g. Fig. 2a) the center of mass must fall within the yellow outline, or for rhomboidshaped icosahedral asymmetric units (e.g. Fig. 2b) within the green outline. Restricted placement conditions for the primary asymmetric unit center of mass are also defined for the other point symmetries (last row in Table 2).
3.1.2. Helical symmetry. Symmetry parameters, standard frames, hierarchy of symmetry operations and asymmetric unit placement for polar and nonpolar helical symmetries are defined in Table 3. Polar and nonpolar helical symmetries closely follow the definitions for related circular and dihedral point symmetries.
Helical screw symmetry is defined using three parameters in order to allow an exact repeat: rotation around the helical axis for n subunit repeats, translation along the helical axis for n subunit repeats and number of subunit repeats divisor (n). For example, the fiber-diffraction structure of cucumber green mottle mosaic virus (CGMMV; Fig. 2f) with 49 subunits in three turns has a rotation per subunit repeat of 1080/49 degrees and translation per subunit repeat of 70.8/49 Å . When there is no exact repeat, rotation and translation is defined for a single subunit repeat with the divisor set to unity.
Two additional parameters define rotational symmetries of a helical assembly. The presence or absence of dyad symmetry perpendicular to the helical axis distinguishes nonpolar helical structures (two ends equivalent) and polar helical structures (each end unique). Circular symmetry is a positive integer that defines the number of subunit strands twisting in parallel about the helical axis. Circular symmetry is onefold for CGMMV (Fig. 2f) and fivefold for the filamentous phage illustrated in Fig. 2(g). Both of these helical viruses are polar.  n-fold and screw on z n -fold and screw on z Twofold on x Hierarchy of symmetry operations n-fold on z n -fold on z Screw on z Twofold on x Screw on z Asymmetric unit center-of-mass position On +x Nearest +x and +z

Figure 3
Icosahedral standard frame, shown with respect to orthogonal coordinate axes. Fivefolds and threefolds nearest to the the z axis are identified with symbols. Numbers show the order of symmetry operations for positions visible in this view. Yellow and green lines delimit the two alternate restricted placement boundaries for the first point asymmetric unit position.
Although not an essential parameter, the number of symmetry operations needed to generate a representative helical assembly should be defined. The number is arbitrary but should be large enough to represent the overall symmetry and all unique intersubunit interactions. It should also ideally be a multiple of the circular symmetry parameter, a multiple of 2 if dyad symmetry is present and a multiple of an odd number so that generated operations may be centered about the identity operation.

Transformation operations list
All transformation operations that may be applied to the deposited orthogonal angstrom coordinate positions are gathered into a single unified list. The list can include transformations to other orthogonal coordinate frames, as well as regular point, helical and crystal symmetry operations in the deposited frame. Inverse transformations (i.e. transformations from other frames/positions into the deposited frame/posi-tion) are not included, since they do not meet the criteria of being applicable to the deposited coordinates.
Each operation is identified by a unique ID and is represented as nine-element rotation matrix plus a three-element translation vector. To convert to the more convenient 16-element 4 Â 4 matrix form, the rotation matrix is placed in the first three rows and columns and the translation vector becomes the first three elements of the fourth column. The fourth row is set to 0, 0, 0, 1. The resulting 4 Â 4 matrix that operates on four-element vectors is 3.2.1. Frame transformations. Assemblies in experimental orthogonal coordinate frames other than the deposited frame may be defined. The deposited frame can be any arbitrary orthogonal coordinate frame favored by the deposition Assembly generation with regular point-symmetry example: 1al0, crystal structure of 'X174 procapsid (Dokland et al., 1997). The pathway to generate assemblies in standard point, author-defined and crystal frames is shown. Frame transformations are represented by yellow arrows connecting the deposited frame, standard icosahedral point frame and crystal frame. See x3.3 for details. authors, although a standard frame is preferred. The relationship between the deposited frame and standard point, helical, crystal and/or other frames is then explicitly defined by including frame transformations in the operations list.
3.2.2. Regular symmetry operations. Point, helical or crystal symmetry operations in the deposited frame of the entry may be included in the transformation list. By convention, pointsymmetry operations begin with the identity operation and the order of subsequent operations follows the hierarchy for the defined symmetry in the standard frame (e.g. fivefold, twofold, twofold, threefold for icosahedral symmetry; see Table 2). For point symmetries deposited in nonstandard frames, symmetry operations are calculated using (1) after determination of the frame transformation matrix [P] (see x3.1.1). This method ensures that relative spatial relationships among symmetryrelated asymmetric units are consistent across the database. For example, the pentamer subassembly of every remediated icosahedral virus entry may be built by applying the first five point-symmetry operations. Helical symmetry operations are defined in a continuous run centered about the identity operation.

Assembly generation
Here, we describe the logic for generating complete macromolecular assemblies for a PDB entry containing minimal coordinates plus a set of regular noncrystallographic symmetry operations. Fig. 4 presents an overview of generation of assemblies in multiple coordinate frames using the example of the icosahedral 'X174 procapsid (PDB entry 1al0; Dokland et al., 1997), a structure determined by X-ray crystallography with two independent virus-particle positions in the crystal asymmetric unit. Atomic coordinates were deposited in an alternate icosahedral frame.
The assembly path begins at the top center of Fig. 4 with the deposited chains represented as enveloped ribbons and proceeds counterclockwise. The coordinates are moved into the standard icosahedral frame (upper left) by application of the frame-transformation matrix [P]. The complete biological assembly (lower left) is produced in the standard icosahedral frame by the application of 60 point-symmetry operations and is moved back to the deposited frame (bottom center) by the application of [P-inv], calculated as the inverse of matrix [P].
[X0] and [X1] are author-provided transformations that place two independent copies of the virus assembly onto the cubic (I2 1 3) crystal lattice body diagonal (lower right). A subset of operations defines the crystal asymmetric unit (upper right).
Assembly definitions corresponding to the path in Fig. 4 are summarized in Table 4. Each definition includes a text description and a list of one or more operation expressions with associated coordinate selections. Operation expressions are given in a compact notation and specify matrices from the operations list, which includes frame transformations  [20]. Similarly, '(X1)(1-20)' specifies the portion of the crystal asymmetric unit belonging to the second independent virus particle. The two specifications listed together define the full crystal asymmetric unit (see bottom row of Table 4). Coordinate selections are given as lists of comma-separated coordinate-group identities (Bourne et al., 1997).

Discussion
Remediated entries for the viruses and other assemblies listed in Table 1 were released into the PDB archive on 31 July 2007 and are available by ftp or web interface from any of the wwPDB partners (RCSB PDB, EBI MSD, PDBj; see http:// wwpdb.org and Berman et al., 2003). PDB-format files automatically generated from remediated mmCIFs hold much of the updated information, including corrected BIOMT matrices to build the full biological assembly and a text description of the regular symmetry. For crystal structures  Table 4 Assembly definitions, icosahedral virus crystal illustrated in Fig. 4 Table 5 Assembly definitions, complex symmetry (PBCV-1). deposited in the crystal frame, noncrystallographic symmetry operations to build the crystal asymmetric unit are provided in MTRIX records. The mmCIF files or their PDBML translations should be consulted for the most complete machinereadable representations of these entries. One immediate consequence of remediation is that routine visualization of complete biological assemblies of viruses is now possible. Biological unit files containing explicit coordinates for the full assembly are available in the PDB archive and can be viewed with a number of different software programs. However, the downloading, storage and manipulation of a biological unit file is inefficient compared with handling the equivalent representation in matrices and coordinates. PBCV-1 virus (PDB entry 1m4x) is the most extreme case: the compressed storage size for the biounit file with 5040 chains is 1000 times bigger than the mmCIF or PDB file with three chains and matrices (0.3 Gb versus 0.3 Mb). The Chimera Multiscale Module was designed specifically for displaying large assemblies and can calculate full assemblies on the fly from PDB BIOMT records (Goddard et al., 2005); examples of its use are shown in Figs. 2 and 4. Adoption of this mmCIF (or equivalently, PDBML) representation will further enhance the capabilities of visualization tools to display complex biological assemblies.
To optimally represent future entries of this type, we encourage the deposition of coordinates representing the minimal unique repeating unit along with a clear description of the symmetry, including all local, point, helical, twodimensional and/or three-dimensional crystal parameters. A complete set of point-symmetry operations or representative set of helical operations should be provided in the deposited frame, along with known transformations to other experimental frames. We anticipate that continued progress in development of X-ray diffraction, cryoEM and other structural biology methods will result in many more examples of large biological assemblies with regular symmetry in years to come.