research papers
Scotty: lattice coincidences in the Protein Data Bank
aCambridge Institute for Medical Research, Department of Haematology, University of Cambridge, The Keith Peters Building, Hills Road, Cambridge CB2 0XY, United Kingdom, bRonin Institute for Independent Scholarship 2.0, 9515 NE 137th Street, Kirkland, WA 98034, USA, and cFresh Pond Research Institute c/o Brookhaven National Laboratory, NSLS II, Upton, NY 11973, USA
*Correspondence e-mail: [email protected]
The application Scotty, implemented within the Phasertng codebase, was used to perform a PDB-wide analysis of lattice coincidences. Using a broad definition of lattice coincidence, the number of distinct lattice clusters is approximately half the total number of crystallographic PDB entries. In over one thousand lattice clusters entries are reported in different space groups, consistent with pseudo-symmetric variation within a common lattice framework. Space-group frequencies computed at the lattice-coincidence level update those obtained by entry-based counting, and more accurately reflect priors for novel crystal forms. Combining lattice clustering with sequence identity and deposited oligomeric annotations reveals multiple cases of inconsistent biological assembly assignments among structures sharing near-identical lattices, suggesting an opportunity to improve annotations. The survey also identifies a range of protein systems forming extended in cellulo paracrystalline arrays, including storage, sequestration, toxin and membrane-associated proteins, an understudied area of structural biology. Overall, the results demonstrate that lattice-level analysis provides a valuable perspective on macromolecular self-association.
Keywords: non-isomorphism; PDB annotation; oligomerization; paracrystalline arrays.
1. Introduction
Searching the Protein Data Bank (PDB) for crystals with similar macromolecules arranged within a similar lattice is not trivial. Even in the simplest case of the same macromolecule in the same crystal form, crystallographers in different laboratories may choose different settings. Non-isomorphism will make the search more difficult, as unit-cell differences between related lattices can be very high; for example, the 10 Å change in a unit-cell dimension in HIV-1 reverse transcriptase (Esnouf et al., 1998
). The problem compounds if there has been a breakdown in one or more symmetry relationships within the lattice, so that the space groups also differ; even the crystal system can vary.
To improve searches for similar lattices we have developed Scotty, a software tool that searches the PDB to find lattice matches regardless of space-group setting, annotation or subgroup relationship, and allowing significant perturbations in lattice distances and angles, and other types of non-isomorphism such as skewed translations and small-molecular rotations (McCoy et al., 2026
). The search uses NCDist, a Niggli reduced cell distance metric (McGill et al., 2014
), to find nearest-neighbour lattices in the Niggli reduced cell space, followed by structure-factor intensity correlation (CCI) to filter for underlying atomic distribution matches. By identifying `coincident lattices', where crystal forms share the same fundamental three-dimensional macromolecular packing, we can recognize recurring higher order assemblies. These can be grouped to form `lattice clusters' where members of a cluster share three-dimensional intermolecular associations that build the crystal.
Lattice clusters can be evidence of consistent oligomerization of homologues. Crystal lattices habitually incorporate the protein solution oligomerization state (Gaber & Pavšič, 2021
). Protein oligomers usually possess point-group symmetry compatible with a space-group symmetry (such as Schönflies C2, D2), and the oligomer effectively reduces the number of unique crystal contacts required for lattice formation (Wukovitz & Yeates, 1995
). This mechanism increases crystallization propensity, and engineering oligomerization has even been proposed as a technique to encourage crystallization (Banatao et al., 2006
). That oligomeric symmetry is incorporated into crystallographic lattices can be seen indirectly in the altered space-group propensities for proteins that oligomerize versus those that are monomers in solution (Gaur, 2021
).
In some specialized cases the entire crystalline lattice has biological relevance, such as in the case of insulin (Asai et al., 2022
). While rare, functionally crystalline proteins exist across all kingdoms of life, linked to storage, secretion, structural rigidity or protection (Doye & Poon, 2006
; Schönherr et al., 2018
). In vivo crystalline arrays are generally understudied and many remain uncharacterized.
Biologically relevant interfaces tend to recur across different crystal forms, bury larger surface areas, display shape and electrostatic complementarity, and exhibit evolutionary conservation of key residues. Quantitative criteria estimating the likelihood that a given interface corresponds to a stable oligomer in solution are implemented in software tools such as PISA (Krissinel & Henrick, 2007
) and EPPIC (Duarte et al., 2012
). ProtCID (Xu & Dunbrack, 2020
, 2011
; Xu et al., 2008
) develops the idea of interface recurrence as evidence of biological relevance by clustering protein interfaces across PDB crystal structures at the Pfam domain level, to identify recurring protein–protein, domain–peptide, nucleic acid and ligand interactions and thereby generate hypotheses about biologically relevant assemblies.
Better identification of matching and near-matching lattices could strengthen approaches such as ProtCID by separating true recurrence of an interface from simple repetition of the lattice. If the same macromolecule crystallizes repeatedly in crystal forms falling in the same lattice cluster, this in itself is little evidence for biological relevance. In contrast, detection of related lattices among homologous proteins is highly informative. Finding lattice clusters could therefore act both as a filter, down-weighting redundant observations from essentially identical crystal forms, and as a discovery tool, highlighting conserved domain contacts in homologues where sequence, domain architecture and crystal packing have varied but the same interface persists. This would be particularly useful for weak oligomeric interfaces, which are difficult to distinguish from crystallization-induced contacts and which are often missed by biological assembly annotations or interface-prediction tools (Xu & Dunbrack, 2020
).
In this study, we use Scotty to survey the PDB to group the entries into lattice clusters and probe the resulting classification for physical and biological properties, giving examples of how the method can deepen analysis of difficult-to-study interactions.
2. Survey of the PDB for coincident lattices
A survey of lattice coincidences within the PDB archive was undertaken with the Scotty software (McCoy et al., 2021
, 2026
). Scotty was used to probe the PDB metric-space index built from each crystallographic structure in the 20260101 PDB snapshot with each structure in the same snapshot (all-on-all).
2.1. Measuring lattice coincidences
The Scotty software uses the NCDist (Niggli-cell distance) metric to find matches between Niggli reduced unit cells, a canonical representation of the lattice (unit cell) agnostic to and space-group setting. Matches are then filtered by CCI, which quantifies how similar the underlying atomic arrangements are via the structure factors, either observed or calculated. Results are reported in terms of √CC (= ), which is an estimate of σA, the complex correlation between true and model structure factors, and is also the expected real-space map correlation. Scotty allows a speed optimization to avoid costly structure-factor calculations for CCI calculation by allowing NCDist matches to be (conservatively) filtered by applying a (low) sequence-identity threshold between the match and the query.
The NCDist threshold for lattice matches was 5 Å. Lattice coincidences were filtered from the NCDist lattice matches returned from the PDB index for having at least one of the two criteria: either the √CCI was over 0.4 or the sequence identity between the model and target was over 30%, a low threshold for indicating the same fold.
2.2. Correlation between NCDist and √CCI
To evaluate the link between lattice similarity and electron-density agreement, the joint distribution of NCDist and √CCI was analysed across three sequence-identity regimes (Fig. 1
).
|
| Figure 1 Kernel density estimates of the relationship between NCDist and √CCI for lattice coincidences in PDB snapshot 20260101 by sequence identity. (a) Identical sequences (100% sequence identity), (b) moderate pairwise sequence identity (between 100% and 65%) and (c) low pairwise sequence identity (between 65% and 30%). In each panel, the distribution is shown as filled density contours, with darker shading indicating higher probability density. |
Cases in which there is a lattice match by NCDist < 5 Å and yet √CCI ≃ 0 can arise for three reasons.
|
2.3. Space groups with lattice coincidences
Matches with √CCI > 0.4 and differing space groups were identified. The matrix of space-group pairs (Fig. 2
) is generally sparse. The matrix is nearly symmetric above and below the diagonal, which confirms that the procedure is reciprocal and stable: if A matches B, B matches A. Rare nonsymmetric entries are due to minor numerical border-crossing effects around the 0.4 inclusion threshold.
| Figure 2 Matrix of space groups of lattice coincidences in PDB snapshot 20260101: probe space group (horizontal axis) and matched space group (vertical axis). Crystal systems for probe space group are coloured for clarity and boundaries between crystal systems are shown with a grey line for both probe and matched space groups. Grid positions are blue where the relationship is found in the data and orange where the relationship is missing but the reciprocal relationship is present; these isolated cases are due to numerical instability in the calculation of √CCI at the selection boundary. Within the 87 404 mismatch pairs, 3174 entries matched with higher symmetry space groups in 30 821 pairs, 13 536 entries matched with lower symmetry space groups in 54 568 pairs, and 1068 entries matched with different space groups with the same number of symmetry operations in 2015 pairs. |
The matrix has a block-like structure, revealing that most space-group transformations stay within the same by gaining or losing a single symmetry operator within tetragonal, hexagonal or cubic systems, or from orthorhombic to monoclinic β = 90°.
Certain high-connectivity space groups act as operational `hubs', linking families of subgroup/supergroup transitions. Low-symmetry space groups P1 and C2 are prominent in this role, often involving the loss/gain of more than one symmetry operator.
There are isolated matrix entries that are not explained by loss of symmetry but rather change of space-group screw-axis symmetry, for example P212121 to P21212. These cases have more than one molecule in the asymmetric unit. Where the screw has been lost, it is substituted for translational noncrystallographic symmetry (tNCS) that can be characterized as a pseudo-centring operator; for example, a tNCS vector at (0, 0, ½). The ambiguity lies in the alternative selections of the space-group symmetry/noncrystallographic symmetry, which may not always have been made optimally in the deposited structure.
Our observations do not necessarily imply that there is an error in any of the analysed crystal entries. When clusters contain structures spanning different space groups, these variations need not be byproducts of poor data analysis or flawed processing. These crystallographic differences may be fully justified by the underlying data-merging statistics, indicating that the lattice-matching algorithm is successfully clustering genuine, physically real perturbations in the crystal lattice symmetry. Furthermore, even in specific instances where the data might be under-merged, or there are differences in space-group symmetry/noncrystallographic symmetry pairings, there is no reason per se to expect the resulting structural models to be incorrect in any biologically significant way.
2.4. Lattice coverage of PDB
PDB entries were clustered into lattice-coincident groups using an undirected graph where the edges represent the pairwise matches and then finding all connected components. The database reduced to 88 846 lattice clusters (Fig. 3
), demonstrating that lattice diversity can be represented by approximately half of the entries in the PDB.
| Figure 3 Effect of duplicate identification on the effective search space for 190 770 entries in PDB snapshot 20260101. The main panel shows that the number of clusters decreases approximately linearly with increasing cluster size on log–log scaling consistent with a power-law-like decay. A long tail corresponds to a small number of highly over-represented entries. The largest ten clusters are labelled. Inset: proportion of entries retained after collapsing each duplicated cluster to a single representative. (a) Duplicates defined as the same unit cell, same space group, same sequence and low r.m.s.d. without fitting, as described in the text. (b) Duplicates defined by lattice clustering, as described in the text. |
We compared the number and size of the lattice clusters generated by Scotty with the number and size of trivially isomorphous replicate groups in the PDB. To qualify as trivially isomorphous – a duplicate – entries had to contain the same macromolecules with 100% primary-sequence identity, share identical deposited space groups and settings, exhibit tightly matched deposited unit-cell parameters (within 1.5 Å on unit-cell edges and 1.5° on unit-cell angles and an overall NCDist < 1.0 Å) and display a full-atom r.m.s.d. of less than 2 Å calculated strictly without rigid-body fitting (McCoy et al., 2026
). Because this overlay condition relies entirely on unmodified deposited coordinates, the resulting count is sensitive to the historical archival issues, where identical structures were sometimes processed in inappropriately low-symmetry space groups or alternative coordinate frames (Wlodawer et al., 2018
, 2025
; Dauter & Wlodawer, 2018
). Fortunately, there has been recent work to remediate the PDB archive so that structural duplicates can be cleanly recognized in automated workflows. Constructing an undirected graph where edges represent verified duplicates, finding all connected components yielded a total count of 161 121 replicate clusters, meaning that lattice clusters offer considerably larger groups and a reduced lattice diversity compared with that offered by lattice duplicates (Fig. 3
).
2.5. Lattice coincidences with multiple space groups
Out of the 88 846 lattice clusters, 1538 had two or more space groups (Table 1
). Clusters containing four or more space groups were limited to short DNA or α-helical peptide segments. These clusters were caused by the propensity of DNA and α-helices to form end-to-end continuous helical stacking in crystals. The resulting diffraction is dominated by the shared helical molecular transform, inflating the √CCI. Excluding these pathological cases, multi-space-group variations occurred in 1442 clusters and were capped at a maximum of four space groups.
| ||||||||||||||||||||||||||||||||||||||||||
The 1442 clusters containing multiple space groups generated a total of 87 404 directed pairwise mismatches between entries. This tally is heavily skewed by a few exceptionally large clusters due to cross-combination scaling with 2 × N1 × N2, where N1 and N2 are the numbers in each space-group category: a large cluster containing one entry with an outlier (N1 = 1, N2 ≫ N1) contributes disproportionately to the total. The median cluster containing multiple space groups had only two space groups and three entries (N1 = 1, N2 = 2), contributing four pairwise mismatches to the total.
We further assessed the multiple space-group clusters with other validation tools.
2.5.1. Under-merging (higher symmetry pairs)
When analysed via POINTLESS (Evans, 2011
), 51% of the 3008 entries (out of 3174) with deposited experimental data were flagged as potentially under-merged. An additional 27% had metrics indicating under-merging that fell just short of internal software thresholds for being reported.
The remaining 22% of the entries with deposited data were not identified as having higher lattice symmetry, showing that the Scotty procedure is more sensitive to detecting coincident lattices than the lattice matching and merging methods implemented in POINTLESS.
That the PDB contains under-merged data has been known since the audit by Dauter et al. (2014
), who characterized these crystallographic discrepancies as avoidable errors. The authors argued that while these lower symmetry models are technically valid, they obscure the true symmetry of the assembly, complicate crystal-packing analysis and introduce artificial structural clutter that hinders automated data harvesting.
2.5.2. Over-merging/twinning (lower symmetry pairs)
PDB-REDO identifies using SFCHECK (Vaguine et al., 1999
) twinning statistics followed by REFMAC R-value statistics (Joosten et al., 2014
). Tested against PDB-REDO gold-standard flags, only 1.7% of the 13 536 entries matched with lower symmetry counterparts were annotated as twinned. Because this is far below the baseline PDB-wide average of 4.2%, classic crystal does not explain these over-merged classifications.
2.6. Space-group statistics
Historical surveys of PDB space-group propensities consistently confirm a tripartite dominance of P212121, P21 and C2. However, the specific reported percentages vary significantly across the literature (published between 1990 and 2021) depending on database size, inclusion criteria, curation and/or categorization (Table 2
).
|
Space-group statistics for the PDB were calculated with one per lattice-coincidence cluster. Where a cluster had multiple space groups, the with the highest symmetry was taken to represent the cluster (Supplementary Table S1). This study finds that the space-group frequencies by lattice cluster closely match the frequencies of Chruszcz et al. (2008
), but push P21 even further towards equal frequency to P212121.
3. Lattice coincidences and quaternary structure
To evaluate the quaternary-structure information that lattice coincidences provide, we focused on crystals of single macromolecular protein entities (which excluded hetero-oligomers) from the 20260101 snapshot of the PDB archive. A structured query submitted to the RCSB PDB search API retrieved entries meeting the following criteria: (i) having all unit-cell dimensions greater than 5 Å, (ii) including one protein entity, (iii) excluding any nucleic acid polymer entities (DNA, RNA or hybrid), (iv) consisting of over 50 amino acids per assembly, (v) having only one deposited entity instance count and (vi) a polymer entity mutation count equal to zero.
The search returned 47 588 PDB identifiers in 6272 clusters. The assembly records of all entries in each of the 6272 clusters were extracted from the archive and were used to infer the oligomeric state of each entry.
3.1. Space-group statistics
Because biological oligomers generally have point-group symmetry, space groups P1, P21, P212121, P41, P42, P43, P31, P32, P61, P65, P62, P64 and P63, which lack point-group symmetry operations (Schönflies C2, C3, C4, C6, D2, D3, D4, D6, T or O), are underrepresented (Table 3
). However, a space group's maximal point-group symmetry does not necessarily match the oligomeric symmetry, since crystal-packing interactions can impose additional symmetry elements: this is as true for oligomeric assemblies as it is for monomeric proteins.
| |||||||||||||||||||||||||||||||
3.2. Oligomers
The survey enabled us to checkpoint the consistency of annotation of oligomeric symmetry within the PDB archive.
To check PDB oligomeric annotation consistency, clusters were filtered using MUSCLE alignments to eliminate low-diversity groupings, where the internal sequence identity exceeded 95%. This left 304 clusters that had retained lattice coincidences through evolution. The 304 cases were equally divided between enzymes (158) and non-enzymes (147). Individual cases of lattice coincidences were examined on a case-by-case basis for biological relevance. We give four examples below.
3.2.1. Keto–acid reductoisomerases
Keto–acid reductoisomerases from Campylobacter jejuni (PDB entry 7lat) and Methanothermococcus thermolithotrophicus (PDB entry 7q03) with 50% sequence identity share an I23 lattice coincidence after applying a re-indexing operation (−h, l, k), with √CCI = 0.42 and NCDist = 2.14 Å. The latter is annotated as a dodecamer (with point-group symmetry T) while the former is annotated as a dimer (with point group C2).
In EPPIC (Bliven et al., 2018
), both are identified as dimers. PISA (Krissinel & Henrick, 2007
) identifies the latter as a dodecamer and the former as a dimer, which matches the annotation (and is likely the source of the annotation). However, a closer look at the interfaces interpreted by PISA for the two cases show that the differences are very marginal (Table 4
).
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In addition, the keto–acid reductoisomerase from Bacillus anthracis, with 57% sequence identity to that from C. jejuni and 54% sequence identity to that from M. thermolithothrophicus, crystallizes in space group I222 with three monomers in the asymmetric unit. This crystal form has NCDist = 3.5 Å and √CCI = 0.41 to that from M. thermolithotrophicus. It too is annotated as a dimer but shares the same dodecameric assembly within the lattice.
The coincident lattices suggest that the dodecamer is likely to be present in solution prior to crystallization.
3.2.2. Cardiotoxin
Cardiotoxins are small, disulfide-rich, non-enzymatic proteins that disrupt cell membranes. Cardiotoxin γ (PDB entry 1tgx) and cardiotoxin IV (PDB entry 1ug4) from two different species of cobra, with 85% sequence identity, share a lattice coincidence after applying a re-indexing operation from C2 to H32 (−h + k, h + k, ⅓h − ⅓k − ⅓l) with NCDist = 2.228 Å and √CCI = 0.49. POINTLESS does not identify the coincidence.
Cardiotoxin γ, the C2 crystal form, contains three molecules in the asymmetric unit arranged around a noncrystallographic threefold axis. This trimer has been proposed to represent a biologically relevant assembly, an ion-conducting trimeric pore, which is then stacked back-to-back by a crystallographic twofold axis to form a continuous hexameric channel. Cardiotoxin γ is annotated as a hexamer, and its pore-forming ability would be consistent with the known membrane-disruptive activity of cardiotoxins (Bilwes et al., 1994
). The structure of cardiotoxin IV has the same hexameric arrangement, but around the crystallographic threefold and twofold axes in H32, with one monomer in the asymmetric unit. Nonetheless, cardiotoxin IV is annotated as a monomer in the archive.
3.2.3. Methionine adenosyltransferase
Across the four human methionine adenosyltransferase (MAT) structures with PDB entries 8oog, 2p02, 8xam and 2obv, a consistent picture emerges once the crystals are compared at the lattice level rather than by deposited space-group annotations alone. The three PDB entries 8oog, 2p02 and 8xam are structures of MAT2A, while PDB entry 2obv corresponds to MAT1A, a closely related paralogue expressed in adult liver with 86% identity to MAT1A. The MAT2A structure 8xam in space group I121 with two molecules in the asymmetric unit shares a lattice coincidence with the other three, all of which are in space group I222 with one molecule in the asymmetric unit, and therefore all four have the same crystalline assembly. However, PDB entry 8oog is annotated as a monomer, PDB entries 2p02 and 8xam as C2 dimers and PDB entry 2obv as a D2 tetramer. It is most parsimonious to infer that MAT1A and MAT2A share a common tetrameric structure in solution, with the tetramer consistently represented in the even when not explicitly annotated as such in the PDB.
3.2.4. Ras-family GTPase
An intriguing case is provided by Ras-family GTPases. A lattice cluster of 86 structures (for example, PDB entries 7ycc, 5us4, 1lf0, 4l9w, 2cl0 and 3i3s), which includes H-Ras and K-Ras, and has pairwise sequence identities between 90% and 100%, have closely related hexagonal settings H32 (one molecule in the asymmetric unit), H3 (two molecules in the asymmetric unit) or C121 (PDB entry 7ycc, three molecules in the asymmetric unit). They are annotated variously as a monomer, dimer, trimer and hexamer. Despite this heterogeneity in space group and biological assembly annotation, the underlying crystallographic arrangement is a consistent recurrent hexamer. Ras has been shown to form nanoclusters on the plasma membrane, with each cluster containing on average ∼6–7.7 Ras molecules and having a radius of approximately 60 Å (Plowman et al., 2005
). The repeated observation of a conserved lattice across multiple Ras crystal structures with divergent sequences suggests that there is an energetically favourable set of intermolecular contacts compatible with nanocluster formation.
3.3. In cellulo paracrystalline arrays
Protein crystals can occur naturally in cellulo (McPherson, 1999
), an observation dating back to the mid-19th century. The assemblies often lack the long-range periodicity required for classification as true crystals and are often termed paracrystalline arrays. Naturally occurring crystals and paracrystalline arrays are known to serve a large variety of specific biological roles (Doye & Poon, 2006
). In archaea and bacteria, 2D crystals or paracrystalline arrays are well characterized, serving protective and structural roles. Helical assemblies of proteins are also known to be functionally important, enabling dense storage, dynamic organization and mechanical force generation across a variety of biological systems.
Paracrystalline arrays naturally lend themselves to study by X-ray crystallography; however, cryo-electron microscopy has also been utilized, for example in the study of ribosome microcrystals found in chick embryos (Nurullina et al., 2024
; Barbieri, 1979
)
Interestingly, given the rarity of in cellulo 3D, 2D and 1D helical arrays, the clusters of lattice coincidences with divergent sequences were highly enriched in these arrays, with 26 of the 304 clusters representing known cases; these are summarized in Table 5
, along with information about space groups and sequence identities. When considering only non-enzymes (or non-proenzymes), 22 of the 124, or nearly one-fifth, are proteins found to form in cellulo arrays. The 26 clusters represent 17 well known biological categories, as discussed below. While the enrichment in in cellulo arrays in this set of lattice coincidences could arise if the crystal packing recapitulates the cellular association, not all associations have relevant evidence from in cellulo imaging and characterization.
‡PISA: area is give in Å2 and ΔG in kcal mol−1; NHB is the number of hydrogen bonds and NSB is the number of salt bridges. §EPPIC: the prediction of assembly interface (BIO) versus crystal contact (XTAL) is qualified with a Z-score of the sequence entropy of core residues at 70% burial versus random samples of all surface residues [in brackets], with a gold star for high confidence and grey star for medium confidence, as per the EPPIC markup. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3.3.1. Ferritin and bacterioferritin
Under conditions of iron overload, ferritin can accumulate in cells and assemble into arrays in a strategy to safely sequester excess iron in a compact, inert form. The building block of the arrays is a 24-subunit ferritin nanocage capable of mineralizing thousands of Fe atoms. Ferritin paracrystalline arrays have been documented in hepatocytes, Kupffer cells and tissue macrophages in pathological states such as hereditary hemochromatosis and transfusion hemosiderosis, within lysosomes or specialized siderosomes (Massover, 1993
). A closely related strategy is observed in bacteria through bacterioferritin, a homologous iron-storage protein that likewise assembles into a 24-subunit nanocage but incorporates haem groups at subunit interfaces.
3.3.2. Woronin bodies
Woronin bodies are specialized, membrane-bound found exclusively in filamentous fungi, such as Neurospora crassa, where they prevent cytoplasmic leakage following cell damage. Hex-1, the major protein component of Woronin bodies, self-assembles into crystalline cores within these organelles, providing the mechanical rigidity necessary for rapid pore sealing (Yuan et al., 2003
).
3.3.3. Parasporal crystals
Bacterial δ-endotoxins (Cry toxins) are highly specific insecticidal pore-forming proteins that facilitate propagation of the bacterium by killing susceptible insect larvae after ingestion, thereby converting the host carcass into a nutrient-rich environment for spore germination and proliferation. The proteins are produced in extremely large quantities during sporulation and are packed into crystals to be stored at high local concentration without being toxic to the bacterium itself. The crystals only dissolve in the alkaline pH and protease-rich environment of an insect midgut. Electron microscopy and X-ray diffraction studies have extensively characterized these in vivo crystals, which can reach micrometre size (Jisha et al., 2013
).
3.3.4. Charcot–Leyden crystals
Human galectin-10/Charcot–Leyden crystal protein is abundantly expressed in eosinophils. In vivo, it forms distinctive bipyramidal crystals: Charcot–Leyden crystals (CLCs). CLCs are a feature of asthma, are often found in parasitic diseases due to the influx of eosinophils and can be found in some haematological malignancies. These crystals are highly ordered and are easily identifiable by light and electron microscopy in tissue biopsies (Persson et al., 2019
).
3.3.5. Pseudo-Charcot–Leyden crystals
Ym1/Ym2 crystals in mice produce pseudo-Charcot–Leyden crystals (Ym1/Ym2 crystals) from chitinase-like proteins. These proteins are not homologous with galectin-10. However, both types of crystals are associated with type 2 immunity and, like CLCs, can be found in the context of allergic reactions and parasitic infections (Aegerter et al., 2021
).
3.3.6. Cypovirus polyhedra
Cypoviruses (cytoplasmic polyhedrosis viruses) are a genus of insect-infecting viruses belonging to the family Reoviridae. They possess a segmented double-stranded RNA genome and have non-enveloped, icosahedral capsids. Cypoviruses replicate exclusively in the cytoplasm of midgut epithelial cells, primarily in insect larvae from the order Lepidoptera. A distinctive feature of cypoviruses is their formation of occlusion bodies, in which crystalline polyhedrin proteins embed the viral particles, stabilizing and protecting them to enhance environmental persistence and transmission (Chen et al., 2011
).
3.3.7. Baculovirus polyhedra
Baculoviruses are large, double-stranded DNA viruses within the family Baculoviridae, also known for infecting insect larvae from the order Lepidoptera. These viruses have a circular dsDNA genome and a distinctive rod-shaped enveloped nucleocapsid. Unlike cypoviruses, baculoviruses replicate in the nucleus of infected cells. They also produce occlusion bodies that encase virions (Chiu et al., 2012
).
3.3.8. Entomopoxvirus spindles
Fusolin is a protein that forms part of `spindles', crystalline structures produced by entomopoxviruses (EPVs). These spindles enhance viral infection by disrupting the insects' peritrophic membrane (PM), a protective barrier in the gut. This disruption allows the virus to pass through the PM and reach the midgut epithelium, the primary site of infection (Mitsuhashi et al., 2007
).
3.3.9. Prorenin
Prorenin is the inactive precursor of renin, an enzyme in the renin–angiotensin–aldosterone system (RAAS), which regulates blood pressure, electrolyte balance and fluid homeostasis. Synthesized primarily in the juxtaglomerular cells of the kidney, prorenin can stored intracellularly in dense granules and released into the circulation under conditions of physiological demand (Galen et al., 1984
). Prorenin crystals, which may have the same form as the deposited structures (Yan, 2017
), have been observed in the cytoplasm of human tissue biopsies (Squires et al., 1984
), including from a patient with a juxtaglomerular cell tumour (Kuten et al., 2004
).
3.3.10. Annexins
Annexin V is a phospholipid-binding protein consisting of a single annexin core domain, with a role in membrane-related processes such as membrane repair, exocytosis, and apoptosis. It is a monomer in solution, but when bound on the cytoplasmic face of phospholipid membranes forms trimers that assemble into highly ordered 2D arrays in a calcium-dependent manner (Olofsson et al., 1994
).
3.3.11. Bacterial microcompartment shell proteins
Bacterial microcompartments (BMCs) are found in many bacteria that spatially organize specific metabolic pathways without using lipid membranes. BMC-H (hexameric) shell proteins are structural components of BMCs, forming flat, hexameric tiles that assemble into the proteinaceous shells of these EutM and CsoS1A are canonical BMC-H proteins that share a conserved fold consisting of a single BMC domain and form symmetric hexamers with small central pores that regulate metabolite diffusion. Despite their structural similarity, they occur in distinct functional contexts: EutM is a component of the ethanolamine-utilization (Eut) metabolosome, which catabolizes ethanolamine in enteric bacteria, whereas CsoS1A is found in α-carboxysomes, specialized BMCs used by autotrophic bacteria for CO2 fixation via the Calvin cycle (Trettel et al., 2024
).
3.3.12. Microbial rhodopsin
Bacteriorhodopsin is a light-driven proton pump embedded in the purple membrane of halophilic archaea, where it converts photon energy directly into a transmembrane proton gradient. It forms exceptionally well ordered 2D crystalline arrays in native membranes, which made it a model system for electron crystallography and early atomic-resolution studies of membrane-protein structure. Cruxrhodopsin-3 from Haloarcula vallismortis clusters with bacteriorhodopsin and can also oligomerize as trimers that laterally associate within the lipid bilayer to form extended, hexagonally ordered 2D arrays, analogous to the classic purple membrane. The structure in the cluster, in P321, is related to the structures in the cluster that includes the first structure of bacteriorhodopsin, PDB entry 1ap9 in space group P63, with √CCI of 0.3 and NCDist of 3.5 Å, which is outside the clustering criteria used for this study but nevertheless is significant.
3.3.13. Vicilin
Vicilins are plant seed storage proteins (7S globulins) that accumulate to extremely high concentrations during seed development and frequently organize into paracrystalline arrays within protein-storage vacuoles. Paracrystallinity allows vicilins to be stored in a compact, osmotically inert form while remaining readily mobilizable during germination, when proteolysis rapidly dismantles the ordered assemblies. Paracrystalline organization represents a functional storage strategy: structural order is sufficient to stabilize and condense large protein reserves, yet deliberately imperfect to permit efficient biological turnover.
3.3.14. RecA
RecA is a bacterial DNA-repair protein that plays a role in homologous recombination. In vivo, RecA polymerizes along single-stranded DNA to form a nucleoprotein filament in bacterial cells under stress. The RecA–DNA filament is considered a dynamic, functional crystal-like structure, exhibiting helical symmetry with a characteristic pitch (∼95 Å) and repeating subunits that align along the DNA backbone (Nishinaka et al., 2007
).
3.3.15. Amyloidosis
Amyloid light-chain amyloidosis is a polymerization-driven disease where misfolding events in light chains, from an abnormal clone of plasma cells in the bone marrow, lead to the formation of small amyloid nuclei, recruitment of further chains and rapid fibril elongation. This devastating process makes the disease progression self-amplifying. The structures of highly fibrillogenic light-chain variable-domain mutants shows molecules packing into extended arrays dominated by β-sheet-mediated contacts, producing linear, zipper-like interactions. These crystal-packing arrangements are widely interpreted as the fibrillar assemblies that accumulate in amyloidosis (Luna-Martínez et al., 2017
).
3.3.16. Phospholipase A2
Phospholipase A2 (PLA2) is a small, secreted enzyme that hydrolyses phospholipids in a calcium-dependent manner, releasing fatty acids and lysophospholipids and acting as a potent toxin. Because this activity would be catastrophic inside the producing cell, PLA2 must be rendered inactive prior to secretion. PLA2 is stored in dense secretory granules, as observed ultrastructurally in venom glands. Reversible and quasi-ordered packing provide an effective strategy to neutralize enzymatic activity by restricting access to membranes and substrates. Although true intracellular crystals of PLA2 have not been demonstrated, the combination of ready crystallizability, extreme local concentration in the cells and known venom-granule biology strongly supports storage in highly ordered granular assemblies that suppress activity until venom secretion.
4. Discussion
Multiple crystal forms sharing a lattice coincidence can appear across different investigations of the same or similar systems by different crystallographers separated by time and laboratory. Our PDB-wide analysis shows that lattice coincidences are common amongst deposited structures; the number of distinct lattice clusters is approximately half the total number of PDB entries, of which only half are identifiable through structure-overlap tests.
Our software Scotty can rapidly scan the PDB archive for closely and distantly related crystal forms. In general, the tolerances used for probing lattice coincidences for biological relevance can be broad, as although most matches suitable for DFFT phasing are made with NCDist < 3 Å, significant matches are made up to NCDist = 5 Å. Likewise, thresholds for √CCI (default 0.4) can be lowered, for example to 0.3, to consider matches that may be of interest when other supporting information suggests that the relationships are biologically relevant. However, care should be taken with short sections of DNA and coiled coils, where convergences in cell dimensions and strong helical diffraction patterns can give high √CCI regardless of sequence.
In the future, we aim to make the Scotty search available through server access. This will synchronize the search with PDB updates and further improve access.
Biological interpretation of lattice coincidences for any given use case must be approached with a nuanced understanding of the biology of the system under study. For example, from our own work, the identification of prorenin amongst the number of structures in the isomorphous set of distant homologues may indeed have biological relevance. Prorenin is the inactive precursor of renin, an enzyme in the renin–angiotensin–aldosterone system (RAAS), which regulates blood pressure, electrolyte balance and fluid homeostasis. Synthesized primarily in the juxtaglomerular cells of the kidney, prorenin can be stored intracellularly in dense granules, activated enzymatically to mature renin and released into the circulation under conditions of physiological demand (Galen et al., 1984
). Prorenin crystals have been observed in the cytoplasm of human tissue biopsies (Squires et al., 1984
). The identification of prorenin in our survey, which has a high proportion of biologically relevant paracrystalline assemblies, supports the proposition that the characterized prorenin crystals are likely to be those seen in cellulo. We aim to pursue experimental confirmation of this hypothesis with the aid of recent advances in cryo-electron tomography, in situ diffraction techniques and correlative light and (CLEM), which are now starting to enable characterization of these protein crystals within challenging native cellular environments (Yang et al., 2024
). However, this work remains technically difficult, not least in the difficulty of obtaining and preparing suitable tissue for analysis.
Surprisingly, our limited survey of the PDB archive, restricted to the subset of those entries with only one protein entity in the asymmetric unit, showed that lattices conserved through evolution were highly enriched in structures known to naturally form 3D, 2D and 1D helical assemblies. Note that our survey restrictions excluded known paracrystalline arrays such as insulin, which after processing of proinsulin is annotated as chain A and chain B, and hence does not meet the criteria for inclusion in the survey.
Our survey focused on clusters of proteins with divergent sequences as a proxy for biological significance in the absence of specific domain knowledge. Conservation of crystal packing between different crystal forms of the same or closely related proteins is also of interest. Given the speed and ease of crystallographic structure solution at modern synchrotron facilities, we would encourage crystallographers to consider solving all crystal forms observed in crystallization trials in order to extract any biologically relevant information that may be present in lattice coincidences.
We hope that those with specialized knowledge of interesting systems will find promising research questions, within and without the scope of this survey, with this new research tool.
Supporting information
Supplementary Table S1. DOI: https://doi.org/10.1107/S2059798326005723/nz5022sup1.pdf
Conflict of interest
There are no conflicts of interest.
Data availability
Scotty will be made available through the Phenix and CCP4 software packages.
Funding information
RJR and AJM acknowledge funding from the Biotechnology and Biological Sciences Research Council (UK) grant No. BB/Y009398/1. HJB was supported by the US DIALS National Resource, NIH R24GM154040.
References
Aegerter, H., Smole, U., Heyndrickx, I., Verstraete, K., Savvides, S. N., Hammad, H. & Lambrecht, B. N. (2021). Curr. Opin. Immunol. 72, 72–78. CrossRef CAS PubMed Google Scholar
Andersson, K. M. & Hovmöller, S. (2000). Acta Cryst. D56, 789–790. Web of Science CrossRef CAS IUCr Journals Google Scholar
Asai, S., Moravcová, J., Žáková, L., Selicharová, I., Hadravová, R., Brzozowski, A. M., Nováček, J. & Jiráček, J. (2022). Open Biol. 12, 220322. CrossRef PubMed Google Scholar
Banatao, D. R., Cascio, D., Crowley, C. S., Fleissner, M. R., Tienson, H. L. & Yeates, T. O. (2006). Proc. Natl Acad. Sci. USA, 103, 16230–16235. Web of Science CrossRef PubMed CAS Google Scholar
Barbieri, M. (1979). J. Supramol. Struct. 10, 349–357. CrossRef CAS PubMed Google Scholar
Bilwes, A., Rees, B., Moras, D., Ménez, R. & Ménez, A. (1994). J. Mol. Biol. 239, 122–136. CrossRef CAS PubMed Google Scholar
Bliven, S., Lafita, A., Parker, A., Capitani, G. & Duarte, J. M. (2018). PLoS Comput. Biol. 14, e1006104. CrossRef PubMed Google Scholar
Chen, J., Sun, J., Atanasov, I., Ryazantsev, S. & Zhou, Z. H. (2011). J. Virol. 85, 6077–6081. CrossRef CAS PubMed Google Scholar
Chiu, E., Coulibaly, F. & Metcalf, P. (2012). Curr. Opin. Struct. Biol. 22, 234–240. Web of Science CrossRef CAS PubMed Google Scholar
Chruszcz, M., Potrzebowski, W., Zimmerman, M. D., Grabowski, M., Zheng, H., Lasota, P. & Minor, W. (2008). Protein Sci. 17, 623–632. Web of Science CrossRef PubMed CAS Google Scholar
Dauter, Z. & Wlodawer, A. (2018). Acta Cryst. D74, 939–945. CrossRef IUCr Journals Google Scholar
Dauter, Z., Wlodawer, A., Minor, W., Jaskolski, M. & Rupp, B. (2014). IUCrJ, 1, 179–193. Web of Science CrossRef CAS PubMed IUCr Journals Google Scholar
Doye, J. P. K. & Poon, W. C. K. (2006). Curr. Opin. Colloid Interface Sci. 11, 40–46. Web of Science CrossRef CAS Google Scholar
Duarte, J. M., Srebniak, A., Schärer, M. A. & Capitani, G. (2012). BMC Bioinformatics, 13, 334. Google Scholar
Esnouf, R. M., Ren, J., Garman, E. F., Somers, D. O., Ross, C. K., Jones, E. Y., Stammers, D. K. & Stuart, D. I. (1998). Acta Cryst. D54, 938–953. Web of Science CrossRef CAS IUCr Journals Google Scholar
Evans, P. R. (2011). Acta Cryst. D67, 282–292. Web of Science CrossRef CAS IUCr Journals Google Scholar
Gaber, A. & Pavšič, M. (2021). Int. J. Mol. Sci. 22, 9081. CrossRef PubMed Google Scholar
Galen, F. X., Devaux, C., Houot, A. M., Menard, J., Corvol, P., Corvol, M. T., Gubler, M. C., Mounier, F. & Camilleri, J. P. (1984). J. Clin. Invest. 73, 1144–1155. CrossRef CAS PubMed Google Scholar
Gaur, R. K. (2021). Acta Cryst. F77, 187–191. Web of Science CrossRef IUCr Journals Google Scholar
Jisha, V. N., Smitha, R. B. & Benjamin, S. (2013). Adv Microbiol. 3, 462–472. CrossRef CAS Google Scholar
Joosten, R. P., Long, F., Murshudov, G. N. & Perrakis, A. (2014). IUCrJ, 1, 213–220. Web of Science CrossRef CAS PubMed IUCr Journals Google Scholar
Krissinel, E. & Henrick, K. (2007). J. Mol. Biol. 372, 774–797. Web of Science CrossRef PubMed CAS Google Scholar
Kuten, A., Olumi, A., Goldsmith, J., Monahan-Earley, R. A., Dvorak, A. M. & Genega, E. M. (2004). Arch. Pathol. Lab. Med. 128, e112–e114. CrossRef PubMed Google Scholar
Luna-Martínez, O. D., Hernández-Santoyo, A., Villalba-Velázquez, M. I., Sánchez-Alcalá, R., Fernández-Velasco, D. A. & Becerril, B. (2017). FEBS J. 284, 3702–3717. PubMed Google Scholar
Massover, W. H. (1993). Micron, 24, 389–437. CrossRef CAS Web of Science Google Scholar
McCoy, A. J., Andrews, L. C., Bernstein, H. J. & Read, R. J. (2026). Acta Cryst. D82, 813–823. CrossRef IUCr Journals Google Scholar
McCoy, A. J., Stockwell, D. H., Sammito, M. D., Oeffner, R. D., Hatti, K. S., Croll, T. I. & Read, R. J. (2021). Acta Cryst. D77, 1–10. Web of Science CrossRef IUCr Journals Google Scholar
McGill, K. J., Asadi, M., Karakasheva, M. T., Andrews, L. C. & Bernstein, H. J. (2014). J. Appl. Cryst. 47, 360–364. Web of Science CrossRef CAS IUCr Journals Google Scholar
McPherson, A. (1999). Crystallization of Biological Macromolecules. New York: Cold Spring Harbor Laboratory Press. Google Scholar
Mitsuhashi, W., Kawakita, H., Murakami, R., Takemoto, Y., Saiki, T., Miyamoto, K. & Wada, S. (2007). J. Virol. 81, 4235–4243. CrossRef PubMed CAS Google Scholar
Nishinaka, T., Doi, Y., Hara, R. & Yashima, E. (2007). J. Mol. Biol. 370, 837–845. CrossRef PubMed CAS Google Scholar
Nurullina, L., Terrosu, S., Myasnikov, A. G., Jenner, L. B. & Yusupov, M. (2024). FEBS Lett. 598, 537–547. CrossRef CAS PubMed Google Scholar
Olofsson, A., Mallouh, V. & Brisson, A. (1994). J. Struct. Biol. 113, 199–205. CrossRef CAS PubMed Google Scholar
Padmaja, N., Ramakumar, S. & Viswamitra, M. A. (1990). Acta Cryst. A46, 725–730. CrossRef CAS Web of Science IUCr Journals Google Scholar
Persson, E. K., Verstraete, K., Heyndrickx, I., Gevaert, E., Aegerter, H., Percier, J.-M., Deswarte, K., Verschueren, K. H. G., Dansercoer, A., Gras, D., Chanez, P., Bachert, C., Gonçalves, A., Van Gorp, H., De Haard, H., Blanchetot, C., Saunders, M., Hammad, H., Savvides, S. N. & Lambrecht, B. N. (2019). Science, 364, eaaw4295. CrossRef PubMed Google Scholar
Plowman, S. J., Muncke, C., Parton, R. G. & Hancock, J. F. (2005). Proc. Natl Acad. Sci. USA, 102, 15500–15505. CrossRef PubMed CAS Google Scholar
Schönherr, R., Rudolph, J. M. & Redecke, L. (2018). Biol. Chem. 399, 751–772. Web of Science PubMed Google Scholar
Squires, J. P., Ulbright, T. M., Deschryver-Kecskemeti, K. & Engleman, W. (1984). Cancer, 53, 516–523. CrossRef CAS PubMed Google Scholar
Trettel, D. S., Kerfeld, C. A. & Gonzalez-Esquer, C. R. (2024). Curr. Opin. Microbiol. 80, 102497. CrossRef PubMed Google Scholar
Vaguine, A. A., Richelle, J. & Wodak, S. J. (1999). Acta Cryst. D55, 191–205. Web of Science CrossRef CAS IUCr Journals Google Scholar
Wlodawer, A., Dauter, Z., Porebski, P. J., Minor, W., Stanfield, R., Jaskolski, M., Pozharski, E., Weichenberger, C. X. & Rupp, B. (2018). FEBS J. 285, 444–466. Web of Science CrossRef CAS PubMed Google Scholar
Wlodawer, A., Dauter, Z., Rubach, P., Minor, W., Jaskolski, M., Jiang, Z., Jeffcott, W., Anosova, O. & Kurlin, V. (2025). Acta Cryst. D81, 170–180. Web of Science CrossRef IUCr Journals Google Scholar
Wukovitz, S. W. & Yeates, T. O. (1995). Nat. Struct. Mol. Biol. 2, 1062–1067. CrossRef CAS Web of Science Google Scholar
Xu, Q., Canutescu, A. A., Wang, G., Shapovalov, M., Obradovic, Z. & Dunbrack, R. L. (2008). J. Mol. Biol. 381, 487–507. Web of Science CrossRef PubMed CAS Google Scholar
Xu, Q. & Dunbrack, R. L. (2011). Nucleic Acids Res. 39, D761–D770. Web of Science CrossRef CAS PubMed Google Scholar
Xu, Q. & Dunbrack, R. L. (2020). Nat. Commun. 11, 711. CrossRef PubMed Google Scholar
Yan, Y. (2017). Structural Understanding of Prorenin, Angiotensinogen and the Release of Angiotensin I. University of Cambridge. Google Scholar
Yang, J. E., Mitchell, J. M., Bingman, C. A., Mosher, D. F. & Wright, E. R. (2024). bioRxiv, 2024.10.09.617336. Google Scholar
Yuan, P., Jedd, G., Kumaran, D., Swaminathan, S., Shio, H., Hewitt, D., Chua, N.-H. & Swaminathan, K. (2003). Nat. Struct. Biol. 10, 264–270. CrossRef PubMed CAS Google Scholar
This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

journal menu
access



