Elucidating polymorphs of crystal structures by intensity-based hierarchical clustering analysis of multiple diffraction data sets

Single-step intensity-based hierarchical clustering is demonstrated to allow the detection of structural polymorphs in diffraction data sets obtained from multiple crystals. By splitting data sets collected using a continuous helical scheme into several chunks, both inter-crystal and intra-crystal polymorphs can successfully be analyzed.

The two simulations were performed based on the CC models of trypsin.Firstly, assuming 50 datasets for each of the two structural polymorphs, HCA was performed with the parameters (s, loc, scale) estimated from observed datasets to evaluate whether the simulation is valid.After 100 repetitions, the mean and standard deviation of the Ward distance for classifying the two polymorphisms were calculated (W1 in Fig S4).
Acta Cryst.(2023).D79, https://doi.org/10.1107/S2059798323007039Supporting information, sup-2 Secondly, using the CC model, we tried to find a good R in equation 1.If the structures of apo-and benzamidine bound trypsin become very similar, the distributions of CC should overlap, making classification difficult.In the CC model, loc parameter represents the location of the CC distribution.In this simulation, the distributions of CCapo-benz and CCbenz-benz were hypothetically brought closer by gradually moving the loc parameter of CCapo-benz closer to that of CCbenz-benz.Here, we divided the difference between the original CCapo-benz and CCbenz-benz loc parameters by ten and repeated the HCA simulation for ten steps, moving the CCapo-benz loc parameters closer to CCbenz-benz by one step.We performed HCA simulation assuming that there were an equal number of apo-and benztrypsin datasets and evaluated whether they could be neatly classified into two on the dendrogram.
For the evaluation, we defined a score that quantifies whether apo-and benz-trypsin can be classified Purity is the percentage of A or B structure in each cluster, a number between 0 and 1, where 1 indicates 100% purity.We compared the purity of A and B in cluster 1 and set the larger of the two as the score, s1, for cluster 1.Similarly, the greater of the purity of A and the purity of B in cluster 2 was set as the purity of cluster 2, s2.The purity score was defined as the smaller of s1 and s2.
Finally, the total score was defined as the sum of the balance score and the purity score as follows, total score = 0.5 * balance score + 0.5 * purity score The "W0" is the Ward distance of the top cluster, and W1 is the larger Ward distance of the two clusters classified in the dendrogram.The value of "W1" eventually equals to the isomorphic threshold (equation1 and Figure S3).By plotting the resulting scores against R, we selected a valid R to estimate the "isomorphic threshold."This simulation was performed for 100, 200, 300, 500, and 1000 data sets.
An NLS peptide of Trn1 was purchased from Eurofins and dissolved in a gel filtration buffer.
By mixing Trn1Δloop and the peptide solution, Trn1-peptide complex solution was prepared (the final concentration of Trn1Δloop and NLS peptide were 5 mg ml -1 and 5 mM, respectively).Then, 1 μl of the Trn1-peptide complex solution and 1 μl of reservoir solution (0.5 M NaK phosphate pH 5.0) were mixed at room temperature, and crystallization was performed using the sitting-drop vapor diffusion method at 10˚C.The obtained crystals were cryoprotected by 30% (w/v) glycerol-containing reservoir solution prior to data collection.

S4. Data statistics for merged data obtained at cluster nodes
Crystallographic statistics values may help interpret the existence of polymorphs in the obtained datasets.Especially, some statistic values, e.g.<I/I>, Rmeas, CC1/2, obtained from merged datasets at each cluster from intensity-based HCA may be informative.These statistics are summarized in Table S4-S7 for four datasets (two kinds of in silico mixed trypsin datasets, Trn1peptide complex, and AaHypD-C360S used in this study), respectively.Resolution cut-off for each merged data was estimated using kamo.decide_resolution_cutoffwith the CC1/2 of outer resolution shell (approximately CC1/2 ~ 0.50).All the statistics values were obtained from output of XSCALE for finally obtained merged data after outlier rejection process of KAMO.Number of chunks (30° each) is obtained from input files for XSCALE, however, some frames may be rejected.

Figure S2
Dendrogram obtained with different linkage method.The distance matrix was obtained from the intensity-based HCA on in silico mixed dataset including apo and benzamidine-bound trypsin.Seven linkage method, namely, "Ward", "single", "complete", "average", "weighted", "centroid", and "median", available in scipy module was applied.Data label at each leaf is colored by green (apo-trypsin) and orange (benzamidine-bound trypsin).The color threshold for each dendrogram were not set (0.7 × maximum distance in default) except for the data with "Ward" linkage (used 0.6 as well as Fig. 2).

Figure S19
The number of common reflections (black) and the ratio of rejected data (magenta) in the case of apo-trypsin.The higher resolution limit dmin in CC calculation was set to 1.5 Å. Rejected data drastically increased when chunk size was set to below 3.0° due to the lack of common reflections to calculate CC value.
into two clusters on the dendrogram.Two types of indicators are considered: balance score and purity score.The balance score, assuming that the HCA simulation can divide datasets into two clusters and the number of data in each cluster is n1 and n2.balance score = 1 -|n1-n2| / (n1+n2) where n1 and n2 are the number of datasets in classified two clusters.Since we assume an equal number of two structures in this simulation, this score should be 1.0 if the HCA classification is successful.

Figure
Figure S1 2Fo-Fc and Fo-Fc map around the inhibitor binding site of trypsin from a large-wedge dataset without splitting into chunks.Here the results from only a single crystal are shown as a representative of each trypsin dataset.Structural formulas of inhibitors used in this study are shown in (a) 4-methoxybenzamidine (referred to as 'benzamidine') and (b) 5-chlorotryptamine (referred to as 'tryptamine').2Fo-Fc (gray mesh) and Fo-Fc (green) density maps obtained from (c) apo-trypsin, (d) benzamidine-bound, and (e) tryptamine-bound trypsin datasets are depicted.Inhibitors were omitted in map calculation.The contour level for each 2Fo-Fc and Fo-Fc map is 1.0 and 3.0, respectively.Structural formulas for inhibitors were depicted by Molview (https://molview.org).2Fo-Fc and Fo-Fc maps were generated by Coot.

Figure S3
Figure S3 Schematic diagram of our HCA simulation.We assume two mixed data sets, Structure A and Structure B, each with 50 datasets in this figure.We first fit a log-normal function to the observations to obtain probability density functions for each of the three CC distributions (CCA-A, CCB-B, and CCA-B).CC values were randomly extracted from the models, and a CC matrix was created to compute the distance matrix.Each colored box in the table contains the extracted CCs and the color is the same as the color of the model used for extraction.HCA was performed from the distance matrix to see if the dendrogram classified the A and B structures.The isomorphic threshold is calculated as W1/W0 by using W0 and W1.Each value is Ward distance indicated by the red arrow.

Figure S4
Figure S4 Comparison of unit cell constants among apo-trypsin and inhibitor-bound trypsin datasets.The distribution of each cell parameters was illustrated with combination of box plot and swarm plot.The vertical axis of each plot was depicted with the same scale.Although all datasets were slightly different at each unit cell axis, the distribution of each axis value was largely overlapped.

Figure S6
Figure S6 Result of unit cell-based HCA on in silico mixed datasets containing benzamidine-bound and tryptamine-bound trypsin datasets represented by the electron density maps.Electron density maps calculated from merged data at different nodes in the dendrogram are illustrated.The contour level of 2Fo-Fc map (gray mesh) and Fo-Fc map (green mesh) are 1.0 and 3.0, respectively.Figures were generated by Coot exploited in the NABE pipeline.

Figure S7
Figure S7 Result of intensity-based HCA on in silico mixed datasets containing benzamidine-bound and tryptamine-bound trypsin datasets represented by the electron density maps.Electron density maps calculated from merged data at different nodes in the dendrogram are illustrated.The contour level of 2Fo-Fc map (gray mesh) and Fo-Fc map (green mesh) are 1.0 and 3.0, respectively.Figures were generated by Coot exploited in the NABE pipeline.

Figure S8
Figure S8 Results of the intensity-based HCA on in silico mixed dataset containing apo-, benzamidine-bound, and tryptamine-bound trypsin.Cluster 115, 114 and 116 are obtained as isomorphic clusters within the suggested 'isomorphic threshold'.

Figure S9
Figure S9 Dendrogram from unit cell-based HCA on in silico mixed datasets consisting of apo-, benzamidine-bound, and tryptamine-bound trypsin.The color label for each dataset is set to green (apo-trypsin), orange (benzamidine-bound trypsin), and blue (tryptamine-bound trypsin).Cluster in the dendrogram is colored by the same color for the leaf label, when more than three chunks of same data make cluster.

Figure S10
Figure S10 Change of CC by rotation of a certain moiety of trypsin molecule.CC between the original and rotated trypsin molecule is plotted.The plot is colored by the rotated moiety; green: whole trypsin molecule (223 amino acid (AA) residues), blue: quarter of the molecule (57 AA), and magenta: a terminal helix (10 AA length).

Figure S11
Figure S11 Histogram of dCC (CC distance: [1 -CC 2 ] 1/2 ) calculated between homogeneous and heterogeneous chunks of trypsin.(a) All histogram are drawn in the same plot.(b) dCC obtained between two apo chunks CCapo-apo, (c) dCC obtained between two benzamidine-bound chunks CCbenzbenz, (d) dCC obtained for heterogeneous combination of apo and benzamidine-bound chunk CCapo-benz

Figure S13
Figure S13 HCA simulation based on the observed CC distributions using trypsin test case.(a)The log-normal distribution was fitted to each of observed CCapo-apo, CCbenz-benz, CCapo-benz.The fitted parameters (s, loc, scale) of log-normal distribution for CCapo-apo, CCbenz-benz, and CCapo-benz were estimated as follows: (s, loc, scale) = (0.704, 0.003, 0.013), (0.765, 0.006, 0.025) and (0.787, 0.015, 0.025), respectively.Each model seems to fit the observed CC distribution well.(b) One of the dendrograms from the HCA simulation assuming a log-normal distribution of CC distribution, with

Figure S15
Figure S15 Dendrogram from unit cell-based HCA on Trn1-NLS peptide complex.The data label are colored by the crystal ID; crystal 1: orange, crystal 2: blue, crystal 3: green, and crystal 4: magenta.Cluster in the dendrogram is colored by the same color for the leaf label, when more than three chunks from the same crystal form cluster.

Figure S16
Figure S16 Peptide-model omitted Fo-Fc maps obtained at different clusters from the unit cell-based HCA on Trn1-peptide complex.Contour level for the Fo-Fc map (green mesh: positive, and red mesh: negative) is set to 3.0.The maps were calculated without a model of binding peptide.Only the main chain for Trn1 and binding peptide is depicted.In contrast to the results from the intensity-based HCA, both peptide binding form were observed in a similar density.Figures were generated byCoot.

Figure S17
Figure S17 Intra-crystal polymorphism implied by intensity-based HCA on Trn1-NLS peptide complex.Each 30° chunk was colored according to the result of intensity-based HCA; green: involved in Cluster 76 (Form1 dominant cluster), orange: involved in Cluster 72 (Form 2 dominant cluster), gray: not well clustered (outlier-like in the dendrogram), white: rejected in pre-processing by KAMO prior to the clustering.

Figure S18
Figure S18 Electron density maps around the N-terminal region and [4Fe-4S] cluster obtained from the merged data of AaHypD-C360S at (a)-(b) Cluster 51 and (c)-(d) Cluster 42.Contour level for 2Fo-Fc map (blue mesh) is 1.0, except for [4Fe-4S] region in Cluster 42, where it is set to 1.5.The contour level for Fo-Fc map (green mesh: positive, and red mesh: negative) is 3.0.Only the main chain of unfolded (purple) and folded models (pink) for the N-terminal region is depicted.The variable N-terminal region (Ser7-Tyr12) was omitted and the occupancy of [4Fe-4S] was set to 1.0 in map calculation.Figures were generated byCoot.

Table S1
Summary of pre-processing prior to HCA by KAMO

Table S2
Number of chunks in merged data at each node from unit cell-based clustering (apotrypsin + benzamidine-bound trypsin)

Table S3
Number of chunks in merged data at each node from unit cell-based clustering (benzamidine-bound trypsin + tryptamine-bound trypsin)

Table S4
Data statistics for merged data obtained from intensity-based clustering (apo + Values in parentheses indicate inner/outer resolution shell.Number of chunks is the value based on the final merged datasets after outlier rejection process by KAMO. *

Table S5
Data statistics for merged data obtained from intensity-based clustering (benzamidinebound + tryptamine-bound trypsin) Values in parentheses indicate inner/outer resolution shell.Number of chunks is the value based on the final merged datasets after outlier rejection process by KAMO. *