research papers\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

Journal logoJOURNAL OF
APPLIED
CRYSTALLOGRAPHY
ISSN: 1600-5767

The AI-based phase-seeding (AI-PhaSeed) method: early applications and statistical analysis

crossmark logo

aInstitute of Crystallography, National Research Council, via Amendola 122/o, Bari, 70126, Italy
*Correspondence e-mail: [email protected]

Edited by Th. Proffen, Oak Ridge National Laboratory, USA (Received 24 July 2025; accepted 19 September 2025; online 18 October 2025)

The crystallographic challenge of structure determination is nowadays effectively supported by advanced computational methods, such as direct methods and Patterson techniques, implemented in sophisticated software. With the rapid expansion of artificial intelligence (AI) across diverse scientific domains, exploring its potential contribution to structure solution and its ability to overcome the limitations of traditional approaches has become increasingly compelling. This work builds upon and extends the findings of two recent studies on AI-driven phasing. The first, by Larsen et al. [Science (2024), 385, 522–528], focused on designing and applying a neural network architecture to solve small structures (with unit-cell volumes up to 1000 Å3), primarily within the most common centrosymmetric space group P21/c. The second, by Carrozzini et al. [Acta Cryst. (2025), A81, 188–201], introduced a novel phase-seeding method applicable to both centro­symmetric and non-centrosymmetric crystal structures of varying complexity, from small molecules to proteins. Although designed with AI integration in mind, this latter method had not yet been tested within an AI framework. In this paper, we apply the method proposed by Carrozzini et al. to cases where seed phases are generated by the AI network developed by Larsen et al. We demonstrate that this combined approach, termed AI-PhaSeed, successfully extends the applicability of Larsen's neural network to structures with unit-cell volumes exceeding 1000 Å3, even under conditions of limited experimental resolution. The proposed procedure has been extensively tested on a set of structures taken from the Crystallography Open Database, proving it to be a powerful and reliable tool for structure solution. We also provide insights into the use of AI for crystallographic phasing and introduce statistical tools to evaluate the robustness of the solution process based on AI-calculated phases.

1. Abbreviations

AI: artificial intelligence.

DM: direct methods.

MPE: mean phase error.

Rf: crystallographic agreement factor.

Nasym: number of non-hydrogen atoms in the asymmetric unit.

Nrefl: number of measured symmetry-independent reflections.

EDM: electron-density modification.

Eh: normalized structure factor for reflection h.

CORR: the correlation coefficient between phased and true electron-density maps.

2. Introduction

The phase problem in crystallography has long represented a central challenge, complicating the determination of crystal structures for organic, inorganic or metal–organic structures regardless of their asymmetric unit size, whether small (<80 non-hydrogen atoms), medium (<300) or large (≥300). Today, the routine solution of small- and medium-sized single-crystal structures using X-ray diffraction data collected in standard laboratories is largely feasible, provided the experimental data are of sufficient quality and resolution. This progress is the result of substantial theoretical and methodological advancements, including the development of direct methods (Giacovazzo, 2013View full citation), Patterson techniques (Rius, 2014View full citation) and dual-space approaches such as charge flipping (Oszlányi & Sütő, 2011View full citation). These approaches have been further empowered by significant improvements in diffractometer instrumentation and, crucially, in software packages that enhance modern computational capabilities (Burla et al., 2015View full citation; Rius, 2011View full citation; Palatinus & Chapuis, 2007View full citation). Nevertheless, structure solution remains a challenge in some cases, especially for large mol­ecules. Difficulties also emerge when the experimental resolution is far from atomic, ∼1.2 Å according to the `Sheldrick rule' (Sheldrick, 1990View full citation), which is the threshold at which individual atoms can be clearly resolved in an electron-density map.

The application of artificial intelligence (AI) is currently being explored across various scientific disciplines. In the field of structural crystallography, one of the most significant breakthroughs has been the development of AlphaFold by Google DeepMind (Jumper et al., 2021View full citation), which has demonstrated unprecedented accuracy in predicting protein 3D structures directly from amino acid sequences. This milestone has underscored the transformative potential of AI in structural biology and opened new avenues for accelerating structure determination beyond traditional experimental approaches.

In a pioneering study, Larsen et al. (2024View full citation) demonstrated that AI can be effectively harnessed to address the phase problem in crystallography, successfully solving ab initio crystal structures with unit-cell volumes up to 1000 Å3 using only experimental amplitude data. Their deep learning architecture was specifically designed for small structures, primarily within the space group P21/c, and was trained on a dataset comprising millions of artificial crystal structures. While the work of Larsen et al. (2024View full citation) was focused on a relatively narrow class of structures, some of which may be solvable through traditional direct methods or Patterson techniques, its findings represent a major step forward. Notably, the study demonstrated that AI can determine missing phases, a task that has traditionally relied on decades of theoretical and algorithmic development in crystallography. Beyond its immediate achievements, this work opened new avenues for applying AI to more complex scenarios that remain challenging for conventional methods, particularly when the experimental resolution is limited. These opportunities can be further explored using the recently developed phase-seeding method (Carrozzini et al., 2025View full citation), which was explicitly designed to be AI compatible and applicable to crystal structures of any size and space group. The method is based on the accurate identification of a small subset of seed phases; when these are combined with experimentally derived amplitudes and randomly initialized phases, they can drive the reconstruction of the full electron-density map. This is accomplished by iterative procedures in both direct and reciprocal space, enhanced by electron-density modification (EDM) cycles to extend and refine the phase set. For non-centrosymmetric structures, the method introduces a discretization of both seed and random phases into a limited number of distinct values, thereby enabling classification and refinement through an AI-driven algorithm. Although this method has been robustly defined and tested in silico, it had not yet been implemented in a fully AI-based framework prior to the present work.

This paper presents the first application of the phase-seeding method, in combination with the neural network developed by Larsen et al. (2024View full citation), extending its use to structures still within the space group P21/c but with unit-cell volumes ranging from 1000 Å3 up to 3500 Å3, well beyond the previously studied limit below 1000 Å3. These structures include organic, inorganic and metal–organic compounds. Despite the original design constraints of the neural network, the combined AI-based phase-seeding approach has demonstrated promising performance, benefiting from the advances introduced by the phase-seeding method, which requires reliable phase estimates for only a limited subset of reflections.

The novelty of this study lies in its clear demonstration that the AI model developed by Larsen et al. (2024View full citation) can be extended beyond its initial volumetric limitations through the integration of the phase-seeding strategy. We provide a comprehensive assessment of the capabilities and limitations of the AI-based phase-seeding (AI-PhaSeed) method. Validated on a large dataset of experimentally determined structures, AI-PhaSeed proves to be a robust and competitive alternative to traditional phasing approaches such as direct methods (DM).

Particularly noteworthy is the section addressing phasing from limited-resolution data, where we demonstrate that AI-PhaSeed can successfully solve structures that remain intractable using traditional methods.

3. Methods

3.1. The AI-based phase-seeding (AI-PhaSeed) method

The AI-based phase-seeding method integrates the concept of phase seeding with AI to solve crystal structures from X-ray diffraction data, once the unit cell and space group have been determined. Its effectiveness has been tested on real small-sized structures with unit-cell volumes ranging from 1000 to 3500 Å3 and space group P21/c, deposited in the Crystallography Open Database (COD) (Gražulis et al., 2009View full citation; Downs & Hall-Wallace, 2003View full citation).

AI-PhaSeed involves the following steps outlined in Fig. 1[link]:

[Figure 1]
Figure 1
Scheme of the AI-PhaSeed method. The image of the phase seeding is taken from Madsen (2025View full citation) and reproduced with permission.

(i) For each real structure in the centrosymmetric space group P21/c, a subset of experimental reflections is selected to meet the input size requirements of the PhAI neural network (Larsen et al., 2024View full citation). The selected reflections follow the constraint on Miller indices (hkl) such that (h2 + k2 + l2)1/2 ≤ 10 and, due to the Laue symmetry 2/m, −10 < h < 10, 0 < k < 10 and 0 < l < 10. This selection is performed automatically by a built-in tool within SIR2024, the latest updated version of the SIR2014 package (Burla et al., 2015View full citation). For each hkl reflection, the amplitude and an initial phase set to zero are provided as input to PhAI, and the run was set with five phasing cycles.

This subset constitutes only a small fraction of the total number of reflections for each structure.

(ii) The phases predicted by the PhAI network (restricted to 0 or π) provide the phase-seed values required by the phase-seeding procedure (Carrozzini et al., 2025View full citation). The efficiency of the AI-generated phases is evaluated using the mean phase error (MPEseed), calculated as the average difference between the AI-derived phases and those derived from the known structural model, as well as the correlation coefficient (CORRseed) as defined by Larsen et al. (2024View full citation).

(iii) For non-seed reflections, phases are randomly assigned as either 0 or π. The subsequent phase extension and refinement procedure employs iterative EDM cycles, using the experimental amplitudes as constraints (Carrozzini et al., 2025View full citation). The final electron-density map is computed during this process, which is executed using the SIR2024 software (Burla et al., 2015View full citation). Three validation parameters are used to evaluate the reliability and accuracy of both the phasing process and the resulting structure solution: (1) the final mean phase error (MPEfinal), calculated in reciprocal space as the deviation between the refined and model-derived phases; (2) the final correlation coefficient (CORRfinal); and (3) the crystallographic agreement factor (Rf) obtained by comparing calculated and observed amplitudes.

3.2. AI-PhaSeed combined with direct methods

In the AI-PhaSeed method, the phases generated by AI are extended and refined using direct-space methods applying cycles of electron-density map modifications (Burla et al., 2010View full citation) (path A in Fig. 2[link]). The conventional phasing protocol, typically performed using DM for small- and medium-sized compounds, involves initial phase determination followed by phase extension and refinement in direct space, the same approach as adopted in the AI-PhaSeed method (path B in Fig. 2[link]).

[Figure 2]
Figure 2
Workflow of the AI-PhaSeed method (path A), the classical phasing procedure (path B) and their combination DM+AI-PhaSeed (path C).

In this study we present the first applications of the AI-PhaSeed method and compare its performance with the classical DM phasing procedure (path A versus path B in Fig. 2[link]). In addition, we propose a strategy, called DM+AI-PhaSeed (see Section 4.2[link]), that combines the two procedures (path C in Fig. 2[link]), based on the theory outlined in Section 3.3[link]. In this approach, the AI-generated phase seed is actively integrated into DM to drive the multi-solution process more effectively towards a reliable phase set.

3.3. AI-based tangent formula

The direct methods approach exploits structure invariants, phase combinations unaffected by origin shifts, typically derived from a set of strong reflections (|Eh| > 1.2), referred to as Nlarge. Phases are estimated from known amplitudes via the tangent formula (Karle & Hauptman, 1956View full citation), which provides probabilistic phase values based on statistical relationships. The phasing process starts with random phase assignment to Nlarge reflections followed by iterative refinement. Since convergence is not always achieved, the procedure is repeated with different random inputs, a multi-solution approach that increases the chance of identifying correct phase estimates.

According to the formalism described by Giacovazzo (1998View full citation, pp. 123–127), the total probability P(θh) that the phase of Eh involved in r structure-invariant relationships (specifically triplets, in this case) equals θh can be expressed as a suitably normalized product of r specific probabilities,

Mathematical equation

where L is a normalizing factor, and ϕk and ϕhk are phases of reflections contributing to the invariants. Usually, for the tangent formula, we have

Mathematical equation

where Mathematical equation and Zj is the atomic number of the jth atom.

When a priori phase information is available, such as AI-derived phase estimates for a subset of Nlarge reflections, this information can be directly combined into equation (1)[link], with random phases assigned to the rest. A priori phase information could be generated by running a neural network multiple times with different random seeds. However, results from this study indicate that performing multiple random trials to generate phase histograms offers limited benefit. Consequently, a single AI-derived phase seed, as used in this work, is sufficient to drive the phasing process effectively, reducing computational cost without compromising accuracy.

Thus, incorporating a priori knowledge from AI-derived phase values with reliability Mathematical equation, the total probability Ptot(θh) that the phase of Eh is θh can be written as (Giacovazzo, 1998View full citation, pp. 564–565)

Mathematical equation

If we treat the AI-derived phase estimates as triplets, we can always estimate Mathematical equation through equation (2)[link] and we have

Mathematical equation

with statistical weight given by

Mathematical equation

Equations (3)[link]–(5)[link][link] generalize the tangent formula to account for a priori knowledge from AI-derived phase estimates. When AI estimates of both ϕk and ϕhk are available, they jointly contribute to equation (5)[link], enhancing the reliability of the corresponding structure invariant. If only one phase is available, it still provides a partial but meaningful enhancement to the triplet's reliability. This approach allows external information (AI estimates) to be incorporated in a natural and consistent manner within the formalism of DM. The a priori phase information does not override the estimates derived from triplet relationships, but rather merges with them according to their relative reliability. Mathematically, this is equivalent to the weighted vector sum of phase estimates in the complex plane.

3.4. Test structures and relevant variables

Test structures were obtained from the COD (Gražulis et al., 2009View full citation; Downs & Hall-Wallace, 2003View full citation) using tools integrated into the software package EXPO (Altomare et al., 2013View full citation). We selected 1505 crystal structures in the P21/c space group, all solved using single-crystal X-ray diffraction data. This choice aligns with the PhAI neural network (Larsen et al., 2024View full citation) used in our study to generate AI phases, which was specifically trained on structures in the P21/c space group. In the AI-PhaSeed method, the AI-generated phases are subsequently extended and refined. Additionally, in accordance with the capabilities of the PhAI neural network designed to perform phasing on datasets with unit-cell volumes below 1000 Å3 and constrained Miller indices (hkl), we restricted our selection to structures with unit-cell volumes between 1000 and 3500 Å3. This range ensures compatibility with the AI-PhaSeed method, which, according to Carrozzini et al. (2025View full citation), requires phase estimates for a subset (typically 10%) of the total reflections. The main features of the selected structures are summarized in Table 1[link], and they are also illustrated as box plots in Fig. S1 in the supporting information.

Table 1
Summary of key features of the test structures, presented as the variable names (first column), their description (second column) and the corresponding variability range (third column)

Variable name Description Variability range
Nasym Number of non-H atoms in the asymmetric unit 9.5–55
maxCellSize (Å) Maximum linear dimension of the unit cell (maximum between a, b, c) 10.7–43.1
minCellSize (Å) Minimum linear dimension of the unit cell (minimum between a, b, c) 3.68–14.8
Perc Percentage of phase-seed reflections that meet the neural network's input size criteria, over the total number of symmetry-independent reflections 5.03–53.9%
Compl (%) Crystallographic data completeness 29.7–100%
maxW Atomic weight of the heaviest element ∼12–238
Pseudo (%) Pseudosymmetry percentage affecting atoms in the structure 0–96%
Nrefl Number of measured symmetry-independent reflections 1508–20169
RES (Å) Experimental data resolution 0.49–0.98
REFLECseed Number of reflections phased by AI 894–1042
RESseed (Å) Resolution for reflections phased by AI 0.59–1.37
Vol (Å3) Volume of the unit cell 1005–3495
†On average, Perc exceeds 10%, ensuring sufficient data for effective phase prediction.

4. Results

As a first check, we verified the performance of the standalone PhAI method (without phase seeding or EDM cycles) on the dataset used in this study. The results, shown in Fig. S2, indicate that, even for structures for which the phase values assigned by PhAI are reliable (MPEseed < 20°), the low-resolution electron-density map that can be calculated using them cannot be properly interpreted in terms of an atomic model. As a result, the percentage of correctly positioned atoms is less than 20% for most of the test structures. Thus, adopting the phase-seeding strategy to generate the PhAI phases is essential for the structures considered in this study.

Given the size of the test dataset (1505 structures), the performance of the AI-PhaSeed method must be evaluated by a robust metric capable of automatically and unambiguously distinguishing solved structures from unsolved ones. To this end, we considered the space defined by three validation parameters (MPEfinal, CORRfinal and Rf), each calculated with respect to the published structure, to evaluate the outcome of the structure solution process. Representative points of each test structure are projected into this space [Fig. 3[link](a)] and clustered by a k-means clustering procedure with k = 2 (Hartigan & Wong, 1979View full citation). This results in two identified clusters: one representing the solved structures [blue points in Fig. 3[link](a)] and the other representing the unsolved structures [red points in Fig. 3[link](a)]. The two clusters are clearly separated, with the points corresponding to the solved structures accumulating around the optimal values of the validation parameters. As a result, the silhouette plot shown in Fig. 3[link](b) is dominated by the cluster of solved structures, most of which have a silhouette width larger than 0.8.

[Figure 3]
Figure 3
(a) A 3D scatter plot illustrating the k-means clustering results, with k = 2 (Hartigan–Wong algorithm), based on MPEfinal (x axis), Rf (y axis) and CORRfinal (z axis). Data points are colour-coded to distinguish between `Solved structures' (blue) and `Unsolved structures' (red), with the legend indicating the number of observations in each category. (b) A silhouette plot for k-means clustering with k = 2. Each block contains a number of vertical bars corresponding to the elements assigned to that cluster (blue for solved and red for unsolved structures). The height of each bar represents the silhouette width, indicating how well each element fits within its assigned cluster compared with the other one. Silhouette values closer to 1 correspond to better-defined clustering. The value of the mean silhouette width, shown above the plot, quantifies the overall clustering quality; values above 0.7 are generally considered indicative of strong and well separated clusters (Kaufman & Rousseeuw, 1990View full citation).

The unambiguous definition of the solved structures allows us to calculate the efficiency of the AI-PhaSeed method as the ratio of solved to total structures, which comes to 86.7%. This value should be compared with that obtained using a random phase seed (random phases replacing AI phases), which is 84.3%, and that from a true phase seed (phases calculated from published structures replacing AI phases), which is 97.1%. As expected, the AI-PhaSeed method performs slightly better than the random-phase case but is quite far from what can be obtained using known phases. This gap is mainly due to a subset of structures for which AI fails to assign reliable phase values to seed reflections. However, if we focus on structures whose AI-generated seed phases exhibit a mean phase error (MPEseed) below 25° and a reflection correlation (CORRseed) greater than 0.8, i.e. the 300 structures for which the AI predictions can be considered reliable, the efficiency of the AI-PhaSeed method reaches 98%, closely aligning with the 97% achieved by the true-phase method on the same subset of structures. This is better shown in Fig. 4[link], where the k-means clusters of solved and unsolved structures are projected in the space defined by the two variables MPEseed and CORRseed (see Section 3.1[link]): the green box contains the 300 structures with MPEseed below 25° and CORRseed greater than 0.8, and only four of them (red dots in the green box) belong to the unsolved k-means cluster. This outcome suggests that the AI-PhaSeed method performs as proposed by Carrozzini et al. (2025View full citation), assuming the AI-predicted seed phases are reliable.

[Figure 4]
Figure 4
Scatter plot of MPEseed versus CORRseed, showing clusters of solved (blue dots) and unsolved (red dots) structures. The green region corresponds to the area defined by MPEseed ≤ 25 and CORRseed ≥ 0.8 where the AI-generated phases can be considered reliable, i.e. sufficiently accurate to approximate the true phases. The yellow region corresponds to the area defined by optimizing the values MPEseed ≤ Q1 and CORRseed ≥ Q3 (Class 1). This region predominantly contains solved structures, which tend to accumulate at CORRseed values close to 1.

However, this high efficiency is achieved for only approximately 20% of the total structures, specifically those that closely match the features of the neural network's training set. To investigate the nature of the decline in AI efficiency and identify the structures for which this occurs, we developed a classification model to explain the AI-PhaSeed results.

To this end, we focused on analysing the intermediate result of the method, the one obtained after application of AI and related to the seed phases, by considering the two variables MPEseed and CORRseed.

Well separated distributions, with a minimal overlap, are obtained when comparing the MPEseed and CORRseed values for structures belonging to the solved k-means cluster with those belonging to the unsolved k-means cluster. This result is illustrated in Fig. S3. The observed separation, while confirming the discriminative power of MPEseed and CORRseed in distinguishing between solved and unsolved structures and thus confirming the impact of AI-generated phases on the performance of the AI-PhaSeed method, also allows the identification of a threshold value for both variables, i.e. the first quartile (Q1) for MPEseed and the third quartile (Q3) for CORRseed, both of the unsolved cluster. This ensures the identification of the maximum number of solved structures, while simultaneously minimizing the number of unsolved ones. These thresholds identify the yellow box in Fig. 4[link], which mostly contains solved structures accumulating at CORRseed values close to 1.

We employed a random forest (RF) classification model (Breiman, 2001View full citation) as a supervised learning algorithm, labelling structures within the yellow region in Fig. 4[link] as Class 1 and all others as Class 0. The model was thus trained to classify as successful (Class 1) those structures with MPEseed ≤ Q1 and CORRseed ≥ Q3, i.e. those falling within the yellow box in Fig. 4[link], and as unsuccessful (Class 0) those lying outside this region. The main goal in building the classification model is to study how the quality of the AI phases influences the outcome of the structure solution. Consequently, a small number of structures (approximately 3% of the overall dataset) that remain unsolved even with the true phase seed are excluded from this stage of the analysis to avoid biases unrelated to phase quality.

The RF results are shown in Fig. 5[link]. The optimal probability threshold on the receiver operating characteristic (ROC) curve (Fawcett, 2006View full citation) [red dot in Fig. 5[link](a)] was determined to classify correctly a significant portion of Class 1 structures by selecting the highest sensitivity value corresponding to a specificity of at least 80%. The area under the curve (AUC) value is 0.87, indicating an excellent overall model performance. The resulting confusion matrix for the tenfold cross-validated RF model is reported in Table 2[link].

Table 2
Confusion matrix for the tenfold cross-validation RF model, showing predicted and actual class distributions

Class labels were assigned using an optimal probability threshold, as determined by identifying the highest sensitivity value corresponding to a specificity of at least 80% (for correct classification of a significant portion of Class 1 structures) based on the global ROC curve [red dot in Fig. 5[link](a)]. Class 1 corresponds to MPEseed ≤ Q1 and CORRseed ≥ Q3, and Class 0 to MPEseed > Q1 and CORRseed < Q3.

  Actual: 0 Actual: 1
Predicted: 0 657 133
Predicted: 1 183 532
[Figure 5]
Figure 5
(a) The ROC curve of the RF model, illustrating sensitivity and specificity values across varying thresholds. The red dot marks the optimal cut-off point, identified as the highest specificity value corresponding to a specificity of at least 80%, for the correct classification of a significant portion of Class 1 structures. The AUC is reported as a measure of the overall performance of the model. (b) Feature importance, ordered by the descending mean decrease Gini (Kaufman & Rousseeuw, 1990View full citation) resulting from the RF model.

The RF model can be effectively used to optimize the decision-making process and resource allocation in crystal structure determination by applying the AI-PhaSeed method only to structures classified as Class 1. For these structures, the efficiency of AI-PhaSeed increases to 90%, compared with 85% with a random phase seed. This indicates that when AI provides reliable seed phases the AI-PhaSeed method performs close to the optimal scenario where true phases are used as seeds (96% efficiency).

As noted above, a key advantage of developing a classification model to assess the performance of AI-PhaSeed is its ability to evaluate the importance of different structural variables in determining the success of the phasing process. The feature importance, ordered by the descending mean decrease Gini coefficient (Kaufman & Rousseeuw, 1990View full citation), is shown in Fig. 5[link](b) and reveals that maxW, i.e. the largest atomic weight among the elements in the unit cell, is a key variable. This reflects a well established principle in crystallography: the presence of heavy atoms can facilitate the phasing process. What is particularly noteworthy is that the importance of this feature was identified by the RF classification model, even though no explicit information about atomic weight was provided during the neural network's training phase. This represents a significant insight into the application of AI in structure determination, as it highlights the deep learning model's capacity to extract physically meaningful information from data autonomously.

Subsequent relevant variables, ordered by the mean decrease Gini (Kaufman & Rousseeuw, 1990View full citation) in Fig. 5[link](b), are Nasym, Vol and maxCellSize, related to the size of the unit cell in direct space, and Perc and REFLECseed, related to the size of the unit cell in reciprocal space. These findings indicate that another key parameter influencing the efficiency of the AI-based phase assignment is the complexity of the crystal structure (e.g. in terms of number of non-H atoms in the asymmetric unit). This is closely linked to the limited size of the phase seed: for a larger crystal structure more effort is required to extend the phase information to all measured reflections. To verify the effect of the variables highlighted by the feature importance analysis, we plot in Fig. 6[link] the distribution of CORRseed across different intervals of the three most important features according to the RF mean decrease Gini [Fig. 5[link](b)]: maxW (maxW ≤ 50, 50 < maxW ≤ 100 and maxW > 100), Nasym (Nasym ≤ 25, 25 < Nasym ≤ 40 and Nasym > 40) and Vol (Vol ≤ 1500, 1500 < Vol ≤ 2000 and Vol > 2000). As expected, a clear increase in CORRseed values is observed in intervals closer to the optimal values of the variable, i.e. for higher maxW values and lower Nasym and Vol values.

[Figure 6]
Figure 6
Distribution of CORRseed across different intervals of the three most important features according to the RF mean decrease Gini [Fig. 5[link](b)]: maximum atomic weight among the structure elements (maxW ≤ 50, 50 < maxW ≤ 100 and maxW > 100), number of non-H atoms in the asymmetric unit (Nasym ≤ 25, 25 < Nasym ≤ 40 and Nasym > 40) and unit-cell volume (Vol ≤ 1500 Å3, 1500 < Vol ≤ 2000 Å3 and Vol > 2000 Å3). Horizontal black bars indicate median values.

As a result of the classification method, we can now estimate the performance on Class 1 structures. The AI-PhaSeed method shows improved efficiency, increasing from 86.7% on the entire dataset to 91.1% when applied specifically to the Class 1 subset.

4.1. AI phasing at limited data resolution

Given the challenging task of solving structures at resolutions far from the atomic one, we explored the performance of AI-PhaSeed as the data resolution was progressively decreased, being aware that the size of the seed relative to the total number of observed reflections plays a significant role. The analysis revealed that both MPEseed and CORRseed values are distributed differently across the datasets obtained after applying the resolution cut-offs (i.e. 1.0, 1.2, 1.4 and 1.6 Å). Additionally, the criteria for distinguishing solved from unsolved structures and for separating Class 0 and Class 1 structures after application of AI become less distinct compared with the corresponding results without a resolution cut-off. However, the mean silhouette width value remains above 0.7 (Kaufman & Rousseeuw, 1990View full citation), still indicating an appreciable degree of separation. Figs. S4, S5 and S6 present the statistical analysis of the results obtained using a 1.6 Å resolution cut-off, following the same procedure as the tests on native uncut structures. At this resolution, 23 structures exhibited statistical parameters that were unsuitable for applying DM (e.g. an insufficient number of structure-invariant relationships) and were therefore excluded from the dataset. The results of the RF model (Fig. S7 and Table S1) and the validation of the feature importance analysis (Fig. S8) are also provided.

The performance of the RF model is characterized by a still acceptable AUC of 0.75. The variable Perc shows increased importance compared with its role in the model trained on uncut data, emerging as the second most important variable after maxW, as shown in Fig. S7(b) [compare with Fig. 5[link](b)].

Similar evaluations have been performed for all datasets corresponding to the various resolution cut-offs. The trend in AI-PhaSeed efficiency (E) plotted against the resolution cut-off is shown in Fig. 7[link] and compared with that obtained by feeding the seed with random or true phase values. In Fig. 7[link], error bars represent the propagated uncertainty, calculated as Mathematical equation, where E is the efficiency (%), D is the number of correctly classified structures and N is the total number of structures. A common decreasing trend is observed across the curves, although the rates of decline differ. In particular, it can be noted that the efficiency of the true phase seed remains unaffected by the resolution cut-off up to 1.2 Å, while it drops substantially at 1.4 Å. Conversely, the random phase seed has a steady decrease in efficiency for data resolution up to 1.2 Å, which becomes less steep when going to 1.4 Å data resolution. The efficiency of the AI-PhaSeed method is always intermediate between the random and true phase seed. The decrease in the rate of efficiency follows that of the random-phase-seed curve for data resolution between 1.0 and 1.2 Å, while it is less steep for data resolution <1.0 Å or between 1.2 and 1.4 Å.

[Figure 7]
Figure 7
Efficiency of the AI-PhaSeed method as a function of data resolution cut-offs, compared with the efficiency obtained using random phase and true phase seeds. Five datasets were analysed: no resolution cut (no_cut), and cut-offs at 1.0, 1.2, 1.4 and 1.6 Å resolution. Error bars represent the propagated uncertainty, calculated as Mathematical equation, where E is the efficiency (%), D is the number of correctly classified structures and N is the total number of structures.

An RF classification model was also built for each resolution cut-off, following the same procedure described above. As shown in Fig. 8[link], an increase in the efficiency of the AI-PhaSeed method for Class 1 structures with respect to the full set of structures (Fig. 7[link]) is observed across all datasets, resulting in a smaller gap with the true phase-seeding efficiency. A satisfactory efficiency, close to 60%, for Class 1 structures is reached even at 1.6 Å resolution. Fig. 8[link] also shows the AI-PhaSeed efficiency calculated for structures with AI-generated phases very close to the true phases (MPEseed < 25 and CORRseed > 0.8), located in their corresponding green boxes as shown in Fig. S6 for the case at 1.6 Å; the efficiency is very close to 100%, regardless of the resolution cut-off, and almost coincides with the true phase-seeding values.

[Figure 8]
Figure 8
Efficiency of the AI-PhaSeed method applied on subsets of test structures as a function of the data resolution cut-offs, compared with the efficiency obtained using true phases (true phase seed). Full and dashed lines represent the efficiency calculated for structures belonging to their corresponding resolution-cut-off-based yellow (Class 1) and green boxes, as reported in Fig. S6. Error bars are also shown.

4.2. AI-PhaSeed combined with DM

The newly introduced AI-PhaSeed needs to be compared with the gold standard for ab initio crystal structure solution of small molecules, i.e. DM. With an efficiency of 86.7% on the entire dataset, AI-PhaSeed performs significantly below DM, which achieves 99.1% efficiency. This difference is expected: the DM approach has been under continuous development since the 1970s and is optimized for a wide variety of structures, whereas the potential of AI phasing is only beginning to be explored. With the aim of developing a more efficient phasing procedure than either method alone, we have integrated AI-PhaSeed with DM into a combined approach. The theoretical basis for using AI-derived phases to initiate DM phasing is outlined in Section 3.3[link]. This protocol is denoted by DM+AI-PhaSeed. However, we have demonstrated that there are structures for which AI is not able to supply reliable phases, and these cases can be predicted with good accuracy by an RF model. Thus, we envisage that the combination of AI-PhaseSeed with DM could be conditioned by the results of the RF model prediction, by applying AI-PhaSeed only if the structure is classified as Class 1 by the RF model. In addition, by evaluating the efficiency of DM+AI-PhaSeed and DM separately for Class 0 and Class 1 structures (Fig. 9[link]), we found that for structures with a data resolution (RES) of about 1.4 Å or worse (i.e. higher values such as 1.6 Å) DM+AI-PhaSeed outperforms DM for both classes. Thus, we can define an optimal solution strategy denoted by DM&AI-PhaSeed, according to which DM+AI-PhaSeed is applied to solve structures with lower data resolution (i.e. RES > 1.4 Å), while AI-PhaSeed alone is applied in cases of structures with higher resolution (i.e. RES ≤ 1.4 Å). From Fig. 10[link] it can be seen that this protocol has the highest efficiency on the entire structure dataset.

[Figure 9]
Figure 9
Efficiency of the DM and DM+AI-PhaSeed methods when applied to Class 0 and Class 1 test structures, plotted as a function of applied resolution cut-off level. Five different datasets were compared: no resolution cut (no_cut) and cut-offs at 1.0, 1.2, 1.4 and 1.6 Å resolution (Class 1 corresponds to MPEseed ≤ Q1 and CORRseed ≥ Q3, and Class 0 to MPEseed > Q1 and CORRseed < Q3). Error bars are also shown.
[Figure 10]
Figure 10
Efficiency of the AI-PhaSeed method combined with DM, plotted as a function of resolution cut-off level, compared with that obtained using DM alone and with the protocol DM&AI-PhaSeed.

5. Discussion and perspectives

We have demonstrated that the method proposed by Carrozzini et al. (2025View full citation), when suitably implemented in the AI-PhaSeed procedure, can successfully phase crystal structures that cannot be solved by the currently available stand-alone AI PhAI neural network. When the AI-predicted phase seed is reliable, its performance approaches that achieved using true phase seeds.

The results were assessed in two steps. First, the reliability of the AI-generated phase seed was evaluated, monitored by the variables MPEseed and CORRseed. Then, the phases obtained at the end of the AI-based solution process, where the AI-generated phases serve as a seed for traditional crystallographic methods, were assessed, using MPEfinal and CORRfinal as indicators. The performance of the first step was optimized using an RF classification model to predict the quality of the AI-generated phase seed. This model enabled us to analyse the dependence of AI performance on key variables related to the test structures, such as the percentage of the phase seed and the presence of heavy atoms.

For structures classified as optimal, the efficiency of AI-PhaSeed does not decline even at resolutions >1 Å, indicating that AI-PhaSeed is capable of phasing at resolution lower than the atomic level.

Besides the application of the method proposed by Carrozzini et al. (2025View full citation), we have made a step forward in integrating AI with ab initio phasing techniques, specifically DM, using the values of the phase seed generated by AI to support and reinforce the phase assignment performed by DM. To this end, we developed a DM+AI-PhaSeed integrated approach based on a modified tangent formula that combines AI-generated phases with those assigned by DM starting from random phases for reflections included in the phase seed.

The RF classification model also allowed us to optimize the application of the DM+AI-PhaSeed integrated approach. This was achieved by introducing a decision-making step based on structural features that are readily accessible after synthesis and X-ray measurement. With this fully integrated AI-based approach, we are able to phase real structures more efficiently than DM, even when the data resolution is >1 Å. This highlights the strong potential of the AI-PhaSeed method with low-resolution data.

In perspective, we expect that a substantial improvement in the AI-PhaSeed method and its integration with DM will result from the use of a neural network specifically trained on a larger number of investigated reflections. In the present study, we employed the PhAI neural network developed by Larsen et al. (2024View full citation), trained on reflection grids of fixed size (21 × 11 × 11). Our analysis identified that a critical factor for AI-PhaSeed performance is the percentage of the seed, defined as the number of seed reflections relative to the total number of experimental reflections.

Two key challenges must be addressed in the future to make this method suitable for routine structural determination of real structures: (i) extending AI-PhaSeed to structures with space groups other than P21/c, including non-centrosymmetric structures, and (ii) developing alternative criteria for selecting the phase seed. Carrozzini et al. (2025View full citation) have laid the groundwork for addressing these challenges. In particular, the application to non-centrosymmetric structures could be addressed by applying a phase binning strategy, where continuous phase values in the range [0, 2π] are sampled by using two, three, four or six phase values equally distributed in the same range. Implementing these enhancements will require the design of a new neural network, offering opportunities for changes and improvements to its architecture.

6. Conclusions

The AI-based approach proposed by Carrozzini et al. (2025View full citation) has been implemented by integrating AI with traditional ab initio phasing techniques. Its first application, the AI-PhaSeed method, is presented here using the neural network originally trained by Larsen et al. (2024View full citation) on real structures of limited size and P21/c symmetry.

We have demonstrated that the performance of the AI-PhaSeed method depends on two key factors: (i) the high extension of the phase seed, defined as the number of seed reflections relative to the total number of observed reflections, and (ii) the good quality of the phase values assigned to those seed reflections. When applied to structures matching both conditions, AI-PhaSeed can achieve optimal performance, in some cases surpassing that of classical DM. Notably, this advantage is observed even at data resolution lower than 1 Å, indicating that the AI approach has the ability to phase at resolution lower than the atomic level.

We have also developed machine learning tools, such as an RF classification model, to introduce advanced decision-making strategies in the phasing process. Guided by this model, we have developed a first-of-its-kind integration of AI-based tools with classical DM approaches, thereby enhancing the overall efficiency and reliability of crystal structure determination. This study highlights the strong potential of developing faster and more robust AI-based ab initio phasing methods, particularly for challenging cases involving low-resolution data, incomplete datasets or entirely novel protein folds.

Our results also suggest that the performance of AI-PhaSeed could be further improved by training a dedicated neural network specifically tailored/optimized to medium- and large-sized structures, incorporating the key features emphasized by the AI-PhaSeed approach. This development is planned for future investigations.

Supporting information


Footnotes

These authors contributed equally to this work.

Acknowledgements

The authors thank Dr Claudia Favia and Dr Mauro de Feudis for their help with the data analysis. Open access publishing facilitated by Consiglio Nazionale delle Ricerche, as part of the Wiley–CRUI-CARE agreement.

Data availability

The data supporting the results reported in this article, taken from the Crystallography Open Database, are available upon request from the authors.

Funding information

Financial support by ICSC – Centro Nazionale di Ricerca in High Performance Computing, Big Data and Quantum Computing, funded by the European Union – NextGenerationEU – PNRR, Missione 4 Componente 2 Investimento 1.4 (grant No. CN00000013) to F. Fedele, A. Moliterni and C. Cuocci, MUR PRIN (project 20223B4JWC) (Valorization of carbon oxides by sequential catalysis: combining the reverse water gas shift reaction with catalytic carbonyl­ation for the synthesis of high value added compounds – COXSECAT) to A. Altomare, and MENDELEEV PRIN (project 2022KMS84P) (Green revolution by merging metal–organic frameworks with deep eutectic solvents for the development of sustainable technologies and artificial nitro­gen fixation) to R. Caliandro is acknowledged.

References

Return to citationAltomare, A., Cuocci, C., Giacovazzo, C., Moliterni, A., Rizzi, R., Corriero, N. & Falcicchio, A. (2013). J. Appl. Cryst. 46, 1231–1235.   CrossRef CAS IUCr Journals Google Scholar
Return to citationBreiman, L. (2001). Mach. Learn. 45, 5–32.  Web of Science CrossRef Google Scholar
Return to citationBurla, M. C., Caliandro, R., Carrozzini, B., Cascarano, G. L., Cuocci, C., Giacovazzo, C., Mallamo, M., Mazzone, A. & Polidori, G. (2015). J. Appl. Cryst. 48, 306–309.  Web of Science CrossRef CAS IUCr Journals Google Scholar
Return to citationBurla, M. C., Caliandro, R., Giacovazzo, C. & Polidori, G. (2010). Acta Cryst. A66, 347–361.  Web of Science CrossRef CAS IUCr Journals Google Scholar
Return to citationCarrozzini, B., De Caro, L., Giannini, C., Altomare, A. & Caliandro, R. (2025). Acta Cryst. A81, 188–201.  Web of Science CrossRef IUCr Journals Google Scholar
Return to citationDowns, R. T. & Hall-Wallace, M. (2003). Am. Mineral. 88, 247–250.  Web of Science CrossRef CAS Google Scholar
Return to citationFawcett, T. (2006). Pattern Recognit. Lett. 27, 861–874.  Web of Science CrossRef Google Scholar
Return to citationGiacovazzo, C. (1998). Direct phasing in crystallography: fundamentals and applications. Oxford University Press.  Google Scholar
Return to citationGiacovazzo, C. (2013). Phasing in crystallography: a modern perspective. Oxford University Press.  Google Scholar
Return to citationGražulis, S., Chateigner, D., Downs, R. T., Yokochi, A. F. T., Quirós, M., Lutterotti, L., Manakova, E., Butkus, J., Moeck, P. & Le Bail, A. (2009). J. Appl. Cryst. 42, 726–729.  Web of Science CrossRef IUCr Journals Google Scholar
Return to citationHartigan, J. A. & Wong, M. A. (1979). J. R. Stat. Soc. Ser. C Appl. Stat. 28, 100–108.   Google Scholar
Return to citationJumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P. & Hassabis, D. (2021). Nature 596, 583–589.  Web of Science CrossRef CAS PubMed Google Scholar
Return to citationKarle, J. & Hauptman, H. (1956). Acta Cryst. 9, 635–651.  CrossRef CAS IUCr Journals Web of Science Google Scholar
Return to citationKaufman, L. & Rousseeuw, P. J. (1990). Finding groups in data: an introduction to cluster analysis. Chichester: Wiley.  Google Scholar
Return to citationLarsen, A. S., Rekis, T. & Madsen, A. Ø. (2024). Science 385, 522–528.  Web of Science CrossRef CAS PubMed Google Scholar
Return to citationMadsen, A. Ø. (2025). Acta Cryst. A81, 251–253.  CrossRef IUCr Journals Google Scholar
Return to citationOszlányi, G. & Sütő, A. (2011). Acta Cryst. A67, 284–291.  Web of Science CrossRef IUCr Journals Google Scholar
Return to citationPalatinus, L. & Chapuis, G. (2007). J. Appl. Cryst. 40, 786–790.  Web of Science CrossRef CAS IUCr Journals Google Scholar
Return to citationRius, J. (2011). Acta Cryst. A67, 63–67.  Web of Science CrossRef CAS IUCr Journals Google Scholar
Return to citationRius, J. (2014). IUCrJ 1, 291–304.  CrossRef CAS PubMed IUCr Journals Google Scholar
Return to citationSheldrick, G. M. (1990). Acta Cryst. A46, 467–473.  CrossRef CAS Web of Science IUCr Journals Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

Journal logoJOURNAL OF
APPLIED
CRYSTALLOGRAPHY
ISSN: 1600-5767
Follow J. Appl. Cryst.
Sign up for e-alerts
Follow J. Appl. Cryst. on Twitter
Follow us on facebook
Sign up for RSS feeds