How far are we from automatic crystal structure solution via molecular-replacement techniques?

An automatic pipeline based on molecular-replacement phases is described for the automatic crystal structure solution of protein and DNA/RNA molecules.

Although the success of molecular-replacement techniques requires the solution of a six-dimensional problem, this is often subdivided into two threedimensional problems. REMO09 is one of the programs which have adopted this approach. It has been revisited in the light of a new probabilistic approach which is able to directly derive conditional distribution functions without passing through a previous calculation of the joint probability distributions. The conditional distributions take into account various types of prior information: in the rotation step the prior information may concern a non-oriented model molecule alone or together with one or more located model molecules. The formulae thus obtained are used to derive figures of merit for recognizing the correct orientation in the rotation step and the correct location in the translation step. The phases obtained by this new version of REMO09 are used as a starting point for a pipeline which in its first step extends and refines the molecularreplacement phases, and in its second step creates the final electron-density map which is automatically interpreted by CAB, an automatic model-building program for proteins and DNA/RNA structures.

Symbols and abbreviations
EDM: electron-density modification. C s = (R s , T s ), with s = 1, . . . , m: the symmetry operators of the target structure. R s is the rotational part, T s is the translational part and m is the number of symmetry operators. t, t p : the numbers of atoms in the asymmetric units of the target and model structure, respectively. N = mt, N p = mt p : the numbers of atoms in the unit cells of the target structure and model structure, respectively. It is supposed, for the sake of simplicity, that all of the atoms are in general positions. Usually N p N, but it may also be the case that N p > N. f j : the atomic scattering factor of the jth atom (thermal factor included). F p = P m s¼1 P t p j¼1 f j exp½2ihðR s r pj þ T s Þ = |F p |exp(i' p ): structure factor of the model structure. r pj are the atomic positions of the model structure when it has been well oriented and located. F = P m s¼1 P t j¼1 f j exp½2ihðR s r j þ T s Þ = |F|exp(i'): structure factor of the target structure. r j are the true atomic positions. It is supposed that the target and model molecules are isomorphous, so that r j = r pj + Ár j . Ár j is the misfit between the atomic position r j in the target and the corresponding r pj in the model structure. E = A + iB = R exp(i'), E p = A p + iB p = R p exp(i' p ): normalized structure factors F and F p , respectively. AE N ¼ P N j¼1 f 2 j , AE N p ¼ P N p j¼1 f 2 j : the scattering power at a given sin / for the target and model structure, respectively. ISSN 2059-7983

Introduction
Molecular-replacement (MR) techniques (Rossmann & Blow, 1962;Rossmann, 1972Rossmann, , 1990 aim at phasing an unknown target structure using a known search molecule. The problem to solve is of a six-dimensional nature because it implies the correct orientation and location of the search molecule. Some MR programs face this in six-dimensional space [for example EPMR (Kissinger et al., 1999), SOMoRe (Jamrog et al., 2003) and Queen Of Spades (Glykos & Kokkinidis, 2000); see also Fujinaga & Read (1987)], even if an exhaustive six-dimensional search is generally avoided. Such programs are, in general, very time-consuming. More frequent is the practice of splitting the MR process into two three-dimensional steps: a rotation and a translation step. The most popular related programs are X-PLOR/CNS (Brü nger, 1992), AMoRe (Navaza, 1994), BEAST (Read, 1999), MOLREP (Vagin & Teplyakov, 2010) and Phaser (McCoy et al., 2007). In BEAST and Phaser, maximum-likelihood-based conditional distributions are applied (see Read & McCoy, 2016McCoy et al., 2018). Comprehensive reviews of the various techniques (updated up to 2007) have been collected in the January 2008 issue of Acta Crystallographica Section D. In recent years, more effort has been dedicated to cases in which the available experimental structures used as search models are only distantly homologous to the target; see, for example, Simpkin et al. (2018), Rigden et al. (2018), Prö pper et al. (2014), Millá n et al. (2015 and Cabellero et al. (2018).
In 2009, an MR program (REMO09; Caliandro et al., 2009) was proposed in which a probabilistic approach based on the joint probability distribution method was described. Joint distributions were derived in the absence of or under various prior conditions. For example, in the rotation step the correct rotation of a monomer is found via a figure of merit calculated when other monomers were previously oriented or located, or also when such information is not available. Joint distributions were also derived for the translation step: a monomer is located given its own orientation or the orientations and/or locations of other monomers. Burla et al. (2017), starting from REMO09 phases, checked the efficiency of a phase-refinement pipeline which synergically combines mainstream refinement techniques (specifically DM; Cowtan, 2001) with out-of-mainstream techniques [specifically, free lunch (Caliandro et al., 2005a,b), low-density Fourier transform (Giacovazzo & Siliqi, 1997), vive la difference (Burla, Caliandro et al., 2010;, Phantom derivative (Giacovazzo, 2015b;Carrozzini et al., 2016) and phase-driven model refinement (Giacovazzo, 2015a)]. For simplicity, we will refer to this modulus as SYNERGY. Burla et al. (2017) automatically submitted the protein data obtained by SYNERGY to the AMB procedure CAB (Burla et al., 2017): it applies Buccaneer (Cowtan, 2006) in a cyclic way.
In a recent paper (Giacovazzo, 2019), the standard method of joint probability distribution functions has been revised and updated. In particular, two-phase, three-phase and four-phase invariants are estimated directly via conditional distributions without passing through a previous calculation of the related joint probability distributions. The probabilistic formulae thus obtained do not coincide, in general, with the corresponding formulae established through the standard study of the joint probability distribution functions. Some of them are immediately applicable to MR, and some others, also suitable for MR, are derived here via this new approach. The formulae thus obtained form the basis for the modified version of REMO09 used in this paper.
In this paper, in accordance with the talk given by one of us at the 2019 CCP4 Study Weekend in Nottingham, England, we show the default results obtained on applying the modified REMO09 ! SYNERGY ! CAB pipeline to a large set of protein and nucleic acid structures. To obtain these results, we extended CAB to nucleic acid structures (unpublished work) by making the use of Nautilus (Cowtan, 2014) cyclical. The purposes are twofold: to check the efficiency of the new probabilistic formulae used in the modified version of REMO09 and to check how far a modern crystallographic pipeline based on MR phases is from the automatic crystal structure solution of macromolecules.

General features of REMO09
Various directives allow REMO09 users to choose proper approaches for solving macromolecular structures. In this section, we will summarize the default approach used in all of our applications.
(i) The observed and calculated data are scaled by Wilson techniques, which are also used to calculate the normalized structure factors (the observed and calculated hR 2 i are scaled to unity shell by shell). The isotropic thermal factors of the model atoms are automatically modified to make them compatible with the overall temperature factor of the target structure.
(ii) The target and model sequences are read.
(iii) The orientation space is sampled in terms of Lattman angles (Lattman, 1972) with an angular step depending on the resolution of the active reflections (the maximum angular step is 5 ). The extent of the orientation space is limited to the asymmetric region of the rotation group (Hirshfeld, 1968). For the first monomer to be located, only the Cheshire cell is explored in the translation step.
(iv) The map grid used in the translation search along each axis is 1/3 of the data resolution for proteins and 1/4 for nucleic acids.
(v) The active reflections for calculating figures of merit used in the rotation and translation searches are automatically research papers selected. Low-resolution reflections (up to 7 Å ) are eliminated from the calculations unless the SI is less than 0.5. The highest accepted resolution is 2.5 Å . This limit is extended a little for the translation step owing to the increased prior information gained during the rotation step. The SI is usually less critical for nucleic acids, mostly because nucleic acid helices can adopt similar conformations even when their sequences are drastically different.
(vi) The rotations are ordered according to the rotation figure of merit (RFOM; see Section 4). The good solutions are usually dispersed at the top of the list of ordered solutions: therefore, to speed up calculations only a subset are submitted to the translation step, in which the new figure of merit TFOM is used (see Section 5).

Rotational search when only one monomer lies in the asymmetric unit of the target structure
The rotational search is performed by locating the model molecule in a P1 cubic unit cell. According to Rabinovich et al. (1998), the structure factors of the model are calculated only once: fitting to the observed data is obtained by rotating the observed reciprocal lattice with respect to the model lattice.
The figure of merit designed for picking up the correct orientation of the model molecule is RFOM, the correlation factor between the observed R 2 and its expected value hR 2 i as calculated by the probabilistic approach described by Giacovazzo (2019). RFOM is expected to be maximum for the correct model orientation and hR 2 i is the expected value of R 2 given the prior information on the model stereochemistry: where F ps is the contribution to the calculated model structure factor arising from the asymmetric unit of the model structure, and E ps is its normalized (with respect to the scattering power of the model structure, symmetry-equivalent molecules included) form. The E ps are calculated and stored for each reflection via FFT of the electron density of the model structure in the enlarged cubic cell.
(1) has appropriate asymptotic behaviours: i.e. when A = 0 then hR 2 i = 1, as it should be in the absence of prior information, and when A = 1 then hR 2 i = P m s¼1 jE ps j 2 . The identity hR 2 i = R 2 may only occur in P1 when the asymmetric unit contains only one monomer showing a high similarity index to the target molecule.
Despite its good asymptotic properties, the use of (1) did not lead to a very efficient RFOM. The reason may lie in the mathematical definition of A 2 : according to Carrozzini et al. (2013) it coincides with the correlation factor between |F | 2 and the calculated squared structure factor. In the rotation step the experimental values of A 2 are generally small, mostly because P m s¼1 jE ps j 2 is not the dominant component of the calculated squared structure factor. Thus, in some resolution shells A < 0 (anticorrelation situation), while the A 2 parameter to be used in (1) remains positive. This suggested that we eliminate the calculation of A from (1) and simplify it as The 200 orientations corresponding to the highest values of RFOM are selected for the translation step: this number is enhanced to 300 if more than one monomer is in the target molecule and to 400 if SI < 0.4.

Translation search when only one monomer lies in the asymmetric unit of the target structure
The orientations selected according to Section 4 are submitted to the translation search one by one. This is performed by using the T2 function of Crowther & Blow (1967) in the form modified by Harada et al. (1981) and by Navaza (1994). T2 is implemented via FFT, as suggested by Vagin & Teplyakov (1997).
Only peaks falling inside the Cheshire unit cell are considered. For the same orientation, more peaks can be found: to spare computing time, only the largest five translations per orientation are saved. The selection of the best translations is made via the figure of merit TFOM, coinciding with the correlation factor between the observed amplitude |F | and the structure-factor amplitude |F p | as calculated for each translation.
Some further controls modify the simple approach above.
(i) The translations with the largest TFOM values are submitted to the SIMPLEX method (Rowan, 1990), an unconstrained optimization technique related to the downhill method (Nelder & Mead, 1965), which is here applied to a sixdimensional parameter space (three for rotation and three for translation). The method is applied two times to the selected five (or ten for nucleic acids or if SI < 0.4) roto-translations with the largest values of TFOM: they are then submitted to REFMAC optimization cycles. The purpose is to optimize the model and better recognize the best solution. The final figure of merit is (ii) The clash test (among symmetry-equivalent molecules) is applied, which dumps the TFOM value calculated above when a nonvanishing clash is found. The dumping factor is set to dump ¼ 1:0 À ð0:8clÞ; research papers Acta Cryst. (2020). D76, 9-18 where cl is the percentage of C atoms in the clash condition. The dumping factor cannot be <0.2.
The roto-translation with the highest figure of merit is automatically submitted to the SYNERGY step and to the CAB procedure.
6. Rotational search when more than one monomer lies in the asymmetric unit of the target molecule In the standard REMO09 program, when several monomers with the same stereochemistry are present in the asymmetric unit, the following three-step approach is used.
(i) A number of orientations are selected when the orientation of the first monomer is searched.
(ii) Once the first monomer has been located, the orientation of the second monomer is searched among the most probable orientations selected in step (i).
(iii) After the location of the second monomer, steps (i) and (ii) are repeated until all monomers are located. This simple procedure may not work when the number of monomers in the asymmetric unit is large (more than three) or when the target is constituted of a number of components with different stereochemistry, each contributing a fraction of the scattering power in the asymmetric unit. This is the case for PDB entries 1lat and 2iff. The first test structure shows two chains of 71 and 74 resideues, respectively, and two identical nucleic acid chains, each with 19 nucleotides. The structure with PDB code 2iff is composed of three protein chains: two with 212 and 214 residues and a third chain with only 129 residues. The model coincides with the third target protein chain.
We then decided to modify the REMO09 approach as follows: when the first molecule has been located, the rotations of the second and the others must be searched for using an ex novo rotation step and, where the case, by using a different model.
In both of the approaches the figures of merit to be used for recognizing the correct rotation must be designed to take into account that one or more monomers have been previously oriented and located. This increases the signal to noise in the search for the new monomer.
Let us consider the simplest case: the first monomer has been located and we want to orient the second monomer (no other monomers are supposed to lie in the asymmetric unit). Appendix A suggests that RFOM may still be the correlation factor between the observed R 2 and its expected value hR 2 i, but now where R 2 p1 is the squared amplitude of the normalized model structure factor corresponding to the already located first model monomer (normalized with respect to the scattering power of the structure containing the first monomer and its symmetry equivalents) and A1 is the A value corresponding to the pairs (R, R p1 ). The last term on the right-hand side of (4) corresponds to the contribution of the second model monomer (the correct orientation of which we are searching for). A2 is the A value corresponding to the pairs (R, hR 2 2 i 1/2 ), where Let us briefly discuss the expected behaviour of (4). The probabilistic approach used to derive (4) excludes the existence of a mixed nonzero term relating the monomer already positioned to the monomer for which the orientation is searched. Thus, the two contributions are simply additive.
When the first monomer is badly oriented and/or located 2 A1 is expected to be close to zero. Since 2 A2 is always expected to be a small value (at least for non-P1 space groups; see Section 4), RFOM is expected to be small. When the first monomer is well located and the second is well oriented then RFOM is expected to be larger. However, values of 2 A1 and 2 A2 that are both close to unity are not expected because AE p1 /AE N and AE p2 /AE N values that are both close to unity are not allowed. Sections 4 and 5 suggest avoiding the use of A values so that hR 2 i reduces to The final RFOM is the correlation coefficient between the observed R 2 and its expected value hR 2 i. Let us now generalize (6) to the case in which three monomers are contained in the asymmetric unit under the condition that the first and second monomers have already been oriented and located. The expression (6) is still valid; we only have to change the meaning of the symbols. R p1 will represent the normalized amplitude of the model structure corresponding to the first and second monomers (symmetry equivalents included), P m s¼1 jE sp2 j 2 À 1 will represent the contribution arising from the monomer for which the correct orientation is searched.
The procedure is now cyclic: the same equation may be applied to any number of monomers.
7. Translational search when more than one monomer lies in the asymmetric unit of the target molecule Let us first suppose that one monomer has already been oriented and located (F 1 is its generic structure factor) and that a second monomer has been oriented. If we use the Crowther T2 function to locate the second monomer in the translation step then the expected squared structure factor of the structure constituted by the two monomers and their symmetry equivalents in correct positions is This is a weak relation owing to the fact that h|F | 2 i does not include the mixed term F 1 F 2 .
A better approach is that using the translation function involving F instead than its square. Let r pj be the current positional vector of the jth atom of the second model monomer: the structure factor of the structure constituted by research papers the second monomer and its symmetry equivalents in correct positions is then where Ár is a suitable unknown positional shift, is the component of the current model structure factor. The algorithm is very simple. F 2 is calculated for each active reflection only once, in the initial position of the second monomer. The second monomer is then moved by the shift Ár on all of the grid points of the asymmetric unit, where F 2 is calculated via (7) and summed with F 1 to obtain The correct grid position is expected to be that for which TFOM, the correlation factor between the observed amplitude |F| and the structure-factor amplitude hF i, is a maximum.
The method is simply generalized to locate an nth well oriented monomer when the first n À 1 monomers have been well oriented and located.

Applications
We applied the automatic modified pipeline REMO09 ! SYNERGY ! CAB to an extended set of test structures, proteins and nucleic acids. We used 80 protein and 38 nucleic acid test structures, the PDB codes of which are reported in Tables 1 and 2. The first 34 protein test structures had previously been used by Burla et al. (2017) to check the SYNERGY refinement process on standard REMO09 phases. Proteins 25-34 belong to the set of 13 structures studied by DiMaio et al. (2011) and characterized by an SI between the model and target structures of lower than 0.30. The experimental data and models for the remaining 46 protein test structures had been deposited in the PDB by the Joint Centre for Structural Genomics, Wilson Laboratory, Scripps Institute: they were used to verify the efficiency of our pipeline on a larger number of test structures (most of them were not originally solved by MR).
The 38 nucleic acid structures were selected from the PDB: we downloaded the observed diffraction data, information on  Table 1 The 80 protein test structures are identified by their PDB codes.
Their experimental data were submitted to the REMO09 + SYNERGY + CAB pipeline. For each test structure we show MRP , the average phase error/weighted average phase error in degrees at the end of REMO09; SYN , the average phase error in degrees at the end of the SYNERGY step; and MA, the ratio 'number of C atoms within 0.6 Å distance from the published positions/number of C atoms in the asymmetric unit'. Dashes indicate that useful roto-translations were not found by the MR program. For all of the test structures the same small set of directives was used (coinciding with our default set) such as those shown in Table 3 for PDB entry 1xyg.
The experimental results are reported in Tables 1 and 2. For each test structure PDB is the PDB code, MRP is the average phase error in degrees at the end of the REMO09 step and SYN is the average phase error in degrees at the end of the SYNERGY step. For proteins, MA is the ratio 'number of C atoms within 0.6 Å distance from the published positions/ number of C atoms in the asymmetric unit' as obtained by CAB. For nucleic acids, MA is the ratio 'number of residues with P atoms within 1.3 Å distance from the published positions/number of residues in the asymmetric unit' in accordance with CAB interpretation. We will assume that good models are obtained by CAB when MA is sufficiently large: as a rough rule of thumb, we will assume that a good solution has been automatically found when MA > 0.5.
For proteins we observe the following.
(i) Good solutions were found for 64 of the 80 test proteins. The 16 failures are essentially owing to the limited efficiency of REMO09. Indeed, for 14 of the 16 failures MRP was !74 : in these conditions SYNERGY is often unable to substantially reduce the average phase error so as to allow CAB to succeed. REMO09 failures are frequent for DiMaio structures because, owing to the extreme low value of SI, the MR step often ends with a large model bias which SYNERGY is unable to correct.
(ii) When MRP is not extremely large, SYNERGY dramatically reduces the average phase error. In 15 cases MRP values in the interval 73-80 are broken down to values of less than 43 , thus allowing CAB to succeed.
(iii) CAB for proteins is extremely efficient. The MA value is very often close to 100 (a clear signal of successful map interpretation), even in nine of the cases for which SYNERGY ended with SYN > 50 .
The panorama is different for nucleic acids. Such behaviour is in part expected because of the special stereochemistry of DNA/RNA structures. They have a large number of rotatable bonds in the main chain (six, while there are two for proteins); consequently, the conformation at low resolution is often ambiguous (Keating & Pyle, 2012;Murray et al., 2003). Our experimental results may be summarized as follows: of the 38 nucleic acid structures only 24 are routinely solved. Ten of the 14 failures may be ascribed to REMO09 (i.e. for these MRP ! 77 ). Four of the remaining five failures are owing to CAB failures (CAB is unable to interpret the electron-density maps of PDB entries 3tok, 4gsg, 4xqz and 5ihd, for which SYN 51 ).
SYNERGY is again efficient (MPR values of >70 are broken down to values smaller than 40 ).
The above experimental tests indicate that the application of REMO09 and CAB to DNA/RNA are the weakest points of the pipeline. On the contrary, SYNERGY, applied to both nucleic acids and to proteins, and the application of CAB to proteins are particularly efficient. The existence of weak points in the pipeline do not allow us to positively answer the question in the title of this paper. There are three simple ways to improve the present situation.
(i) Modify REMO09 to give a more modern and efficient version.
(ii) Replace REMO09 with a more efficient program.  Table 2 The 38 nucleic acid test structures are identified by their PDB codes.
Their experimental data were submitted to the REMO09 + SYNERGY + CAB pipeline. For each test structure we give MRP , the average phase error/weighted average phase error in degrees at the end of REMO09; SYN , the average phase error in degrees at the end of the SYNERGY step; and MA, the ratio 'number of residues with P atoms within 1.3 Å distance from the published positions/number of residues in the asymmetric unit'.  Table 3 Directives for the default use of the REMO09/SYNERGY/CAB pipeline.   Table 4 The 80 protein test structures are identified by their PDB codes.
Their experimental data were submitted to the MOLREP + SYNERGY + CAB pipeline. For each test structure we give MRP , the average phase error/weighted average phase error in degrees at the end of MOLREP; SYN , the average phase error at the end of the SYNERGY step; and MA, the ratio 'number of C atoms within 0.6 Å distance from the published positions/number of C atoms in the asymmetric unit'. Dashes indicate that useful roto-translations were not found by the MR program.  Table 5 The 38 nucleic acid test structures are identified by their PDB codes.
Their experimental data were submitted to the MOLREP + SYNERGY + CAB pipeline. For each test structure we give MRP , the average phase error/weighted average phase error in degrees at the end of MOLREP; SYN , the average phase error in degrees at the end of the SYNERGY step; and MA, the ratio 'number of residues with P atoms within 1.3 Å distance from the published positions/number of residues in the asymmetric unit'. Dashes indicate that useful roto-translations were not found by the MR program. than that corresponding to the naïve default we choose. However, the experimental results obtained by the pipeline MOLREP ! SYNERGY ! CAB, shown in Tables 4 and 5, help to better answer the general question regarding automatic crystal structure solution via MR. The results in Table 4 for proteins may be summarized as follows.
(i) Solutions are found for 61 of the 80 test structures. Most of them are owing to our non-optimal MOLREP default choice.
(ii) The efficiency of SYNERGY and CAB is similar to that described for the REMO09 ! SYNERGY ! CAB pipeline.
(iii) REMO09 and MOLREP have a complementary behaviour. Indeed, only nine of the 80 protein test structures remained unsolved by both pipelines.
The experimental results in Table 5 for nucleic acid structures may be summarized as follows.
(i) Of the 38 nucleic acids only 20 are automatically solved: 16 of the 18 failures may be ascribed to the limited effectiveness of our default MOLREP procedure (for these MRP ! 86 ) and two to CAB (PDB entries 3tok, for which SYN = 47 , and 4gsg, for which SYN = 55 ); (ii) 14 of the 38 nucleic acid structures remained unsolved by both pipelines.

Conclusions
The phase problem for small molecules is considered to be universally solved in practice. The main purpose of this paper is to check whether a similar situation is, or will soon be, available for macromolecules if MR techniques are used. We applied the two pipelines REMO09 ! SYNERGY ! CAB and MOLREP ! SYNERGY ! CAB to 80 protein structures and 38 nucleic acid structures. Only nine of the 80 protein structures remained unsolved by both of the pipelines; most of the failures occurred when the SI was extremely low (below 0.30). The increasing availability of better models, the selection of improved default procedures for REMO09 and MOLREP, and the possible use of more efficient MR programs (e.g. SYNERGY and CAB may use Phaser) suggest that automatic crystal structure solution is close for proteins. The situation for nucleic acid structures is different: 14 of the 38 nucleic acid structures remained unsolved by both of the pipelines. Further efforts are therefore necessary to obtain their automatic crystal structure solution: the necessary improvements involve the MR programs (in particular the treatment of ligands, which may be a non-negligible part of the structure) and the AMB section.

APPENDIX A On the orientation of a second monomer
The problem that we will treat in this appendix is the following: if the first monomer has been correctly oriented and located, how do we fix the orientation of a second monomer? To answer this question, in the following probabilistic approach we will explicitly consider the case in which the orientation of the second monomer has been fixed while its location is unknown. We will see that the conclusive formulae thus obtained may be applied to fix the orientation of the second monomer.
Let t 1 and t p1 be the number of non-H atoms of the first target monomer and of its model molecule, respectively: for simplicity, we are supposing that t 1 ! t p1 . t 2 and t p2 are the equivalent numbers for the second target monomer and for its model molecule. We order the atoms in the target asymmetric unit so that its structure factor may be represented as where t = t 1 + t 2 is the number of non-H atoms in the target asymmetric unit. In our probabilistic approach h is fixed while the positional vectors are the primitive random variables. U is an overall free translation vector that is necessary to locate the second monomer in the correct position and Ár j are local variables relating the atomic positions of the target monomers to the corresponding positions of the model. In order, (8) may be rewritten as The atoms contributing to F 1 are related to the atoms of the model molecule of the first monomer via the local shift vectors Ár j only (the first monomer has been already located). The atoms contributing to F 2 are related to the atoms of the model molecule of the second monomer through the local shift vectors Ár j and through the unknown overall translation vector U (indeed, the second monomer has not been located). The coordinates of the atoms contributing to F q1 and F q2 are not related to the atoms of the model molecules; they may be thought of as unconstrained unknown variables.
We now calculate the average value of |F| 2 given the prior information described above, The above equation may be more explicitly written if the cases in which i = j and/or s1 = s2 are emphasized. We have research papers where D 1 and D 2 are the D values (see Section 1) calculated for monomers 1 and 2, respectively. Let us now take into account the relations (11), (12) and (14) below.
i6 ¼j¼1 f i f j exp½2ihðR s1 r pi À R s2 r pj þ T s1 À T s2 Þ P t p1 j¼1 f 2 j expf2ih½ðR s1 À R s2 Þr pj þ T s1 À T s2 g þ P m s16 ¼s2¼1 P t p1 i6 ¼j¼1 f i f j exp½2ihðR s1 r pi À R s2 r pj þ T s1 À T s2 Þ ¼ jF p1 j 2 À m P t p1 j¼1 f 2 j : ð13Þ F p1 is the structure factor corresponding to the structure constituted of the model molecule that has already been located (and its symmetry equivalents).
where F sp2 ¼ P t 1 þt p2 j¼t 1 þ1 f i f j expð2ihR s r pj Þ is the contribution to the structure factor of the model molecule of the second monomer (oriented but not located) arising from the asymmetric unit. In accordance with (14), we have X m s¼1 X t 1 þt p2 i6 ¼j¼t 1 þ1 f i f j expf2ih½R s ðr pi À r pj Þg ¼ X m s¼1 jF sp2 j 2 À m X t 1 þt p2 j¼t 1 þ1 f 2 j : ð15Þ Substituting (11), (13) and (15) into (10) gives Dividing the left-and right-hand sides of (16) by AE N leads to where 2 A1 ¼ D 2 1 AE p1 AE N and 2 A2 ¼ D 2 2 AE p2 AE N : R 2 is normalized with respect to the scattering power of the full target unit cell, R 2 p1 is normalized with respect to the scattering power of the structure constituted of the oriented and located molecule (symmetry equivalents included) and |E ps2 | 2 is normalized with respect to the scattering power of the model molecules (symmetry equivalents included) that are oriented but not located.
We can now return to the question: why did we formulate a probabilistic theory for the case in which one monomer is well located and the second well oriented, when we are primarily interested in the case in which one monomer is well located and we are looking for the orientation of the second monomer? The answer is simple. Indeed, when we continuously rotate reciprocal space and look for the best fit between R 2 and hR 2 i we hope to find a rotation in which the second monomer is well oriented. In this case hR 2 i will really be the expected value of R 2 in accordance with (17), while for all of the other orientations this condition will not be obeyed. Accordingly, the correlation will be a maximum.