Automated matching of two-time X-ray photon correlation maps from phase-separating proteins with Cahn–Hilliard-type simulations using auto-encoder networks

Two-time correlation maps are classified in a simulation framework using an auto-encoder network.


Introduction
X-ray photon correlation spectroscopy (XPCS) is an experimental method capable of accessing the dynamics of protein systems over length scales that range from the atomic to micrometre scale, and timescales from microseconds to hours (X. Lu et al., 2008;Zhang et al., 2011Zhang et al., , 2017Madsen et al., 2016;Mö ller et al., 2016;Zinn et al., 2018;Perakis & Gutt, 2020;Ruta et al., 2012;Begam et al., 2021). In XPCS, time series of coherent X-ray speckle patterns are measured, giving access to dynamical properties via time-resolved correlation maps -the two-time correlation function (TTC) (Brown et al., 1997;Bikondoa, 2017). With fast megapixel 2D X-ray detectors, XPCS experiments can produce large quantities of data in a short time interval with up to thousands of TTCs per hour of beamtime. Given the typical duration of a synchrotron experiment of a few days, the resulting quantities of TTCs are difficult to handle. Therefore, there is a generic need for methods facilitating fast and reliable analysis including classification of the data. This is especially important with regards to steering, selecting and controlling the experimental parameters during the beamtime. However, in the aftermath of the experiment, quick classification methods are important for benchmarking models which, for example, simulate a gel transition or the solidification process of a protein solution upon liquid-liquid phase separation (LLPS).
Phase separation is a ubiquitous process in nature, with applications and consequences for a wide range of scientific disciplines such as solid-state physics, material sciences and biology. Phase separation in biological systems attracted considerable attention with the discovery that LLPS in protein solutions (Ishimoto & Tanaka, 1977;Schurtenberger et al., 1989;Broide et al., 1991;Berland et al., 1992;Muschol & Rosenberger, 1997) constitutes a possible pathway for organizing membraneless structures in living cells (Brangwynne et al., 2009;Shin & Brangwynne, 2017;Berry et al., 2018). Detailed investigations were carried out with respect to the biological functions of these protein condensates, encompassing biochemical reaction rates, buffering protein concentrations, and sensing or signaling (Shin & Brangwynne, 2017). A variety of diseases which result from a loss or change of function of these condensates are accompanied by phase transitions (Malinovska et al., 2013;Weber & Brangwynne, 2012).
Phase separation is initiated by quenching a system into the metastable region of the phase diagram, launching a process of out-of-equilibrium self-organization (Dong & Granick, 2021). The state of the condensates depends on the dynamic and kinetic processes during their formation, with dynamical asymmetries between the two phases on a hierarchy of length and timescales and invoking viscoelastic properties of the resulting network structures (Berry et al., 2018;Zaccarelli, 2007;Tanaka, 2000). In the biological context, an understanding of the component timescales is important as they are expected to meet the intrinsic timescales of biochemical processes in the condensates.
Upon phase separation, the dynamics slow down on molecular length scales due to local concentration changes. This microscopic deacceleration can eventually lead to the arrest of phase separation on larger length scales complemented by the formation of bicontinuous gel network structures -as observed in colloidal and protein systems (Manley et al., 2005;Conrad et al., 2010;P. J. Lu et al., 2008). Employing timeresolved scattering experiments, the kinetics during arrested phase separations have been observed frequently as a slowdown of the growth of the static structure factor S(Q) in both Q position and peak height for low-temperature quenches (Gibaud & Schurtenberger, 2009;Gibaud et al., 2011;Cardinaux et al., 2007;Da Vela et al., 2016, 2017Bucciarelli et al., 2015). In contrast, the dynamics of protein solutions evolving into an arrested phase transition are still poorly understood as studies require the concurrent monitoring of an extraordinarily broad range of length and timescales. Likewise, experimental validation of models of the dynamics of critical phenomena during LLPS like the Cahn-Hilliard equation (CHE) (Cahn, 1965) and other correlated models is still needed, particularly with respect to glass-gel transitions displaying considerable dynamical asymmetries between the dilute and concentrated phase (Berry et al., 2018;Girelli et al., 2021;Ragulskaya et al., 2021).
The focus of X-ray scattering applications of ML has been mainly on static properties up to now. We note, however, that many processes in nature, such as the LLPS of protein solutions, display dynamic phenomena which require the development of schemes and methods capable of classifying the corresponding dynamic X-ray signatures of the processes. MLbased classification of process data evolving in time is computationally very expensive due to the large quantity of training data required. Employing a minimum 2D model which allows a network with finite computer resources to be trained, we report on a first step towards this goal and present a neural network method for analysis and classification of dynamical X-ray information during LLPS.
The CHE in a simplified 2D version still captures the essence of the physics involved (Sciortino et al., 1993;Sappelt & Jä ckle, 1997;Lamorgese & Mauri, 2009), while allowing us to generate a large number of simulations on the timescale of hours, and couple it with a concentration-induced gelation as proposed by Sciortino et al. (1993) and Sappelt & Jä ckle (1997). We aim for an automated assignment of the measured TTCs in the work of Girelli et al. (2021) with TTCs from simulations that differ in the parameters , i.e. proportional to reduced quench depth, and É gel , i.e. the order parameter at which the mobility drops by a factor of 1/2. We train an autoencoder neural network with TTC simulation data and employ a differential evolution based algorithm for matching encoded experimental TTCs with simulations. Our work constitutes a first step to making use of the static, kinetic and dynamical information contained in XPCS data for ML-based analysis of processes in phase separation.

Simulation of the Cahn-Hilliard equation
We model the dynamic processes during the spinodal decomposition with the help of the CHE ( Here Éðr; tÞ is the order parameter at a given time t and position r which is related to the local protein concentration È by È ¼ ðÉ þ 1Þ=2 (Sciortino et al., 1993). The symmetry breaking is introduced by the parameter which can be identified as being proportional to the reduced quench depth / ðT c À TÞ=T c , where T c is the upper critical temperature of the system. If fulfills the condition > 3É 2 (e.g. Gunton et al., 1983), the system is in an unstable state below the spinodal curve and separates into regions with higher and lower local protein concentrations [Figs. 1(a) and 1(b)].
The phase separation by spinodal decomposition is coupled to a gelation process by introducing a concentrationdependent mobility parameter M in equation (1) (Sappelt & Jä ckle, 1997) which reduces the mobility of the highly concentrated phase via In this model the mobility decreases from values M ¼ 1 to M ¼ 1 2 when the order parameter É exceeds the value of É gel [ Fig. 1(c)]. The parameter is fixed to ¼ 10; providing a steep decrease of the mobility as required for an arrested phase separation (Sappelt & Jä ckle, 1997).
We perform the simulation on a 2D grid of size 256 Â 256. The grid uses periodic boundary conditions and is initialized with a mean order parameter of É 0 ¼ 0:4 such that we observe droplets with a low concentration instead of an interconnected network, which would be the case for É 0 ¼ 0. The fluctuations necessary for LLPS are added as random noise with a maximum amplitude of 0:5 ð1 À Þ. Ten thousand time steps are calculated for each simulation with a fixed time step of Át ¼ 0:03. The temporal evolution of the order parameter for different parameter configurations is depicted in Figs. 1(a) and 1(b). Apart from the initial noise configuration, the simulations differ only in the values of the parameters and É gel . Both were systematically varied within the intervals 2 ½0:88; 0:99 and É gel 2 ½0:55; 1: These intervals were chosen with É gel > É 0 , thus avoiding an arrest of the phase separation before the formation of droplets. We restrict our analysis to large values of as for small values of the concentration of the dense phase is too low [gray line in Fig 1(a)] to introduce visible changes in the simulations for different values of É gel . A single simulation run yields a time series of the protein density's real-space configuration Éðr; tÞ [ Fig. 2(a)] which is converted to an X-ray speckle pattern by means of Fourier transform, yielding a 256 Â 256 image. The temporal evolution of the square of the azimuthal integration of these Fourier images jÉðR Q ; tÞj 2 is shown in Fig. 2(b) and is equivalent to the X-ray intensity. We introduce R Q as the distance from the center of the Fourier image in units of pixels. In Fig. 2  Order parameter averaged over all lattice sites having an order parameter higher/lower than the initial order parameter (É ¼ 0:4) as a function of time. In (a), É gel was fixed to É gel ¼ 1 and was varied. For ¼ 0:7, the phase separation is much slower and the concentration of the dense phase is lower at the end of the chosen time window. This makes it hard to see the effects of high É gel , which is why is restricted to ½0:88; 0:99 for the generation of the simulated training data. In (b), was fixed to ¼ 0:9 and É gel was varied. The phase separation starts later for larger values of due to the smaller initial fluctuations. For constant , the phase separations show similar behavior at early times but the droplet formation is faster for larger values of É gel . (c) Mobility as a function of the order parameter and the gelation point É gel marking the order parameter at which the mobility drops to 0.5.  c ð2Þ ðt 1 ; t 2 Þ ¼ ½Iðt 1 Þ À hIðt 1 Þi½Iðt 2 Þ À hIðt 2 Þi ½hI 2 ðt 1 Þi À hIðt 1 Þi 2 ½hI 2 ðt 2 Þi À hIðt 2 Þi 2 È É 1=2 ; ð4Þ with hÁi being the average over pixels within a distance ½R Q À dR Q ; R Q þ dR Q to the center. The azimuthally integrated intensity profile [ Fig. 2(b)] shows the growth dynamics of the droplets by a shift of the peak position towards smaller values of R Q . These dynamics can be observed in the TTC by choosing a value of R Q such that it is located at the falling slopes of the peaks in Fig. 2(b), which is R Q 2 ½26; 45. For our training data, we chose R Q to be an integer in this range, while dR Q increases linearly from 0.7 at R Q ¼ 26 to 3 at R Q ¼ 45; as done in experimental analyses as well to compensate for lower photon counts at higher Q values.

Experimental data
The XPCS experiments (Fig. 3) were conducted at the Coherence Applications beamline P10 at PETRA III, DESY, employing an X-ray beam of photon energy 8.54 keV, size 100 Â 100 mm and maximum photon density 10 7 photons s À1 mm À2 [for further experimental details, see Girelli et al. (2021)]. We measured time series of coherent diffraction patterns that were collected with an EIGER 4 Mpixel detector covering a Q range from 3 to 50 mm À1 . The samples consisted of immunoglobulin G (IgG) with polyethylene glycol (PEG) and NaCl, and were quenched from 37 C to six different quench temperatures below the binodal line. Further details about the sample preparation can be found in the work of Da Vela et al. (2017).
In the experiment, 4000 frames were recorded for every quench temperature. The sample was illuminated for 0.02 s followed by 0.1 s where the shutter was closed such that the measurement covers a time range of 4000 Â 0.12 s = 480 s. The experimental TTCs were calculated similarly to the TTCs from the simulation using equation (4). Regarding the temporal shift of the integrated intensity profiles as a function of Q, the falling edge can be restricted to Q 2 [7.5, 10] mm À1 . The contrast of the experimental TTCs is lower than that of the simulated ones where we assume completely coherent scattering and detection. This issue is addressed by normalizing the experimental TTCs to the speckle visibility which is determined by the values of the pixels adjacent to the diagonal of the experimental TTC, as the values directly on the diagonal are distorted by shot noise (Inoue et al., 2012).

Data preparation and auto-encoding
For the training data, 12 000 pairs of the parameters and É gel were sampled from a uniform distribution within the intervals. To increase the simulation robustness against the random initial conditions, TTCs from five simulations with different initial noise were averaged. Thus, 5 Â 12 000 = 60 000 simulations were performed in total, which took 7 h on two NVIDIA 2080 Ti graphic cards. TTCs were calculated for 20 R Q ranges, so that the training data consisted of 20 Â 12 000 = 240 000 TTCs.
All simulated TTCs display a broad, slow part in the beginning (Fig. 4) which is absent in the experimental data. This part is associated with the still homogeneous phase at early times before the onset of density differences [ Fig. 2(a), t = 2000]. Approaching smaller values of R Q this slow part becomes even more pronounced.
These broad parts are eliminated by cropping the TTCs. The maximum of the intensity in the associated Q rings is well defined for both simulated and experimental TTCs and is chosen to be the time where the TTCs are cropped. By doing this, the early part of the phase separation is removed from the TTC, such that only the coarsening stages of simulated and experimental TTCs are compared. The cutting procedure is illustrated in Fig. 4 for three simulated TTCs. For different models, this preprocessing via cropping needs to be adapted to remove, for example, well known deficiencies from simulated TTCs or make sure that the timescales of simulation and experiment are comparable. We derive the latter from the facts that the intensity curves show similar behavior within the coarsening stage for both experimental and simulated data and the TTCs show the same overall features in the whole Schematic of the experimental setup. The coherent X-ray beam is scattered from the protein solution during the LLPS yielding a time series of X-ray speckle patterns. A constant Q ring is selected in the speckle patterns and the intensities are correlated pixel by pixel for different times.

Figure 4
The simulated TTCs show a broad feature at early times (lower-left corner of the TTCs). In the second row, the mean intensity of the respective Q range is shown as a function of time. The maximum of this intensity curve defines the point where the TTCs are cropped. measurement time . The cropped TTCs are finally scaled down to a 64 Â 64 resolution.
Instead of matching raw TTCs between the experiment and the simulations, we compare their encoded representations via an auto-encoder (Rumelhart et al., 1986). The auto-encoder network (Starostin, 2022) was trained with the simulated TTCs, encoding them in 32-dimensional vectors Z. For the encoder, the first three layers of the pretrained ResNet-18 model (He et al., 2016) were used as a feature extractor, followed by adaptive average pooling with an output size of 3 Â 3 and a fully connected network. The decoder architecture follows the work of Hou et al. (2017).
The performance of the auto-encoder network on simulated and experimental data is shown in Fig. 5, from which one can see that the network is capable of reducing the experimental TTCs to the main dynamic components. Thus, the encoded representations allow comparison of TTCs in a more reliable and faster way. If a series of TTCs from the same simulation but at different values of R Q is created, the entries of the Z vector change continuously (Fig. 6) which enabled us to interpolate TTCs for arbitrary values of R Q in between the selected values.

Results
Our goal is to develop tools for a fast classification of experimental TTCs in the context of different models. We approach this task via matching encoded TTCs from measurements at different quench temperatures with encoded TTCs from simulations of our simplified model. During this matching process, we find the quench temperature dependence of the two simulation parameters and the values of the three additional calibration parameters t low ; t up ; k r , where t low and t up are the times at which the experimental TTCs are cropped such that they match the timescale of the simulation. k r denotes the factor for converting the Q range in units of pixels, which were used in the simulation, to units of inverse micrometre as they were used in the experiment via A differential evolution (Price, 1996) such that we obtain a set of 30 experimental TTCs. The differential evolution algorithm optimizes the mean dot product D between the normalized encoded representations of 30 experimental TTCs and 30 simulated TTCs from the training data set with respect to , É gel , k r , t up and t low : Z q sim ½ðTÞ; É gel ðTÞ; k r Á Z T;q exp ðt up ; t low Þ: ð7Þ The choice of boundaries for the calibration parameters was based on the assumption that the duration of the experiment is longer than that of the simulation, and that the chosen Q range [7.02; 8.2] mm À1 lies within the used R Q range for the simulations. Knowing that is proportional to ðT c À TÞ=T c , we apply additional boundaries to consider only those solutions where the average value of is decreasing as a function of experimental quench temperature.
During the fitting procedure, the experimental TTCs were cropped and encoded for each generated combination of the parameters. A caching technique was used to accelerate these calculations. The encoded vectors of the simulated TTCs were taken from the training data and interpolated on the basis of the generated calibration coefficient k r . After 1500 iterations the result of the fit converged to D ' 0:72. Fig. 7(a) shows the averaged outcomes for the simulation parameters É gel and . The rise in É gel for shallower quenches indicates that the gelation is occurring at higher concentrations at these temperatures. The (É gel ; ) pairs were integrated into a phase diagram [ Fig. 7(b)] which displays the spinodal and binodal lines as determined from the Landau free energy [equation (1)]. The predictions of the neural network enable us to estimate the corresponding gel line [line in the lower- Performance of the auto-encoder on simulated and experimental TTCs. The encoder network encodes the TTCs into a 32-dimensional vector from which the decoder network restores the TTC. On the experimental data, this procedure reduces the experimental TTC to the main dynamic components, enabling a comparison with simulated TTCs.

Figure 6
Example of encoded vector components' Z i dependence on R Q for a simulation with ¼ 0:88 and É gel ¼ 0:7. The error bars result from averaging over ten simulations with different initial noise. right part of Fig. 7(b) as a guide to the eye]. This line bends towards the spinodal line, resulting in lower concentrations of the dense phase for deeper quenches. Such a behavior of the gel line in the coexistence region has also been observed for the LLPS of BSA-YCl 3 (Da Vela et al., 2020), lysozyme (Cardinaux et al., 2007) and -globulin (Da Vela et al., 2017).
Averaging over the 30 outcomes for the calibration parameters yields k r ¼ ð0:185 AE 0:002Þ mm À1 pixel À1 ; ð8Þ t low ¼ ð5:02 AE 1:72Þ s; ð9Þ t up ¼ ð353:3 AE 5:7Þ s: A further validation of these results with respect to and É gel for the experimental TTCs is difficult as neither parameter is directly accessible in the experiment.

Discussion and conclusion
In conclusion, we successfully trained an auto-encoder neural network with TTC data originating from Cahn-Hilliard simulations. This auto-encoder was used to extract the main dynamic components out of TTCs originating from XPCS experiments of LLPS in a model protein solution of IgG. We used a differential evolution based algorithm for matching encoded TTCs from the experiment and the simulation and determine thereby the parameters necessary for connecting length and timescales of simulation to experiment. The simulation parameters of the classified experimental TTCs were used to construct a phase diagram with the experimental data indicating the position of a gel line. The training of neural networks based on dynamical simulation data is challenging. The Cahn-Hilliard model employed here, including the gelation, is rather simplistic and uses two labeling parameters only. However, we emphasize the possibility of extending this methodology to more complicated models capable of making strong predictions and eventually also benchmarking of theoretical models based on very large XPCS data sets, possibly by including the kinetic information in the model as well. This is a first step towards handling large amounts of XPCS data on short timescales and will be of general interest not only for protein dynamics but for material science aspects of spinodal decomposition.