Quantitative selection of sample structures in small-angle scattering using Bayesian methods

A Bayesian method is proposed for quantitatively selecting a mathematical model of a sample for small-angle scattering. The performance of this method is evaluated through numerical experiments on artificial data for a sample containing a mixture of multiple spherical particles.


Introduction
In recent years, the analysis of nano-scale structures of materials has become increasingly important in advancing the development of new materials and understanding biological phenomena.Small-angle scattering (SAS) is a fundamental experimental method for analyzing such nano-scale structures.It involves irradiating substances with X-rays or neutron beams and analyzing the resulting scattering intensity data at small angles, typically 5 degrees or less (Guinier & Fournet, 1955).SAS is versatile and applicable to a wide array of heterogeneous materials including nanoparticles, polymers, soft materials, and fibers, and utilized across many fields including material science, chemistry, and biology.SAS measurement data are expressed in terms of scattering intensity that corresponds to a scattering vector, a physical quantity representing the scattering angle.
Data analysis requires selection and parameter estimation of a mathematical model of the scattering intensity, that contains information about the structure of the specimen.This selection process is critical as it involves assumptions about the structure of the specimen.
Traditionally, model selection in SAS data analysis has been performed by listing candidate models based on theoretical or empirical rules, conducting parameter fitting against the measurements, and comparing suitability using criteria such as χ-squared IUCr macros version 2.1.10:2016/01/28 error, among other criteria (Da Vela & Svergun, 2020;Svergun et al., 1995;Kline, 2006;Schneidman-Duhovny et al., 2010;Breßler et al., 2015).Alternatively, models may be chosen based on the general shape of the measurement data.However, these methods each have drawbacks: the former risks overfitting, which can lead to an overestimation of the model's degrees of freedom (Rambo & Tainer, 2013), while the latter yields only qualitative model selections.Furthermore, quantitatively evaluating the reliability of the results is challenging with traditional methods.
In this study, we propose a novel framework for the SAS model selection that quantitatively assesses the validity of mathematical models that represent specimen structures in measurements.This approach uses Bayesian model selection within the framework of Bayesian inference, a method increasingly applied to analysis of various types of physical experimental data (Nagata et al., 2012;Nagata et al., 2019;Rappel et al., 2020;Machida et al., 2021;Nagai et al., 2021;Kashiwamura et al., 2022;Katakami et al., 2022;Ueda et al., 2023).In the context of SAS data, Bayesian inference has been used for grain size distribution (Asahara et al., 2021) and parameter estimates (Hayashi et al., 2023).This method solves inverse problems by establishing the likelihood, which is the data generation model, and the prior distribution that corresponds to the prior knowledge about the target being estimated.The posterior distribution is then calculated according to the model and parameters with the acquired data using Bayes' theorem.In our proposed method, the posterior probability of the data generation model is calculated under the measured data using the Exchange Monte Carlo method (Hukushima & Nemoto, 1996), also known as Parallel Tempering, and then comparing the resulting value among the candidate models while concurrently obtaining Bayesian estimates of the model parameters.Moreover, since the validity of the measured data model is obtained as a posterior probability, the reliability of the results can be quantified by comparing these probabilistic values.In this study, we IUCr macros version 2.1.10:2016/01/28 conducted numerical experiments to assess the effectiveness of our proposed method.
These experiments are based on synthetic data used to estimate the number of distinct components in a specimen, which was modeled as a mixture of two types of monodisperse spheres of varying radii, scattering length densities, and volume fractions.This type of problem is challenging due to the risk of overfitting, as the candidate models have similar structures; however, the results demonstrate high accuracy, interpretability, and stability of our method, even in the presence of measurement noise.
The structure of this paper is as follows: we first formalize the proposed analytical method, then discribe the model of multicomponent monodisperse spheres used in our numerical experiments.In Sect.4, we detail the set-up and results of these experiments using the proposed method to estimate the number of mixed components in the synthetic data.We then discuss the analytical capabilities of our method and the performance of the traditional method based on the degree of fitting.We conclude with implications and potential applications of our method.

Formulation of the Proposed Framework
In this section, we present a detailed formulation of our algorithm for selecting mathematical models for SAS specimens using Bayesian model selection.The pseudocode for this algorithm is provided in Algorithm 1.

Bayesian Model Selection
The process of generating experimental measurement data is generally described by a probabilistic model that includes noise components.The SAS measurement data consist of scattering intensities that correspond to the scattering vector.As the scattering intensity is a measure of the number of incident photons on the detector, the scattering intensity values are assumed to follow a Poisson distribution (Durant et al., IUCr macros version 2.1.10:2016/01/28 2021, Katakami et al., 2022, Nagata et al., 2019, Straasø et al., 2013).Let I K (q, Θ) be the mathematical model of scattering intensity characterized by the parameter K for sample parameters Θ and the scattering vector q.The likelihood, which is the probability of generating the measured value y is then given by the equation: Assuming that the measurement data D = {q i , y i } N i=1 , which consist of N data points, are samples from an independent and identically distributed population under the K and Θ, the likelihood is expressed by the equation: Here, we introduce the Poisson cost function to transform the likelihood of the measured data in Eq. (2) as: The likelihood is thus expressed as: Let φ(K) be the prior distribution of the parameter K that characterizes the model, and let φ(Θ|K) be the prior distribution of the model parameters Θ.Then, from Bayes' theorem, the posterior distribution of the parameters given the measurement data can be written as: where Z(K) is the marginal likelihood, which corresponds to the normalization constant of the posterior parameters distribution.Furthermore, the probability of model K given the data D, denoted as p(K|D), is given by the equation: F (K) is referred to as the Bayesian free energy, also known as the stochastic complexity.The posterior probability of the model, p(K|D), can be rephrased as the validity of model K for the measurement data D. In other words, calculating and comparing the value of p(K|D) for all candidate models {K} thus enables quantitative model selection.Note that in Bayesian model selection, the parameter K does not need to explicitly appear within the mathematical model of the specimen.

Calculation of Marginal Likelihood
In our Bayesian model selection method, the Bayesian free energy F (K) and the probability p(K|D) are calculated and compared for all candidate models.This computation relies on determining the marginal likelihood Z(K), as expressed in Eq. ( 7).
The marginal likelihood generally involves multi-dimensional integration, which can be computationally intensive and unstable.To address this challenge, our framework uses the REMC to calculate the marginal likelihood (Hukushima & Nemoto, 1996).
This method facilitates sampling from the desired probability distribution at multiple IUCr macros version 2.1.10:2016/01/28 inverse temperatures, referred to as replicas, using the Markov Chain Monte Carlo method (MCMC) to strategically exchange states between adjacent inverse temperatures at arbitrary intervals, thus avoiding local minima.To calculate the marginal likelihood using REMC, we establish a series of L inverse temperatures {β l } L i=1 that they satisfy the relation: Sampling from the joint probability distribution at each inverse temperature gives: where Θ l denotes the model parameter at the l-th inverse temperature β l .The posterior distribution p(Θ l |D, K, β l ) satisfies the following relation: These distributions are sampled using MCMC at each inverse temperature, as expressed in Eq. ( 14), and states at adjacent inverse temperatures are periodically exchanged with a probability that satisfies the detailed balance condition.The probability of exchanging the l-th and (l + 1)-th states, p(Θ l ↔ Θ l+1 ), is: The marginal likelihood expressed in Eq. ( 7) can be efficiently determined using samples from various inverse temperatures sampled by REMC.The marginal likelihood Z(K, β) at inverse temperature β is expressed as: In this case, the target marginal likelihood expressed in Eq. ( 7) is equivalent to Z(K, β = 1).Using the relation in Eq. ( 12), Z(K, β = 1) can be expressed as follows: In Eq. ( 20), the symbol ⟨•⟩ p(Θ l |D,K,β l ) denotes the expectation value with respect to p(Θ l |D, K, β l ).Computing Eq. ( 20) using sampling with REMC provides the marginal likelihood expressed in Eq. ( 7).Once the marginal likelihood Z(K) is determined, we can find the Bayesian free energy expressed in Eq. ( 11) and the posterior probability of model K given the measurement data expressed in Eq. (10).For the numerical experiments presented here, we used the Metropolis method (Metropolis et al., 1953) for MCMC sampling of the posterior distributions at each inverse temperature, as expressed in Eq. ( 14).

Estimation of Model Parameters
During the marginal likelihood calculation the posterior distribution of p(Θ L |D, K, β L = 1) is obtained, which simply represents the Bayesian estimate of the model parameters (Hayashi et al., 2023).Therefore, the parameter estimation is conducted simultaneously as the Bayesian model selection performed.Moreover, since the posterior distribution is sampled using REMC sampling, it can provide a global parameter esti-IUCr macros version 2.1.10:2016/01/28 mate solution.Additionally, the reliability of the estimation can be assessed from the statistical properties of the sampled posterior distribution.
In Bayesian estimation, the Maximum A Posteriori (MAP) solution provides a point estimate of the parameters.The MAP solution Θ MAP for the parameters of model K is expressed by this equation from Eq. ( 14):

Formulation of a Multicomponent Monodisperse Spheres Model
In this section, we describe a model for the scattering intensity of a dilute sample comprised of multicomponent monodisperse spheres (Guinier & Fournet, 1955;Hashimoto, 2022).This model serves as the basis for evaluating the performance of the proposed method.
Let e i and e s represent the unit vectors in the direction of the wave number vector of the incident and scattered beams, respectively.If e i and e s form an angle 2α, and the wavelength of the beam is λ, then the scattering vector q is given by: Assuming isotropic scattering, we consider the magnitude q of the scattering vector.
The monodisperse spheres are spherical particles of uniform radius.The scattering intensity I(q, θ) of a specimen composed of sufficiently dilute monodisperse spheres of a single type for the scattering vector q is given by: Initialize array of sampled parameters, Ψ = {}. 3: end for 6: Propose the following state, Θ ′ = Θ τ −1 l + ϵ × Uniform(−1, 1).

21:
end for

25:
end for

26:
Calculate the marginal likelihood Z(K m ) and the Bayesian free energy F (K m ).27: end for 28: Calculate the likelihood of the model against the measured data {p(K m |D)} M m=1 .
where V = 4 3 πR 3 .If the difference in scattering length density difference between the solute and solvent of the specimen is ∆ρ and the volume fraction is ϕ, then S = (3∆ρ) 2 ϕ.The parameters θ of this model are the particle size R, the scale S, and the background B.
To formulate the scattering intensity of a specimen composed of K types of monodisperse spheres, we assume a dilute system and denote the particle size of the k-th component in the sample as R k and the scale as S k .The scattering intensity of a sample composed of K types of monodisperse spheres is then given by: where we assume that V k = 4 3 πR 3 k .The model parameters Θ for the scattering intensity

Numerical Experiments
Here, we present numerical experiments to evaluate the model selection among models with K ranging from 1 to 4 components to demonstrate the capabilities of the proposed framework.We apply the framework to synthetic data generated to represent a system with two types (K = 2) of monodisperse spheres, as described by Eq. ( 24).This experimental design poses a challenge for discerning the true structure of the specimen; despite the simple structure, the similarity of the candidate models increases the risk of overfitting.
In typical SAS experiments, the scale parameter S k in Eq. ( 24) tends to be small.
Therefore, we normalize the scale parameter S k as: We accordingly refer to the model parameters as Θ = {R k , Sk } K k=1 , B .
The numerical experiments reported in this section were conducted with a burn-in period of 10 5 and a sample size of 10 5 for the REMC.We set the number of replicas for REMC, the values of inverse temperature, and the step size of the Metropolis method, taking into consideration the state exchange rate and the acceptance rate.

Generation of Synthetic Data
This scattering intensity in SAS experiments, which is typically recorded as count data, is subject to be Poisson noise, as described by Eq. ( 1).We therefore generated synthetic data D using the procedure: 1. Set the number of data points to N = 400, and define the scattering vectors at the N equally spaced points within the interval [0.1, 3] to obtain 2. Assume K = 2 and set the true model parameters Θ to Θ * .
3. Calculate the scattering intensity at the scattering vectors {q i } N i=1 obtained in step 1, using the model in Eq. ( 24) and Θ * .Introduce a pseudo-measurement time T to adjust the noise intensity in the data, to obtain {I(q i , Θ * , K) × T } N i=1 .

Generate measurement values {y
In this section, we consider cases with pseudo-measurement times of T = 1 and T = 0.1.Generally, smaller values T indicate the greater effects from measurement noise.

Setting the Prior Distributions
In the Bayesian model selection framework, prior knowledge concerning the parameters Θ and the model-characterizing parameter K is set as their prior distributions.
IUCr macros version 2.1.10:2016/01/28 In this numerical experiment, the prior distributions for the parameters Θ were set as Gamma distributions based on the pseudo-measurement time T used during data generation, while the prior for K was a discrete uniform distribution over the interval [1,4].

Results for Two-Component Monodisperse Spheres Based on Scale Ratio
The ratio of scale parameters S 1 , S 2 for spheres 1 and 2 during data generation, denoted as r S , is defined as follows: Next, we present the analysis results from applying our proposed method to analyzing 6 types of data generated by varying the value of r S for pseudo measurement times of T = 1 and 0.1.Table 1 displays the parameter values used for generating the synthetic data.Figure 1 shows the fitting results for each model using the MAP solution for the synthetic data generated using the parameter values given in Table 1.In Fig. 1 (a) -(c), (g) -(i), it is apparent that the model with K = 1 fails to accurately represent the data.However, we can also see that the fitting curves for models with K = 2 − 4 are almost identical in shape.Furthermore, the data shown in Fig. 1 (d) -(f), (j) -(l) are difficult to distinguish from the well-known scattering data of a single type of monodisperse sphere (K = 1), making it challenging to qualitatively compare the goodness of fit among the models with K = 1 − 4.  Table 2 summarizes the number of times each model was found to have the highest probability in numerical experiments using the 10 separate datasets shown in Fig. 2.
For values of r S = 0.0004 and above (  {9.9, 9.7, 9.5, 0.5, 0.5, 0.4, 0.3} 10 Scale S 250 100 Background B (cm −1 ) 0.01 Pseudo measurement time T {1, 0.1} Aside from the cases of r R = 0.5 in (d) and (k), the profiles of the data in Fig. 3 are very similar to those of a single monodisperse sphere, and the fitting curves for models K = 1 to K = 4 are nearly identical in shape.In contrast, the data for cases (d) and (k) with r R = 0.5 have a complex profile, and it can be seen that the model with K = 1 can be seen to represent the data poorly.Table 4 presents the results of numerical experiments for the 10 separate datasets shown in Fig. 4, summarizing the number of times each model K = 1 -4 was most highly supported.Near the analytical limits of the proposed method, there are cases with noise intensity and anticipated parameter values similar to those of the measured data.This step can help ensure a more reliable analysis, as detailed below.
The scale parameter S is a value that is multiplied by the square of the difference in scattering length density between the solvent and the specimen, as well as the volume fraction.This can cause r S to become extremely small when there is little difference in scattering length density between the solvent and a component of the specimen, or when there is a significant difference in the mixing ratio of the components.The results in Fig. 2 and Table 2 for a pseudo-measurement time of T = 1 (Panel A) indicates that the model selection favored non-true models at a scale ratio of r S = 0.0002.Similarly, for T = 0.1 (Panel B), non-true models were favored at scale ratios of r S = 0.0004 and r S = 0.0002, indicating that these cases exceed the analytical capabilities of the proposed method.These findings imply that within experimental parameters of this study, the proposed method reliably identifies the true model with a high probability for scale ratios up to r S = 0.0004 at T = 1 and up to r S = 0.002 at T = 0.1.
In Sect.4.2, we investigated the effect of varying the radius ratio r R .When components of different radii are mixed, it is important to consider not only simple mixtures but also instances of aggregated specimens.The findings shown in Fig. 4 and Table 4 indicate that the proposed method reaches its analytical limits as r R approaches 1 and as it approaches 0. As r R nears 1, the scattering profiles of the two-component system become similar to that of a single-component system, leading to the selection of the single-component model (K = 1).We identified an analytical limit at r R = 0.99 for both T = 1 and T = 0.1.The results for r R = 0.97 show that at T = 0.1, which has a higher noise intensity compared to T = 1, the posterior probability of the singlecomponent model (K = 1) increases, resulting in an unstable analysis.Conversely, as r R approaches 0, the results r R = 0.03 at T = 1 and r R = 0.04 and r R = 0.03 at T = 0.1, the single-component model (K = 1) is associated with high probability, IUCr macros version 2.1.10:2016/01/28 indicating an analytical limit.Overall, the proposed method demonstrates the ability to select the true model with high probability for radius ratios ranging from r R = 0.04 to 0.99 at T = 1, and from r R = 0.05 to 0.99 at T = 0.1.

Model Selection Based on Cost Function Values
Selecting an appropriate mathematical model is a critical step in SAS data analysis, yet it remains a challenge.Traditionally, model fitting involved comparing a set of candidate models using metrics such as the χ-squared error to assist in model selection.
However, the χ-squared error assumes a Gaussian distribution of the measurement noise.In this study, we assume Poisson-distributed measurement noise, and therefore discuss the results of model selection by calculating and comparing the values of the Poisson cost function, as formulated in Eq. ( 3), across models K = 1 − 4 using the measurement data and fitting curves derived from the MAP solutions.
Table 5 reports the frequency of each model minimizing the Poisson cost value for models with K = 1 to K = 4.These results are based on the same datasets described in Sect.4.4, which were generated by varying the random seed for each of the 6 distinct r S values determined by the parameters listed in Table 1, with 10 datasets produced for each r S value at T = 1.   5 indicate that for (a) and (e), there are two models most supported, suggesting that it is difficult to obtain reliable results.In cases (b) and (c), the K = 3 model is most supported, likely as a result of overfitting, and failing to select the correct model, K = 2. Furthermore, when the true model K = 2 was selected in cases In contrast the proposed method results in Table 2-A demonstrate that the correct model K = 2 is supported in all 10 out of 10 cases from (a) to (e).Within the analyzable range described in the previous subsection, the model selection results are highly stably against data noise.

Conclusions
In this paper, we introduced Bayesian model selection framework for SAS data analysis that quantitatively evaluates model validity through posterior probabilities.We conducted numerical experiments using synthetic data for a two-component system of monodisperse spheres to assess the performance of the proposed method.We identified the analytical limits of the proposed method, under the settings of this study, with respect to the scale and radius ratios of two-component spherical particles.Additionally, we compared its performance to traditional fit-based model selection methods based on the Poisson cost function.The numerical experiments and subsequent discussion revealed the range of parameters that can be analyzed using the proposed method.Within that range, our method provides stable and highly accurate model selection even for data with significant noise or situations in which qualitative model IUCr macros version 2.1.10:2016/01/28 determination is challenging.Furthermore, in comparison to the traditional method of selecting models based on fitting curves and data residuals, it was found that the proposed method offers greater accuracy and stability.
SAS is used to study specimens with a variety of structures other than spheres, including cylinders, core-shells, lamellae, and more.The proposed method should be applied other sample models to determining the feasibility of expanding the analysis beyond the case examined here to broader experimental settings.Future work can benefit from using the proposed method to conduct real data analysis, which is expected to yield new insights through our more efficient analysis approach.

Fig. 1 .
Fig. 1.Fitting to synthetic data generated at various r s values with MAP solutions.Panels A and B show the cases for a pseudo-measurement time of T = 1 and T = 0.1, respectively.In cases (a)-(f) and (g)-(l), the scale ratio r S is displayed in descending order for T = 1 and T = 0.1, respectively.Black circles represent the generated data, the black dashed line represents the true scattering intensity curve, the solid blue, red, green, and orange lines represent the fitting curves using the MAP solution for K = 1, K = 2, K = 3, and K = 4, respectively.

Figure 2
Figure 2 presents the Bayesian model selection results using our proposed framework.Figure 2-A contains results for the case with T = 1, and Figure 2-B contains result with T = 0.1, each showing the probability p(K|D) of model K based on the synthetic data D for each scale ratio r S .Here, 10 datasets were created for each parameter value by varying the random seed during data generation, and the average value of p(K|D) is indicated by the height of the bar graph, with error bars indicating the maximum and minimum values.For the relatively large scale ratios r S in (a) -(e) in Fig.2-A, the true model with K = 2 has a high probability, while the average value of p(K|D) is highest for K = 1 in (f).Conversely, in Fig.2-B, the true model with K = 2 is associated with high probability in cases (g) -(j), while in cases (k) and (l), K = 1 is associated with the highest probability.

Fig. 3 .
Fig. 3. Fitting to synthetic data generated at various r R values with MAP solutions.Panels A and B show the cases for a pseudo-measurement time of T = 1 and T = 0.1, respectively.In cases (a)-(g) and (h)-(n), the radius ratio r R is displayed in descending order for T = 1 and T = 0.1, respectively.The black dots represent the generated data, and the black dashed line represents the true scattering intensity curve, and the solid blue, red, green, and orange lines represent the fitting curves using the MAP solution for K = 1, K = 2, K = 3, and K = 4, respectively.

Figure 4
Figure 4 displays the results of Bayesian model selection using synthetic data generated by varying the radius ratio r R .10 datasets were created for each parameter value by varying the random seed during data generation, and the average value of p(K|D) is indicated by the height of the bar graph, with the maximum and minimum values shown as error bars.Unlike the results for the variations in scale ratio shown in Fig.2, the model selection procedure fails not only at a radius ratio r R is close to 0, but also at values is close to 1, with K = 1 being the most highly supported.Additionally, in the case of r R = 0.04, the result for T = 1 in case (f) supports the true model K = 2, but for T = 0.1 in case (m), the alternative model K = 1 is most supported.However, in cases (b) -(f) and (i) -(l), the true model K = 2 is associated with a high average probability (Fig.4).

Fig. 4 .
Fig. 4. Results of Bayesian model selection among models K = 1 -4 for varying r R values.Panel A shows the posterior probability of each model using data generated with a pseudo-measurement time of T = 1, and Panel B shows results for T = 0.1.In cases (a)-(g) and (h)-(n), the radius ratio r R is displayed in descending order for T = 1 and T = 0.1, respectively.The height of each bar corresponds to the average values calculated for 10 datasets generated with different the random seeds, with the maximum and minimum values shown as error bars.Areas highlighted in red indicates cases where the true model K = 2 was most highly supported, while the blue backgrounds indicate that the likelihood of a model other than K = 2 was the highest.
IUCr macros version 2.1.10:2016/01/28The model selection results based on the Poisson cost function shown in Table (d) and (f), the K = 3 model is supported similarly often.The above findings demonstrate that evaluating how well a model fits the measurement data based on residual values such as the Poisson cost and candidate model comparison dose not consistently result in selection of the true model.
This work was supported by JST CREST (Grant Nos.PMJCR1761 and JPMJCR1861) from the Japan Science and Technology Agency (JST) and JSPS KAKENHI Grantin-Aid for Scientific Research(A) (No. 23H00486).

Table 1 .
Parameter values used for data generation with varying r S .

Table 3 .
Parameter values used for data generation when varying r R .

Table 5 .
Results of model selection based on the Poisson cost function.The cases (a)-(f ) correspond to the settings in Figs. 1 -3 of Sect.4.4.The table indicates the number of times each model was found to have the lowest Poisson cost value for 10 datasets generated with different random seeds for each r S value at T = 1.The most highly supported model, if unique, is shown in bold.