computer programs\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

Journal logoJOURNAL OF
APPLIED
CRYSTALLOGRAPHY
ISSN: 1600-5767

refnx: neutron and X-ray reflectometry analysis in Python

CROSSMARK_Color_square_no_text.svg

aANSTO, Locked Bag 2001, Kirrawee DC, NSW 2232, Australia, and bSchool of Chemical Engineering, University of New South Wales, Sydney, NSW 2052, Australia
*Correspondence e-mail: andrew.nelson@ansto.gov.au

Edited by G. Renaud, CEA-Grenoble DSM/INAC/SP2M/NRS, Grenoble, France (Received 31 August 2018; accepted 5 December 2018)

refnx is a model-based neutron and X-ray reflectometry data analysis package written in Python. It is cross platform and has been tested on Linux, macOS and Windows. Its graphical user interface is browser based, through a Jupyter notebook. Model construction is modular, being composed from a series of components that each describe a subset of the interface, parameterized in terms of physically relevant parameters (volume fraction of a polymer, lipid area per molecule etc.). The model and data are used to create an objective, which is used to calculate the residuals, log-likelihood and log-prior probabilities of the system. Objectives are combined to perform co-refinement of multiple data sets and mixed-area models. Prior knowledge of parameter values is encoded as probability distribution functions or bounds on all parameters in the system. Additional prior probability terms can be defined for sets of components, over and above those available from the parameters alone. Algebraic parameter constraints are available. The software offers a choice of fitting approaches, including least-squares (global and gradient-based optimizers) and a Bayesian approach using a Markov-chain Monte Carlo algorithm to investigate the posterior distribution of the model parameters. The Bayesian approach is useful for examining parameter covariances, model selection and variability in the resulting scattering length density profiles. The package is designed to facilitate reproducible research; its use in Jupyter notebooks, and subsequent distribution of those notebooks as supporting information, permits straightforward reproduction of analyses.

1. Introduction

The use of specular X-ray and neutron reflectometry for the morphological characterization of thin films in the approximate size range 10–5000 Å has grown remarkably over the past few years (Wood & Clarke, 2017[Wood, M. & Clarke, S. (2017). Metals, 7, 304.]; Daillant & Gibaud, 2009[Daillant, J. & Gibaud, A. (2009). Editors. X-ray and Neutron Reflectivity: Principles and Applications, Lecture Notes in Physics, Vol. 770. Heidelberg: Springer Verlag.]). Most neutron and X-ray sources have instruments to perform reflectometry measurements, and there is an ongoing need for accessible software programs for users of those instruments to analyse their data in a straightforward fashion, including the co-refinement of multiple contrast data sets. Several programs are available for this purpose, with a variety of different features (Nelson, 2006[Nelson, A. (2006). J. Appl. Cryst. 39, 273-276.]; Bjorck & Andersson, 2007[Björck, M. & Andersson, G. (2007). J. Appl. Cryst. 40, 1174-1178.]; Kienzle et al., 2011[Kienzle, P. A., Krycka, J., Patel, N. & Sahin, I. (2011). Refl1d - Depth Profile Modelling, https://reflectometry.org/danse/docs/refl1d/.]; Gerelli, 2016[Gerelli, Y. (2016). J. Appl. Cryst. 49, 330-339.]; Hughes et al., 2016[Hughes, A. V., Ciesielski, F., Kalli, A. C., Clifton, L. A., Charlton, T. R., Sansom, M. S. P. & Webster, J. R. P. (2016). Acta Cryst. D72, 1227-1240.]). These programs typically create a model of the interface, and either incrementally refine the model against the data using least-squares methods or use Bayesian approaches (Sivia & Skilling, 2006[Sivia, D. & Skilling, J. (2006). Data Analysis: A Bayesian Tutorial. Oxford Science Publications.]; Kienzle et al., 2011[Kienzle, P. A., Krycka, J., Patel, N. & Sahin, I. (2011). Refl1d - Depth Profile Modelling, https://reflectometry.org/danse/docs/refl1d/.]; Hogg et al., 2010[Hogg, D. W., Bovy, J. & Lang, D. (2010). ArXiv e-prints. arXiv: 1008.4686.]) to examine the posterior probability distribution of the parameters (i.e. the statistical variation of the parameters in a model).

Given the number of publications arising from the reflectometry technique, it is vital that both the experiments and analyses are reproducible. Reproducibility in research is an underlying principle of science, but unfortunately it is not always possible to reproduce the results of others (Stark, 2018[Stark, P. (2018). Nature, 557, 613.]), because there is frequently not enough information provided in journal articles to repeat the analyses. Even if the data sets and the software packages used to analyse them are supplied in supporting information (most often they are not), a comprehensive, ordered set of instructions or a codified workflow would need to be provided (Möller et al., 2017[Möller, S., Prescott, S. W., Wirzenius, L., Reinholdtsen, P., Chapman, B., Prins, P., Soiland-Reyes, S., Klötzl, F., Bangnacani, A., Kalaš, M., Tille, A. & Crusoe, M. R. (2017). Data Sci. Eng. 2, 232-244.]). One example addressing this reproducibility issue is the set of guidelines from the small-angle scattering community for the deposition of data and and associated models (Trewhella et al., 2017[Trewhella, J., Duff, A. P., Durand, D., Gabel, F., Guss, J. M., Hendrickson, W. A., Hura, G. L., Jacques, D. A., Kirby, N. M., Kwan, A. H., Pérez, J., Pollack, L., Ryan, T. M., Sali, A., Schneidman-Duhovny, D., Schwede, T., Svergun, D. I., Sugiyama, M., Tainer, J. A., Vachette, P., Westbrook, J. & Whitten, A. E. (2017). Acta Cryst. D73, 710-728.]; Pauw, 2013[Pauw, B. R. (2013). J. Phys. Cond. Matter, 25, 383201.]).

Here, we outline a new reflectometry analysis package, refnx (version number 0.1 is used in this paper; Nelson & Prescott, 2018b[Nelson, A. & Prescott, S. W. (2018b). refnx - Neutron and X-ray Reflectometry Analysis in Python, https://www.github.com/refnx/refnx.]), that helps to address the reproducibility issue for the reflectometry community (we do not mean that other programs are irreproducible, rather that the information provided in journal articles is often lacking). It does this by creating a scripted analysis workflow that is readily deposited alongside the publication, such as we have done with this paper (see the supporting information). The refnx Python package is specifically designed for use in Jupyter notebooks (Kluyver et al., 2016[Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S. & Willing, C. (2016). Positioning and Power in Academic Publishing: Players, Agents and Agendas, edited by F. Loizides & B. Schmidt, pp. 87-90. Amsterdam: IOS Press.]), which provide a literate programming environment that mixes executable code cells, rich documentation of the steps that were performed and the computational output. By including the analysis, as performed by the authors, in such a notebook, and appending it as supporting information along with the data, readers are empowered to replicate the exact data analysis and potentially extend the analysis, provided they have set up the same computing environment (Millman & Pérez, 2014[Millman, K. J. & Pérez, F. (2014). Implementing Reproducible Research, edited by V. Stodden, F. Leisch & R. D. Peng, ch. 6. London: Chapman & Hall.]). Setting up the computing environment is simplified using the conda package manager (Continuum Analytics, 2017[Continuum Analytics (2017). Conda - Package, Dependency and Environment Management for any Language, https://conda.io/docs/.]) and an environment file, although other approaches are available.

2. Method

refnx is written in Python with an extensible object-oriented design (Fig. 1[link]) in which the user creates a model of the sample based on what they know about its composition, with refinement of that model against the data. As with Motofit (Nelson, 2006[Nelson, A. (2006). J. Appl. Cryst. 39, 273-276.]), it calculates reflectivity using the Abeles method (Heavens, 1955[Heavens, O. (1955). Optical Properties of Thin Films. London: Butterworth.]) for specular reflection from a stratified medium. Detailed documentation for refnx is available online (https://refnx.readthedocs.io/) and is distributed with the package.

[Figure 1]
Figure 1
A schematic flow chart, showing the relationship between the classes that make up a typical reflectometry curve-fitting problem. The key step for the user is assembling materials (Component) such as a `Slab: Component' (a Component that is a Slab) and encoding prior knowledge into each Parameter that describes that Component.

The building block of the analysis is the Parameter object, which represents a model value [e.g. the scattering length density (SLD) of the material], whether that value is allowed to vary in a fit, and a bounds attribute. The bounds are a probability distribution representing pre-existing knowledge of a parameter's value, called a prior probability. A prior might be a simple uniform distribution that specifies a lower and upper bound (e.g. volume fraction is in the interval [0, 1]), or a normal distribution that represents an experimentally derived value and its associated uncertainty (e.g. thickness is 100 ± 4 Å). Any of the scipy.stats (Jones et al., 2001[Jones, E., Oliphant, T., Peterson, P. et al. (2001). SciPy: Open Source Scientific Tools for Python, https://www.scipy.org/.]) continuous distributions, or other distributions created by the user, can be used for this purpose. Algebraic relationships between Parameter objects can be applied to permit more sophisticated constraints that can cross between Component objects (e.g. the sum of the thicknesses of several layers is known to some uncertainty).

2.1. Structure representation

The Structure object represents the interfacial model, assembled from individual Component objects in series. Each Component represents a subset of the interface and selected attributes of the Component can be described by physically relevant Parameter objects. The simplest and most familiar Component is a Slab, which has a uniform SLD, thickness, roughness and volume fraction of solvent. The simplest models are simply a series of Slab objects. More sophisticated components include LipidLeaflet (a lipid monolayer, or one-half of a lipid bilayer) and Spline (for free-form modelling of an SLD profile using spline interpolation). It is straightforward to develop/modify new components for different structural functionality, a consequence of the program design.

To include further prior knowledge of the real sample into the model, each Component can contribute to the prior probability in addition to its constituent Parameter objects. This is useful when a Component has a derived value, such as surface excess, which is already known.

To calculate the reflectivity from the series of Component objects that form the model, each Component has a slabs property that represents a discretized `slice' approximation to a continuous SLD profile for its particular region of the interface. A Slab object has a single slice because it is a single thickness of uniform SLD. A LipidLeaflet is made of two slices (head/tail regions), but the Spline has many thin slices approximating the smooth curve. Each of these slices has uniform SLD, with the Névot–Croce approach being used to describe the roughness between them (Névot & Croce, 1980[Névot, L. & Croce, P. (1980). Rev. Phys. Appl. 15, 761-769.]).

The Structure object is used to construct a ReflectModel object. This object is responsible for calculating the resolution-smeared reflectivity of the Structure, scaling the data and adding a Q-independent constant background [via the scale and background Parameter objects; Q = (4π/λ)sin(Ω), where Ω is the angle of incidence and λ is the wavelength of the incident radiation]. There are different types of smearing available: constant dQ/Q, point-by-point resolution smearing read from the data set of interest or via a smearing probability kernel of arbitrary shape (Nelson & Dewhurst, 2014[Nelson, A. R. J. & Dewhurst, C. D. (2014). J. Appl. Cryst. 47, 1162.]). The constant dQ/Q and point-by-point smearing use Gaussian convolution, with dQ representing the full width at half-maximum (FWHM) of a Gaussian approximation to the instrument resolution function (van Well & Fredrikze, 2005[Well, A. A. van & Fredrikze, H. (2005). Physica B, 357, 204-207.]).

2.2. Model/data comparison

The Objective class is the comparator of the predicted and measured reflectivities, using the ReflectModel and a data set, Data1D, to calculate χ2, log-likelihood [equation (2)[link]], log-prior, residuals and the generative model. The Data1D object has x, x_err, y and y_err attributes to represent Q, dQ, R and dR, respectively. As is standard for many reflectometry data files, the Data1D object reads a three- or four-column plain-text datafile. A three-column data set represents Q−1), R and dR (one standard deviation). A four-column data set represents Q−1), R, dR and dQ−1). dR is the uncertainty in the reflectivity, R, and dQ specifies the FWHM of the instrument resolution function, for each data point. Extending Data1D would allow other formats to be read – at the moment there is no standardized data format for reflectometry. One example of this could be a wavelength-dispersive file using (Ω, λ) data instead of Q, such as that used in energy-scanned X-ray reflectometry, or sometimes produced by wavelength-dispersive neutron reflectometers. In such a case, ReflectModel could be subclassed to make full use of this energy-dispersive information. The creation of a standardized data format for reflectometry would facilitate ingestion of data and allow other important information, such as experimental metadata, to be used.

An Objective can be given a Transform object to permit fitting as log10R versus Q or RQ4 versus Q; the default (no Transform) is R versus Q. Several Objective objects can be combined to form a GlobalObjective for co-refinement. The object-oriented nature of the program allows reuse of Parameter and Component objects, and this is the basis for linking parameters between samples for co-refinement. For a comprehensive demonstration of multiple-contrast co-refinement, see the annotated notebook in the supporting information.

2.3. Statistical comparison and model refinement

The Objective statistics are used directly by the CurveFitter class to perform least-squares fitting with the functionality provided by the SciPy package (Differential Evolution, Levenberg–Marquardt, LBFGSB – limited Broyden–Fletcher–Goldfarb–Shanno with bounds). Additional SciPy solvers can be added relatively simply and it would be possible for other minimizers to use Objective directly. CurveFitter can also perform Bayesian Markov-chain Monte Carlo (MCMC) sampling of the system, examining the posterior probability distribution of the parameters [equation (1)[link]]. The posterior distribution is proportional to the product of the prior probability and the likelihood (or the sum of the log-probabilities):

[p (\theta \mid D, I ) = {{p(\theta \mid I) \, p(D \mid \theta, I)} \over {p (D \mid I)}} , \eqno(1)]

[p (D \mid \theta, I) = -{{1} \over {2}} \sum _{n} \left [ \left ( {{y_{n} - y_{{\rm model},n}} \over {\sigma _{n}}} \right )^{2} + \log (2 \pi \sigma _{n}^{2} ) \right ] . \eqno(2)]

The prior, p(θ | I), is the probability distribution function for a parameter, θ, given pre-existing knowledge of the system, I, as outlined above. The likelihood [equation (2)[link]], p(D | θ, I), is the probability of the observed data, D, given the model parameters and other prior information. It is calculated from the measured data, yn (with uncertainties σn), and the generative model, ymodel,n. The likelihoods that are used here assume that the measurement uncertainties are normally distributed [equation (2)[link]]. However, other types of measurement uncertainties (e.g. Poissonian) could be implemented by a subclass of Objective, overriding the log-likelihood method. The model evidence, p(D | I), is a normalizing factor.

The posterior probability, p(θ | D, I), describes the distribution of parameter values consistent with the data and prior information. In the simplest form, this is akin to a confidence interval for a parameter derived by least-squares analysis. However, when parameters are correlated, or two models give a similar quality of fit (`multi-modality'), a simple confidence interval can be misleading. The posterior probability is derived by encoding the likelihood and prior distributions and then using an MCMC algorithm (via the emcee and ptemcee packages) to perform affine-invariant ensemble sampling (Foreman-Mackey et al., 2013[Foreman-Mackey, D., Hogg, D. W., Lang, D. & Goodman, J. (2013). Publ. Astron. Soc. Pac. 125, 306-312.]; Vousden et al., 2016[Vousden, W., Farr, W. M. & Mandel, I. (2016). Mon. Not. R. Astron. Soc. 455, 1919-1937.]). At the end of an MCMC run, the parameter set possesses a number of samples (called a `chain'); the samples reveal the distribution and covariance of the parameters, the spread of the model-predicted measurements around the data and, in a reflectometry context, the range of SLD profiles that are consistent with the data. The chain statistics are used to update each Parameter value and assign a standard uncertainty. For the sampling, these represent the median and half of the [15.87, 84.13] percentile range, respectively; the latter approximates the standard deviation for a normally distributed statistic.

The ptemcee package is a variant (a `fork' in open-source software development terms) of the emcee package that has been extended to implement the parallel-tempering algorithm for the characterization of multi-modal probability distributions; different modes can be traversed by chain populations at higher `temperatures', while individual modes are efficiently explored by chains at lower `temperatures' (Vousden et al., 2016[Vousden, W., Farr, W. M. & Mandel, I. (2016). Mon. Not. R. Astron. Soc. 455, 1919-1937.]). Having multiple populations in the parallel-tempering algorithm allows the sampler to escape local maxima, greatly aiding its ability to explore the most probable regions of the posterior. ptemcee is also able to estimate the log-evidence term [the denominator in equation (2)[link]], which is useful when calculating the Bayes factor for model comparison.

Parallelization of the sampling is automatic, making full use of multi-core machines, and can use MPI (message passing interface) on a cluster for yet greater parallelization. Visualization of the samples produced by MCMC sampling is performed using the corner package for scatter-plot matrices (Foreman-Mackey, 2016[Foreman-Mackey, D. (2016). J. Open Source Software, 1(2), 24.]), which gives a representation of the probability distribution function for each individual parameter and also the covariance for each pair of parameters. As will be seen later, the plot for two normally distributed and uncorrelated parameters is isotropic, while covariant parameters show significant anisotropy. An evaluation of the impact of hard bounds can also be made by looking for plots where the bounds are clearly truncating the distribution function, allowing the bounds to be re-evaluated and adjusted if necessary.

2.4. User interface

A significant motivation in the development of refnx has been the facilitation of reproducible analysis by helping the user describe how an analysis was performed. A few lines of computer code provide an incredibly powerful description, conveying the details with a precision that is hard to match in written text, as well as being extremely concise. Example analyses within the refnx code base are often sufficient to complete the task. These few lines of Python code can be further extended to produce publication-quality plots, saved and ready to import into the next publication, or to be used in a loop for batch-fitting purposes. While Python is a popular language for instruction and data analysis, meaning that the relatively few lines of code required to complete a refnx analysis of a set of experiments is not a huge hurdle, a simpler graphical user interface (GUI) is also provided. The browser-based GUI is available for fitting within a Jupyter notebook (Fig. 2[link]), making use of the ipywidgets modules (Project Jupyter Contributors, 2015[Project Jupyter Contributors (2015). ipywidgets, https://github.com/jupyter-widgets/ipywidgets.]). The GUI has a `To code' button that turns the current model into the few lines of code required to perform the analysis without using the GUI, thus providing the desired instructions for a reproducible analysis. The ability to generate analysis code also provides a stepping point for building more advanced models independently. The current GUI is able to use slab-based models for fitting a single data set, and a fully functional web-based reflectometry analysis notebook is currently available (Nelson & Prescott, 2018a[Nelson, A. & Prescott, S. W. (2018a). Online Reflectivity Fitting with refnx, https://mybinder.org/v2/gh/refnx/refnx-binder.git/master.]). If desired, it is possible to execute Jupyter notebooks in batch mode or to run the generated Python code within a Python program to complete batch-mode fitting of larger data sets. The refnx repository contains a growing set of examples of different uses of the refnx package.

[Figure 2]
Figure 2
A screenshot of the Jupyter/ipywidgets GUI; this Jupyter notebook is available in the supporting information.

3. Example data analysis with a lipid bilayer

Neutron reflectometry is an ideal technique for the study of biologically relevant lipid membrane mimics and their interactions with proteins etc. Multiple contrast-variation measurements are necessary to reduce modelling ambiguity (due to loss of phase information in the scattering experiment) and improve the ability to determine the structure of various components in the system. The gold-standard approach for analysis of these data sets is co-refinement with a common model, and to parameterize the model in terms of chemically relevant parameters, such as the area per molecule (Campbell et al., 2018[Campbell, R. A., Saaka, Y., Shao, Y., Gerelli, Y., Cubitt, R., Nazaruk, E., Matyszewska, D. & Lawrence, M. J. (2018). J. Colloid Interface Sci. 531, 98-108.]). Sometimes a patchy coverage (distinct to low area per molecule) necessitates the use of an (incoherent) sum of reflectivities from different areas. refnx has functionality for all these requirements, such as the LipidLeaflet component for describing the head and tail groups of a lipid leaflet, and MixedReflectModel to account for patchiness.

The parameters used in the LipidLeaflet component are area per molecule (A), thicknesses for each of the head and tail regions (tx), sums of scattering lengths of the head and tail regions (bx), partial volumes of the head and tail groups (Vx), roughness between head and tail region, and SLDs of the solvents for the head and tail groups (ρx,solv). The overall SLDs of each of the head and tail group regions are given by

[\varphi _{x} = {{V_x} \over {At_x}} , \eqno(3)]

[\rho _{x} = \varphi _{x} {{b_x} \over {V_x}} + (1 - \varphi _{x}) \, \rho _{x,{\rm solv}} . \eqno(4)]

The approach used in the LipidLeaflet component ensures that there is a 1:1 correspondence of heads to tails. By default, the head and tail solvents are assumed to be the same as the solvent that is used throughout the Structure. This will be the case when using LipidLeaflet for a solid–liquid reflectometry experiment. However, at the air–liquid or liquid–liquid interfaces the solvents for the head and tail regions may be different, and it is possible to use different solvent SLDs for each. We note that the LipidLeaflet component may also be used to describe other amphiphiles adsorbing at an interface.

Here, LipidLeaflet is used to co-refine three contrasts [D2O, Si contrast-match (HDmix, SLD = 2.07 × 10−6 Å−1) and H2O] of a 1,2-dimyristoyl-sn-glycero-3-phosphocholine (DMPC) bilayer at the solid–liquid interface (Fig. 3[link]). [The validity of LipidLeaflet does depend on the area per molecule being equal for the headgroup and tailgroup regions, as pointed out by Gerelli (2016[Gerelli, Y. (2016). J. Appl. Cryst. 49, 330-339.]), which can be violated if there are guest molecules that insert into the membrane.] Two LipidLeaflet objects are required to describe the inner and outer leaflets of a bilayer; hence the component contains an attribute which can reverse the direction of one of the leaflets. The use of individual objects to describe each leaflet leads to great flexibility: it becomes easy to model asymmetric bilayers (the inner leaflet can be a different lipid from the outer lipid), and one can model interstitial water layers between the leaflets as well.

[Figure 3]
Figure 3
(a) Neutron reflectivity from a DMPC bilayer supported on a silicon crystal, measured at three contrasts, with 500 samples from the posterior distribution in grey and the median of the distribution in red. Data for the contrast-matched (HDmix) and H2O contrast are offset by 0.1 and 0.01, respectively. (b) The SLD profile of the D2O, model showing 500 samples from the posterior distribution, with the median in red. It is seen that the uncertainty in the reflectivity at high Q is associated with an uncertainty in the SLD profile at the lipid–D2O interface.

The Jupyter notebook used for the analysis, lipid.ipynb, is available in the supporting information. The corner plot (Fig. 4[link]) produced from the MCMC analysis shows the covariance between parameters, with an area per molecule of 57.00 ± 0.15 Å. Fig. 3[link] shows the probability distribution of the generative model around the data and in the SLD profile. These are families of plausible fits that are obtained by plotting a subset of samples from the MCMC chain. The spread in the SLD profiles is used to determine what range of structures is consistent with the data. Multi-modalities in these SLD profiles can be due to statistical uncertainties, the Q ranges measured and the loss of phase information in neutron reflectometry (Majkrzak, 1999[Majkrzak, C. (1999). Acta Phys. Pol. A, 96, 81-99.]; Heinrich et al., 2009[Heinrich, F., Ng, T., Vanderah, D. J., Shekhar, P., Mihailescu, M., Nanda, H. & Lösche, M. (2009). Langmuir, 25, 4219-4229.]).

[Figure 4]
Figure 4
A corner plot for the varying parameters of DMPC bilayers supported on a silicon crystal, measured at three contrasts. The sampling took ∼33 min on a 2.8 GHz quad-core computer for 20 saved steps, corresponding to 4000 samples, with the steps being thinned by a factor of 400. A larger-scale image is available in the supporting information.

4. Distribution and modification

Each submodule in refnx possesses its own unit testing code for checking that the functions and classes in the module operate correctly, both individually and collectively. For example, there are tests that check that the reflectivity of a model is calculated correctly, or that the behaviour of a function is correct for the different possible inputs and code paths through it. Since the test suite is an integral part of the package, each installation is testable. In addition, there is a benchmarking suite to track changes in performance, specifically the speed of critical calculations, over time. This development approach is important for providing assurances to the community that the code is tested and works.

The source code for refnx is held in a publicly accessible version-controlled git repository (Nelson & Prescott, 2018b[Nelson, A. & Prescott, S. W. (2018b). refnx - Neutron and X-ray Reflectometry Analysis in Python, https://www.github.com/refnx/refnx.]). User contributions may be made using the standard GitHub workflow, in which contributors create their own `fork' of the main refnx repository and create a feature branch to which they make modifications. They then submit a pull request (PR) against the main repository. The modifications made in the PR are checked on continuous-integration (CI) web services that run the test suite against a matrix of Python versions on the macOS, Linux and Windows operating systems. Features are merged into the main repository if all tests pass, and if manual code review concludes that the changes are scientifically correct, of sufficiently high standard and useful. When a sufficient number of features have accumulated, a new release is made. Successive releases have an incrementing semantic version number which can be obtained from the installed package, with each release being given its own digital object identifier (DOI). We encourage users to submit models for inclusion in a user-contributed models repository (refnx-models, https://github.com/refnx/refnx-models). We will work with users to develop a suitable way of documenting and sharing their models.

The recommended way of using refnx is from a conda environment, which offers package, dependency and environment management (Continuum Analytics, 2017[Continuum Analytics (2017). Conda - Package, Dependency and Environment Management for any Language, https://conda.io/docs/.]), using the pre-compiled distributables on the refnx conda-forge channel. These distributables are made as part of the release process using the same CI web services as are used to test the code. The matrix of distributables covers the major Python versions currently in use, across the macOS, Windows and Linux operating systems. Alternatively the package can be installed from source, either directly from the git repository or via pip from the version uploaded to PyPI (https://pypi.python.org/pypi/refnx; the installation command is pip install refnx). Building from source requires a C compiler and the Cython (https://cython.org/) and NumPy (https://www.numpy.org/) packages to be installed; further dependencies should be installed to run the test suite to verify that compilation and installation have been successful.

refnx is released under the BSD permissive open-source licence. In addition, all of the dependencies of refnx are released under open-source licences, which means that use is free of cost to the end user and, more importantly, the user is free to modify, improve and inspect this software.

5. Comments on reproducibility of analyses

In order for a given scattering analysis to be fully reproducible by others, a general set of conditions need to be met (Helliwell et al., 2017[Helliwell, J. R., McMahon, B., Guss, J. M. & Kroon-Batenburg, L. M. J. (2017). IUCrJ, 4, 714-722.]; Möller et al., 2017[Möller, S., Prescott, S. W., Wirzenius, L., Reinholdtsen, P., Chapman, B., Prins, P., Soiland-Reyes, S., Klötzl, F., Bangnacani, A., Kalaš, M., Tille, A. & Crusoe, M. R. (2017). Data Sci. Eng. 2, 232-244.]):

(i) The processed data sets used in the analysis need to be deposited with a journal article, or be freely available. Ideally, the raw data sets, and the means to create the processed data sets, should also be made available.

(ii) The exact software environment needs to be recreatable.

(iii) The exact ordered set of steps taken during the analysis needs to be listed.

Each of these points is often inadequately addressed in the literature. For example, the use of different software versions may change the output of an analysis, or the use of a GUI program may preclude recording the full set of steps, or options, applied by a user (Chirigati et al., 2013[Chirigati, F. S., Troyer, M., Shasha, D. E. & Freire, J. (2013). IEEE Data Eng. Bull. 36(4), 54-59.]). Whilst it is unable to meet the first criterion by itself, the use of refnx in a Jupyter notebook can fulfil the other two requirements, providing a little care is taken. As we have already noted, the ordered set of steps to perform the analysis is the Jupyter notebook in which the analysis was performed, and this is an artefact that can be archived.

The exact software environment can be recreated by noting down the versions of the software packages used during an analysis (refnx, SciPy, NumPy, Python etc.). At a later date, those exact versions can be installed in the same Python version using one of the following: the conda package manager, by installing from the source at a given version tag in the git repository, or by pip. The conda package manager can use environment files to recreate a specific setup. An alternative way of recreating the environment is by using a virtual machine, or another container environment such as Docker. The strengths and weaknesses of various software distribution practices and their relationship with reproducible science have been discussed in detail elsewhere (Möller et al., 2017[Möller, S., Prescott, S. W., Wirzenius, L., Reinholdtsen, P., Chapman, B., Prins, P., Soiland-Reyes, S., Klötzl, F., Bangnacani, A., Kalaš, M., Tille, A. & Crusoe, M. R. (2017). Data Sci. Eng. 2, 232-244.]).

The usefulness of open-source software in a git (or other version-controlled) repository must be emphasized here (Möller et al., 2017[Möller, S., Prescott, S. W., Wirzenius, L., Reinholdtsen, P., Chapman, B., Prins, P., Soiland-Reyes, S., Klötzl, F., Bangnacani, A., Kalaš, M., Tille, A. & Crusoe, M. R. (2017). Data Sci. Eng. 2, 232-244.]). With closed-source or proprietary software, the ability to return to a specific software version/environment can be frustrated, and different versions can have modifications that can unknowingly change the output of an analysis. In addition, reduced accessibility (due to cost etc.) to the wider scientific community can also hinder reproducibility.

Moreover, there are important ramifications for verifiability (Chirigati et al., 2013[Chirigati, F. S., Troyer, M., Shasha, D. E. & Freire, J. (2013). IEEE Data Eng. Bull. 36(4), 54-59.]). refnx is based on a fully open software stack, with good unit test coverage. The user can run tests for each component and inspect parts for correctness. For example, the behaviour of the reflectivity calculation in refnx is checked from first principles in the test suite, and this can be done now and in several years' time. If problems are discovered, they can be corrected. With a fully or partially closed-source program, such checking is much harder, as one does not possess full knowledge of what happens inside.

6. Conclusions

refnx is a powerful tool for least-squares or Bayesian analysis of neutron and X-ray reflectometry data that is ideally usable for reproducible research with Jupyter notebooks, and it has been built with extensibility in mind. Its features include MCMC sampling of posterior distribution for parameters, structural models constructed from modular components with physically relevant parameterization, algebraic inter-parameter constraints, mixed-area models, co-refinement of multiple data sets, probability distributions for parameter bounds used directly for log-prior terms and a (Jupyter) ipywidgets GUI.

6.1. Supporting information

The files available in the supporting information for this article are as follows:

(i) gui.ipynb – the Jupyter notebook used to create the GUI screenshot.

(ii) lipid.ipynb – the Jupyter notebook used for the lipid analysis example.

(iii) rg5158sup2.pdf – a PDF view of the Jupyter notebook used for the lipid analysis example.

(iv) rg5158sup3.pdf – a larger-scale image of the corner plot shown in Fig. 4[link].

(v) rg5158sup4.zip – containing example lipid data sets c_PLP0016596.dat, c_PLP0016601.dat and c_PLP0016607.dat.

(vi) reduction.ipynb – the notebook for reducing example data sets.

(vii) rg5158sup5.zip – raw files for the example data sets.

(viii) refnx-paper.yml – the conda environment file to reproduce the analysis environment in this paper.

Supporting information


Acknowledgements

We acknowledge Anton Le Brun (ANSTO) for the provision of the lipid bilayer data sets in the example, James Hester (ANSTO) for comments made on the draft manuscript, and Andrew McCluskey (Bath University) and Isaac Gresham (UNSW) for important feedback on refnx development.

References

First citationBjörck, M. & Andersson, G. (2007). J. Appl. Cryst. 40, 1174–1178.  Web of Science CrossRef IUCr Journals Google Scholar
First citationCampbell, R. A., Saaka, Y., Shao, Y., Gerelli, Y., Cubitt, R., Nazaruk, E., Matyszewska, D. & Lawrence, M. J. (2018). J. Colloid Interface Sci. 531, 98–108.  CrossRef CAS Google Scholar
First citationChirigati, F. S., Troyer, M., Shasha, D. E. & Freire, J. (2013). IEEE Data Eng. Bull. 36(4), 54–59.  Google Scholar
First citationContinuum Analytics (2017). Conda – Package, Dependency and Environment Management for any Language, https://conda.io/docs/Google Scholar
First citationDaillant, J. & Gibaud, A. (2009). Editors. X-ray and Neutron Reflectivity: Principles and Applications, Lecture Notes in Physics, Vol. 770. Heidelberg: Springer Verlag.  Google Scholar
First citationForeman-Mackey, D. (2016). J. Open Source Software, 1(2), 24.  Google Scholar
First citationForeman-Mackey, D., Hogg, D. W., Lang, D. & Goodman, J. (2013). Publ. Astron. Soc. Pac. 125, 306–312.  Google Scholar
First citationGerelli, Y. (2016). J. Appl. Cryst. 49, 330–339.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationHeavens, O. (1955). Optical Properties of Thin Films. London: Butterworth.  Google Scholar
First citationHeinrich, F., Ng, T., Vanderah, D. J., Shekhar, P., Mihailescu, M., Nanda, H. & Lösche, M. (2009). Langmuir, 25, 4219–4229.  CrossRef CAS Google Scholar
First citationHelliwell, J. R., McMahon, B., Guss, J. M. & Kroon-Batenburg, L. M. J. (2017). IUCrJ, 4, 714–722.  Web of Science CrossRef CAS PubMed IUCr Journals Google Scholar
First citationHogg, D. W., Bovy, J. & Lang, D. (2010). ArXiv e-prints. arXiv: 1008.4686.  Google Scholar
First citationHughes, A. V., Ciesielski, F., Kalli, A. C., Clifton, L. A., Charlton, T. R., Sansom, M. S. P. & Webster, J. R. P. (2016). Acta Cryst. D72, 1227–1240.  CrossRef IUCr Journals Google Scholar
First citationJones, E., Oliphant, T., Peterson, P. et al. (2001). SciPy: Open Source Scientific Tools for Python, https://www.scipy.org/Google Scholar
First citationKienzle, P. A., Krycka, J., Patel, N. & Sahin, I. (2011). Refl1d – Depth Profile Modelling, https://reflectometry.org/danse/docs/refl1d/Google Scholar
First citationKluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S. & Willing, C. (2016). Positioning and Power in Academic Publishing: Players, Agents and Agendas, edited by F. Loizides & B. Schmidt, pp. 87–90. Amsterdam: IOS Press.  Google Scholar
First citationMajkrzak, C. (1999). Acta Phys. Pol. A, 96, 81–99.  CrossRef CAS Google Scholar
First citationMillman, K. J. & Pérez, F. (2014). Implementing Reproducible Research, edited by V. Stodden, F. Leisch & R. D. Peng, ch. 6. London: Chapman & Hall.  Google Scholar
First citationMöller, S., Prescott, S. W., Wirzenius, L., Reinholdtsen, P., Chapman, B., Prins, P., Soiland-Reyes, S., Klötzl, F., Bangnacani, A., Kalaš, M., Tille, A. & Crusoe, M. R. (2017). Data Sci. Eng. 2, 232–244.  Google Scholar
First citationNelson, A. (2006). J. Appl. Cryst. 39, 273–276.  Web of Science CrossRef CAS IUCr Journals Google Scholar
First citationNelson, A. R. J. & Dewhurst, C. D. (2014). J. Appl. Cryst. 47, 1162.  Web of Science CrossRef IUCr Journals Google Scholar
First citationNelson, A. & Prescott, S. W. (2018a). Online Reflectivity Fitting with refnx, https://mybinder.org/v2/gh/refnx/refnx-binder.git/masterGoogle Scholar
First citationNelson, A. & Prescott, S. W. (2018b). refnx – Neutron and X-ray Reflectometry Analysis in Python, https://www.github.com/refnx/refnxGoogle Scholar
First citationNévot, L. & Croce, P. (1980). Rev. Phys. Appl. 15, 761–769.  CAS Google Scholar
First citationPauw, B. R. (2013). J. Phys. Cond. Matter, 25, 383201.  CrossRef Google Scholar
First citationProject Jupyter Contributors (2015). ipywidgets, https://github.com/jupyter-widgets/ipywidgetsGoogle Scholar
First citationSivia, D. & Skilling, J. (2006). Data Analysis: A Bayesian Tutorial. Oxford Science Publications.  Google Scholar
First citationStark, P. (2018). Nature, 557, 613.  CrossRef Google Scholar
First citationTrewhella, J., Duff, A. P., Durand, D., Gabel, F., Guss, J. M., Hendrickson, W. A., Hura, G. L., Jacques, D. A., Kirby, N. M., Kwan, A. H., Pérez, J., Pollack, L., Ryan, T. M., Sali, A., Schneidman-Duhovny, D., Schwede, T., Svergun, D. I., Sugiyama, M., Tainer, J. A., Vachette, P., Westbrook, J. & Whitten, A. E. (2017). Acta Cryst. D73, 710–728.  CrossRef IUCr Journals Google Scholar
First citationVousden, W., Farr, W. M. & Mandel, I. (2016). Mon. Not. R. Astron. Soc. 455, 1919–1937.  CrossRef CAS Google Scholar
First citationWell, A. A. van & Fredrikze, H. (2005). Physica B, 357, 204–207.  Google Scholar
First citationWood, M. & Clarke, S. (2017). Metals, 7, 304.  CrossRef Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

Journal logoJOURNAL OF
APPLIED
CRYSTALLOGRAPHY
ISSN: 1600-5767
Follow J. Appl. Cryst.
Sign up for e-alerts
Follow J. Appl. Cryst. on Twitter
Follow us on facebook
Sign up for RSS feeds