teaching and education
Highperformance Python for crystallographic computing
^{a}Institut de Recherche sur les Céramiques, CNRS UMR, 7315, Centre Européen de la Céramique, 12 rue Atlantis, 87068 Limoges Cedex, France, and ^{b}European Synchrotron Radiation Facility, 71 Avenue des Martyrs, 38000 Grenoble, France
^{*}Correspondence email: alexandre.boulle@unilim.fr
The Python programming language, combined with the numerical computing library NumPy and the scientific computing library SciPy, has become the de facto standard for scientific computing in a variety of fields. This popularity is mainly due to the ease with which a Python program can be written and executed (easy syntax, dynamical typing, no compilation etc.), coupled with the existence of a large number of specialized thirdparty libraries that aim to lift all the limitations of the raw Python language. NumPy introduces vector programming, improving execution speeds, whereas SciPy brings a wealth of highly optimized and reliable scientific functions. There are cases, however, where vector programming alone is not sufficient to reach optimal performance. This issue is addressed with dedicated compilers that aim to translate Python code into native and statically typed code with support for the multicore architectures of modern processors. In the present article it is shown how these approaches can be efficiently used to tackle different problems, with increasing complexity, that are relevant to crystallography: the 2D Laue function, scattering from a strained 2D crystal, scattering from 3D nanocrystals and, finally, diffraction from films and multilayers. For each case, detailed implementations and explanations of the functioning of the algorithms are provided. Different Python compilers (namely NumExpr, Numba, Pythran and Cython) are used to improve performance and are benchmarked against stateoftheart NumPy implementations. All examples are also provided as commented and didactic Python (Jupyter) notebooks that can be used as starting points for crystallographers curious to enter the Python ecosystem or wishing to accelerate their existing codes.
1. Introduction
Python is a highlevel programming language which is very popular in the scientific community. Among the various reasons which make Python an attractive platform for scientific computing one can cite (Oliphant, 2007)
(i) it is an interpreted (as opposed to compiled) and dynamically typed (i.e. the variable types can change depending on the context) programming language which allows for fast prototyping of scientific applications,
(ii) the syntax is clean and simple, and hence easily accessible to nonprofessional programmers,
(iii) Python runs on different operating systems (OSs), including the three major desktop OSs (Windows, MacOS and Linux),
(iv) it is distributed freely with a permissive license which favors an easy distribution of programs and library modules, and
(v) Python has a huge number of libraries (both installed in the standard library or available through third parties) that allow almost every task one can possibly imagine to be addressed.
In the field of science and engineering these tasks include, for instance, vector programming with NumPy (https://www.numpy.org/), generalpurpose scientific computing with SciPy (https://www.scipy.org/), symbolic computing with SymPy (https://www.sympy.org), image processing (scikitimage, https://scikitimage.org/), statistical analysis (pandas, https://pandas.pydata.org/), machine learning (scikitlearn, https://scikitlearn.org), plotting (Matplotlib, https://matplotlib.org/) and many others.
This popularity also extends to the crystallographic community. Fig. 1 shows the percentage of articles published every year by the International Union of Crystallography containing the word `Python' in the title or in the abstract. Between 2012 and 2016 this fraction has been multiplied by a factor of ∼12.
The dynamically typed and interpreted nature of Python is also its greatest weakness when it comes to performance: the fact that the interpreter has to determine the type of each variable before running an operation results in increased execution times. Whereas the consequences of this remain limited when working with small data sets (i.e. a small number of operations), the performance dramatically decreases when repeating a large number of operations, i.e. in the case of loops (Oliphant, 2007; Behnel et al., 2011).
This behavior is well known, and several improvements (in the form of thirdparty libraries) have been made to Python to mitigate or even completely suppress this issue. For instance, the NumPy library (Oliphant, 2007; van der Walt et al., 2011) introduces the ndimensional array, which often permits the replacement of loops with vectorized operations (i.e. operations implicitly working on all components of a vector), resulting in significantly improved performance. Associated with the SciPy library, which provides a wealth of scientific functions (linear algebra, optimization, integration, interpolation, statistics, signal processing, Fourier transformation etc.), the NumPy module has established itself as the de facto standard for scientific computing within the Python ecosystem. One of the limits of the vectorized approach implemented in NumPy is the evaluation of complicated mathematical formulae on large arrays. When those arrays are too large to fit into the cache system of the processor (a few tens of MB), every single operation needs to fetch and store the data from/to the central memory. The performance of NumPy can be improved with the NumExpr library (https://numexpr.readthedocs.io). NumExpr is focused on the evaluation of complicated mathematical expressions and, among other improvements, it allows programmers to take full advantage of the, now common, multicore central processing units (CPUs).
There are cases, however, where looping constructs cannot be avoided and where vector programming is of no help. This motivated the development of several projects aiming at replacing the critical parts of Python code with compiled code. This is the case of the popular Cython library (https://cython.org/), which translates Python into C code (Behnel et al., 2011). This is then compiled and linked against the standard Python interpreter, which is also referred to as CPython (as it is mainly written in C) in order to avoid confusion with the Python language itself. Another project with a similar purpose is the Pythran library (https://pythran.readthedocs.io/), which is a compiler to the C++ language (Guelton et al., 2015; Guelton, 2018). The latest contender in this `highperformance Python' game is the Numba project (Lam et al., 2015). Numba (https://numba.pydata.org/) was first released in 2012 but has been fully open source only since 2017. There are several technical differences between these libraries, which will be briefly discussed in the next part from the viewpoint of the user.
What motivated the writing the present article (and the associated Jupyter notebooks given in the supporting information) is that, despite the performance boosts promised by the abovementioned libraries, finding online resources that are relevant to crystallography can sometimes be very difficult. Most of the time the examples are either too simplistic or, conversely, too specialized to a particular field, making the transposition to crystallography not straightforward, especially for scientists who are not necessarily experts in highperformance computing (HPC). Despite the importance of the Python programming language in the scientific community, a set of didactic examples of highperformance crystallographic computing, structured around the NumPy standard, was lacking.
In the present article we show how to accelerate Pythonbased crystallographic computing using NumPy, NumExpr, Numba, Pythran and Cython. For this purpose we present four examples with increasing complexity: the calculation of a simple 2D Laue function, the scattering from a 2D circular crystal with strain, the scattering from an arbitrary 3D collection of atoms using the Debye scattering equation (Warren, 1969) and, finally, the scattering from an epitaxial multilayer structure using the recursive solution (Bartels et al., 1986) to the Takagi–Taupin equations (Takagi, 1969; Taupin, 1964). In all cases the algorithms are explained in detail and compared with stateoftheart Python and NumPy implementations. The performances of each implementation are compared and discussed.
As a side note we aknowledge that the opensource communities evolving in the Python ecosystem are extremely active and it is therefore not possible to review all existing libraries aiming at accelerating Python computing. For instance, we deliberately omitted the intelpython library (https://software.intel.com/enus/distributionforpython), which, as the name suggests, is an implementation of the NumPy library developed by Intel Corporation and optimized for Intel architectures. As a consequence, it might not operate correctly with chips from different providers (e.g. AMD) or with different architectures (e.g. ARM, which is the leading architecture for the mobile market and is now targeting the PC and server markets). Moreover, the performance of intelpython is tightly linked to the exploitation of the multicore architecture, a feature which is explicitly included in all of the abovementioned libraries. We also omitted the PyPy project (https://pypy.org/), which is a generalpurpose interpreter, written in Python, aiming at replacing CPython and targeting high performance. However, with the examples presented in this article, we did not observe any performance improvement.
2. Computational details
Broadly speaking, performance bottlenecks may originate either from the power of the CPU, i.e. the number of instructions it is able to execute over a given time period, or from data (memory) input/output (Alted, 2010). The former issue fueled the race to the maximum CPU clock speed (now capped at around 4 GHz) and the subsequent development of multicore CPU architectures. The latter led to what is known as the hierarchical memory model: since the randomaccess memory (RAM) effectively operates at a much lower rate than the CPU, processor manufacturers introduced onchip, smaller but faster, cache memory. Modern CPUs now have up to three levels of memory cache (L1, L2, L3) with progressively increasing size and decreasing speed. Below we present the hardware as well the computing libraries used in this work. Their functioning will be briefly discussed in terms of these two concepts.
2.1. Library specifications^{1}
In this work, version 1.13.3 of the NumPy library is used. As mentioned earlier, NumPy introduces the multidimensional array object which, contrarily to the native Python `list' object (which stores other Python objects), is of fixed size at creation and contiguously stores data of the same type and size. Vectorized operations are performed on all elements of an array, without the necessity to loop over the array coordinates from the Python code. As a consequence, as illustrated below, the code is cleaner and closer to mathematical notation. These vectorized operations are implemented in C, which results in significant speed improvements (Oliphant, 2007; van der Walt et al., 2011). The fact that the data are stored contiguously in the RAM enables one to take advantage of vectorized instructions of modern CPUs such as SSE (streaming SIMD extensions, with SIMD meaning `single instruction multiple data') or AVX (advanced vector extensions), which allows processing on several data elements per CPU clock cycle (Rossant, 2018). For instance the AVX2 standard is characterized by 256 bitwide registers, hence containing four doubleprecision (64 bit) floatingpoint numbers.
NumPy operates optimally when the calculations are performed on data fitting the cache of the CPU (say, a few tens of MB). For larger data sets, the performance is limited by the bandwidth to access the data stored in the RAM. NumExpr (version 2.6.4 in this work) has been specifically developed to address these issues with NumPy. NumExpr is a `justintime' (JIT) compiler^{2} for mathematical formulae (relying on an integrated computing virtual machine), which splits the data into smaller chunks that fit within the L1 cache of the processor (a technique known as `blocking') and avoids allocating memory for intermediate results for optimal cache utilization and reduced memory access. The chunks are seamlessly distributed among the available cores of the CPU, which results in highly parallelized code execution.
One important factor limiting Python's performance is the conversion from the Python object (`PyObject'^{3}) to a native data type. Although NumPy and NumExpr allow this limitation to be lifted by working with arrays with fixed data types, conversions back and forth from PyObjects to native data types are still required, which becomes critical when loops or branching are not avoidable.
The Numba library (version 0.34.0) is a JIT compiler for Python. Numba relies on LLVM (low level virtual machine, https://llvm.org/) to produce machine code from Python's bytecode that runs at speeds very close to compiled C (Lam et al., 2015). One of the greatest appeals of Numba is that this increase in performance is reached with very little code modification with respect to pure Python code. Moreover, Numba is compatible with NumPy arrays, supports SIMD vectorized operations and allows for a straightforward parallelization of loops.
Pythran (version 0.9.0) is an aheadoftime (AOT) compiler for Python, which also aims at accelerating scientific computations; it is therefore also fully compatible with NumPy and parallel execution (via the openMP project, https://www.openmp.org/). Pythran not only relies on static typing of variables but also performs various compiler optimizations to produce C++ code which is then compiled into native code (Guelton et al., 2015; Guelton, 2018). Although there are major technical differences between Numba and Pythran, from the point of view of the user, their implementation is quite similar (except for the fact that Pythran requires a separate compilation step).
The most well known compiler for Python is Cython (here used in version 0.26.1). Cython produces statically typed C code which is then executed by the CPython interpreter (Behnel et al., 2011). Contrarily to Pythran and Numba, which are domain specific (i.e. scientific) languages, Cython is a generalpurpose AOT compiler. As such, its syntax is more complex and, contrarily to Numba and Pythran code, (uncompiled) Cython code is not compatible with the CPython interpreter. However, this library potentially allows for the highest level of code optimization.
2.2. Hardware and operating system
All computations were performed on a Linux workstation equipped with two Intel Xeon E52698v4 processors (2 × 20 cores, operating at 2.2 GHz base frequency and 3.6 GHz maximum frequency) and 194 GB of RAM. The operating system is Kubuntu 18.04 with version 3.6.6 of the Python programming language. The Python language and all thirdparty libraries were installed from the official software repository of the Linux OS used in this work [except Pythran, which has been installed from the Python package index using the package installer pip (https://pypi.org/project/pip/)]. Windows and MacOS users might want to use the Anaconda Python distribution (https://www.anaconda.com/), which allows for simple module management similar to what is found in the Linux ecosystem.
Note that, in the context of this article, we are not going to consider HPC based on graphics processing units (GPUs). The main reason is that programming such devices requires specific (computerscientist) skills which are out of the scope of this article. Although highlevel Python wrappers have been developed for the two main GPU computing frameworks (CUDA and OpenCL), getting the best possible performance out of a given device in general requires the development of lowlevel kernels adapted to the device. The second reason is that, compared with multicore CPUs, GPUs able to perform HPC are much less widespread and are in general only to be found in desktop workstations (i.e. most laptops are excluded).
3. Implementation
3.1. Twodimensional Laue function
In this section we are going to present in detail the different implementations of a given problem using the different libraries discussed above. For this purpose we shall use a simple example, namely a twodimensional Laue function (Warren, 1969):
This equation describes the scattering from a square crystal having N unit cells in both directions; n and m are the indices of the in each direction and H and K are the continuous Miller indices.
3.1.1. Pure Python implementation
A naive Python implementation of this equation into a function is as follows:^{4}
This function takes four arrays as arguments: two containing the unitcell coordinates (n and m) and two containing the reciprocalspace coordinates (h and k) (line 1). Line 2 creates the Python list which will receive the results [the range(x) function generates a list of integers ranging from 0 to x  1, whereas the len(y) function returns the size of the array y]. Lines 3–7 are the loops over the reciprocal and realspace coordinates: the for x in y instruction loops over all x values stored in the y list, whereas the for i, x in enumerate(y) instructions loops over all x values stored in the y list and a loop counter is stored in the variable i. Line 8 computes the complex exponential and the sum, and the result is stored in a temporary variable in order to avoid accessing the Python list from inside the loop. When the loops over the realspace coordinates are completed the result is stored in one of the cells of the result list (line 9).
In the following, a square crystal with 100 unit cells in each direction is taken as an example. In addition, a complete reciprocallattice ). With a sampling rate of 2, i.e. the Nyquist rate, the main peak, the minima and all secondary maxima are sampled, and so on. With these values, this implies that the complex exponential has to be computed 3.6 × 10^{9} times. The above implementation of the Laue function requires 37 minutes of computing time (all values are reported in Table 1). The result of the calculation is shown in Figs. 2(a) and 2(b), where the well known Laue function is easily recognized, and it will be used as a reference to evaluate possible errors induced by the different implementations.
is sampled with a resolution of 1/(100 × 6), where 6 is the oversampling rate. The oversampling rate can be viewed as the number of points needed to describe a single Laue fringe. With a sampling rate of 1, only the main peak and the minima are sampled (Neder & Proffen, 2008

3.1.2. NumPy implementation
As mentioned in the Introduction, the performance of pure Python dramatically decreases when looping over large data sets. A simpler, and much more efficient, implementation is possible using NumPy:
The first striking observation is that the code is much more concise. It is also much clearer, since the actual calculation requires one line (line 6) with a syntax similar to equation (1). Lines 2–5 add new (empty) dimensions to the input arrays. With this transformation the calculation of h*n + k*m actually returns a fourdimensional array. This important feature of NumPy is known as `broadcasting'. In mathematical notation this is equivalent to e_{ijkl} = a_{i}b_{j} + c_{k}d_{l}. The exponential then operates on all cells of this array, and the sum over the realspace coordinates is performed using the sum() method of the NumPy arrays (line 6) [the axis = (2,3) argument specifies that the summation has to be performed over the last two dimensions of the array which contain the realspace variables]. Besides the cleaner syntax, this implementation is also much faster, with an execution time of ∼4 min. Since NumPy is the established standard for scientific computing, all computing times will be compared with this NumPy implementation.
3.1.3. NumExpr implementation
The implicit tradeoff that has been made with the NumPy implementation is that, since the arrays are all stored in RAM at creation, this operation requires up to ∼107 GB of RAM; the fourdimensional array of complex numbers (coded over 128 bits) requires 53.6 GB, but twice this space is needed to compute the exponential. This issue regarding memory usage is clearly a limitation of NumPy when dealing with very large arrays. This could be avoided by creating only one 2D output array and looping over the realspace coordinates (see `NumPy + for loop' in the supporting information). However, a more efficient implementation is to use NumExpr, which, as explained above, makes a better use of memory allocation and avoids memory transfers. A very interesting feature of NumExpr is that the code is very similar to that for NumPy:
There are two differences compared with NumPy. Firstly, the expression to be evaluated is passed as an argument to a ne.evaluate() function (lines 8–9). Secondly, the sum function of NumExpr does not operate over several dimensions (line 9). Therefore the 4D array has been reshaped into a 3D array, where the last dimension contains the realspace variables (line 8).
When only one thread is used, the acceleration is rather modest (×1.19). The real power of NumExpr is revealed when all cores are used (line 7). In the present case, the computing times drops to 17 s (instead of 4 min for NumPy), that is a ×13.9 speedup with almost no code modification.
3.1.4. Numba implementation
As mentioned earlier, there are cases where the mathematical problem cannot be solved using vector operations (an example will be given in the last section). In such cases, it can be useful to consider replacing the critical parts of the program with compiled, statically typed, code. Below is such an implementation using Numba:
This code is almost identical to the pure Python implementation described above. The common way to use Numba consists in `decorating' the function to be compiled with an @nb.jit() statement (line 1). Although this can be sufficient, the following options turn out to be important from the point of view of performance. First of all, the nopython = True statement tells the compiler to compile the code without any remaining Python objects. If this condition is not fulfilled an error is raised. If this option is omitted, the code will execute normally, but all parts of the program containing data types incompatible with Numba or datatype inconsistencies will be treated as PyObjects, resulting in significantly degraded performance. Another performance improvement is achieved by explicitly stating the data types that are produced (here a 2D array of floatingpoint numbers, nb.float64[:,:]) and those that are given as arguments (two 1D integer arrays, nb.int64[:], and two 1D floatingpoint arrays, nb.float64[:]). Because the variables have to be statically typed, the (complexvalued) amplitude cannot be stored in the final intensity array (which is real valued). This is why a temporary complex variable is created on line 6 and used in line 9. Finally, it is important to mention that Numba operates more efficiently when using Python's math and cmath (for complex numbers) libraries rather than NumPy's mathematical functions (see e.g. line 9).
This function runs in 2 min 46 s, which is a ×1.43 acceleration compared with NumPy (see Table 1). The gain, compared with NumPy, is relatively small, and since significant code rewriting is needed when moving from NumPy to Numba, this option should only be considered if no vectorized solution exists. However, the gain is significant when compared with pure Python, and considering the little effort needed to compile the code, Numba appears as a straightforward solution to accelerated Pythonbased computing.
Examination of line 9 reveals that the complex exponential can be evaluated independently for each value of h, k, m and n, i.e. the evaluation of one value of the exponential is independent of the others. Therefore, significant improvement in the execution speed can be expected if these values are effectively evaluated in parallel, using the different cores of the CPU. Instead of using Python's range() function, we can use Numba's prange() function to select the loop to be parallelized. Below we have chosen to evaluate in parallel the values corresponding to different h/k values:
For this purpose the loops overs h and k have been condensed into a single loop (line 4) and the corresponding intensity is hence stored in a 1D array with 36 × 10^{4} elements (line 3), which can later be reshaped into a 2D 600 × 600 array. Additionally, the option parallel = True should be passed to the decorator. With this implementation and the hardware used in this work the computation time drops to 5.04 s, that is a ×47 acceleration compared with NumPy. The fastmath = True statement allows one to bypass the strict compliance to the IEEE 754 standard regarding floatingpoint number operations (Goldberg, 1991). Briefly, this refers to the fact that floatingpoint numbers have a finite precision and, hence, cannot exactly represent all real numbers. An important consequence of this is that floatingpoint operations are not associative (Goldberg, 1991) and compilers usually perform calculations in the strict order defined by the code. The fastmath option allows this condition to be lifted, and the compiler is able to reassociate floatingpoint operations to optimize performance. Without this option the above function is ∼25% slower, i.e. the computation time is 6.3 s instead of 5.04 s. In the present case, this does not introduce any computing error since the result is strictly equal to that of the pure Python implementation (Table 1). However, in general, relaxing the IEEE standard compliance can introduce significant bias and this should be used carefully.
3.1.5. Pythran implementation
Let us now consider the case of Pythran. The corresponding code is the following:
Line 1 is a socalled `magic command' which allows the Pythran compiler to be used within Jupyter notebooks; the fopenmp option tells the compiler to use OpenMP parallelization where indicated. Line 2 indicates the name of the function to be compiled, together with the type of the arguments (similarly to Numba). Finally, line 3 loads the external libraries needed to build the compiled code. The rest of the code is identical to pure Python or Numba. The performance is slightly better than that of Numba, with an execution time of 4.7 s, i.e. a ×50 acceleration compared with NumPy. Similarly to what is observed for Numba, the performance boost principally originates from the parallelization (when the code runs on a single thread, the speedup is ×1.62).
3.1.6. Cython implementation
Finally, the Cython implementation of the Laue function is given below:
The first striking observation is that this block of code is significantly less readable than a NumPy or even a Numba/Pythran implementation (30 lines versus 10–15 lines). Let us briefly review this code:
Line 1 allows use of Cython within a notebook and specifies that the code has to be compiled using OpenMP parallelization and linked to the OpenMP library.
Lines 2–4 load external libraries.
Lines 5–7 overwrite some Python functions (here the complex exponential and the complex modulus) with native C functions (improves performance).
Lines 8 and 9 are decorators which specify that, when using arrays, Cython is not compatible with negative indexing and does not check whether the array indices are in the bounds of the actual array range (improves performance).
Line 10 is the function definition with its argument types. The syntax double[::1] is called a `typed memory view' and allows efficient access to memory buffers (with contiguous data storage in RAM), such as those underlying NumPy arrays.
Lines 11–15 are the declaration of all variable types used in the code (similarly to what is done with other statically typed languages such as C, Fortran etc.).
In line 22, similarly to what is done with Numba, Python's range instruction is replaced with Cython's prange instruction to indicate the loop to be parallelized. This instruction takes the additional option nogil = True passed as a parameter, which indicates that Python's global interpreter lock (GIL) has to be deactivated. Briefly, the GIL is a feature of Python that prevents several threads from simultaneously using Python objects. Although this feature is useful for Python's memory management, it inherently prevents multithreading and, contrarily to Numba and Pythran, Cython requires an explicit declaration to deactivate the GIL. The rest of the code is similar to the previous implementations.
Performance wise, this code runs in 4.98 s (i.e. a ×47.6 speedup compared with NumPy), which is extremely close to Pythran and Numba (the performances are actually contained within three standard deviations). Considering the complexity of this code one might wonder, why use Cython? Actually, for such a simple case, Cython is definitively not an interesting option. In the next sections, we will show that, in some cases, Cython allows access to optimization levels that are not possible with other options.
3.1.7. Discussion
Figs. 2(a) and 2(b) show the 2D and 1D intensity distributions (zoomed at ±0.2 around the Bragg position). As expected, the result equals the exact solution to equation (1), namely sin^{2}(πNH)sin^{2}(πNK)/sin^{2}(πH)sin^{2}(πK). Given that an exact solution exists, the previous implementations are of course useless and have only been presented for didactic purposes. It should also be emphasized that for this class of problem, i.e. computing a Fourier series, if an exact solution cannot be worked out, the direct summation approach is, in general, a poor choice of algorithm as the computation time scales as ∝N^{4} (where N is the number of unit cells per direction^{5}). On the other hand, the fast Fourier transform (FFT) algorithm (Cooley & Tukey, 1965), which is the stateoftheart algorithm to compute Fourier series, scales as ∝N^{2}log(N^{2}), resulting in orders of magnitude faster computations. In the present case it is ∼13 860 times faster than the NumPy implementations and more than 250 times faster than any parallelized implementation (the calculations have been performed using NumPy arrays and the FFT algorithm included in the NumPy library). This illustrates the fact that, when possible, using existing, well tested and heavily optimized algorithms is much more important than optimizing a mediocre algorithm.
Another point to take into account is whether the increase in speed results in a lowering of the arithmetic precision. To quantify this effect, we evaluate the maximum deviation from the exact calculation normalized to the maximum intensity, i.e.
The results are given in Table 1. All errors are below one part in 10^{13}, which is several orders of magnitude lower than the that can be experimentally achieved on diffractometers, even at synchrotron facilities (∼10^{8}–10^{10}). Moreover as soon as lattice disorder is present, the experimental rapidly decreases (FavreNicolin et al., 2011) so that the present implementations can be used safely without worrying about the numerical precision.
To conclude this first part, it is worth mentioning that the results presented here should in no way be taken as general. The performance of a given library depends on many external parameters (i.e. it does not solely depend on the implementation of the algorithm). For example, the size of the problem is a common factor influencing performance. Fig. 3 shows the evolution of the execution time and speedup as a function of the number of unit cells in the crystal (all other conditions being identical). All implementations effectively converge to a ∝N^{4} behavior for large N^{2} values, with more or less pronounced deviations at smaller values [Fig. 3(a)]. Using NumPy as a time reference, it can be observed that the speedup of all parallel implementations (JIT or AOT) increases with the crystal size [Fig. 3(b)]. All (singlethreaded) compiled codes exhibit an acceleration of a few tens of percent compared with NumPy, which mainly illustrates the fact that NumPy is an extremely well optimized library with programs running close to what is observed with compiled languages.
The two JIT compilers (NumExpr and Numba) seem to suffer from an increased overhead when using multithreading for crystal sizes less than 10 × 10 unit cells. However, above this limit, NumExpr allows one to accelerate NumPy code by ×6 to ×16 with almost no code rewriting, whereas Numba converges to the values exhibited by the AOT compilers, Pythran and Cython. These two last allow Python code to be accelerated by ×8 to ×50 for increasing crystal size.
In the next section we consider a slightly more complicated example, for which no exact solution can be worked out and where the FFT algorithm might yield incorrect results.
3.2. Circular crystal with strain
3.2.1. Implementation
Let us now consider a case where the crystal does not have an orthogonal shape and additionally exhibits strain. The first point implies that the unitcell coordinates cannot be simply represented by a array, and it is necessary to introduce a support function equal to one when the unitcell coordinate is located inside the crystal and 0 otherwise. For conciseness, a 2D circular crystal is considered, but, as in the previous case, the extension to three dimensions is straightforward. This yields
where
is the distance from the center of the crystal and R is the crystal radius. The second point implies that unit cells are shifted from their regular position according to n′ = n + Δn, where Δn is the displacement. For the case of a circular crystal, a function describing a progressive dilatation when moving from the center towards the periphery can be written:
where Δr(r) is the displacement of the located at a distance r from the crystal center, e_{0} is the maximum strain and w is a parameter describing the width of the displacement profile (for small values of w the strain is confined at the periphery of the crystal, whereas for large values it also affects the interior of the crystal). The NumPy implementation is as follows:
The radius from the center of the R = N/2). The strain Δr/r is computed in line 9 and, similarly to the radius and the support function, it is a 2D array with two additional empty dimensions dedicated to receiving the reciprocalspace coordinates (line 10). Strictly speaking, it is not necessary to explicitly define the support function (line 8), in the sense that we could limit the evaluation of the tmp variable only to those regions where the condition radius < N/2 is fulfilled. Thereby we would save the corresponding memory space (although without improvement regarding the execution speed). We nonetheless keep this implementation here for didactic purposes; the modified version is implemented in the following versions below. Afterwards, similarly to the Laue function, the summation over the unitcell coordinates is performed using the sum method of NumPy. As observed in the previous section, the broadcasting and vectorizing properties of NumPy allow for a painless implementation of mathematical equations.
is computed in line 7 and it is a 2D array (with two additional empty dimensions). The support function is computed in line 8 and sets to 0 all values for which the radius is larger than the crystal radius (here chosen asFig. 2(b) shows a section along the K direction (performed at H = 0 and around K = 4) with e_{0} = 0.01 and w = 20, and Fig. 2(c) shows a portion of the 2D intensity distribution. Fig. 2(d) shows the distribution of Δr/r within the crystal computed with equation (3). The crystal is tensily strained, so that the peak is shifted towards lower K values and, since the distribution of strain is inhomogeneous, the intensity distribution is broadened and asymmetric compared with the Laue function.
This implementation runs in ∼299 s, which is more than 100 times faster than the pure Python implementation (given in the corresponding notebook). In this case, the acceleration is more than 10 times greater than in the previous section, i.e. the Python implementation required more than 9 h instead of 37 min for the Laue function, whereas the NumPy implementation only required ∼62 s more (299 versus 237 s). This is because the relative efficiency of NumPy increases when the number of floatingpoint operations within the loops increases.
The NumExpr implementation is similar to the NumPy code and only differs by calls to the ne.evaluate() function:
Notice that in this implementation we did not explicitly define the support function and we merged its definition with the evaluation of the complex exponential (line 11). Among all (singlethreaded) implementations, NumExpr exhibits the best performance with ×1.32 speedup, whereas the worst result is obtained with Numba with a ×1.07 speedup compared with NumPy (see Table 1). As mentioned in the previous section, this again illustrates the efficiency of the NumPy library. In the rest of the article we shall therefore focus on the results obtained with the parallel versions of the algorithms, since this is when the best performance is obtained.
The parallel version of the NumExpr algorithm is obtained by modifying the n_cpu variable. The resulting speedup is ×12.4. The Pythran parallel implementation is given below (the Numba and Cython versions are similar and can be found in the corresponding notebook in the supporting information):
From the earlier descriptions, this code is rather straightforward to understand. The only novelty here is the continue statement (line 15). This instruction is used in conjunction with the preceding for instruction and continues to the next iteration of the for loop if the if condition is fulfilled, i.e. this corresponds to the where function used with NumPy. This implementation is the most efficient, with a speedup of ×47.7 (very close to the value obtained for the Laue function), whereas Numba and Cython, respectively, yield speedups of ×37.8 and ×45.6.
3.2.2. Discussion
Let us briefly consider the FFT algorithm. As shown in Table 1, it again outperforms any of the above implementations with an execution time of 0.046 s. However, as also shown in Table 1, the FFT implementation now yields an incorrect result with a relative error of 0.011. This is because, in the presence of lattice strain, replacing the Fourier series with a Fourier transform of exp(2πiH_{B}Δn) (where Δn is the displacement and H_{B} is the vector at the Bragg position) is valid only when (H − H_{B})Δn ≪ 1 (FavreNicolin et al., 2011). This hypothesis fails for large strains and large deviations from the Bragg position. In such cases, despite its rather poor performance, the direct summation algorithm is the only remaining option. Moreover, the FFT algorithm can only be used when the data to be transformed (i.e. the unitcell coordinates, atomic coordinates etc.) can be represented on a regular grid, like a 3D array of (x, y, z) coordinates. This constraint therefore excludes atomistic simulation results, for which the direct summation algorithm is thus the only remaining option to compute the diffracted intensity. The possibility to accelerate such computations can hence be relevant in such situations.
As mentioned in Section 2, in this work we are not considering the implementation on GPUs. However, it is worth mentioning the existence of a Python library called PyNX (FavreNicolin et al., 2011), specifically developed (among other tasks) to optimize the direct summation algorithm of the Fourier transform (https://ftp.esrf.fr/pub/scisoft/PyNX/doc/). Given its highly parallel nature, PyNX strongly benefits from the architecture of GPUs and yields far better results than the present implementations.
3.3. The Debye scattering equation
For didactic purposes, the two previous sections were focused on theoretical 2D crystals. We now consider a slightly more complex example, namely the scattering from actual nanocrystals. Assuming that the crystals exhibit all possible orientations (like in a powder sample), the intensity scattered by an ensemble of N atoms is correctly described by the Debye scattering equation (Warren, 1969):
where Q = 4πsin(θ)/λ (θ being half the scattering angle and λ the radiation wavelength), f_{i} is the of the ith atom and r_{ij} are the distances between all possible pairs of atoms in the crystal. For the sake of simplicity we shall consider a monoatomic crystal. In such a case, the scattering equation can be simplified to
The extension to polyatomic crystals is straightforwardly obtained by computing equation (7) for all possible homoatomic and heteroatomic pairs.
An interesting feature of this equation is that it does not require the existence of an average reveals that the calculation actually requires two distinct steps:
so that, contrarily to the previous section, the intensity can be computed even from highly disordered crystals, or from liquids, gasses…. From a computational point of view, this also implies that one can make use of atomic coordinates obtained by atomic scale calculations (like for instance) to compute the corresponding intensity and compare it with the diffraction pattern from actual nanostructured materials. The examination of equation (7)(i) The calculation of all pairwise distances, r_{ij}, from a given collection of (x, y, z) coordinates.
(ii) The calculation of the double sum.
3.3.1. Calculation of the pairwise distances
Computing all pairwise distances actually consists in computing the Euclidean distance matrix with component r_{ij}:
where … is the Euclidean norm. All diagonal elements are zero (r_{ii} = 0) and the matrix is symmetrical (r_{ij} = r_{ji}) so that only a triangular part needs to be computed. A naive Python implementation is as follows:
For a collection of N = 4753 atoms (with coordinates stored in the 2D [N × 3] array coords) this algorithm runs in 24.2 s (Table 2). As usual, far better results are to be expected with a NumPy implementation. For instance, developing equation (8) yields
The corresponding NumPy code reads (Bauckhage, 2014)

Line 2 computes the dot product; line 3 creates the matrix by extracting the diagonal of the dot product computed in line 2 and repeating (tile) it over N columns, and line 4 computes equation (9). Finally, line 5 sets to 0 all elements of the lower triangular matrix (triu) and only the nonzero elements are returned. This implementation runs in 1.374 s.
The NumExpr code is quite similar (see notebooks) and only differs from the previous NumPy implementation in lines 4 and 5 by calls to the ne.evalute() function. The corresponding implementation runs in 814 ms, which is a ×1.68 speedup compared with NumPy.
Before proceeding to the parallelized implementations we note that the SciPy library has a collection of functions devoted to analysis of data structures representing spatial properties (scipy.spatial). Using this library, the Euclidean matrix can be computed in a single line of code:
where pdist is the function that computes all pairwise distances within the list of coordinates coords. Interestingly this function runs at the same speed as a compiled implementation, 71.5 ms (×19.2), which outperforms both NumPy and NumExpr. This result, again, illustrates the importance of using thirdparty libraries that contain efficient and well tested code rather than reinventing the wheel.
Inspection of the Python implementation reveals that the algorithm could benefit from further acceleration by parallelizing the first for loops. The Pythran implementation is shown below:
Apart from the Pythranspecific instructions (lines 1–3, 7) the code differs from the Python implementation only at line 10. Whereas in the Python implementation the coordinate, l, of the distance array, r, was evaluated by incrementing its value for each iteration of the for loop over j, this is not possible here since these iterations are distributed among different threads. Although it is in principle possible to communicate between threads, the simplest solution is to evaluate the index l on the basis of the corresponding i and j values. The Numba and Cython implementations are similar (see notebook). The Pythran implementation yields the best results with a ×115.5 speedup (11.9 ms). Cython and Numba run in, respectively, 12.9 and 16.5 ms, with corresponding speedups of ×106.5 and ×83.3. The Pythran and Cython execution times only differ by 1 ms, which is within three standard deviations, while Numba appears to be slower.
At this point it is also interesting to briefly consider the performance of other languages. Python is often referred to as a glue language in the sense that it can be easily interfaced with other programming languages. For instance, the Fortran programming language has long been the favorite development tool of scientists and remains a very important language in the scientific community. The notebook Pairs_Fortran provides a detailed implementation. When a single thread is used, Fortran is ∼27% faster than Cython (65.0 versus 82.5 ms). When multiple threads are used, Fortran is still slightly faster than Cython (10.4 versus 12.9 ms) and very close to Pythran (11.9 ms). This result mainly illustrates the fact that a highlevel programming language such as Python, when combined with appropriate libraries, is able to reach levels of performance equivalent to the most efficient compiled languages. Moreover, the fact that Python is easily interfaced with other languages allows reuse of existing code without having to rewrite a single line.
3.3.2. Summation over r_{ij}
The double sum in equation (7) can actually be condensed into a single loop over all indices of the Euclidean distance matrix r. The naive Python implementation is straightforward:
The first for loop runs over the values of the vector Q, whereas the second for loop performs the summation over all r_{ij} values. The variable f_at is the which is evaluated using the Waasmaier & Kirfel (1995) method (see notebook). As for the Laue function, a tmp variable is used to store the result of the summation in order to avoid accessing the array from inside the loop. It takes 53 min to evaluate this function for 850 values of Q (corresponding to a 2θ range of 170° with a 0.2° step) and a gold nanocrystal containing 4753 atoms (see Table 2). This corresponds to a crystal with a ∼2.4 nm diameter exhibiting {100}, {110} and {111} facets (Fig. 4). The corresponding intensity distribution is labeled (2) in Fig. 4. The gold clusters were generated using the ase (atomic simulation environment) Python library (Hjorth Larsen et al., 2017; https://wiki.fysik.dtu.dk/ase/). The atomic structures were rendered with the VESTA program (Momma & Izumi, 2011).
The NumPy implementation is also straightforward to write:
The elegance of NumPy can here really be appreciated, with a calculation requiring a single line of code and a strong similarity with the mathematical formulation. The computing requires 7 min and 15 s (Table 2).
The NumExpr implementation requires some tweaking in order to benefit from multiprocessing: the sin(Qr)/(Qr) array should be evaluated independently from the summation; otherwise the workload is not dispatched over different threads, resulting in poor performance:
The speedup is appreciable, ×16.7, with a 26 s computing time. Finally, let us consider the compiled implementations. For conciseness, only the Pythran implementation is shown (the others can be found in the notebook):
Apart from the nowfamiliar Pythranspecific instructions (lines 1–3, 6) this code is rigorously identical to the Python implementation and runs in 5.73 s, that is a ×75.9 speedup. Cython performs similarly (5.71 s), whereas Numba here clearly fails to operate optimally (18.3 s), although no clear reason could be identified. This illustrates the fact, already mentioned in Section 2, that the performance of different libraries should not be taken for granted on the basis of a single example, and the performance of different implementations should always be critically compared.
3.3.3. The histogram approximation
Examination of Table 2 reveals that most of the computation time is spent in the evaluation of the sum over r_{ij}. This is easily understandable since this implies the evaluation of a transcendental function (i.e. a function that cannot be described by a sequence of simple algebraic operations, here the sine function) for a large number [N(N − 1)/2] of r_{ij} values. A widely used method to reduce the number of floatingpoint operations is to cast the r_{ij} values into a histogram which enumerates all atomic pairs being separated by distances between r_{i} and r_{i} + dr, where dr is the bin width (Glatter & Kratky, 1982). With this approximation the Debye scattering equation can be rewritten:
where n_{i} is the number of atomic pairs in the bin corresponding to the distance r_{i}.
In contrast to the previous case, this approximation therefore requires the evaluation of the distance histogram. This implies the generation of a series of bins and counting the number of r_{ij} values falling inside each bin. NumPy has a builtin histogram generation function which turns out to be too slow to be usable in the present case.
A fast histogram algorithm could be as follows:
(i) Loop over all r_{ij} values.
(ii) For each r_{ij} value compute the corresponding bin index: (r_{ij} − r_{0})/dr (where r_{0} is the distance corresponding to the first bin).
(iii) Increment the corresponding value of the histogram by 1.
The Cython implementation is given below:
The Cythonspecific instructions (explained in Section 2) are given in lines 1–7. Lines 8–18 declare and define the variables needed to compute the index of the bins and its value. Lines 19–21 strictly correspond to the fasthistogram algorithm outlined previously. This code runs in 0.052 s, in addition to the 0.0119 s already needed to generate the Euclidean distance matrix (Table 2). However, this additional time will allow us to achieve a tremendous speedup in the evaluation of the summation. Before proceeding to this summation, one might wonder whether this fast histogram could be further accelerated using multithreading.
A straightforward parallelization of the for loop (by replacing the range instruction with a prange instruction, line 18) results in erroneous results since there is no mechanism that prevents the different threads from writing in the same memory block at the same time, hence resulting in a wrong numbering of distances in the histogram. This issue can be circumvented by allocating a different histogram to each thread and then combining them together. The corresponding code is as follows:
Lines 22–24 correspond to the same algorithm as previously; the difference lies in the fact that the local histogram (i.e. corresponding to a specific thread) is correctly updated by using the correct thread number [via the threadid() function]. Lines 25–27 then simply add up the different local histograms. This type of finegrained optimization is only possible with Cython and is one of the advantages of using this library over Pythran or Numba.
For the nanocrystal used here, the parallel implementation actually runs slower than the singlethreaded implementation (70 ms instead of 52 ms). This is due to the overhead in spawning the different threads compared with the number of floatingpoint operations performed in each thread. For larger crystals (i.e. with a larger number of atoms), significant accelerations are obtained. For instance for a 10 times larger crystal (48 500 atoms), the speedup is ×7.5 (0.219 s versus 1.64 s).
Finally, let us compute the scattered intensity using equation (9). For consistency, we stay with the Cython implementation:
This implementation differs from the one presented in Section 3.3.2 in lines 21–22: the for loop runs over bins (instead of r_{ij}), and the sum is weighted by the number of atomic pairs within the bin (w) [equation (10)]. With a bin width dr = 5 × 10^{−4} Å, this code runs more than 100 times faster than without the histogram approximation, that is a ×9477 acceleration compared with NumPy.
However, as clearly stated earlier, this procedure is an approximation; it is therefore necessary to evaluate the error induced by making use of this procedure. As in Section 2, we evaluated the maximum relative error, which here amounts to 9 × 10^{−4}. Although this is several orders of magnitude higher than the errors observed without this approximation (10^{−16}–10^{−11}) this remains acceptable, especially in the case of nanocrystals for which the observed rarely exceeds 10^{4}. An error of 9 × 10^{−4} would therefore be hidden in the background noise. Obviously, the error can be reduced by reducing the bin width at the expense of an increased computing time. A compromise between speed and accuracy has therefore to be made.
Similarly to the Laue equation, the Debye scattering equation belongs to this class of `embarrassingly parallel' problems, i.e. a class of problems for which the parallelization is straightforward and might hence benefit the most from multicore and other massively parallel architectures. As such, GPUs provide the best results for that type of problem and there have been several successful attempts to compute the Debye scattering equation on that type of device. Restricting ourselves to opensource solutions, for which the code can hence be reused and redistributed, one can cite for example the PHAISTOS program (Antonov et al., 2012) and the XaNSoNS program (Neverov, 2017), the latter being partly written in Python. Although we did not test these programs in the present work, it can be expected that better results will be obtained because of the massively parallel architecture of GPUs. However, as outlined earlier, this increase in performance comes with an increased development complexity and more limited hardware support.
3.4. The Takagi–Taupin equation
Films and multilayers are another important class of materials. In the case of highquality heteroepitaxial materials, the calculation of the diffracted intensity has to be performed in the framework of the ; Taupin, 1964). In the case of epitaxial films and multilayers which are laterally homogeneous and infinite (compared with the coherence length of the incident radiation), a simple recursive solution to the Takagi–Taupin equations can be derived (Bartels et al., 1986).
of diffraction, rather than within the of diffraction that has implicitly been used so far in the present article. Within the of diffraction, the Takagi–Taupin equations are coupled differential equations which describe the scattering from distorted crystals (Takagi, 1969Below, we address the case of irradiated single crystals where radiation damage (consisting of lattice strain and random lattice displacement) is confined in a thin subsurface layer. Such materials can be described as arbitrary multilayers where each individual layer exhibits a distinct level of strain and disorder.
It is not in the scope of this article to provide the details of these equations. We here only provide the minimum number of equations needed to understand the algorithms. The solution of Bartels et al. (1986) allows one to compute the Xray amplitude ratio at the interface n + 1, knowing the ratio at the interface n:
where
The amplitude ratio X is a function of the electric field amplitude ratio, the and the geometry of the measurement. η is dynamical theory's deviation parameter, i.e. the deviation from the center of the Bragg peak, corrected for refraction and absorption, and T is a function of the individual layer thickness, the and the geometry of the measurement (Bartels et al., 1986). In order not to overload the notebooks, several quantities have been precomputed and are loaded from text files (the strain and disorder depth profiles as well as the instrumental resolution function) or hardcoded in the notebook (the structure factors). A more detailed description can be found elsewhere (Boulle & Debelle, 2010; Souilah et al., 2016).
The NumPy implementation is as follows:
Lines 2–3 compute the amplitude scattered by the perfect (unirradiated) crystal. Lines 5–15 compute the scattering from the strained and disordered region above the perfect crystal. This is achieved by dividing this region into layers with thickness t_l, and the scattering from a given layer is computed with the knowledge of the scattering from the previous layer. The information concerning strain and disorder are contained in the thB (Bragg angle) and DW (Debye–Waller factor) arrays, respectively. FH, FmH and wl are the structure factors of the hkl and hkl reflections and the Xray wavelength, respectively. The quantity (eta*eta1)**0.5 is stored in a separate variable (line 11) to avoid multiple evaluations.
Starting from the interface with the perfect crystal, the scattering from the whole `crystal + damaged region' system is progressively computed for a given angle. The recursive nature of this algorithm makes it impossible to vectorize. However, as can be observed in line 1, this function takes an array of angles (th) as an argument, indicating that the computation for the different scattering angles can still be vectorized which will result in acceptable performance. Indeed, typical computation times are around 0.01–0.1 s depending on the size of the crystal. Fig. 5 shows the 400 diffraction profile of a (100)oriented ZrO_{2} single crystal irradiated with 300 keV Cs ions (Boulle & Debelle, 2010). The computed curve in Fig. 5 has been generated in 18.5 ms (see Table 3). This data set comprises 501 intensity values; the strained region is divided into 200 slices, i.e. the recursive loop runs over 200 elements for each intensity point.

Although the recursive loop cannot be parallelized, the loop over the different scattering angles should in principle benefit from parallelization. Although it might seem useless to accelerate a program that already runs in 18.5 ms, it must be borne in mind that such calculations are in general implemented as part of a leastsquares fitting procedure, in order to extract the structural characteristics of the films/multilayers, which implies several hundreds to several thousands of function evaluations.
Somewhat unexpectedly, NumExpr runs slower than NumPy (×0.45), whereas Numba provides a modest ×4.51 acceleration (Table 3). The most likely reason for these negative and modest accelerations is that, as shown in Fig. 3, the number of floatingpoint operations is too small compared with the overhead associated with JIT compilation and multithreading initialization. Here again, the best results are obtained with Pythran and Cython. Below is the Pythran code:
The only significant difference from the previous NumPy implementation is the explicit loop over the angles (line 9). The execution time drops down to 1 and 0.87 ms for Pythran and Cython, respectively, which correspond to an acceleration of ×18.5 and ×21.2, the difference between the two being negligible (smaller than three standard deviations). In all cases, the errors induced by the acceleration are negligibly small (Table 3).
4. Conclusion
Within the Python ecosystem, the NumPy library is the de facto standard when it comes to scientific computing. As long as the algorithms are appropriately vectorized and the memory is large enough to store the arrays, NumPy can achieve high computational performance while keeping clean and simple code, close to the mathematical notation. Used in combination with the NumExpr library, simple NumPy code can benefit from multicore CPUs as well as optimized memory management, with very little code modification.
In cases where it is not possible to vectorize the algorithms, or when increased performance is critical, one must make use of compilers that translate Python code into statically typed code that also provides an improved support of multicore architectures. We have shown that Pythran and Cython in general exhibit very similar performance and, given the heavier syntax of Cython, Pythran is easier to implement. Cython, on the other hand, allows access to more advanced options regarding thread and memory management. Within the examples examined in this article and with the present hardware, accelerations ranging between 1 and 2 orders of magnitude (compared with NumPy) can be obtained while staying in the Python ecosystem. Finally, the performance of Numba seems to be less predictable than that of Pythran and Cython. However, Numba is a relatively recent project with very interesting features, such as compatibility with GPUs, and it should therefore not be excluded when looking for highperformance Python programming.
Finally, the ability of Python to interface with other languages such as C, C++ or Fortran makes it a prominent language for the development of libraries combining an easy syntax with the performance of compiled languages. In the field of crystallography this is for instance the case of the Computational Crystallography Toolbox project, cctbx (GrosseKunstleve et al., 2002; https://cctbx.github.io/). The cctbx library is developed as a set of C++ classes that can be accessed via a Python interface. As such, cctbx can be used in combination with the libraries presented above. This library is also a component of several crystallographic software tools. cctbx is for instance used to compute scattering factors in the PyNX library mentioned earlier (FavreNicolin et al., 2011). In the field of macromolecular crystallography, it is the key component of the PHENIX program (Adams et al., 2010; https://www.phenixonline.org/) which is also written in Python. Besides cctbx, largescale facilities are particularly active in the development of Xray and neutron scattering data manipulation and visualization software based on Python; one can cite DIALS (Winter et al., 2018; https://dials.github.io/), Mantid (Arnold et al., 2014; https://www.mantidproject.org/), DAWN (Basham et al., 2015; Filik et al., 2017; https://dawnsci.org/about/) and silx (https://www.silx.org/).
Supporting information
Jupyter notebooks containing all examples discussed in the article. DOI: https://doi.org/10.1107/S1600576719008471/gj5229sup1.zip
Footnotes
^{1}Different versions of these libraries can produce slightly different timings, but this should not change the conclusions.
^{2}A JIT compiler compiles the code when it is first called, contrary to aheadoftime compilers which require a separate compilation step.
^{3}Any Python object (variable, list, dictionary, class etc.) is implement as a PyObject in C. It contains the object type and the references to this object by other objects.
^{4}In all code snippets, line numbers are only given to facilitate the description of the algorithm. They are not part of the code. The complete Python files are provided as Jupyter notebooks in the supporting information.
^{5}Usually the performance is expressed in terms of the total number of elements in the starting array, so that the asymptotic behavior is N^{2} and Nlog(N) for the direct summation and fast Fourier transform algorithms, respectively. In order to remain consistent with the notation used in the Laue equation, we keep N as being the number of unit cells in a given direction, hence the difference in the expressions of the asymptotic behaviors.
Acknowledgements
All figures have been generated with the scientific plotting program Veusz (https://veusz.github.io/). AB is thankful to D. André for the fruitful discussions about the performance of various languages.
References
Adams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.W., Kapral, G. J., GrosseKunstleve, R. W., McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C. & Zwart, P. H. (2010). Acta Cryst. D66, 213–221. Web of Science CrossRef CAS IUCr Journals Google Scholar
Alted, F. (2010). Comput. Sci. Eng. 12, 68–71. CrossRef Google Scholar
Antonov, L. D., Andreetta, C. & Hamelryck, T. (2012). Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms, pp. 102–108. SciTePress. https://doi.org/10.5220/0003781501020108. Google Scholar
Arnold, O., Bilheux, J. C., Borreguero, J. M., Buts, A., Campbell, S. I., Chapon, L., Doucet, M., Draper, N., Ferraz Leal, R., Gigg, M. A., Lynch, V. E., Markvardsen, A., Mikkelson, D. J., Mikkelson, R. L., Miller, R., Palmen, K., Parker, P., Passos, G., Perring, T. G., Peterson, P. F., Ren, S., Reuter, M. A., Savici, A. T., Taylor, J. W., Taylor, R. J., Tolchenov, R., Zhou, W. & Zikovsky, J. (2014). Nucl. Instrum. Methods Phys. Res. A, 764, 156–166. Web of Science CrossRef CAS Google Scholar
Bartels, W. J., Hornstra, J. & Lobeek, D. J. W. (1986). Acta Cryst. A42, 539–545. CrossRef CAS Web of Science IUCr Journals Google Scholar
Basham, M., Filik, J., Wharmby, M. T., Chang, P. C. Y., El Kassaby, B., Gerring, M., Aishima, J., Levik, K., Pulford, B. C. A., Sikharulidze, I., Sneddon, D., Webber, M., Dhesi, S. S., Maccherozzi, F., Svensson, O., Brockhauser, S., Náray, G. & Ashton, A. W. (2015). J. Synchrotron Rad. 22, 853–858. Web of Science CrossRef IUCr Journals Google Scholar
Bauckhage, C. (2014). Technical Report, https://doi.org/10.13140/2.1.4426.1127. Google Scholar
Behnel, S., Bradshaw, R., Citro, C., Dalcin, L., Seljebotn, D. S. & Smith, K. (2011). Comput. Sci. Eng. 13, 31–39. Web of Science CrossRef Google Scholar
Boulle, A. & Debelle, A. (2010). J. Appl. Cryst. 43, 1046–1052. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cooley, J. W. & Tukey, J. W. (1965). Math. Comp. 19, 297–301. CrossRef Google Scholar
FavreNicolin, V., Coraux, J., Richard, M.I. & Renevier, H. (2011). J. Appl. Cryst. 44, 635–640. Web of Science CrossRef CAS IUCr Journals Google Scholar
Filik, J., Ashton, A. W., Chang, P. C. Y., Chater, P. A., Day, S. J., Drakopoulos, M., Gerring, M. W., Hart, M. L., Magdysyuk, O. V., Michalik, S., Smith, A., Tang, C. C., Terrill, N. J., Wharmby, M. T. & Wilhelm, H. (2017). J. Appl. Cryst. 50, 959–966. Web of Science CrossRef CAS IUCr Journals Google Scholar
Glatter, O. & Kratky, O. (1982). Small Angle Xray Scattering London: Academic Press. Google Scholar
Goldberg, D. (1991). ACM Comput. Surv. 23, 5–48. CrossRef Google Scholar
GrosseKunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 126–136. Web of Science CrossRef CAS IUCr Journals Google Scholar
Guelton, S. (2018). Comput. Sci. Eng. 20, 83–89. CrossRef Google Scholar
Guelton, S., Brunet, P., Amini, M., Merlini, A., Corbillon, X. & Raynaud, A. (2015). Comput. Sci. Disc. 8, 014001. CrossRef Google Scholar
Hjorth Larsen, A., Jørgen Mortensen, J., Blomqvist, J., Castelli, I. E., Christensen, R., Dułak, M., Friis, J., Groves, M. N., Hammer, B., Hargus, C., Hermes, E. D., Jennings, P. C., Bjerre Jensen, P., Kermode, J., Kitchin, J. R., Leonhard Kolsbjerg, E., Kubal, J., Kaasbjerg, K., Lysgaard, S., Bergmann Maronsson, J., Maxson, T., Olsen, T., Pastewka, L., Peterson, A., Rostgaard, C., Schiøtz, J., Schütt, O., Strange, M., Thygesen, K. S., Vegge, T., Vilhelmsen, L., Walter, M., Zeng, Z. & Jacobsen, K. W. (2017). J. Phys. Condens. Matter, 29, 273002. CrossRef PubMed Google Scholar
Lam, S. K., Pitrou, A. & Seibert, S. (2015). Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, 7. New York: ACM. https://doi.org/10.1145/2833157.2833162. Google Scholar
Momma, K. & Izumi, F. (2011). J. Appl. Cryst. 44, 1272–1276. Web of Science CrossRef CAS IUCr Journals Google Scholar
Neder, B. N. & Proffen, T. (2008). Diffuse Scattering and Defect Structure Simulations. A Cook Book Using the Program DISCUS. Oxford University Press. Google Scholar
Neverov, V. S. (2017). SoftwareX, pp. 63–68. CrossRef Google Scholar
Oliphant, T. E. (2007). Comput. Sci. Eng. 9, 10–20. Web of Science CrossRef CAS Google Scholar
Rossant, C. (2018). IPython Interactive Computing and Visualization Cookbook, 2nd ed. Birmingham: Packt Publishing. Google Scholar
Souilah, M., Boulle, A. & Debelle, A. (2016). J. Appl. Cryst. 49, 311–316. CrossRef CAS IUCr Journals Google Scholar
Takagi, S. (1969). J. Phys. Soc. Jpn, 26, 1239–1253. CrossRef CAS Web of Science Google Scholar
Taupin, D. (1964). Bull. Soc. Franç. Minér. Crist. 87, 469–511. CAS Google Scholar
Waasmaier, D. & Kirfel, A. (1995). Acta Cryst. A51, 416–431. CrossRef CAS Web of Science IUCr Journals Google Scholar
Walt, S. van der, Colbert, S. C. & Varoquaux, G. (2011). Comput. Sci. Eng. 13, 22–30. Google Scholar
Warren, B. E. (1969). Xray Diffraction. New York: Addison Wesley. Google Scholar
Winter, G., Waterman, D. G., Parkhurst, J. M., Brewster, A. S., Gildea, R. J., Gerstel, M., FuentesMontero, L., Vollmar, M., MichelsClark, T., Young, I. D., Sauter, N. K. & Evans, G. (2018). Acta Cryst. D74, 85–97. Web of Science CrossRef IUCr Journals Google Scholar
This article is published by the International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.