tomoCAM: fast model-based iterative reconstruction via GPU acceleration and non-uniform fast Fourier transforms

tomoCAM is an open-source implementation of model-based iterative reconstructions that reformulates the iterative tomographic reconstruction problem to leverage the faster time complexity of non-uniform fast Fourier transforms. Moreover, it harnesses the computational capabilities of modern GPUs, leading to significant performance gains compared with currently available publicly accessible tools. tomoCAM is accessible via Python and is compatible with NumPy, facilitating its incorporation into pre-existing workflows at synchrotron light sources.


INTRODUCTION
Micro-and nano-tomography using synchrotron technology is crucial in uncovering the inner makeup of modern materials, particularly in dynamic settings.Its diverse applications include the investigation of the fractures and deterioration of ceramic matrix composites, which are novel lightweight materials used in jet engines that operate under high temperatures and pressure (Forna-Kreutzer et al., 2021); the study of the flow of oil, brine, and carbon dioxide through rocks (Walsh et al., 2014); and the analysis of dendrite formation in batteries, which causes capacity reduction and eventual failure (Dienemann et al., 2023).Many synchrotron micro-CT facilities now have cameras that can acquire many-megapixel images at thousands of frames per second (MacDowell et al., 2012;Nikitin et al., 2022;Ge et al., 2018;Thuering et al., 2011;Mokso et al., 2017).These advances in instrumentation have encouraged users to push the boundaries of what can be imaged at synchrotron beamlines.An increasing number of investigators are conducting in-situ (Larson & Zok, 2018;French et al., 2022) and in-operando (Kulkarni et al., 2020;Dienemann et al., 2023) measurements.Typically the initial technique attempted for micro-CT reconstructions is the filtered back-projection (FBP) method (Herman, 2009), which is available in tomopy (Gürsoy et al., 2014) and tomocupy (Nikitin, 2023).However, in the case of many dynamic experiments, where the specimen under observation is changing rapidly, it is generally not possible to capture sufficient projections to satisfy angular Shannon sampling conditions (Crowther et al., 1970) and overcome the noise.FBP is not a suitable option in such situations.The reconstructions obtained through this method tend to have excessive noise levels and exhibit streaking artifacts, making it difficult or even impossible to carry out further analysis.
As an alternative, iterative methods, such as simultaneous iterative reconstruction technique (SIRT) (Tarantola & Valette, 1982) and model-based iterative reconstruction (MBIR) (Venkatakrishnan et al., 2013;Mohan et al., 2014) aim to mitigate these shortcomings.They formulate the reconstruction as an optimization problem.The solution is obtained through an iterative process that aims to minimize the mismatch between the measured data and a forward model (Radon transform) of a digital representation of the sample.This iterative approach enables the incorporation of prior knowledge, such as total-variation constraints, into the optimization process, as demonstrated in various studies (Trampert & Leveque, 1990;Zhang et al., 2014;Venkatakrishnan et al., 2013;Mohan et al., 2014).However, current CPU-based implementations of MBIR typically require a large compute cluster to achieve turnaround times that are comparable to data collection times.This not only adds extra time to the experiment-to-analysis loop but also places an additional burden on material scientists, who must acquire a new set of expertise in using a compute cluster.This paper introduces tomoCAM, a GPU-accelerated implementation of MBIR that is based on the NUFFT approach.With the computational power provided by modern GPU devices and the relatively affordable cost of computer memory, it has become possible to perform these reconstructions on a single machine within a reasonable amount of time.
To design our GPU-accelerated algorithm and implementation, we build Radon and back-projection operators based on Non-uniform Fast Fourier Transforms (NUFFT) (Greengard & Lee, 2004;Fessler & Sutton, 2003), which significantly reduces the computational cost, while maintaining high accuracy.We also leverage highly optimized cuFFT libraries that are native to the CUDA software development kit (NVIDIA et al., 2020).We follow the mathematical outline laid out in (Venkatakrishnan et al., 2013;Mohan et al., 2014) to add a total variation constraint, which helps in reducing noise while preserving the sharp edges.An important feature of our implementation is the flexibility to introduce a different constraint.The choice of the constraint is not limited by the algorithm design.
We test our computational framework through a series of numerical experiments on known phantoms and experimental data made publicly available through Tomobank (Carlo et al., 2018).We compare the reconstructions with those obtained from filtered back-projection (Gürsoy et al., 2014) and SVMBIR (Team, 2020), a publicly available CPU-based package.
In our numerical experiments, we found that: • when compared to FBP, the reconstruction quality produced by tomoCAM was superior with less noise, and required a lower number of projections, • additionally, we observed that tomoCAM was around 15 times faster on a single machine than SVMBIR. .Finally, we provide a python front-end that exposes tomoCAM functionality to the widely-used Numpy package (Harris et al., 2020), using Pybind11 (Jakob et al., 2017).This makes it easy to integrate tomoCAM into existing workflows.The code is freely available at https://github.com/lbl-camera/tomocam.

Radon Transform and Non-uniform Fast Fourier Transform
This section provides a brief overview of the fundamental concepts related to tomography, including the Radon transform (as well as its adjoint) and its connection to the Fourier transform.To perform tomographic measurements, a series of images, referred to as projections, are captured at various angles by rotating either the camera or the sample being studied.

Radon Transform
The Radon transform is fundamental to any tomographic reconstruction.It transforms a function f (x, z), x ∈ R 2 , z ∈ R, to R f (t, n, z), t ∈ R, n ∈ S 1 through a line integral (2), see fig. 1.Given a set of oriented lines t,n defined as t,n = {x : x, n = t} = {tn + sn ⊥ : where n⊥ is direction of the X-ray beam, n is perpendicular to beam in same plane as , and t is the distance to from the origin.The Radon transform Rf of function f is defined as,  2) that reconstructs f from the data, and is of primary importance in tomographic reconstruction.By the central slice theorem, the Fourier transform of the Radon transform of f in direction n is equivalent to the Fourier transform of f along n, i.e., where z is the dimension along the axis of rotation and F d denotes the d-dimensional Fourier transform.Radon transform and its adjoint are two-dimensional operators that are applied slice-by-slice on three-dimensional data.For simplification, we will drop the z dependency from the subsequent notations.Assuming f and F 1 f are integrable everywhere, the inverse of the Radon transform (2) is given by, where we denote the θ dependency as n(θ) = (cos θ, sin θ).
It is computationally very expensive to exactly compute equation ( 4).In practice, ( 4) is efficiently approximated with a Non-Uniform Fast Fourier Transform (Gürsoy et al., 2014) or directly estimated in real space through the use of various filters such as Shepp-Logan (Shepp & Logan, 1974), Ram-Lak (G.N. & A.V., 1971), and Butterworth (Butterworth et al., 1930), which approximate and weight by the Fourier sampling density, hence the name Filtered Back-projection.These filters are additionally designed to dampen out the higher Fourier frequencies.This is the most commonly used method in the reconstruction of tomographic data, in part because of the sheer speed by which the inversion can be performed.However, in cases when the view is partially blocked, or the specimen is evolving, it may not be possible to collect enough projections to sufficiently sample the Fourier space.In such cases, the Filtered Back-projection results in poor image quality.

Non-Uniform Fast Fourier Transform (NUFFT)
On a discrete uniform grid, inversion of the Radon transform entails computing Fourier coefficients along radial lines using a 1-dimensional Fast Fourier Transform (FFT), followed by 2-dimensional backward Fourier transforms from a non-uniform polar grid {k j = k j (cos θ j , sin θ j )} onto a Cartesian grid {(x n , y n )}, which can be represented as the summation c j e 2πikj (xn cos θj +yn sin θj ) , n ∈ [1, N ], (5) where c j is the Fourier coefficient at k j , N is number of discrete points that represent the sample density f on a uniform Cartesian grid, and M is the number of polar grid points representing the projection data.However, directly computing equation ( 5) is computationally expensive, as the complexity is O(M N ).Data taken at synchrotron light-sources can usually reach up to M = O(10 10 ) pixels, and the final reconstructed image size N has a similar order or magnitude for the final reconstructed image.Non-Uniform Fast Fourier transforms (NUFFTs) offer a precise and efficient method for computing equation (5).This method involves first computing the Fourier coefficients on a polar grid using a sequence of one-dimensional FFTs along radial lines.Then, the computed coefficients are convolved with a compactly supported spreading kernel ϕ, and this convolution is evaluated on a uniform grid.Subsequently, an inverse Fourier transformation is performed on the convolution values on the uniform grid, followed by division by the Fourier transform φ of the kernel, i.e., where {k c r } is a Cartesian grid, F 2 is the 2D Fourier transform, equations ( 6) and ( 8) are computed via fast Fourier transforms, and W is the spreading width of the convolution in equation ( 7).For an appropriately chosen kernel, the NUFFT has an error of if W is chosen to span approximately w = log 10 (1/ ) grid points per dimension.The computational complexity of equations ( 6)-( 8) is O(M logM t + w 2 N + N logN ), where M t is the number of points in the radial direction of the polar grid.Since w 2 M , this results in a massive speedup compared to the direct computation of equation ( 5).The Radon transform can similarly be computed by performing the above steps in reverse order.In this work we have used the cuFINUFFT (Shih et al., 2021) library to compute NUFFTs.For a detailed discussion on the topic, we refer the reader to (Dutt & Rokhlin, 1993;Fessler & Sutton, 2003;Greengard & Lee, 2004;Barnett et al., 2019;Barnett, 2021).Fig. 2. The NUFFT is used to transform intensity on a uniform grid to its Fourier transform on a polar grid and vice-versa.

Model-based Iterative reconstruction
An alternate approach to FBP methods is to rely on iterative methods, such as MBIR.Although these methods have longer processing times, they produce better-quality reconstructions when compared with FBP methods.This is especially noticeable when a smaller number of projections are available.This is because iterative methods are able incorporate a priori information as a constraint on the optimization process.We refer the reader to (ASTM-E1441-19, 2019) for a more detailed discussion.Iterative methods seek a solution f by minimizing the difference between its Radon transform and projection data b, i.e., Here we set up the objective function as a least-squared problem.The target is to iteratively search for f that minimizes the 2 -norm of difference between Rf and b while penalizing violation of the constraint by g.Now we differentiate equation ( 9) with respect to f and equate the result to 0. The gradient of the first term is where R * is the adjoint of equation ( 2), which is simply equation (4) without the scaling |k|.The operators R and R * can be efficiently computed using NUFFT.
In the results presented here, we choose H to be a total-variation penalty in equation ( 9).We follow the mathematical approach presented in (Venkatakrishnan et al., 2013;Mohan et al., 2014) to model H as a q-Generalized Gaussian Random Field (qGGRMF), where h is defined over 1-hop neighborhood of m, with m and n being integer coordinates on the threedimensional uniform grid.The weights w mn are the Gaussian weights that partition the unity and are inversely proportional to the distance between m and n.Hyper-parameters c, p, and σ are used to control the strength of the penalty term.The term H is an algebraic expression, and can easily be differentiated.
In this work, we have used a monotonic accelerated gradient method with restart detailed in (Giselsson & Boyd, 2014), but it is possible to use other optimizers.

Implementation
When it comes to implementing software solutions, performance is a critical factor.In this section, we discuss some of the important implementation details that have a significant impact on the performance of tomoCAM.These include factors such as memory management, and hiding PCIe latency efficient GPU caching.To achieve both high performance and user-friendliness, we utilize a blend of C++, CUDA, and Python.The data structures of tomoCAM are implemented in C++, while most of the mathematical functions are coded using CUDA.To efficiently handle large datasets, a two-tier partition scheme is employed to seamlessly stream data into and out of GPU memory.To address the vast number of pixels in a typical synchrotron micro-or nano-CT sinogram, which can exceed O(10 10 ), we have carefully optimized the memory usage in the implementation of tomoCAM.For instance, to minimize memory footprint, we pass large arrays that contain frequently accessed data such as the most recent solution, projection data, and gradient as references rather than copies, which is the default behavior in C++.We have implemented various strategies to minimize the memory footprint, including: • Quantities are never stored as complex numbers in the host memory.This additionally helps with the amount of data copied to and from the GPU memory.• Instead of duplicating data, partitions contain pointers to memory locations in the parent array.
• Gradients are updated in place when computing the total-variation constraint.
• Projection data is reordered into sinogram form for fast contiguous readouts.

GPU Optimizations
While GPUs are highly efficient in performing complex calculations, the latency over the PCIe bus remains a significant bottleneck for GPU-accelerated software implementations.In order to minimize runtime and maximize throughput from CPU to GPU memory, we employ a combination of techniques.These include asynchronous transfers, OpenMP threads, and a two-tier data partitioning scheme.The partitioning is done along the axis of rotation, with the data first divided into as many partitions as there are available GPU devices.Each partition is then further subdivided into smaller chunks, with the optimal size depending on the GPU device's available memory.The sub-partitions are streamed to GPU memory, and to minimize memory footprint, they do not create deep copies of the data.Figure 3 provides an overview of this process.By utilizing these techniques, we can significantly reduce the impact of the PCIe bottleneck and achieve higher performance in our GPU-accelerated software implementations.Some of the other optimizations and features of tomoCAM include: • Since the axis of rotation may not be aligned with the center of the image, we use the Fourier shift property to efficiently move the rotation axis to the center of the image.• We use OpenMP threads to parallelly launch level-1 partitions on all the available GPUs, as well as to stream data into GPU memory.• To improve cache efficiency, we utilize GPUs' shared memory to store data that is accessed multiple times, such as when computing the total-variation constraint.
• A python front-end and numpy compatibility are provided via pybind11 project (Jakob et al., 2017).
Fig. 3. Large arrays are partitioned along the axis of rotation using a two-tier partitioning scheme.

NUMERICAL EXPERIMENTS
We tested tomoCAM with publicly available phantoms and measured datasets.Here, we present a comparison of reconstructed results using tomoCAM, SVMBIR (Team, 2020) and filtered back-projection using gridrec available in the Tomopy package (Gürsoy et al., 2014).Each reconstruction and line-profile (B) is scaled with s and shifted with ∆, where s, ∆ = arg min s,∆ A − s B + ∆ to the ground truth (A) before plotting.In the case of experimental data, we rescale reconstructions from SVMBIR and gridrec with the one obtained from tomoCAM.The total-variation constraint used in SVMBIR is slightly different from the one used in tomoCAM, see the theory section in (Team, 2020).SVMBIR uses 10 nearest neighbors, while the tomoCAM uses 26 of them, to evaluate (12).We believe parameters can be fine-tuned for tomoCAM and SVMBIR to produce equivalent results.The primary comparison with SVMBIR is to demonstrate performance gains, rather than comparing two different constraints or image quality.All the tests were done on a single machine with • 128 GB RAM In the first experiment, we compare the reconstruction of a foam phantom from all three codes.A foam phantom and its projection data of size (128×16×2048) was generated using the foam ct phantom package (Pelt et al., 2022).A full slice from the phantom in fig 4(a), is compared with the reconstruction obtained from each code (SVMBIR, tomoCAM, and gridrec) in 4(b-d).This is followed by zoomed-in regions of each image in 4(e-h).A line profile from each of the zoomed-in regions is then compared in 4(i).It is evident from the results, that both tomoCAM and SVMBIR are effective at suppressing the noise.One major advantage of tomoCAM is that it can get equivalent results in an order of magnitude faster time.
Next, we evaluate the reconstruction of two experimental datasets obtained from diverse synchrotron lightsources that are accessible through Tomobank (Carlo et al., 2018).For each dataset, the available number of projections is notably lower than what is typically expected, which follows the general rule of thumb that it should be as many as the number of pixel columns in the camera sensor.
Using Beer-Lambert's law (Swinehart, 1962) b in equation ( 9) is defined as − log(I/I 0 ), where I is the measured intensity and I 0 is the beam intensity without the sample blocking the view.The hyper parameters used for tomoCAM are, p = 1.2, σ = 1 0.7 and c = 0.0001.For the gridrec we chose the Butterworth filter with order 2, and the cutoff frequency was set to 0.25, which is typical for a synchrotron tomographic reconstruction.We choose T = 1 and σ x = {0.98,2.1, 1.1} × 10 −4 for the phantom, Tomobank dataset id 25 (TB-25), and Tomobank dataset id 86 (TB-86) respectively for SVMBIR runs, in order to produce similar quality reconstructions as tomoCAM.
We follow a similar pattern to Fig. 4 for plotting images and line profiles.The first row of images depicts full slices, followed by zoomed-in regions, and then a line profile is taken from the middle of each zoomed-in region.We expect that conducting a thorough hyper-parameter search would yield comparable outcomes from both tomoCAM and SVMBIR, given their mathematical similarity.A circular mas was applied to all the reconstructions.While tomoCAM and SVMBIR both do an excellent job at suppressing the noise when compared to gridrec, tomoCAM is approximately 15× faster.

CONCLUSIONS
In this work, we have presented tomoCAM, a new GPU-accelerated software for reconstructing high-quality tomographic images.tomoCAM is capable of running model-based iterative reconstructions for large datasets with relatively modest hardware requirements, within a reasonable time.The resulting reconstructed images have lower noise when compared with the prevalent filtered-back projection methods, while being an order of magnitude faster than CPU-only MBIR implementations.
A Python-based front-end has been created for tomoCAM, which is specifically designed to receive Numpy arrays as both input and output for reconstructions.This facilitates seamless integration of tomoCAM into the existing workflows of beamline scientists.Although the use of MBIR is particularly advantageous in cases where there is a scarcity of available projection data, the current implementation of MBIR is quite timeconsuming.Consequently, this is the primary reason why beamline scientists do not utilize MBIR even when it is obviously advantageous.tomoCAM overcomes this problem, thus making MBIR reconstruction more practical, by • improving efficiency: the run time has been reduced by an order of magnitude, making it faster than previous MBIR versions, • reducing hardware requirements: it can run on machines as small as an individual desktop with a GPU, making it more accessible, • simplifying hyper-parameter search: tomoCAM's speed makes it easier to search for hyper-parameters, allowing for faster and more efficient experimentation, • enhancing compatibility: the implementation provides a Python interface, which makes it easy to integrate with existing workflows that use FBP.

Fig. 1 .
Fig. 1.Radon transform of f is its line integral along each line perpendicular to n

Fig. 5 .
Fig. 4. A compassion of reconstruction methods for a foam phantom with 128 projections (a) Ground Truth, (b) tomoCAM, (c) SVMBIR, and (d) gridrec.(e), (f), (g) and (h) are the zoomed-in regions of interest represented by the boxes in (a), (b), (d) and (e) respectively, and (i) displays the line profiles on (e), (f), (g) and (h).Reconstructions using tomoCAM and SVMBIR result in images with low noise, when compared to gridrec.Here tomoCAM is about 9× faster than SVMBIR.

Fig. 6 .
Fig. 6.Reconstructions for Tomobank Dataset ID: 86, Nano-CT data with sparse projection angles using 202 projections, with (a) tomoCAM, (b) SVMBIR, and (c) gridrec.(d), (e), and (f) are zoomed-in regions of interest represented by the boxes in (a), (b), and (c) respectively.(e) displays the line profiles for (d), (e), and (f).A circular mas was applied to all the reconstructions.While tomoCAM and SVMBIR both do an excellent job at suppressing the noise when compared to gridrec, tomoCAM is approximately 15× faster.

Table 1 .
Table 1 shows a comparison of the time taken by each code.A comparison of time taken to reconstruct various datasets.Both tomoCAM and SVMBIR were timed for 100 iterations.