## short communications

## An algorithm for the automatic deglitching of X-ray absorption spectroscopy data

^{a}Department of Civil and Environmental Engineering, Northwestern University, 2145 Sheridan Road, Evanston, USA, and ^{b}Department of Construction Engineering and Management, University of Talca, Camino Los Niches Km 1, Curicó, Chile^{*}Correspondence e-mail: jf-gaillard@northwestern.edu

Analysis of X-ray absorption spectroscopy data often involves the removal of artifacts or *glitches* from the acquired signal, a process commonly known as *deglitching*. Glitches result either from specific orientations of monochromator crystals or from scattering by crystallites in the sample itself. Since the precise energy – or wavelength – location and the intensity of glitches in a spectrum cannot always be predicted, deglitching is often performed on a per spectrum basis by the analyst. Some routines have been proposed, but they are prone to arbitrary selection of spectral artifacts and are often inadequate for processing large data sets. Here, a statistically robust algorithm, implemented as a Python program, for the automatic detection and removal of glitches that can be applied to a large number of spectra, is presented. It uses a Savitzky–Golay filter to smooth spectra and the generalized extreme Studentized deviate test to identify outliers. Robust, repeatable, and selective removal of glitches is achieved using this algorithm.

Keywords: X-ray absorption spectroscopy; glitches; deglitching.

### 1. Introduction

In ; Bunker, 2010; Calvin, 2013). Glitches often arise from multiple-diffraction events within the crystal monochromator used to set the energy of the X-rays, resulting in a significant decrease or rise in the intensity of the beam delivered to the sample (Bauchspiess & Crozier, 1984). Alternatively, diffraction from other crystalline phases, for instance in diamond anvil cells, may also create glitches, impacting the measurement of the [μ(*E*)] (Sapelkin & Bayliss, 2002).

Analysts can minimize the impact of glitches by using best practices at the time of data collection, which includes ensuring uniform sample preparation, confirming the linearity of equipment at the beamline, and detuning the monochromator to reduce the intensity of the multiple diffraction events; however, these practices do not completely eliminate glitches (Abe *et al.*, 2018). Collecting data free from significant artifacts is particularly challenging in some energy ranges, such as at the Fe *K*-edge using a Si(111) monochromator (Pickering, 1999). Leaving spurious points in place may interfere with the analysis of data; for instance, glitches may introduce error either in the normalization of the spectrum or the transformation of the extended X-ray absorption fine structure (EXAFS) data into *R*-space (Abe *et al.*, 2018), and spectra may be improperly clustered. As such, the processing of spectra may require a deglitching step, where data points corresponding to glitches are removed.

Given the erratic nature of glitches, their removal is commonly based on the judgment of the analyst through visual inspection of the data. For example, the program *Athena* offers a graphical user interface for glitch removal, where one may either conduct point-and-click identification of spurious data points or set a threshold value based on the post-edge normalization curve outside of which data points are removed (Ravel, 2016). Other programs include the capability to remove data points at specified indices (Wellenreuther & Meyer-Klaucke, 2009) or the option to compare μ(*E*) with respect to the incoming incident beam intensity *I*_{0} (Aberdam, 1998). As a result, these methods require the analyst to manually inspect each spectrum to remove the aberrant values, an approach that is readily applicable to large glitches and small datasets.

However, the growing relevance of time-resolved *et al.*, 2018) and continuous scanning `quick modes (Prestipino *et al.*, 2011) along with increasing accessibility to laboratory-source instruments (Anklamm *et al.*, 2014; Jahrman *et al.*, 2019) promises large datasets where manual deglitching of spectra becomes impractical. A rapid and robust method for automated deglitching of spectra is needed to address this shortcoming. Previous methods with greater potential for automation used the first derivative of the to identify glitches (Zhuchkov *et al.*, 2001). While useful, such strategies can result in points adjacent to glitches being erroneously identified, and the applicability to the X-ray absorption near-edge structure (XANES) region is limited. Moreover, a manual threshold must still be set by the analyst to identify glitches based on visual inspection of the data.

Here, we propose a method for the high-throughput deglitching of

data. Outliers are identified in the normalized residuals between original and low-pass-filtered data. A second iteration of this process minimizes the occurrence of type I `false positive' errors. With this method, we achieve accurate and repeatable removal of glitches from full-spectrum data. We present deglitching examples on XANES and spectra, as well as a negative control. Finally, we discuss potential limitations of our method as well as strategies for improvement.### 2. Methods

The deglitching algorithm was implemented as a computer program written in the Python 3.7 programming language. The packages *Scipy* (Virtanen *et al.*, 2020) and *Numpy* (Oliphant, 2006) were used for all calculations, and *Larch* (Newville, 2013) was used for data processing in a Jupyter Notebook environment (Kluyver *et al.*, 2016).

The deglitching program was designed to be compatible with the data structure used by *Larch* for analysis. The only required input for the deglitching program are data channels corresponding to the energy (*E*) and the [μ(*E*)] within a *Larch* group. Absorption data may or may not be normalized prior to deglitching. Additional parameters may be specified here to tune the deglitching program, though default parameters should work in most circumstances.

All spectra presented here were collected at the bending magnet beamline of the Dow–Northwestern–Dupont Collaborative Access Team (DND-CAT), Sector 5 at the Advanced Photon Source at Argonne National Laboratory, Lemont, IL, USA. Fe *K*-edge data are presented for drinking water treatment residual samples collected from drinking water treatment plants in the USA. The other samples include a denture adhesive cream and cobalt (II, III) oxide (Aldrich). Energy was set using a Si(111) double-crystal monochromator. Intensity values for the incident beam (*I*_{0}), the transmitted beam (*I*_{T1}), and the secondary transmitted beam (*I*_{T2}) were collected using Oxford ionization chambers with 29.6 cm path lengths. Fluorescence data were captured using Vortex ME4 silicon drift detectors. data were collected at an energy interval of 10 eV from 150 to 20 eV below the edge energy, 0.5 eV in the XANES region (up to *k* = 3 Å^{−1}), and 0.05 Å^{−1} in the region.

### 3. Algorithm

In the following a description of the deglitching process, also depicted in Fig. 1, is given. After data channels for energy and μ(*E*) are provided to the algorithm (Fig. 1, box 1), the μ(*E*) channel is fit with a Savitzky–Golay filter to provide a smoothed representation of the data, μ_{SG}(*E*) (Fig. 1, box 2 and subplot A). The Savitzky–Golay filter uses a least-squares fitting procedure to fit each point over a rolling window of odd length (*w*_{SG}), effectively acting as a low-pass filter (Savitzky & Golay, 1964) (Fig. 2). Savitzky–Golay window length and the polynomial order of the rolling fit are adjustable parameters set at 9 and 3, respectively, by default. The results of the filter are then subtracted from the normalized absorption to compute the residuals between the original and filtered data, δμ(*E*)_{A} (Fig. 1, box 3),

Given that the Savitzky–Golay filter acts as a low-pass filter, δμ(*E*)_{A} will reflect the contribution of high frequencies to μ(*E*). One expects a normal distribution of residuals should Gaussian noise be the only source of misfit between the data and a fit aiming to represent the true values of the data (Trutna *et al.*, 2003). Glitches, being aberrant high-frequency spectral features, will result in residuals that are outliers compared with the rest of the residuals. In simple cases, only glitches will be outliers in these residuals. However, in the absence of a glitch, these residuals will vary slightly based on regional features of the spectrum. One can generally expect larger values for |δμ(*E*)_{A}| near the owing to the wider range of absorbances contained in a fitting window within this region (Fig. 3) and, potentially, the attenuation of some high-frequency features. Identifying outliers on untreated residuals may result in the false positive identification of glitches in the XANES region or failure to identify subtle glitches in the region. In addition, low-pass filters cannot completely attenuate high-frequency signals. This is apparent in Fig. 1, box 2 and subplot C: the initial Savitzky–Golay filter is not smooth in the region surrounding aberrant points. Glitches will impact δμ(*E*)_{A} at any point within (*w*_{SG}/2) − 1 points of a glitch, increasing the variance of these residuals relative to those outside the Savitzky–Golay filter window.

Residuals must be normalized regionally to account for differences in residual values between XANES and δμ(*E*)_{A}| to acquire a rolling median absolute deviation. Much like the Savitzky–Golay filter, these median values are calculated from a rolling window, *w*_{m}. The window size for the median calculation is selected such that, should a glitch be contained within *w*_{m}, more than half the points in *w*_{m} will not have included the glitch in their Savitzky–Golay filter fitting window *w*_{SG}. This may be defined as

where *w*_{m} is the window length for calculating the median absolute deviation, *w*_{SG} is the Savitzky–Golay filter window length, and *L*_{g} is the maximum number of points corresponding to a single glitch. Normalized residuals *r* are computed by dividing δμ(*E*)_{A} by the local median absolute deviation of residuals to account for regional variability in the fit,

where *r*_{i} corresponds to the normalized residual value at *E*_{i}. At the beginning and end of the data, calculations are performed using truncated windows. Provided Savitzky–Golay parameters that do not filter true signal, these normalized residuals will be normally distributed with a standard deviation of approximately 1.48, corresponding to the conversion from median absolute deviation to standard deviation based on the cumulative distribution function of the standard normal distribution (Filliben & Heckert, 2003) (Fig. 1, box 4 and subplot B). This may be confirmed by plotting the normalized residuals and a histogram of their distribution; well defined oscillations in the residuals are indicative of an overlong Savitzky–Golay filter window length; and non-normal distributions, such as bimodal distributions, are an indication of poor normalization due to the selection of the filter window length (*w*_{SG}) and/or the maximum glitch length (*L*_{g}).

From the normalized residuals, outliers are mathematically identified using a generalized extreme Studentized deviate test (generalized e.s.d.) (Fig. 1, box 5 and subplot B). The generalized e.s.d. identifies outliers in normally distributed data when provided with a maximum number of outliers and a significance value for the outlier identification (Rosner, 1983; Czesla *et al.*, 2019); these values are set at 10% of the data points contained in the spectrum and 0.025 by default, respectively. The first step of the generalized e.s.d. is to calculate the mean of the dataset. Next, the data point furthest from the mean value is identified. The *test statistic* is calculated by finding the distance of this point from the mean value in terms of number of standard deviations. Using a *t* distribution at the provided significance level, the maximum acceptable distance from the mean is calculated for a dataset of a given length. This provides a *critical value* which is, again, in units of number of standard deviations from the mean. The most distant point is removed and the calculation is repeated until the maximum number of outliers are reached. If the *critical value* of a given point is greater than the test statistic, that point and all previously analyzed points are identified as outliers (Rosner, 1983).

Outliers identified in this first stage are taken as *candidate points* for glitches. A sufficiently large glitch may result in a poor fit from the Savitzky–Golay filter on points within the same filter window as the glitch (Fig. 1, box 2 and subplot C), so outliers in the first pass (*candidate points*) may not all be glitches. In the data in Fig. 1, for instance, although there are only four obvious glitch points, seven points are identified as outliers (Fig. 1, box 5 and subplot B). An interpolation step is included in the deglitching procedure to account for the limitations of low-pass filtering (Fig. 1, subplot C); because high frequencies are dampened and not eliminated by low-pass filters (including the Savitzky–Golay filter), points near the glitch are poorly fit by the filter. This leads to the identification of three *candidate points* that, while near glitches, are not part of any glitches. *Candidate points* are removed from a copy of the data, and μ(*E*) at these points is interpolated using a cubic spline (Fig. 1, box 6). The resulting μ(*E*)_{B} only differs from μ(*E*) at the *candidate points*. A second set of Savitzky–Golay filtered data, μ_{SG}(*E*)_{B}, is generated based on this copy of the data using the same parameters as before (Fig. 1, box 7A). From here, the initial process is repeated, finding the residuals [δμ(*E*)_{B}] between the original data [μ(*E*)] and the second Savitzky–Golay filter [μ_{SG}(*E*)_{B}], defined as

As before, these residuals are normalized by the regional median absolute deviation, this time calculated on δμ(*E*)_{B}. The window for this second normalization shrinks to (2*L*_{g}) + 1 points, effectively limiting the maximum number of points in a glitch to *L*_{g}. This adjustment is made possible by the interpolation of *candidate points*, which minimizes the impact of glitches on the filter's fit for other points within *w*_{SG}. Outliers are identified in δμ(*E*)_{B} using the generalized e.s.d. The outliers identified in this second pass that are within one half of the Savitzky–Golay window length, *w*_{SG}, of a candidate point are taken to be glitches and removed. In our example, the interpolated points for the non-glitch candidate points essentially overlap with the original data (Fig. 1, box 6). As a result, the residuals between the second filter and these three *candidate points* are small, and these points are not identified as glitches.

### 4. Results and discussion

The deglitching algorithm was tested on a range of elements and data collection modes, as shown in Fig. 2. For all data, the deglitching procedure was applied across the full spectrum and used a Savitzky–Golay window length of 9, a significance level of 0.025, and a maximum glitch length of 4 (the default values for the program). In each of these circumstances, the algorithm successfully identified and removed glitches across the full spectrum of data without removing non-glitch points. Given the diverse datasets included in these examples – which include XANES glitches, minor glitches, and data for various elements collected in both absorbance and fluorescence mode – the default parameters should be adequate for most applications wherein the data are sampled on a similar energy grid (0.5 eV in the XANES, 0.05 Å^{−1} in the EXAFS).

Using the sampling regime of the example data, the default parameters would result in a Savitzky–Golay filtering window that spans 4.0 eV in the XANES and 0.4 Å^{−1} in the Initial normalization of the residuals would occur over a 12.0 eV window in the XANES and a 1.2 Å^{−1} window in the and normalization of the final residuals would cover 4.0 eV in the XANES and 0.4 Å^{−1} in the For full-spectrum deglitching, these ranges should not be larger than the range of energies with high-derivative features (*i.e.* near the rising edge and the white line). This is especially important for the normalization of the final residuals, which is based on the maximum glitch length. In most cases, in the interest of maintaining more consistent rolling window lengths, *L*_{g} may be kept equal to (*w*_{SG} − 1)/2.

Figs. 4, 5, and 6 show the results of the deglitching algorithm as applied to distinct datasets. Fig. 4 shows an example of the algorithm simultaneously removing glitches in the XANES [Fig. 4(B)] and [Fig. 4(C)] regions for data collected in fluorescence mode at the Fe *K*-edge. Fig. 5 shows data for a sample where the algorithm removed a subtle glitch at a high *K* value in a sample of denture adhesive cream collected in transmission mode at the Zn *K*-edge. Finally, Fig. 6 presents a negative control for analysis, a cobalt (II–III) oxide sample collected in transmission mode with no apparent glitches, where the deglitching algorithm identified no points as glitches. All glitches identified and removed in these examples correspond to monochromator glitches, but not all monochromator glitches in these examples resulted in aberrant data points.

The deglitching algorithm is unlikely to require major adjustments when applied to data with sporadic glitches sampled on a similar energy grid as the provided examples. Extended glitches with regular, absorption-like features present a greater challenge for this algorithm, given the intended use case of full-spectrum deglitching consistent across the XANES and α for outlier identification may also be changed, where a higher value will result in more aggressive outlier identification, which may be useful for low-intensity glitches; and the algorithm may be limited to a specific region of interest, either specified in terms of energy (*e.g.* 7300–7800 eV) or as XANES or regions (defined in the program as being divided at 150 eV above the absorption threshold).

This algorithm provides repeatable and robust glitch removal in typical

data across the full spectrum. An analyst may confirm that the algorithm is performing according to expectations on one data scan, then apply those settings to all similar data, permitting the rapid processing of large datasets.The deglitching algorithm is available for download at https://github.com/wallacesam/deglitching.

### Acknowledgements

We thank Dr Qing Ma for his technical assistance while performing

experiments at the Advanced Photon Source (APS). Portions of this work were performed at the DuPont–Northwestern–Dow Collaborative Access Team (DND-CAT) located at Sector 5 of the APS. DND-CAT is supported by Northwestern University, The Dow Chemical Company, and DuPont de Nemours, Inc. This research used resources of the Advanced Photon Source, a US Department of Energy (DOE) Office of Science User Facility operated for the DOE Office of Science by Argonne National Laboratory under Contract No. DE-AC02-06CH11357.### Funding information

Partial funding for this research was provided by the Strategic Environmental Research and Development Program (grant No. ER18-1428 to Jean-François Gaillard); Northwestern University, Institute of Sustainability and Engineering at Northwestern (scholarship to Samuel Wallace).

### References

Abe, H., Aquilanti, G., Boada, R., Bunker, B., Glatzel, P., Nachtegaal, M. & Pascarelli, S. (2018). *J. Synchrotron Rad.* **25**, 972–980. Web of Science CrossRef CAS IUCr Journals Google Scholar

Aberdam, D. (1998). *J. Synchrotron Rad.* **5**, 1287–1297. Web of Science CrossRef CAS IUCr Journals Google Scholar

Anklamm, L., Schlesiger, C., Malzer, W., Grötzsch, D., Neitzel, M. & Kanngießer, B. (2014). *Rev. Sci. Instrum.* **85**, 053110. Web of Science CrossRef PubMed Google Scholar

Bak, S.-M., Shadike, Z., Lin, R., Yu, X. & Yang, X.-Q. (2018). *NPG Asia Mater.* **10**, 563–580. Web of Science CrossRef Google Scholar

Bauchspiess, K. R. & Crozier, E. D. (1984). In *EXAFS and Near Edge Structure III*, pp. 514–516. Berlin, Heidelberg: Springer. Google Scholar

Bunker, G. (2010). *Introduction to XAFS: A Practical Guide to X-ray Absorption Fine Structure Spectroscopy.* Cambridge University Press. Google Scholar

Calvin, S. (2013). *XAFS for Everyone.* CRC Press. Google Scholar

Czesla, S., Schröter, S., Schneider, C. P., Huber, K. F., Pfeifer, F., Andreasen, D. T. & Zechmeister, M. (2019). *Astrophysics Source Code Library*, p. ascl-1906. Google Scholar

Filliben, J. J. & Heckert, A. (2003). *NIST/SEMATECH e-handbook of Statistical Methods*, https://www.itl.nist.gov/div898/handbook/. Google Scholar

Jahrman, E. P., Holden, W. M., Ditter, A. S., Kozimor, S. A., Kihara, S. L. & Seidler, G. T. (2019). *Rev. Sci. Instrum.* **90**, 013106. Web of Science CrossRef PubMed Google Scholar

Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S. & Willing, C. (2016). *Positioning and Power in Academic Publishing: Players, Agents and Agendas*, edited by F. Loizides & B. Schmidt, pp. 87–90. IOS Press. Google Scholar

Newville, M. (2013). *J. Phys. Conf. Ser.* **430**, 012007. CrossRef Google Scholar

Oliphant, T. E. (2006). *A Guide to NumPy*, Vol. 1. Trelgol Publishing USA. Google Scholar

Pickering, I. (1999). *Monochromator crystal glitch library*, https://www-ssrl.slac.stanford.edu/xas/glitch/glitch.html. Google Scholar

Prestipino, C., Mathon, O., Hino, R., Beteva, A. & Pascarelli, S. (2011). *J. Synchrotron Rad.* **18**, 176–182. Web of Science CrossRef CAS IUCr Journals Google Scholar

Ravel, B. (2016). *Athena: XAS Data Processing*, section 9.5, https://bruceravel.github.io/demeter/documents/Athena/index.html. Google Scholar

Rosner, B. (1983). *Technometrics*, **25**, 165–172. CrossRef Google Scholar

Sapelkin, A. V. & Bayliss, S. C. (2002). *High. Press. Res.* **21**, 315–329. CrossRef Google Scholar

Savitzky, A. & Golay, M. J. E. (1964). *Anal. Chem.* **36**, 1627–1639. CrossRef CAS Web of Science Google Scholar

Stern, E. A. & Lu, K. (1982). *Nucl. Instrum. Methods Phys. Res.* **195**, 415–417. CrossRef CAS Web of Science Google Scholar

Trutna, L., Spagon, P., del Castillo, E., Moore, T., Hartley, S. & Hurwitz, A. (2003). *NIST/SEMATECH e-handbook of Statistical Methods*, https://www.itl.nist.gov/div898/handbook/. Google Scholar

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C., Polat, İ., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P. & SciPy Contributors (2020). *Nat. Methods*, **17**, 261–272. Web of Science CrossRef CAS PubMed Google Scholar

Wellenreuther, G. & Meyer-Klaucke, W. (2009). *J. Phys. Conf. Ser.* **190**, 012033. CrossRef Google Scholar

Zhuchkov, K. N., Shuvaeva, V. A., Yagi, K. & Terauchi, H. (2001). *J. Synchrotron Rad.* **8**, 302–304. Web of Science CrossRef CAS IUCr Journals Google Scholar

This article is published by the International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.