q2xafs workshop
Towards data format standardization for X-ray absorption spectroscopy
aNational Institute of Standards and Technology, Gaithersburg, MD 20899, USA, bBragg Institute, ANSTO, Locked Bag 2001, Kirrawee DC, NSW 2232, Australia, cEuropean Synchrotron Radiation Facility, 6 rue Jules Horowitz, BP 220, 38043 Grenoble Cedex 9, France, and dCenter for Advanced Radiation Studies, University of Chicago, Building 434A, Argonne National Laboratory, Argonne, IL 60439, USA
*Correspondence e-mail: bravel@bnl.gov
A working group on data format standardization for
has recently formed under the auspices of the International X-ray Absorption Society and the Commission of the International Union of Crystallography. This group of beamline scientists and practitioners has been tasked to propose data format standards to meet the needs of the world-wide community. In this report, concepts for addressing three data storage needs are presented: a single spectrum interchange format, a hierarchical format for multispectral X-ray experiment, and a relational database format for data libraries.Keywords: XAFS; standardization; data formats.
1. Introduction
In the years since the seminal 1971 paper (Sayers et al., 1971) demonstrating quantitative analysis of the X-ray absorption spectroscopy fine structure (XAFS), has developed into a mature field used by researchers in a broad array of scientific disciplines. During those decades the world-wide user community has developed best practices for all aspects of the measurement, including beamline design, sample preparation, measurement practice, theory, and data processing and analysis. The measurement of data is today one of the core competencies of synchrotron science and beamlines dedicated to are available at almost every synchrotron in the world. In these four decades there has never been a sustained and broadly supported effort within the community towards standardization of data formats despite the substantial benefit that data format standardization has brought to other disciplines, notably X-ray scattering. The authors of this report represent a working group consisting of beamline scientists and practitioners from around the world working under the auspices of the International X-ray Absorption Society and the Commission of the International Union of Crystallography (IUCr) to develop a proposal for data format standardization for and related measurement techniques. This report summarizes the results of the first meeting of this working group.
Data format standardization addresses a number of problems shared by measurement facilities, the scientists and engineers who develop and maintain the hardware and software at those facilities, the scientists who develop software related to theory and analysis, and the scientists who use all of those things in their research. These problems include (i) representing spectral data, (ii) representing metadata related to the spectral data, (iii) composing relationships between We seek to share data across continents, decades and analysis toolkits.
spectra measured at different times or at different facilities, (iv) composing relationships between spectra and other data measured during the same experiment, (v) archiving of data and metadata for storage and future use, (vi) preparation of data for subsequent processing and analysis, (vii) comparison of measured data with applicable theory, and (viii) deposition of data with journals or other repositories. Stated more simply, we seek standards for data formatting to facilitate sharing of data in the broadest possible sense.In the following we outline data format proposals for three different applications. In this report the term `data format' includes both the syntax specification and the semantic and scientific meanings assigned to the data names appearing in the file. Two data format proposals address data interchange of a minimal unit of currency: a single
spectrum. A further proposal suggests using a hierarchical format to represent the result of a complex multispectral experiment in which one or more spectra are measured along with other X-ray or non-X-ray measurement techniques. The final proposal is a database standard for data libraries.2. An interchange format for spectra
The irreducible unit of
data is a single spectrum. The bare minimum of data and metadata required to encode an spectrum is:(i) A table of numbers representing the energy axis, μ(E), and the intensity of the incident X-ray beam (which is often referred to as I0), as well as the uncertainties in the measurement of those numbers. In some situations this table consists instead of the photoelectron wavenumber and χ(k) and their associated uncertainties.
(ii) Identification of the absorbing species and absorption edge.
(iii) The value of the d-spacing of the crystal monochromator or line spacing of the grating monochromator used to provide monochromatic beam for the experiment.
In the language of this report the table of numbers is the data and the other items are examples of metadata, i.e. descriptive information about the specific instance of data. This set of data and metadata has been identified as essential and irreducible for a full assessment and analysis of an spectrum. The incident intensity is included in the data table so that systematic errors in a spectrum related to the photon source and optics can be distinguished from those related to the measured sample. The identification of the absorbing element and edge are required to identify unambiguously the elemental origin of the spectrum. The value of the monochromator d-spacing or line spacing is required to apply corrections to the energy axis owing to miscalibration of energy, angle or encoder value.
In addition to the required data table entries and metadata values, a particular representation of data might require additional data table entries or metadata values. The data interchange format must be sufficiently flexible to allow the addition of new data table columns and metadata values. For example, the data table might include signals from additional detectors or encoder values related to the energy axis. Additional metadata might include such things as the start and end times of a scan, details about the photon source or the optics used to condition the photon beam, details about the preparation of the sample, details about the sample environment, or other descriptive information.
The spectrum encoded in the data table and described by the metadata might represent a single scan of μ(E) spectrum. The merged μ(E) spectrum, not the individual scans, is the unit of currency we wish to capture in the interchange format. Often an spectrum is but one result of a complex experiment. As an example, an spectrum might be measured at a point in an map or an spectrum measured in an energy-loss channel might be extracted from a non-resonant experiment. The μ(E) spectrum that might be extracted from this larger data set is in the unit of currency we wish to capture in the interchange format.
data. An individual scan, however, is only one example of a spectrum that might be represented by an interchange standard. It is common practice, for example, to measure two or more scans on an individual sample and to merge those spectra by performing a statistical average. Subsequent data processing and analysis is then performed on the mergedWith the established contents of the data table, a dictionary of commonly recognized metadata and a specified format for the interchange file, several problems related to data transmission are obviated.
data in this interchange format can be readily transferred between people, beamlines, desktop and Internet applications, data archives and journals. Our working group is approaching a general accord on the content of the data table and metadata dictionary and we are currently evaluating two proposals for the syntax of the interchange format.2.1. The data interchange format proposal
The
data interchange (or XDI) proposal is a plain-text format loosely based on the structure of Internet email. Like the syntactic separation of email headers from the body of the email, the XDI header precedes and is syntactically separated from the data table. The header contains a small set of syntax elements used to positively identify metadata content in a manner that is easily understood both by human and computer readers. The representation of the metadata is line-based with obvious separations between metadata names and values. These names can correspond to a dictionary of defined metadata, but the syntax is sufficiently flexible to allow the introduction of additional kinds of metadata. An example might be parameters related to specific software for data processing or analysis.The data table follows the header as white-space-separated columns of numbers. Fig. 1 shows an example of data in the XDI format. The syntax elements distinguishing the metadata header from the data table are chosen so that data in the XDI format can be imported without filters into many existing analysis packages, although most existing programs are not equipped to use the metadata. Additionally, many general-purpose data processing and visualization packages can readily use files in this format. Thus, the XDI format is immediately useful even in existing software that does not explicitly recognize and use the format and is able to represent data uniformly and consistently.
2.2. The xasCIF proposal
The et al., 1991) was adopted by the IUCr as its preferred format for data exchange in 1991, and is now the de facto standard for communicating crystal structures in the scientific community. Despite its name, there is nothing particularly `crystallographic' in the specifications. consists of a tightly specified text-based syntax coupled with a comprehensive dictionary mechanism for defining the meanings of tags appearing in data files. A data file written according to the syntax consists of one or more data blocks, with each data block notionally corresponding to a single structure. Each data block contains a set of key-value pairs interspersed with tabular information. There are no restrictions on the order in which key-value pairs or tables appear in data blocks. Thus, with appropriate choice of data names to act as relational keys, a data block may be mapped directly into a relational database.
(CIF) (HallA simple example of what a . Note that multiple tables in a single data block allow data files to organize metadata in tabular form where appropriate. The data names _xafs_detector.label and _xafs_ionisation_detector.label in this example act as relational keys linking the two metadata tables together. Further details of the syntax can be found in International Tables for Crystallography (Hall & Westbrook, 2005), and the advantages of relational databases are discussed further in §4.
data file containing data might look like is shown in Fig. 23. Hierarchical formatting of complex data sets
While text-based file formats are adequate for single spectra, they are not well suited for large or complex data sets. While a single spectrum is our single unit of currency, its information relevance as a single spectrum is quite different when it can be compared with other spectra acquired at other sample points in the context of a mapping experiment. The relationship among those spectra is well described by a hierarchical approach. This need becomes more obvious when we combine an arbitrary number of
measurements with other X-ray and non-X-ray measurements in a single experiment. For instance, one might save complete fluorescence spectra while performing a fluorescence mapping experiment. While all this can be achieved with text files, binary file formats are far better suited to that task. A clear example in the field is full-field transmission X-ray microscopy where storage of image data is seldom made in plain text.XAS scientists are often uncomfortable with binary file formats. The main concern is transferability and readability of a binary file among individuals and across computing platforms. On the other hand, those same
scientists never object to storing experimental data on a physical format, such as a hard disk, a compact disk or a memory stick. It is taken as granted that the data can be recovered from the physical format and data are no longer stored in printed form. Clearly they are confident about the availability of tool sets to access the file system of the physical medium and to make the stored data accessible.The Hierarchical File System version 5 (HDF5) format is an extant binary format which offers a close analogy to a file system. This file format is developed and maintained by the HDF group (HDF GROUP, 2012). An overview of the HDF5 capabilities from a scientific perspective can be found in the article by Dougherty et al. (2009). To summarize, any set of data can be stored in an HDF5 file in a self-descriptive way: numbers are retrieved as numbers retaining their original precision, text is retrieved as text, multi-dimensional data sets are retrieved in their original type and dimensions, machine endian-ness is automatically handled, automatic compression and decompression is supported, and so on. Data analysis programmers face fewer problems correctly reading an HDF5 file than a text-based file for which there is not a good library.
HDF5 itself provides a support and a set of tools that make sure the data are readable in their original form but does not add anything concerning their meaning. The versatility of the HDF5 format assures that the data conventions and the dictionary of
metadata developed for the data interchange format can be readily mapped onto HDF5. This, along with the fact that most European synchrotrons are already using HDF5 or planning to do so, implies that an HDF5-based implementation will be available for use in analysis programs.4. data libraries
While the XDI format presented in §2.1 represents a single spectrum and is the fundamental quantum of exchangeable data, it is unable to convey much context in relation to other measurements. This includes not only other measurements made at the same time, as discussed in §3, but also measurements made on the sample under different conditions, such as temperature, or on different samples measured during the same experimental session. And yet, one of the motivations for adopting a standardized format is to compare and exchange spectra, and especially spectra on model or reference compounds measured at different places and times that may be used in comparative analysis of XANES and spectra. For this, libraries of spectra are needed. While the XDI format itself is not capable of holding a library of spectra, its well described header format and clear meanings of column data make it easily inserted into and extracted from spectra libraries.
While some attempts (Lytle et al., 1995; Newville et al., 1999) have been made at building searchable on-line spectral libraries, these have suffered from difficulties in uniformity of format, vetting the quality of data, control and management of a centralized library, and the mechanism for adding and extracting spectra in a usable format. Parts of these difficulties are technical, while other parts are clearly more social and political. In this section we will focus on the technical aspects, and note that these may mitigate some of the more complicated non-technical issues.
The organization of digital data is well matched to using a relational model (Codd, 1970), and relational database management systems using this model and especially the structured query language (SQL) are both ubiquitous and well supported. The relational model differs from the hierarchical model described in §3 in that data are organized in tables without a strict hierarchy, but with relations defined between table entries. This allows more complex and interconnected relationship of data properties. For example, a database of spectra using a relational model could readily and efficiently support searching through a collection of spectra by absorbing element, beamline of data collection or date of collection.
Several high-quality and freely available SQL database systems exist and are proven solutions for many database applications. In particular for data ) library provides a relational database that can be accessed with SQL in a single portable file, and is used in a wide range of desktop applications, such as web browsers and media players, and is widely deployed for embedded data management in commercial mobile devices. The file format is well described, if binary, and many tools and programming languages can access the data contained in these files.
spectral libraries, which we would expect to be relatively small by today's standards, relational databases that map to a single file are a reasonable solution. For reference, the Lytle collection contains roughly 16000 files of raw unfiltered data, and uses 500 Mb, which is probably a good estimate of the size needed for a database encompassing model compound spectra. The Public Domain SQLite (Hipp, 2012XAFS spectral libraries could be built using SQLite for data storage and the same metadata used in XDI files. This would allow, for example, a desktop application that worked like a modern media player to import, sort and organize spectra by absorbing element, beamline, facility, sample name and so on as described in the XDI metadata. Some of the advantages of such an approach, in which a standard SQLite database file contains a large suite of data, including the relations between spectra, include (i) a library can be shared in its entirety, (ii) libraries do not have to be centralized and public, but can hold data to be shared between trusted colleagues or publicly, and (iii) centralized repositories can hold `libraries of libraries' as well as a full suite of data. We propose to use SQLite for spectral libraries, and are in the process of building a portable desktop application based on SQLite to hold and manage libraries of , using metadata imported from the XDI format. The spectral libraries used by this application are portable, and XDI-format spectra can be extracted from the library for use in existing data.
spectra. While this application will try to read in many data formats, it will use the same metadata fields as described for XDI and output plain text files only in XDI format. An example screenshot of the still-in-development application is shown in Fig. 45. Conclusion
The current draft of the metadata dictionary is under discussion using Internet social media resources, including a mailing list (XAS Data Format Working Group, 2012a) and a wiki (XAS Data Format Working Group, 2012b). The wiki is the forum at which the content of the metadata dictionary and the format proposals are presented. Further development continues on both the mailing list and the wiki. All community members are invited to contribute to our deliberations.
One of the challenges of defining the metadata dictionary is to distill the very large number of details that comprise an d-spacing. A final challenge is that metadata might not be well represented by the model of name/value pairs presented in the XDI proposal. xasCIF inherits from a syntax for handling non-shallow data structures, as demonstrated by the first two loop_ structures in Fig. 2. A modification to the XDI proposal to handle non-shallow metadata in that fashion would be straightforward.
measurement into a set of agreed-upon items of sufficient importance to merit a dictionary entry. Another challenge is to capture the salient aspects of all photon sources used for measurement. For example, the polychromator for a dispersive beamline is not adequately described only by itsWe note that the data format needs of the growing field of
are quite similar to those of the community. The solutions suggested for here map naturally onto the field of The data interchange and library standards can be adopted with little modification for spectra, while the hierarchical format can be adopted for various kinds of dispersive measurement geometries or resonant experiments.The proposals outlined above cover storage of a single spectrum (XDI and xasCIF), multi-dimensional data sets with potentially complex interrelationships (HDF5) and data libraries. Our working group will provide a common dictionary of metadata for use across all applications. For each of these applications our working group plans to provide both a vetted standard and appropriate application programming interfaces (APIs) so that these data format standards can be easily incorporated into new and existing software, including software for data acquisition, data archiving and data analysis. The metadata dictionary, the file format standards and the details of the APIs will be topics of future reports from this working group.
Finally, we acknowledge that two important topics have not been discussed in this report. We have addressed neither the representation of raw data as collected at the beamline nor the representation of the chain of analysis of an μ(E) [or perhaps χ(k)]. We note that the syntax of either XDI or xasCIF is adequate for conventional measurements consisting of signals from a small number of scalars. Either format could also be used by theory to encode μ(E) or other functions. The HDF5-based format from §3 is an attractive solution for experiments involving more complex arrangements of detectors. That hierarchical format could also be applied to the capture of a complete analysis chain, including algorithm parametrization, user interaction and application of theory.
spectrum. We in the working group consider both of these topics to be beyond the scope of our work. The nature of raw data is often idiosyncratic and characteristic of the particular beamline at which it was measured. Furthermore, existing beamlines have existing control and acquisition systems with working solutions for the representation of raw data. Our concern for data interchange is, therefore, focused on the representation of data that have been processed into the form ofAcknowledgements
The authors thank Hiroyuki Oyanagi for organizing our working group and for offering continual encouragement and many suggestions. We are grateful to all the organizers of the Q2XAFS workshop. Finally, we thank Ken McIvor for his initial work on the formal grammar of the XDI format and for many helpful discussions.
References
Codd, E. F. (1970). Commun. ACM, 13, 377–387. CrossRef Web of Science Google Scholar
Dougherty, M. T., Folk, M. J., Zadok, E., Bernstein, H. J., Bernstein, F. C., Eliceiri, K. W., Benger, W. & Best, C. (2009). Commun. ACM, 52, 42–47. Web of Science CrossRef PubMed Google Scholar
Hall, S. R., Allen, F. H. & Brown, I. D. (1991). Acta Cryst. A47, 655–685. CSD CrossRef CAS Web of Science IUCr Journals Google Scholar
Hall, S. R. & Westbrook, J. D. (2005). International Tables for Crystallography, Vol. G, ch. 2.2. Dordrecht: Kluwer. Google Scholar
HDF GROUP (2012). HDF5, https://www.hdfgroup.org/HDF5/. Google Scholar
Hipp, D. R. (2012). Sqlite, https://www.sqlite.org/. Google Scholar
Lytle, F., Boyanov, B. & Segre, S. (1995). IXAS XAFS database, https://ixs.iit.edu/database. Google Scholar
Newville, M., Carroll, S. A., O'Day, P. A., Waychunas, G. & Ebert, M. (1999). J. Synchrotron Rad. 6, 276–277. Web of Science CrossRef CAS IUCr Journals Google Scholar
Sayers, D. E., Stern, E. A. & Lytle, F. W. (1971). Phys. Rev. Lett. 27, 1204–1207. CrossRef CAS Web of Science Google Scholar
XAS Data Format Working Group (2012a). Xasformat mailing list, https://millenia.cars.aps.anl.gov/mailman/listinfo/xasformat. Google Scholar
XAS Data Format Working Group (2012b). Data format working group wiki, https://github.com/XraySpectroscopy/Data-Format-Working-Group/wiki. Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.