CIF applications
CIFXML: a schema and toolkit for managing CIFs in XML
aUnilever Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK
*Correspondence e-mail: pm286@cam.ac.uk
CIFXML applies the XML strategies and technologies to create a general interface for processing CIFXOM provides an easy way of converting CIFs to XML and vice versa using Java.
documents that conform to the syntax and DDL1. Both a DTD and an XML schema for CIFs are presented. CIFs can be read, edited, validated syntactically, sorted, normalized, filtered, stored as an XML document object model, transformed and output.1. Introduction
et al., 1991) is a structured document format that is in common use for the interchange of crystallographic information [the same acronym is used for the broader system of exchange protocols known as (Hall & McMahon, 2005)]. It has a formal syntax, which describes `well formed' CIFs and which is now almost completely honoured in practice. Some semantics are formalized but current usage is variable. Ontology is provided by dictionaries, which in principle allow machine validation of data instances, but relatively few tools exist for semantic integration.
(CIF; HallSeveral excellent ) and C, or variants (Westbrook et al., 1997; Hester, 2006), Python (Chang & Bourne, 1998; Edgington, 1997), and Perl (Bluhm, 2000).
parsers have been developed, but most of them store the parsed information (infoset) within the memory of the program and users will need to know the internals and language of each program to extract information. Programming libraries for working with files have already been described for Fortran (Hall & Bernstein, 1996The ) by about a decade, and its components [datafiles, dictionaries and DDLs (dictionary definition languages)] are essentially isomorphous to the XML infrastructure [documents, schemas, XSD (XML Schema Definition)]. It is possible to represent much of the formal power of DDLs in XSD. One of us (PM-R) has been through this exercise and established that when the constructs are translated into XML equivalence it is possible to carry out a large amount of validation (Murray-Rust, 1998). However there are a number of constructs in that cannot be trivially converted to XML and doing this explicitly is considerably more laborious than hard-coding pragmatic implicit semantics where they are essential (https://www.iucr.org/__data/iucr/cif/software/ciftbx/ ).
standard pre-dated XML (W3C, 1997The XML community has developed many strategies and tools for semantics and ontological operations on structured documents, and we have transferred these to support CIFs by developing the XML dialect CIFXML. XML provides schema-based validation of data instances and a variety of strategies for transforming documents [Simple API (application programming interface) for XML (SAX; https://www.saxproject.org/ ), Document Object Model (DOM; W3C, 2005), Extensible Stylesheet Language Transformations (XSLT; W3C, 1999) and XSD (W3C, 2000)]. XML has the great benefit that it allows the infoset to be serialized independently of the program that created it. There are a very large number of tools for validating XML so that it is possible to check the structure and content of the serialized XML without knowing the domain specifics. Indeed many web browsers now contain good XML parsers which allow searching and filtering through JavaScript and related languages. Many other XML tools exist for searching, indexing and other manipulation, and the type of information is easily transformed into RDF (Resource Description Framework) and other web-friendly languages. This makes it possible to search for name–value constructs (CIFItem) in RDF-ized CIFXML.
XML serialization also allows `round-tripping', which is an important tool for checking consistency and completeness of parsing and representation. Some information (mainly whitespace and other formatting) will be lost during the round trips but it is possible to carry out the process
to CIFXML to to CIFXML with a high degree of stability.There are two main strategies for processing structured documents:
(a) SAX. After lexical processing a document is broken into chunks, which fire events in a linear order. In XML this normally corresponds to start and end tags and contained text.
(b) DOM. The document is converted into a tree structure (often representable by a DTD or XML schema). The tree is held in memory and can be navigated and transformed in many ways.
SAX and DOM are complementary. SAX has the advantage of being rapid and not limited by memory. DOM preserves the context of every piece of information. In practice many XML parsers provide both strategies and use SAX to build a DOM. The use of SAX, DOM and callbacks may be unfamiliar so a brief description is given later (§7.2).
2. CIFXML
In 1995 one of us (PM-R) visited the Protein Data Bank (PDB; Berman et al., 2000) in Brookhaven and worked with Professor H. J. Bernstein and colleagues on representing the emerging mmCIF (Fitzgerald et al., 1996) specification in a bespoke structured markup language. Later one of the authors (PM-R) envisaged a complete suite of XML tools (Murray-Rust, 1998) that mapped onto the emerging DDLs and dictionaries, and much of this was discussed with Professors S. R. Hall and N. Spadaccini. A prototype of DDL1 and DDL2 was created in an early precursor of CIFXML, as well as a dictionary validator for the complete infrastructure of emerging DDLs and dictionaries. However, the specification was still evolving at this time and even small changes gave rise to large downstream implications in the software. The conclusion of these explorations was that building the complete infrastructure for XML representation and validation through DDLs and dictionaries was a very considerable labour and was also likely to throw up a number of semantic concerns which would have needed to have been addressed by the community. At that stage, therefore, it seemed practical to hard-code the semantics into DDL1-compatible dictionaries. Since the core dictionary has become relatively stable, it can be used without `on-the-fly' validation against DDL1 (Hester, 2006).
To support XML DTD and schema . Alongside this, we have developed CIFXOM, a Java library for converting CIFs to valid CIFXML and vice versa. CIFXOM is based on the XML parsing strategies, and this article describes the fundamental engine for transforming CIFs into XML. The DDL-validated transformation of CIFXML documents into Chemical Markup Language (CML) will be described elsewhere (Murray-Rust et al., 2011).
through standard XML methods and tools, we have now created CIFXML, an XML dialect with a correspondingCIFXML currently supports the ) dictionaries [but not STAR (Cook, 1991) or DDL2 (Westbrook et al., 2005), i.e. save frames]. It interprets any as a structured document (CIF), which may contain the following:
syntax and DDL1-based (Hall & Cook, 1995(a) datablocks: these must have unique identifiers and may contain items, loops and comments.
(b) items: all item names must be unique within a datablock.
(c) loops: all loops within a datablock must belong to different categories (or have specific reference items), and all names in the loop should be unique.
(d) comments: comments can occur anywhere within a where whitespace can occur. It is unclear whether comments are technically part of the content of a or simply annotations for human readers only. We deprecate their use for holding information, but since they are often used for metadata we retain them in the CIFXML model.
(e) Whitespace: elements can be separated by inline and interline whitespace, but this is not included in the CIFXML data model.
The
syntax allows for a number of syntactic variants such as delimiters on values or tokens used for whitespace. These are not held in the CIFXML data model so the precise lexical variant will not be recovered in round trips.There is no formal concept of order in i.e. read into the DOM and re-output without loss). In addition, the order of the components can be canonicalized so that it is possible to compare documents with differing ordering but identical semantic content.
The data blocks, the elements within each data block and the components of a loop can be reordered without affecting the abstract data model of a According to the specification the ordering of `rows' in a loop is not significant. However, XML supports the order of document elements and CIFXML preserves precisely all order in the input document. This allows CIFs to be `round-tripped' (The et al., 2011)]. Certain semantics can only be applied if a dictionary is available (e.g. the requirement that elements in a loop must belong to the same category). These semantics are omitted from the core CIFXML model.
standard requires that data instances are valid against one or more dictionaries. In practice few tools validate CIFs against any dictionary [and we shall report elsewhere a CIFXML-based dictionary and document validation tool (Murray-Rust2.1. conformance
To establish the correctness of CIFXML with respect to the schema/DTD (described below, in §4, and included in full in Appendix A) and to act as a validator we have written a Java toolkit, CIFXOM. CIFXOM has been created to implement the standard as described in the specification. We notice, however, that a small but significant fraction of CIFs do not adhere to the specification precisely. The most common deviations (which probably arise from using normal text editors rather than CIF-aware ones) are
(a) incorrect use of delimiters (e.g. assuming that end-of-line closes quotes),
(b) duplication of items,
(c) duplicate datablock names,
(d) improper insertion of `comments' (sometimes apparently added by technical editors) that do not start with `#',
(e) illegal characters (especially non-printing characters).
CIFXOM provides some optional heuristics to attempt recovery from these, but cannot, of course, guarantee that the result is what was intended. We note that the proportion of these errors is declining, presumably as a result of the greater use of checkCIF (mandated by some publishers), conformance in software and the greater familiarity with in the editing processes. Until relatively recently, few if any CIF-aware editing tools existed and manual editing was required for the majority of the creation process. The Cambridge Crystallographic Data Centre (CCDC) provides a free (for individual research and teaching use) tool (enCIFer; CCDC, 2004) which allows even inexperienced users to generate syntactically correct CIFs. Another aid for pre-publication validation and formatting of CIFs is publCIF (Westrip, 2010), available from the International Union of Crystallography web site (https://www.iucr.org/resources/cif/software/ ).
3. CIFXOM functionality
CIFXOM supports the following operations:
(a) Complete syntactic validation of documents.
(b) Dictionary-free semantic validation against the standard.
(c) Conversion of escaped characters to their Unicode equivalents.
(d) Reporting of errors and warnings with original line numbers. Further processing continues after warnings and we attempt optional recovery from some errors.
(e) Optional parsing of numbers with fields [e.g. 123.45(6)].
(f) Choice of DOM or SAX strategies and choice of parsers.
(g) Creation of a CIFXML object from or XML.
(h) Normalization of document structure.
(i) Canonicalization of document structure.
(j) Optional sorting of part or whole document.
(k) Identification of differences between data models for two CIFs (i.e. independent of syntax and ordering).
(l) Output as XML, HTML or for round-tripping.
4. Representation of documents in XML
The DTD to which the XML serialization of CIFs must conform is included in full in Appendix A, as is its XML schema representation. The elements are listed and described in Table 1.
|
Using this DTD/schema, we show in Table 2 how a fragment of a typical is translated.
|
An alternative syntax for the numeric fields, which avoids the problems of parsing suffixed brackets, is exemplified by the following:
5. CIFXOM architecture
CIFXOM is a single package based closely on the SAX model. We have used the simple and elegant XOM (https://www.xom.nu/ ) model rather than the overly engineered and difficult W3C DOM model. CIFXOM contains the following main classes, most of whose functionality is obvious from the name or the position in the class hierarchy. The parsing uses a SAX-like model where events cause callbacks to the content or error handlers.
(a) AbstractBlock.java
(b) AbstractTextElement.java
(c) AbstractValueElement.java
(d) CIFComment.java
(e) CIFContentHandler.java
(f) CIFDataBlock.java
(g) CIFElement.java
(h) CIFErrorHandler.java
(i) CIFException.java
(j) CIFItem.java
(k) CIFLoop.java
(l) CIFParser.java
(m) CIFRow.java
(n) CIFSaveFrame.java
(o) CIFTableCell.java
(p) DOMBuilderContentHandler.java
(q) DefaultContentHandler.java
(r) DefaultErrorHandler.java
The inheritance hierarchy of the main CIFXOM concrete classes is shown in Fig. 1. All CIFXOM elements are descendants of the XOM Element class.
The base class is CIFElement, which defines a basic API for processes common to all subclasses.
(i) String toCIFString ()
This returns the CIFElement as a CIF-formatted string.
(ii) void writeXML (Writer w) throws IOException
This will output the CIFElement and all of its children in an XML format.
(iii) void writeHTML (Writer w) throws IOException
This will output the CIFElement and all of its children in an HTML format with lists converted into HTML tables.
(iv) void writeCIF (Writer w) throws IOException
This will output the CIFElement and all of its children in format, thus showing that CIFXOM is a lossless library. It uses the toCIFString() method described above.
(v) void normalize ()
This will attempt to remove any lexical variants.
(vi) void canonicalize ()
Within a datablocks, items and loops (including the row/column ordering) are all arbitrary. This will reorganize the order of the various CIFElements within a CIFDocument into a lexical order. The default behaviour of canonicalize () is to apply the following heuristics during its reordering:
file the order of the(1) CIFItems occur lexically before CIFLoops,
(2) CIFItems are sorted alphabetically by name,
(3) the columns of each CIFloop are sorted alphabetically by namelist, then the rows are sorted upon their lexical ordering,
(4) the CIFLoops are sorted alphabetically using the name of their first column.
(vii) void processSu (boolean b)
This determines whether numeric variables with standard uncertainties in brackets should be parsed and analysed.
As a further illustration, an example of the canonicalization algorithm for a small set of .
data is given in Fig. 26. Installing CIFXOM
CIFXOM requires Java 1.5 or higher (https://www.javasoft.com ) and is available under Artistic License 2.0 (https://www.opensource.org/licenses/artistic-license-2.0 ) from the CML project at Sourceforge (https://sourceforge.net/projects/cml/ ). The latest distribution can be downloaded as a jar file (https://sourceforge.net/projects/cml/ ), or the source code can be downloaded from the Subversion/CVS repositories (https://sourceforge.net/projects/cml/develop ) using an appropriate client. To build the source code, Maven 2.0 (https://maven.apache.org/ ) is recommended. Simple examples, expected output and unit tests can be found in both the distribution and the code repository.
7. Using CIFXOM
CIFXOM is a toolkit and can be used for many purposes. A few standard tasks have been programmed and these will also be valuable for understanding how to use the toolkit. All classes are fully documented and are thus supported by Javadoc (https://www.oracle.com/technetwork/java/javase/documentation/index-jsp-135444.html ), which is recommended as a useful ancillary tool.
As all CIFXOM elements are subclassed from the XOM Element class, CIFXOM uses many XML functions from the XOM library. Therefore, application builders may find it useful to refer to the documentation and tutorials about XOM (https://www.xom.nu/ ).
7.1. Examples of the API
Each of the CIFItem are shown in Table 3.
classes has an API to facilitate the programmatic adding, removing, setting and getting of its particular data fields. For example, some of the methods of
|
7.2. Parsing and callbacks
CIFXOM has a default parsing system which can be subclassed should a different parsing mechanism be needed. This allows the implementer or user to choose between parsers (including at runtime), perhaps on the basis of speed or conformance. In practice most programmers will use the default.
The SAX strategy is that a parser provides callbacks when lexical/document events are fired. This means that the user delegates the parsing process to a parser and only regains control after a complete parse (unless exceptions are thrown). The user provides callbacks to trap the events so that any that are not required can be ignored.
The following code is an excerpt from the readToken method of the CIFParser class, which shows a callback to the CIFContentHandler (contentHandler in the code) to add a CIFItem (item) to the current instance of a CIFDataBlock (this). If there is an error during this method call, there is a callback to the CIFErrorHandler (errorHandler) to provide the error message.
7.3. Example use of the CIFParser class
into CIFXML, canonicalize it and then write out the CIFXML is included in full in Appendix7.4. A simple editor
A simple use case involves reading a datablocks and, for instance, manipulates the _cell_measurement_temp item. In the example provided, it will either add a new item or change the value of the current one. The code is included in full in Appendix C.
into CIFXML and manipulating it through DOM-like calls, thus providing some of the features of a simple editing system. After creating the the process iterates over the8. Deployment
CIFXOM has already been implemented in the following:
(a) The CrystalEye (https://wwmm.ch.cam.ac.uk/crystaleye/ ) web site, a crystallographic repository containing over 120 000 files, all of which have been processed by CIFXOM (parsing, manipulation of the data structure and input for conversion into CML). This has exposed CIFXML to CIFs from a wide range of laboratories with varying degrees of conformance to the exact standard.
(b) The SPECTRa (Downing et al., 2008) and SPECTRa-T (Downing et al., 2010) projects, in which CIFXOM was similarly implemented as a component of repository software implemented at the University of Cambridge, Imperial College London and the University of Southampton.
APPENDIX A
DTD and XSD schema
The DTD to which the XML serialization of CIFs must conform is as follows:
This DTD can also be expressed as an XML schema, as in the following:
Supporting information
CIFXML DTD. DOI: 10.1107/S0021889811011058/he5526sup1.sgml
CIFXML schema. DOI: 10.1107/S0021889811011058/he5526sup2.txt
Acknowledgements
We thank the DTI/EPSRC for support under the UK eScience program. NED thanks the EPSRC for a studentship. The invaluable assistance of Dr Charlotte Bolton in the preparation of this manuscript is acknowledged.
References
Bluhm, W. (2000). STAR (CIF) Parser, https://pdb.sdsc.edu/STAR/index.html . Google Scholar
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. Web of Science CrossRef PubMed CAS Google Scholar
CCDC (2004). enCIFer, https://www.ccdc.cam.ac.uk/free_services/encifer/ . Google Scholar
Chang, W. & Bourne, P. E. (1998). J. Appl. Cryst. 31, 505–509. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cook, A. P. F. (1991). Implementing SMD in STAR: Dictionary Definition Language. ORAC Ltd, Leeds, UK. Google Scholar
Downing, J., Harvey, M. J., Morgan, P. B., Murray-Rust, P., Rzepa, H. S., Stewart, D. C., Tonge, A. P. & Townsend, J. A. (2010). J. Chem. Inf. Model. 50, 251–261. Web of Science CrossRef CAS PubMed Google Scholar
Downing, J., Murray-Rust, P., Tonge, A. P., Morgan, P., Rzepa, H. S., Cotterill, F., Day, N. & Harvey, M. J. (2008). J. Chem. Inf. Model. 48, 1571–1581. Web of Science CrossRef PubMed CAS Google Scholar
Edgington, P. R. (1997). HICCuP: High-Integrity CIF Checking Using Python. Cambridge Crystallographic Data Centre, UK. Google Scholar
Fitzgerald, P. M. D., Berman, H. M., Bourne, P. E., McMahon, B., Watenpaugh, K. D. & Westbrook, J. (1996). Acta Cryst. A52(Suppl), MSWK.CF.06. Google Scholar
Hall, S. R., Allen, F. H. & Brown, I. D. (1991). Acta Cryst. A47, 655–685. CrossRef CAS Web of Science IUCr Journals Google Scholar
Hall, S. R. & Bernstein, H. J. (1996). J. Appl. Cryst. 29, 598–603. CrossRef CAS Web of Science IUCr Journals Google Scholar
Hall, S. R. & Cook, A. P. F. (1995). J. Chem. Inf. Comput. Sci. 35, 819–825. CrossRef CAS Web of Science Google Scholar
Hall, S. R. & McMahon, B. (2005). Editors. International Tables for Crystallography, Volume G, Definition and Exchange of Crystallographic Data. Heidelberg: Springer. Google Scholar
Hester, J. R. (2006). J. Appl. Cryst. 39, 621–625. Web of Science CrossRef CAS IUCr Journals Google Scholar
Murray-Rust, P. (1998). Acta Cryst. D54, 1065–1070. Web of Science CrossRef CAS IUCr Journals Google Scholar
Murray-Rust, P., Adams, S. E., Day, N. E., Downing, J., England, N. W. & Townsend, J. A. (2011). In preparation. Google Scholar
W3C (1997). XML Core Working Group Public Page, https://www.w3.org/XML/Core/ . Google Scholar
W3C (1999). XSL Transformations (XSLT), https://www.w3.org/TR/xslt . Google Scholar
W3C (2000). XML Schema, https://www.w3.org/XML/Schema . Google Scholar
W3C (2005). Document Object Model (DOM), https://www.w3.org/DOM/ . Google Scholar
Westbrook, J. D., Berman, H. & Hall, S. R. (2005). International Tables for Crystallography, Volume G, Definition and Exchange of Crystallographic Data, ch. 2.6, edited by S. R. Hall & B. McMahon. Heidelberg: Springer. Google Scholar
Westbrook, J. D., Hsieh, S.-H. & Fitzgerald, P. M. D. (1997). J. Appl. Cryst. 30, 79–83. CrossRef Web of Science IUCr Journals Google Scholar
Westrip, S. P. (2010). J. Appl. Cryst. 43, 920–925. Web of Science CrossRef CAS IUCr Journals Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.