CIF applications
A validating PyCIFRW
parser:aASRP, ANSTO, PMB 1, Menai, NSW, 2234, Australia
*Correspondence e-mail: jrh@anbf2.kek.jp
PyCIFRW is a general-purpose Python package providing a simple, powerful interface for working with files. Objects and methods are available for reading, writing and manipulating files and dictionaries. Comprehensive validation of files and dictionaries against DDL1 or DDL2 dictionaries is also possible. PyCIFRW is easily included in larger projects and is portable across a large number of platforms. Although written in an interpreted language, parsing and validation times are acceptable for most applications.
Keywords: CIF; computer programs.
1. Introduction
The et al., 1991) for data archiving and exchange is well established in crystallography. An important feature of the format is the availability of standard dictionaries which codify the meanings of discipline-specific sets of data items in both human and machine-readable form. These dictionaries are written using a (DDL) (Hall & Cook, 2005; Westbrook et al., 2005) which is itself simply a set of standard tags. The machine-readable tag–value pairs in these dictionaries describe conditions that each data item should satisfy; for example, the attributes for a particular data item may restrict values to positive integers, or allow this data item to occur only in a loop with another specific data item.
syntax (HallProgramming libraries for working with ) and C, or variants (Westbrook et al., 1997; Chang & Bourne, 1998). Among other capabilities, these libraries read and write syntactically correct files, and in some cases check data values and structures against one or more dictionaries.
and STAR files (CIF syntax is a restricted form of STAR syntax) significantly ease the task of adding functionality to software projects. Such libraries have long been available in the well established compiled languages Fortran (Hall & Bernstein, 1996Such broad ), Perl (Wall et al., 2000) and Python (van Rossum & Drake, 2003). A degree of parsing and validation support in these languages is provided by HICCuP (Edgington, 1997), written in Python, and STAR::Parser (Bluhm, 2000), written in Perl. HICCuP was an early stand-alone application for working with CIFs, and offered a series of validation tests against preprocessed DDL1 dictionaries. STAR::Parser is a set of modules offering both parsing and validity checking against DDL1 and DDL2 dictionaries.
support is generally lacking for interpreted languages, despite a number of advantages that these languages have over compiled languages. These advantages include: an identical programming interface on a variety of the common desktop and mainframe platforms, allowing creation of portable programs with no additional effort from the programmer; dynamic typing and high-level constructs mean that similar programming tasks require significantly fewer lines of code than their C/Fortran equivalent, leading to better program maintainability; simple tasks can be accomplished using built-in interactive interpreters; and program distribution and installation is usually simpler for both author and installer. Such advantages come at the cost of a much slower execution speed and larger memory footprint during execution. However, with the rise of desktop computer power over the past decade, these costs have diminished to the point where interpreted languages have become viable alternatives for performing common computational tasks. Among the most established of these languages are Tcl (Ousterhout, 1994The project described here is a comprehensive native Python library for HICCuP, it is a general purpose, native implementation designed for rapid integration into larger projects.
parsing and validating. It differs from the above efforts in the broader scope of its validation tests against both DDL1 and DDL2 dictionaries and the ability to work with multiple merged dictionaries. Unlike2. Using PyCIFRW
This section gives a sample of the simple and powerful interface provided by PyCIFRW, using the example file of Fig. 1.
An important task for a shows an interactive session reading in a file and accessing data blocks. A file is read in by passing a file name or URL when initializing a CifFile object. Data blocks within the file and data items within the data blocks are then accessed using square-bracket notation, emulating the syntax for access into the built-in Python `dictionary' data type. This emulation includes a series of standard Python dictionary access methods for extracting and setting values.
application is to extract data from a pre-existing file. Fig. 2Each e.g. mycif['II'] in Fig. 2) is an instance of a CifBlock object. Data values for items contained in a CifBlock are either strings or numbers. If the value of a looped item name is requested, a Python list of strings or numbers is returned; however, a complete set of looped items is typically more useful than lists of a single item's values. For this reason, the CifBlock method GetLoop(loop_dataname) returns a CifLoopBlock object representing the loop block containing loop_dataname. As a CifLoopBlock object represents an entire loop it is possible to add, remove and alter co-looped items by simple assignment using the square-bracket notation.
data block (A CifLoopBlock objects allow retrieval of a packet from a loop block by packet number, as shown in Fig. 3. In this style of access, the link between position in the returned Python list and data name is obtained by calling method GetItemOrder(). More usefully, Python-style iterators are also defined, allowing economical selection of packets satisfying an arbitrary condition. The final command in Fig. 3 demonstrates the use of this feature. The filter command will return only those packets in the second argument for which the function given in the first argument returns True. When executing the function for each packet in CifLoopBlock object anisos, the packet will be passed as the first variable to the function; in the example, an anonymous (`lambda') function of one variable (a) is specified. This function checks whether the variable's _atom_site_aniso_label is equal to `C4A', thereby selecting only packets for atom C4A.
loop is composed of one or more loop `packets', where a single loop packet contains the values taken by all of the data items in one iteration of the loop.These loop methods and iterators are also available for the more general case of STAR nested loops (see §3 below).
2.1. Save frames
While PyCIFRW reads, manipulates and writes save frames in order to support DDL2 dictionaries. The save frames in a data block are available through the special key `saves', the value of which is also a dictionary. The keys in this dictionary are the frame names. Each save frame behaves identically to a normal data block.
data files should not use save frames,For example, if df = CifFile(`cif_mm_2.0.03.dic'), then df[`cif_mm.dic'][`saves'].keys() will give a list of save-frame names (in this case, data and category name definitions from the macromolecular dictionary), and df[`cif_mm.dic'][`saves'][`atom_site'] will be a CifBlock containing the attributes of the atom_site category. Save frames are invisible during normal operations on a data block: for example, df[`cif_mm.dic'].keys() returns a short list of dictionary global data names, excluding the 1800 data name definitions found inside the save frames.
2.2. output
A string object suitable for writing to a file is obtained by calling the built-in Python str function on a CifFile object. Internal formatting functions insert quotation marks and semicolons where necessary, and illegal characters will be absent, as the presence of such characters in a data value would have caused an error to be raised when the data value was first set. Overlong lines are broken at the last whitespace before the line length limit, or, if no whitespace is available, at an arbitrary 80 characters. Special handling of long lines using the backslash convention is not yet implemented.
2.3. Validation
In the following, `validation' is used to mean checking that the values and placement of data items in a
data block conform to the specifications contained in one or more machine-readable dictionaries.PyCIFRW defines a validate function which returns validation results for the given file and dictionary or dictionaries. Alternatively, PyCIFRW bundles a simple command-line program called cif_validate.py which executes all relevant validation functions on the given file when passed a list of dictionaries. Dictionaries may be specified by name and version, or by file name, and are downloaded if necessary. An example use of this program is shown in Fig. 4. DDL2 dictionaries may be used to validate DDL1-style files; this is achieved by creating new dictionary entries during internal initialization using the value of the DDL2 _item_aliases.alias_name attribute.
A list of the validation tests performed by PyCIFRW is given in Table 1. Each test was developed by examining the description of each attribute in the DDL1 and DDL2 specifications published in the International Tables for Crystallography, Vol. G (Hall & McMahon, 2005), where necessary referring to canonical dictionaries for examples of correct attribute use. Note that, while an output file produced by PyCIFRW is guaranteed to be syntactically correct, it is not guaranteed to be valid unless the validate function returns no errors.
|
2.3.1. Treatment of multiple dictionaries
When multiple dictionaries are provided to the validation routines, they are first merged according to the protocol suggested by McMahon (2005) (BM), with the following variations.
(a) The calling function is responsible for ordering the dictionary list [BM step (i)].
(b) Contrary to BM, when a dictionary definition contains unlooped attributes, and the new dictionary to be merged with it also contains some or all of these attributes, `overlay' mode will not attempt to construct a loop including both old and new attributes, even if those attributes may be looped. Instead, the merged dictionary will contain the attribute as it appears in the new dictionary, if necessary removing it from any loop block or adding the complete loop block in which it appears. Behaviour as specified by BM requires access to the DDL1/2 dictionaries specifying looping properties for the data items, and will be implemented in a future release.
2.3.2. Comments on the validation of DDL1 and DDL2 dictionaries
Just as _item.category_id, which appears in most save frames, is a child of _category.id, so values taken by _category.id in category-definition save frames must be examined to check that _item.category_id takes legal values.
data files are subject to constraints expressed in dictionaries, dictionaries are themselves subject to constraints expressed in the dictionaries that define the DDL1 and DDL2 attributes. The use of save frames in DDL2 dictionaries sometimes leads to subtle differences in interpretation of attribute meaning when validating dictionaries compared with validating data files. In particular, it becomes important to identify the correct object for validation: in the general case, we conclude that it is the complete data block which is valid or invalid, rather than, for example, a single definition block. While it might be expected that the combination of enclosing data block and single definition save frame for a DDL2 dictionary would be sufficient to satisfy validity constraints, this is also not generally true. This behaviour arises because ascertaining the correctness of a number of DDL2 attributes requires checking the presence or value of certain attributes which can only be found in other save frames. For example,The interpretation of the _category.mandatory_code attribute is somewhat unclear in the context of a DDL2 dictionary. Vol. G of International Tables for Crystallography (Hall & McMahon, 2005) states that this attribute `specifies whether the category must appear in any data block based on this dictionary' (p. 64). Uncertainty arises from the setting of this attribute to yes for the ITEM_DESCRIPTION category. In currently available DDL2 dictionaries, attributes belonging to this category appear in those save frames that define data item names. The intention of the authors of the ITEM_DESCRIPTION category definition in specifying that this is a mandatory category would appear to have been to force all defined names to have an associated description, which implies from a validation point of view that the scope of a search when checking for the presence of such a mandatory category item is a single save frame. However, under such an interpretation, all category definition save frames in DDL2 dictionaries would be non-conformant, as they do not contain any data items from the ITEM_DESCRIPTION category. Therefore, an alternative interpretation is adopted, where an item is considered to appear in a data block even if the only appearance is in one or more save frames.
As implied by the above discussion, PyCIFRW will correctly validate DDL1 dictionaries in the same way as ordinary files; however, DDL2 dictionaries require a special flag to the validation routines. This flag causes PyCIFRW firstly to make the translation save frame → data block for the purposes of validation, but then to search outside a single save frame to resolve parent–child references and check for the presence of items from mandatory categories in the dictionary as a whole.
3. Implementation
The various PyCIFRW objects are built out of two fundamental objects defined in the underlying STAR file implementation: BlockCollection and LoopBlock.
LoopBlock objects are collections of key–value pairs with a special key `loops' containing a possibly empty list of LoopBlocks, corresponding to nested loops. This list is also searched when retrieving or setting data values, so that data names act as if they are keys of the outermost LoopBlock. LoopBlocks are STAR-conformant objects; in particular, they may be arbitrarily deeply nested, and have no name length restrictions. A number of methods are provided in the underlying STAR file implementation for iterating over nested loop packets, some of which are useful in the context, as described in the previous section.
A CifLoopBlock is a LoopBlock with restricted-length key names, and values which are either simple lists or atomic values. A CifBlock is a CifLoopBlock with non-list data values. The underlying generalization is that a STAR data block is a special case of a STAR loop block; the set of all non-looped key–value pairs in a or STAR block could be equally well expressed as a loop with a single data packet.
A BlockCollection object represents a collection of objects derived from LoopBlock and is used to construct the CifFile object and each block's set of save frames, which are also collections of blocks. BlockCollections add case-insensitivity of key names and preservation of input order to the standard Python dictionary type.
3.1. input
The data representation described earlier is built up during parsing of a ) from a simple implementation of the STAR grammar specification. This grammar, together with pre- and post-parsing checks, is designed to be rigorously conformant to the 1.1 standard (Hall et al., 2005) and as such PyCIFRW can be used as a syntax checker. It has been tested against the IUCr `trip' test suite (http://www.iucr.org/iucr-top/cif/developers/trip) and correctly identifies both conformant and non-conformant files.
file. Parsing is accomplished using a parser constructed by the Yapps2 lexer/parser (Patel, 20033.2. Dictionaries, validation and merging
The CifDic object is subclassed from a BlockCollection object, and requires one or more DDL1 and/or DDL2 dictionaries to be provided at initialization. These dictionaries are normalized to a uniform internal structure, allowing data name definitions to be accessed using square bracket notation instead of needing to access the various save frames in the case of DDL2.
The validation routines listed in Table 1 are methods of the CifDic object. value type checking for DDL2 dictionaries is performed by direct use of the regular expressions contained in the dictionary file.
A number of obvious transformations are performed when initializing DDL1 dictionaries in order to produce DDL2-like behaviour: category-wide information (e.g. _list_mandatory, _list_uniqueness) is transferred into a category block; definitions containing looped _name data items are expanded to include one item name per definition; enumeration ranges are expressed using DDL2-style maximum/minimum specifications; and specific _type_construct attributes are moved to the dictionary global level. DDL2 dictionaries are also transformed by moving parent/child attributes to the corresponding data name definition as for DDL1 dictionaries. Due to the difficulty of reversing this latter transformation, dictionary merging in PyCIFRW is performed at the CifFile level, and the final merged CifFile object is used to initialize the CifDic object.
4. Discussion
Table 2 gives some representative times for input and validation of typical small-molecule and protein data files, as well as dictionary preparation times (which are dominated by the parsing stage). As expected for an interpreted language, parsing of typical input files is of the order of seconds with current hardware, compared with compiled-language parsers which would take small fractions of a second. Validation times give an order of magnitude estimate of data access times for data-hungry applications, as validation requires accessing every data item at least once.
|
For applications which do not access large numbers of
files, these times are well within acceptable limits. As compensation for the loss in execution efficiency, programmers obtain maintainable, easily distributable code which runs without change on all platforms, and end users have the option of simple command-line interaction with files.5. Availability
PyCIFRW runs on all platforms supported by Python, which includes Windows, Linux and Mac OS X. The program code and developer documentation are produced from single files in the literate programming noweb (Ramsey, 1994) format to ensure maximum accessibility and maintainability. PyCIFRW is copyright the Australian Synchrotron Research Program and is freely downloadable under liberal licensing terms from http://anbf2.kek.jp/CIF. It is also bundled as part of the CCTBX project (Grosse-Kunstleve et al., 2002).
Acknowledgements
The author is grateful to a number of early users of PyCIFRW, especially R. Grosse-Kunstleve and D. du Boulay.
References
Bluhm, W. (2000). Star (CIF) parser, http://pdb.sdsc.edu/STAR/index.html. Google Scholar
Chang, W. & Bourne, P. E. (1998). J. Appl. Cryst. 31, 505–509. Web of Science CrossRef CAS IUCr Journals Google Scholar
Edgington, P. R. (1997). HICCuP: High-Integrity CIF Checking Using Python, Cambridge Crystallographic Data Centre, UK. Google Scholar
Grosse-Kunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 126–136. Web of Science CrossRef CAS IUCr Journals Google Scholar
Hall, S. R. & McMahon, B. (2005). Editors. International Tables for Crystallography, Vol. G. IUCr/Springer. Google Scholar
Hall, S. R., Allen, F. H. & Brown, I. D. (1991). Acta Cryst. A47, 655–685. CSD CrossRef CAS Web of Science IUCr Journals Google Scholar
Hall, S. R. & Bernstein, H. J. (1996). J. Appl. Cryst. 29, 598–603. CrossRef CAS Web of Science IUCr Journals Google Scholar
Hall, S. R. & Cook, A. P. F. (2005). International Tables for Crystallography, Vol. G, edited by S. R Hall & B. McMahon, ch. 2.5, pp. 53–60. IUCr/Springer. Google Scholar
Hall, S. R., Spadaccini, N., Brown, I. D., Bernstein, H. J., Westbrook, J. D. & McMahon, B. (2005). International Tables for Crystallography, Vol. G, edited by S. R Hall & B. McMahon, ch. 2.2.7, pp. 25–36. IUCr/Springer. Google Scholar
Leduc, Y. A., Prasad, L., Laivenieks, M. J. G. Z. & Delbaere, L. T. (2005). Acta Cryst. D61, 903–912. Web of Science CrossRef CAS IUCr Journals Google Scholar
McMahon, B. (2005). International Tables for Crystallography, Vol. G, edited by S. R Hall & B. McMahon, ch. 3.1.9, pp. 88–89. IUCr/Springer. Google Scholar
Mondal, S., Mukherjee, M., Roy, A., Mukherjee, D. & Helliwell, M. (2002). Acta Cryst. C58, o474–o476. Web of Science CSD CrossRef CAS IUCr Journals Google Scholar
Ousterhout, J. K. (1994). The Tcl and the Tk Toolkit, London: Addison-Wesley. Google Scholar
Patel, A. (2003). Parsing with yapps. http://theory.stanford.edu/~amitp/Yapps. Google Scholar
Ramsey, N. (1994). IEEE Softw. 11(5), 97–105. CrossRef Google Scholar
Rossum, G. von & Drake, F. L. Jr (2003). The Python Language Reference Manual. Bristol: Network Theory Ltd. Google Scholar
Wall, L., Christiansen, T. & Orwant, J. (2000). Programming Perl. 3rd ed. Cambridge, MA: O'Reilly. Google Scholar
Westbrook, J. D., Berman, H. M. & Hall, S. R. (2005). International Tables for Crystallography, Vol. G, edited by S. R Hall & B. McMahon, ch. 2.6, pp. 61–70. IUCr/Springer. Google Scholar
Westbrook, J. D., Hsieh, S.-H. & Fitzgerald, P. M. D. (1997). J. Appl. Cryst. 30, 79–83. CrossRef Web of Science IUCr Journals Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.