CIF applications
VCIF2: extended validation software
a95 Biltmore Avenue, Oakdale, NY 11769, USA, and bDepartment of Mathematics and Computer Science, Dowling College, 150 Idle Hour Boulevard, Oakdale, NY 11769-1999, USA
*Correspondence e-mail: terahz@geodar.com
Recent revisions to the
standard, the growing number of dictionaries and the critical role played by in the IUCr publication process led the IUCr to fund a two-year project to upgrade portions of the existing software base to support longer lines and more rigorous validation of CIFs against multiple layered dictionaries. A database-based approach to validation to ensure compliance with data-range and enumeration specifications, to ensure compliance with parent–child relationships, and to detect missing and duplicated tags is presented here. This approach to validation is being extended to support the handling of binary synchrotron imgCIF data.Keywords: CIF validation; binary imgCIF data; compliance.
1. Introduction
`The term `crystallographic information file' (CIF) refers to data and dictionary files conforming to the conventions adopted by the IUCr in 1990 and revised by the IUCr Committee for the Maintenance of the et al., 2005.) Validation of a involves two checks: one check for the syntax (i.e. checking against the formal rules of the grammar) and then a check for the domain content. In this paper we are concerned with checking the syntax. Such syntax validation is carried out by many existing programs, for example the program VCIF (McMahon, 1998, 2005). has grown over the years and a revised version of VCIF, VCIF2, has become necessary.
Standard (COMCIFS). The format is intended to meet the needs of a wide range of scientific applications within, and without, the discipline of crystallography.' (HallWe present a software suite for easy extended lexical, parser and dictionary Acta Crystallographica Section C: Crystal Structure Communications (//journals.iucr.org/c/ ) and the more recent Section E: Structure Reports Online (//journals.iucr.org/e/ ) are two popular journals of the IUCr used for publishing crystal structures. Section F: Structural Biology and Crystallization Communications Online (//journals.iucr.org/f/ ) is a new journal of the IUCr. All three journals use the (CIF; Hall et al., 1991) for submission of new structures and require careful validation of the CIFs. Recently, so-called long-line CIFs have been introduced for Acta C and E and new software for handling long lines was required. For Acta F there is a need for validation programs that can handle the complexity of mmCIF (see Appendix A for a glossary of terms). mmCIF introduces parent–child relationships among categories (tables) that require extensions to the existing validation software. The parent–child relationship (a child is a subtable and the parent is the corresponding supertable) is one of the key features of a relational database. Hence, we introduce a new generation of validation software that uses the database model for dictionary validation without needing a local database server. The software is available as a server web page and as a downloadable kit.
validation for IUCr publications. The chemical and biological content are not validated except in the presence of and to the extent provided by a dictionary.2. CBF and imgCIF
The acronym et al. (1991), and for the a broader system of exchange protocols based on data dictionaries and relational rules expressible in different machine-readable manifestations, including, but not restricted to, and XML (Bray et al., 2004). Fig. 1 is a short snippet of 1zrt.cif.
is used both for the the data exchange standard file format of HallThe very large sizes and short data collection times of raw synchrotron data images make pure ASCII text formats less desirable than binary formats. Since CIFs are pure ASCII text files, a separate binary format had to be defined to allow the combination of pseudo-ASCII sections and binary data sections to handle raw synchrotron data images within the context of ) are two aspects of the same format. The binary file format is the crystallographic binary file (CBF). The ASCII sections are very close to the standard but must use operating-system-independent `line separators'. imgCIF is also the name of the dictionary (Hammersley et al., 2005) that contains the terms specific to describing the binary data. The imgCIF dictionary is layered on the macromolecular (mmCIF) dictionary (Fitzgerald et al., 2005).
CBF and imgCIF (Bernstein & Hammersley, 20053. VCIF2 overview
The default input for VCIF2 is stdin but files can be specified with -i filename. Both and CBF formats are supported. The program prints the output file to stdout or to a file specified with -o filename. If base64 or quoted-printable encoding is used, the output file will be in format, otherwise CBF. All errors and warnings are sent to stderr. If long-line CIFs are being processed the -w option is required in order to avoid `over line size limit' warnings and to output wide lines instead of folding them.
VCIF2 also supports dictionary validation. A dictionary is specified with the -v option.
On read, the parser checks every token (e.g. word, punctuation etc.). First it does a syntax check and then it performs more in-depth validation, such as dictionary and parent–child relationships.
An example for validating a wide-line vcif2 -w -v mmcif_pdbx.dic -i 1zrt.cif /dev/null, where mmcif_pdbx.dic is the dictionary, 1zrt.cif is the file to be validated and /dev/null means the output will be discarded (for Unix machines).
against the PDBx dictionary would be4. VCIF2 web interface
Because of the popularity of the World Wide Web we created a web interface to VCIF2. This simplifies the process of using the program. The user is required to have a web browser and a file for validation. This web interface can be accessed via the CIF Validation Webpage (Todorov, 2006). Currently supported dictionaries are coreCIF, pdCIF, msCIF, rhoCIF, mmCIF, PDBx, imgCIF and symCIF.
The input file is specified via an open file dialog and the rest of the options are radio buttons on the web page (Fig. 2). The output is generated after the Validate button is pressed and will contain the original output of VCIF2. The line numbers where errors were detected will be hyperlinks to sections after VCIF2's output that correspond to the detected error. Each corresponding section provides five context lines from the validated file with the problem line in bold in the middle. Additionally, if common mistakes are detected, suggestions are provided in the same section. See Figs. 3 and 4 for examples of a string error and two quote errors, respectively, detected by VCIF2.
5. Implementation
VCIF2 has been embedded in an existing utility called cif2cbf which is part of the CBF library (CBFlib; Ellis & Bernstein, 2001, 2005). The program name VCIF2 is simply an alias for cif2cbf with appropriate command line options. The majority of it is written in standard C, with small parts making use of Fortran and the yacc parser. The web interface uses standard HTML forms with a php script in the back end for parsing and executing the VCIF2 binary on our server. When VCIF2 validates against a dictionary, the dictionary populates a database-like table, represented as a file in memory. After the lexical and parser validations are performed, the input is checked against the dictionary for validity.
For binary synchrotron imgCIF data, the program checks the validity of the header tags and the ranges of their values, and the validity and checksum of the MIME header of the actual binary image, but the image itself is not validated, other than for the checksum and size.
6. Distribution
VCIF2 is called cif2cbf in CBFlib and is located in its examples folder. Current development of CBFlib is being carried out on our GForge server (Arcib Laboratory, 2006a). Complete developer or binary kits can be found via the file release system on the project's website (Arcib Laboratory, 2006b) or via CVS (Arcib Laboratory, 2006c). The latest testing version of CBFlib is in the CVS repository under module name CBFlib_bleeding_edge. The latest stable version can be found in the CVS with module name CBFlib_latest_stable. The web interface code, along with dictionaries and VCIF2 binaries for Linux, can be found in the CVS under module name CBFlibHTML or at the download page (Arcib Laboratory, 2006d).
For more information, readers are invited to send e-mail to yaya@bernstein-plus-sons.com or terahz@geodar.com
APPENDIX A
Glossary
The following is a glossary of terms used in this paper.
|
Acknowledgements
This work was supported in part by the International Union of Crystallography, the US National Science Foundation and the US Department of Energy.
References
Arcib Laboratory (2006a). GForge CBFlib Project, https://blondie.dowling.edu/projects/cbflib/ . Google Scholar
Arcib Laboratory (2006b). CBFlib Releases Webpage, https://blondie.dowling.edu/frs/?group_id=10 . Google Scholar
Arcib Laboratory (2006c). CBFlib CVS Repository, https://blondie.dowling.edu/scm/?group_id=10 . Google Scholar
Arcib Laboratory (2006d). CIF Validation Webpage Download, https://www.vcif.org/get.html . Google Scholar
Bernstein, H. J. & Hammersley, A. P. (2005). International Tables for Crystallography, Vol. G, Definition and Exchange of Crystallographic Data, pp. 37–43. Heidelberg: Springer. Google Scholar
Bray, T., Paoli, J. & Sperberg-McQueen, C. M. (2004). World Wide Web Consortium, February issue, https://www.w3.org/TR/2004/REC-xml-20040204 . Google Scholar
Ellis, P. J. & Bernstein, H. J. (2001). CBFlib: An API for CBF/imgCIF Crystallographic Binary Files with ASCII Support, https://www.bernstein-plus-sons.com/software/CBF . Google Scholar
Ellis, P. J. & Bernstein, H. J. (2005). International Tables for Crystallography, Vol. G, Definition and Exchange of Crystallographic Data, pp. 544–556. Heidelberg: Springer. Google Scholar
Fitzgerald, P. M. D., Westbrook, J. D., Bourne, P. E., McMahon, B., Watenpaugh, K. D. & Berman, H. M. (2005). International Tables for Crystallography, Vol. G, Definition and Exchange of Crystallographic Data, pp. 295–443. Heidelberg: Springer. Google Scholar
Hall, S. R., Allen, F. H. & Brown, I. D. (1991). Acta Cryst. A47, 655–685. CrossRef CAS Web of Science IUCr Journals Google Scholar
Hall, S. R., Westbrook, J. D., Spadaccini, N., Brown, I. D., Bernstein, H. J. & McMahon, B. (2005). International Tables for Crystallography, Vol. G, Definition and Exchange of Crystallographic Data, pp. 20–36. Heidelberg: Springer. Google Scholar
Hammersley, A. P., Bernstein, H. J. & Westbrook, J. D. (2005). International Tables for Crystallography, Vol. G, Definition and Exchange of Crystallographic Data, pp. 444–458. Heidelberg: Springer. Google Scholar
McMahon, B. (1998). VCIF: a utility to validate the syntax of a crystallographic information file, https://www.iucr.org/iucr-top/cif/software/vcif/index.html . Google Scholar
McMahon, B. (2005). International Tables for Crystallography, Vol. G, Definition and Exchange of Crystallographic Data, pp. 499–525. Heidelberg: Springer. Google Scholar
Todorov, G. (2006). CIF Validation Webpage, https://www.vcif.org . Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.