International tables for crystallography, Vol. G: Definition and exchange of crystallographic data. Edited by Sydney Hall and Brian McMahon. Dordrecht: Springer, 2005. Pp. xii + 594. Price (hardback) EUR 205.00, USD 220.00, GBP 135.00 for institutions; EUR 102.50, USD 110.00, GBP 67.50 for individuals. ISBN 1-4020-3138-6.
Keywords: book review.
A set of recommendations that facilitates the efficient exchange of data between scientists can only be welcomed and such is described in the new Volume G of International Tables for Crystallography, which covers the definition and exchange of crystallographic data. Definition is a two-edged sword in that it can be restrictive but in this case, owing to the flexibility of the system described in the book, it can only be regarded in its positive sense in that it leads to clarity. In this volume of 32 contributions by 24 contributing authors are described the methods and tools for organizing, archiving and retrieving crystallographic data in a machine-readable form. It also sets the standards scientists in an increasing number of fields are using to store and retrieve crystal-structure data in their work, be they physicists, chemists, biologists, biochemists, materials and surface scientists, mineralogists or pharmacists. The volume describes the Crystallographic Information File (CIF), the standard data exchange and archival format adopted by the International Union of Crystallography, the dictionaries containing definitions of the data items used in the CIF and the dictionary definition languages that define the dictionaries. The CIF format is currently in use by several international scientific databases (CCDC, PDB, ICDD) and an increasing number of scientific journals around the world are using the format for archiving and disseminating crystallographic data.
My first contact with the CIF concept was back in 1988 when Syd Hall spent a sabbatical in Mülheim. He introduced us to the idea of the self-defining text archival and retrieval (STAR) file. This was a time when there were many new operating systems coming onto the market and many crystallographers had different computer platforms. At the same time, instruments and computers were becoming more powerful and data archiving and retrieval were becoming an increasing problem. Two important aspects of the STAR concept were that the data file should have a free format to avoid the problems associated with a fixed format file structure and that it could be understood by both machines and human readers. In the summer of 1988, the Working Party on Crystallographic Information (WPCI) first convened at the ECM11 conference in Vienna. At the meeting, it was decided that a WPCI working group, led by Syd Hall, should investigate the development of a universal file protocol that would be suitable for crystallographic data needs. The working group eventually proposed the CIF format which had a syntax similar to but simpler than the STAR file. Before Syd Hall left Mülheim in 1989, we had submitted our first paper to Acta Crystallographica electronically and it was typeset directly from the CIF. All that had to be sent by post to the journal were the two figures and the signed transfer of copyright agreement.
Later on, the STAR concept was applied to chemistry, in particular to the two-dimensional graphical representation of molecules and chemical information, and the Molecular Information File was born. As a result of this work, it was recognized that the attributes of the data items needed for a particular application could be recorded using the same formalism as the data files themselves. This gave rise to the idea of a dictionary definition language (DDL), which meant that software written to parse data files could also parse the associated dictionaries. A further refinement arose when it came to defining a crystallographic information file format for macromolecular data. It became apparent at an early stage that there was a need to specify relationships between data items describing the different elements of a complex macromolecular structure and this led to the development of a richer dictionary definition language known as DDL2. In addition, the need to allow the storage of synchrotron data images led to the development of a Crystallographic Binary File, which is not strictly a CIF file in the original sense of the term, since binary data are involved, but which has become an integral part of the CIF and it is now recognized that it can be applied to other fields that use binary image data, including the publication of articles, the creation of web pages and the production of movies. Volume G describes all these developments and presents the tools needed to archive and exchange crystallographic data.
The volume is divided into five clearly defined parts. Part 1 is an historical introduction to the genesis of the Crystallographic Information File. It is an important chapter, especially for the reader who is not familiar with the CIF concept, because the historical development helps one to understand why the current CIF format is how it is. The second part details concepts and specifications of the files and languages. It begins with the specification of the STAR file, a data language applicable to all scientific disciplines. This chapter will be of interest to programmers and data managers interested in archiving and retrieval of scientific data. This is followed by chapters giving the specification of the CIF family of files. The third part gives general considerations when defining a CIF data item and classification and use of core data (data items that are of interest to many fields), powder diffraction data, modulated and composite structure data, electron-density data, macromolecular data, image data and symmetry data. Part 4 details the data dictionaries in text format. Currently, there are ten dictionaries. These are the core dictionary, the powder dictionary, the modulated and composite structures dictionary, the electron-density dictionary, the macromolecular dictionary, the image dictionary, the symmetry dictionary, the molecular information dictionary and two additional dictionaries that provide a machine-readable description of the data items in the core and related dictionaries. Part 5 is an applications section and includes general considerations in programming CIF applications, STAR utilities, syntactic utilities for CIF, Fortran tools for manipulating CIFs, the use of macromolecular CIF architecture for PDB (Protein Data Bank) data management, an ANSI C library for manipulating image data, and small-molecule crystal structure publication using CIF, including in the Appendix a request list for Acta Crystallographica Section C, which gives all the data items that can be displayed in an article and details of the data-validation tests based mainly on completeness of a CIF file and self-consistency of individual or closely related data items applied by checkcif. This last part will be of interest to the practising crystallographer since it gives information helpful for the preparation, modification and verification of CIFs. The volume is rounded off with a comprehensive CD-ROM which contains STAR and CIF specifications, dictionaries, software libraries, applications and web links, all in machine-readable form.
The concepts described in the volume may seem difficult to understand at first glance but the volume is so well organized and written that even someone who is not familiar with the philosophy will grasp the idea within a short while of opening the book. The historical development of the dictionaries has led to some inconsistencies in the definition of data items and it is to be hoped that they will be ironed out in due course. How they occurred and what is being done to rectify them are understandably and clearly presented. The editors and authors are to be thoroughly congratulated on providing us with a very workable set of rules that will advance the exchange of data and ideas between crystallographers and other scientists.