Received 27 August 2013
A response to this article has been published. To view the response, click here
Timely deposition of macromolecular structures is necessary for peer review
Most of the macromolecular structures in the Protein Data Bank (PDB), which are used daily by thousands of educators and scientists alike, are determined by X-ray crystallography. It was examined whether the crystallographic models and data were deposited to the PDB at the same time as the publications that describe them were submitted for peer review. This condition is necessary to ensure pre-publication validation and the quality of the PDB public archive. It was found that a significant proportion of PDB entries were submitted to the PDB after peer review of the corresponding publication started, and many were only submitted after peer review had ended. It is argued that clear description of journal policies and effective policing is important for pre-publication validation, which is key in ensuring the quality of the PDB and of peer-reviewed literature.
Since the mid-1990s, peer-reviewed journals and the crystallographic community have worked towards the notion that crystallographic models and the associated diffraction data should be submitted to the Protein Data Bank (Baker et al., 1996) and publicly released upon publication (Wlodawer et al., 1998; Editorial, 1998; Baker & Saenger, 1999). This is nowadays the norm, and deviations from that rule are rare. As much as 99.8% of crystallographic structures submitted to the PDB within 2011-2013 make available both the model and the experimental data. This also enables critical re-evaluation of submitted models, based on the original diffraction data but in the light of improved methods and software (Joosten et al., 2009). However, the time frame for data submission has been less well defined: should data be available in one of the wwPDB (Berman et al., 2003) sites before the paper is submitted, before it is accepted for publication, or merely after the paper is accepted, just before publication?
Recently, a Validation Task Force assigned by the PDB has published a recommendation (Read et al., 2011) that the submission of papers that report on crystallographic data should be accompanied by a validation report issued from the PDB. It is an obvious prerequisite that both the experimental data and the model coordinates are submitted to the PDB before paper submission, to achieve this. Such reports are indispensable tools for technical review of the paper by the assigned referees (Read et al., 2011), and crucial for ensuring that any claims based on the structure are supported by data of appropriate quality.
The original data presented in this paper are available in public databases (PDB and PubMed); a data digest relevant to our conclusions are included as Supplementary Material;1 and all the code and the database as well as minimal instructions to reproduce all the results have been uploaded to GitHub, at the repository https://github.com/massyah/PdbMine .
Briefly, the identifier of PDB records with associated `Primary citation' were retrieved from the RCSB webserver on 28 June 2013 at 15:25 GMT+1 (91 738 unique IDs). The corresponding PDB entries were downloaded from the ftp.wwpdb.org FTP server, parsed, and the PDB fields relevant for this study (namely PDB ID, date of deposition, associated PubMed ID) were stored in a SQLITE3 database. The PubMed entries of all associated citations were downloaded from the PubMed web server using the EUTILS suite and then parsed and stored in the SQLITE3 database. From the PubMed associated MEDLINE records, we extracted (if available) the following dates: received, revised, accepted and ahead of print date from the publication history (PHST) field; date of publication (DP); date created (DA); PubMed central release date (PMCR); date of electronic publication (DEP) and Entrez Date (EDAT). The `earliest public date' is then defined as the earliest of the PubMed dates; while the `earliest publication date' is defined as the earliest of the DP, EDAT, DA, DEP and the `ahead of print', `accepted' dates from the PHST. We then considered for this analysis the inner join of the PDB entries table with the PubMed table, where we only kept entries for which (i) the earliest public date was after 1 January 1995; (ii) the published date and accepted date were before 1 January 2014 or available; and (iii) either the publication history was available or the received date was earlier than the accepted or published date; totalling 69 026 unique PDB entries joined with 35 924 unique PubMed entries.
All entries were considered to be `on time' by default. We defined as `deposited after acceptance' those entries for which the date of deposition with the PDB was more than two days after the `earliest publication date'. We identified as `deposited after submission' those entries that were not `deposited after acceptance' but for which deposition with the PDB was more than two days after the `earliest public date'. The impact-factor estimates used to build Table 1 originate from the Thomsom Reuters Journal Citation Reports Science Edition 2011 (http://thomsonreuters.com/journal-citation-reports/ ).
The results from the analysis of the PDB deposition date against the submission and acceptance dates were manually curated to select journals with at least 100 publications that referred to PDB entries over the last 12 years, and are presented in Table 1. The number of structures submitted to the PDB only after the paper was accepted for publication has historically been rather low (less than 10% since 1999) and has been minimized over the years, being just 3.4% (205 of 6003 papers) in 2012 (Fig. 1). However, the number of structures submitted to the PDB after the paper has been submitted for review is, somewhat surprisingly, high. Although tracing the submission date is not possible for all publications, we were able to extract that information for about 50% of the structures published in 2012, and about one third of them were deposited after the paper was submitted to the journal for peer review. It is also noteworthy, that a quarter of the depositions in the window between manuscript submission and manuscript acceptance occurred just within the last six days before manuscript acceptance (Supplementary Fig. S1). It is unlikely that referees had access to PDB validation reports in that time window, and more likely that formal acceptance of the manuscript was postponed until the structure was deposited.
| || Figure 1 |
Deposition dates of structures during the different editorial phases of the corresponding manuscript. Red columns show the percentage of structures that were deposited after the manuscript was accepted (or after it was published if acceptance dates were not available) and blue columns show the percentage of structures deposited after the manuscript was submitted for review but before it was accepted/published. The lines show the number of manuscripts for which the appropriate editorial history was available for each of these categories. Note that before 2000 insufficient data were available on manuscript submission dates.
Many authors are worried that submission of a structure to the PDB will trigger competitors to accelerate their own paper submission. This is a legitimate concern, and having been at the receiving end of this practice, this is not a pleasant experience. However, this concern is ameliorated by an existing submission-time option where the sequences corresponding to the submitted structures are not made publically available before the entry is finally released. The possibility of not directly disclosing the sequence is popular: it is currently used by about two thirds of entries awaiting release. A submission-time option to also withhold the title, currently only possible upon request, would undoubtedly prove equally popular and could help removing remaining concerns.
Urban legend has it that high-impact journals are notorious for tolerating late submission as they typically publish `hot' structures, which many research groups are competing to be the first to determine: to paraphrase a well known quotation (Orwell, 1945), all journals are equal, but some journals are more equal than others. Indeed, we find that journals with a high impact factor for which we could trace the full publication history (the list most regrettably does not include important journals like Science, Proc. Natl Acad. Sci. USA and J. Biol. Chem., which do not make the complete publication history available in the PubMed/MEDLINE records) are more likely to tolerate late submission of crystallographic data (Supplementary Fig. S2). A notable exception to this rule is Acta Crystallographica Section D, which traditionally had a significantly lower impact factor (between 1 and 3) and has only shot to impact-factor prominence over the last couple of years (mainly owing to the publication of highly cited methodological papers). One of the best performing journals in recent years is Proteins, which unsurprisingly has a simple, clear and short policy statement in the instruction for authors: `For all crystallographic studies, coordinates and structure factors should be deposited in the Protein Data Bank at the time of manuscript submission'. This policy, unlike others (a survey of the policies of different journals is available as Supplementary Table S1) is explicit about the timing of deposition. Clarity about policies is crucial, but ensuring that the policies are honored is key.
As we are confident that all journals strive for transparency in the publication procedure and for rigor in the reported results, we strongly advocate that the editorial teams improve the clarity of their policies, and enforce these effectively. The structural biologists, authors and reviewers alike, should also share the responsibility for following these policies. As a community we must strive to ensure that coordinates and experimental data for macromolecular models are submitted to the PDB at the same time as the paper is submitted for review. Only then will validation reports also become available to the referees as part of the necessary material for peer review.
RPJ is supported by a Veni grant 722.011.011 from the Netherlands Organization for Scientific Research (NWO). HS is supported by an ERASysBio+ EU ERA-NET Plus scheme in FP7 (project LymphoSys).
Baker, E. N., Blundell, T. L., Vijayan, M., Dodson, E., Gilliland, G. L. & Sussman, J. L. (1996). Acta Cryst. D52, 609.
Baker, E. N. & Saenger, W. (1999). Acta Cryst. D55, 2-3.
Berman, H., Henrick, K. & Nakamura, H. (2003). Nature Struct. Biol. 10, 980.
Editorial (1998). Nature Struct. Biol. 5, 83-84.
Joosten, R. P., Womack, T., Vriend, G. & Bricogne, G. (2009). Acta Cryst. D65, 176-185.
Orwell, G. (1945). Animal Farm. London: Secker & Warburg.
Read, R. J. et al. (2011). Structure, 19, 1395-1412.
Wlodawer, A., Davies, D., Petsko, G., Rossmann, M., Olson, A. & Sussman, J. L. (1998). Science, 279, 306-307.