An estimated 5% of new protein structures solved today represent a new Pfam family

This study uses the Pfam database to show that the sequence redundancy of protein structures deposited in the PDB is increasing. The possible reasons behind this trend are discussed.

High-resolution structural knowledge is key to understanding how proteins function at the molecular level. The number of entries in the Protein Data Bank (PDB), the repository of all publicly available protein structures, continues to increase, with more than 8000 structures released in 2012 alone. The authors of this article have studied how structural coverage of the protein-sequence space has changed over time by monitoring the number of Pfam families that acquired their first representative structure each year from 1976 to 2012. Twenty years ago, for every 100 new PDB entries released, an estimated 20 Pfam families acquired their first structure. By 2012, this decreased to only about five families per 100 structures. The reasons behind the slower pace at which previously uncharacterized families are being structurally covered were investigated. It was found that although more than 50% of current Pfam families are still without a structural representative, this set is enriched in families that are small, functionally uncharacterized or rich in problem features such as intrinsically disordered and transmembrane regions. While these are important constraints, the reasons why it may not yet be time to give up the pursuit of a targeted but more comprehensive structural coverage of the protein-sequence space are discussed.
Detailed knowledge of the three-dimensional structure of protein chains, of the way in which they assemble into multimeric protein complexes and of their interaction with other molecules (such as DNA, RNA and small ligands) provides us with invaluable information about their molecular function. This information can give important insights into, for example, the effect of mutations such as those caused by some nonsynonymous single-nucleotide polymorphisms or can impact significantly on our ability to design new drugs (Jorgensen, 2009).
The Protein Data Bank (PDB; i.e. the repository of all publicly available protein structures; Rose et al., 2013;Velankar et al., 2012) contains the structures of about 241 000 individual protein chains (as of 17 September 2013). This number can be contrasted with almost 42 million sequences in the UniProt Knowledgebase (UniProtKB, Release 2013_08;The UniProt Consortium, 2013). The gap between the numbers of available structures and available sequences, however, can be considerably reduced by the use of homology modelling. This technique allows us to computationally predict the three-dimensional coordinates of a previously structurally uncharacterized protein using information from the experimental structure of one or more of its homologues (Kiefer et al., 2009;Pieper et al., 2004). While homology models have lower resolution compared with the experimentally determined structures that they are derived from, they are of sufficient quality to be of use during drug discovery (Patny et al., 2006;Cavasotto & Phatak, 2009).
The work presented here belongs to and offers an update to a line of studies that have addressed the question of how close we are to the goal of structurally covering the proteinsequence space; that is, of having the structure of a homologue for each (or, more realistically, most) of the protein sequences available (Chothia, 1992;Yan & Moult, 2005;Chandonia & Brenner, 2006;Levitt, 2007;Nair et al., 2009).

PDB data
We downloaded release dates for PDB structures from the PDB website (http://www.rcsb.org) using the customized table download option. The download was performed on 18 March 2013; however, we excluded any entries released in 2013. Among the resulting entries we considered only those PDB chains that had a 'true' label in the file provided to us by the CATH team (personal communication, 14 March 2013). Chains with a 'true' label are those that CATH considers suitable for reliable domain classification (although not all of them have been classified by CATH as yet), according to the following constraints: (i) the experimental method employed for solving the structure is either NMR or X-ray crystallography with resolution 4.0 Å or a different experimental method with resolution 4.0 Å ; (ii) the fraction of non-C atoms is !0.7; and (iii) the protein-chain sequence length is !40 residues. Additionally, as mentioned above, we only considered structures that had been released by the end of December 2012. This resulted in a total of 196 469 distinct PDB protein chains (belonging to 81 395 structures; see Supplementary Material 1 ). 91.1% of these structures were solved by X-ray crystallography, 8.7% by NMR and 0.2% using other techniques.
Note that over time a certain number of PDB entries are superseded by corrected/improved versions and the new structures are assigned the (release) date on which they replaced the old entries. Other structures are simply withdrawn. This means that the entity and number of released structures per year are likely to change to a certain extent. It also means that the numbers of released structures are to some extent inflated in more recent years, i.e. years for which there has been less time for entries to be replaced or withdrawn.
Information about X-ray structures solved by molecular replacement was downloaded from the PDB website using the customized table download option.

Calculation of yearly Pfam family structural coverage
We used pfam_scan (Finn et al., 2010) with the -clan_overlap option to run all PDB chain sequences that we considered (see x2.1) against all 14 831 Pfam 27.0 profile hidden Markov models (HMMs). The -clan_overlap option was used to ensure that for each PDB chain we retrieved matches to all families, including matches to multiple families within the same Pfam clan, when they occurred. This meant that a single PDB chain could structurally cover more than one Pfam family. The pfam_scan default option for the choice of significance thresholds was used, which applied the Pfamdefined family-specific gathering thresholds (GA) for establishing alignment significance. We used PDB chain sequences as found in the pdb_seqres file (downloaded 18 March 2013). The pfam_scan results together with the release dates of the structures extracted from the PDB (see x2.1) were used to calculate the number of families structurally covered per year (for the period 1976-2012, where 1976 was the first year for which released structures were available in the PDB). For each year, a Pfam family was considered to be structurally covered if at least one protein chain released in that year matched the family profile HMM with a significant alignment score.
Additionally, we calculated the number of newly structurally covered families per year. We started by calculating structurally covered families for 1976 as above. All families assigned to 1976 were removed from the family list. We calculated structurally covered families for 1977 from the remaining families and we then removed them from the list. The same procedure was repeated for all following years until 2012 (inclusive).
As an alternative way to calculate new structural coverage of Pfam families, we used the PDB chain sequence-to-Pfam family mapping provided by the PDBfam website (Xu & Dunbrack, 2012; http://dunbrack2.fccc.edu/ProtCiD/PDBfam/ default.aspx). PDBfam is primarily based on alignments between Pfam family profile HMMs and consensus sequences for the PDB chains that are derived from PSI-BLAST profiles (Altschul et al., 1997). This protocol allows more PDB chains to be mapped to existing families (0.8% of PDB sequences remain unmapped compared with 2.5% when using pfam_scan). Note however that, contrary to what we have implemented with pfam_scan (see above), PDBfam does not allow the same region of a PDB protein sequence to be assigned to more than one family per clan. As a consequence, although more PDB structures are assigned to Pfam families by PDBfam, the number of families that we classify as structurally covered is essentially the same with the two methods (6499 with pfam_scan versus 6496 with PDBfam). families per year was performed as described above, this time using PDBfam to associate PDB chains with Pfam families rather than the pfam_scan output. Note that we discarded any mapping to Pfam-B families present in the downloaded PDBfam file (i.e. only mapping to Pfam-A families was considered).

Clustering of PDB chains that do not match any Pfam family
We clustered PDB chain sequences that had no match to any Pfam family according to the following protocol. (i) We used each of the N unmatched chain sequences as a query in an all-against-all phmmer search. N was equal to 4815 when using pfam_scan to map PDB chains to Pfam families and to 1488 when using the PDBfam mapping. (ii) We stored in an N Â N matrix E values for all sequence pairs for which phmmer returned an alignment with E value 0.001. (iii) We used the matrix calculated in (ii) as input to the clustering program MCL (Enright et al., 2002). Structural coverage of the resulting clusters over the years was calculated in the same way as for Pfam families (see x2.2). Note that a considerable number of PDB sequences carry histidine-rich expression/ purification tags that can cause phmmer to return significant matches between unrelated sequences. This in turn may lead MCL to place these proteins in the same cluster(s). To mitigate this problem, we attempted to remove the most significant part of such tags by monitoring the number of histidines found at the N-and C-termini of the PDB sequences. If we found !5 histidines in the first 12 and/or last 12 positions, we removed the first 12 and/or last 12 amino acids from the sequence.

Running MCL
We ran MCL as follows: (i) mcxload -abc file.abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)' -o file.mci -write-tab file.tab, where file.abc was the pairwise E-value matrix calculated from the phmmer runs (see x2.3); and (ii) mcl file.mci -I 2.3 -use-tab file.tab. Note that the --stream-mirror option symmetrizes phmmer E values for (a, b) versus (b, a) chain sequence pairs (the lower E value is selected for both pairs). The choice of the inflation parameter I influences the granularity of the clustering (with higher values of I corresponding to a higher number of clusters). We ran a simple experiment to estimate a reasonable level of granularity. We considered all PDB protein chain sequences released in 2012 that had at least one significant match to a Pfam family according to the pfam_scan assignment (a total of 21 342 sequences). We applied the clustering procedure described above. The only difference was that in order to account for the increased number of phmmer searches being performed, we considered only E values of 10 À4 (this gives a similar number of estimated false positives). We ran MCL using different inflation values and compared the number of clusters we obtained with the number of Pfam families that were matched by these sequences. We found that an inflation value of I = 2.3 gener-ated a number of MCL clusters (2451) that was comparable to the number of Pfam families (2454 using pfam_scan). We hence chose I = 2.3 for clustering all chain sequences with no current match in Pfam, assuming that such a value would generate a reasonable estimate of the number of Pfam families needed to additionally cover them. We used I = 2.3 for clustering unmatched PDB chains both when the PDB-to-Pfam mapping was calculated via pfam_scan and when it was provided by PDBfam.
2.5. Analysis of Pfam families with or without a structural representative in the PDB Transmembrane helices, coiled-coil regions and disordered regions were predicted in the seed alignments of all Pfam families in Pfam 27.0. Coiled-coil regions were predicted using default parameters with ncoils (http://www.russelllab.org/ cgi-bin/coils/coils-svr.pl). Transmembrane regions were predicted using Phobius (Kä ll et al., 2004) with the default options. Disordered regions were predicted using IUPred (Dosztá nyi et al., 2005) with the long option.
Pfam families in Pfam 27.0 were divided into two categories depending on whether or not they contained a structural member. In particular, families with a structural member were those that had a match to a least one PDB sequence according to pfam_scan (as described in x2.2). For families in each of these two categories, we determined the mean size of the families, the proportion of domains of unknown function (DUFs) and the proportion of residues predicted to be coiled coil, transmembrane or disordered in the seed alignments.

Analysis of human Pfam families
We downloaded (19 October 2012) the UniProtKB/ Swiss-Prot-reviewed protein sequences for Homo sapiens (taxonomic identifier 9606; 20 234 sequences) from the UniProtKB website (http://www.uniprot.org/). The human sequences were searched against the Pfam 27.0 profile HMMs using pfam_scan with default parameters. The resulting Pfam families that matched were taken as the set of human Pfam families. Within this set, we calculated the proportion of families that had a structural member, using the pfam_scan results (described in x2.2) to determine which families contained a structural member. Families were predicted to be multispan transmembrane domains if !50% of the seed members contained !2 predicted transmembrane helices.

The number of structures released in the PDB each year continues to increase
In Fig. 1, we show the numbers of structures and chains released in the PDB each year between 1976 and 2012 (for an account of recent PDB data-deposition trends, see Berman et al., 2013). Note that we limited our analysis to a filtered list of PDB chains provided to us by the CATH database (Orengo et al., 1998) team (x2). We found that the number of both yearly released structures and yearly released chains has increased over time. In 2012, 24% more structures and 35% more research papers protein chains were released in the PDB than five years before in 2007; 211% and 242% more, respectively, were released than ten years before in 2002.

Analysis of PDB structures: from individual sequences to families
In order to better understand what the numbers reported in Fig. 1 mean in terms of progress towards more complete structural coverage of the protein sequence space, we considered PDB entries in the context of protein-sequence families (i.e. sets of homologous protein regions) and measured the increase in the number of families that are being structurally covered (i.e. that have at least one member with a known experimental structure). For this purpose we could use, in principle, the structure-based classification systems provided by SCOP (Andreeva et al., 2008) or CATH (Orengo et al., 1998). Using these resources, however, presents two problems. The first is that many of the structures released in recent years have not yet been included in the latest versions of SCOP and CATH (SCOP 1.75 and CATH v.3.5). The second is that by definition these databases only classify proteins for which structures have been solved. This means that they cannot provide us with any information on the number of protein families that are yet to be structurally characterized. To partially overcome these shortcomings, we decided to use the manually curated, mostly sequence-based Pfam database of protein families (Punta et al., 2012). Pfam provides a higher coverage of PDB structures than either CATH or SCOP, and attempts to classify all protein regions, regardless of whether they fall into a family that contains a member whose structure has been characterized.
Each Pfam family has an associated seed alignment of representative protein regions and, built from this, a profile hidden Markov model (HMM) that can be used to search for further homologues in any collection of protein sequences.
Profile HMMs are built using the HMMER3 suite of programs (http://hmmer.janelia.org/; Eddy, 2009). About 31% of Pfam families are grouped into clans (release 27.0); families in the same clan are believed to be evolutionarily related (Finn et al., 2006). Homologous relationships between protein regions found in different families within a clan, however, are generally more remote than observed within a single family. For the purpose of this analysis, clan relationships were ignored.
It is important to note that Pfam families do not always correspond to structural domains. In particular, when sequence conservation extends seamlessly across more than one domain, such domains may end up as part of a single Pfam family. In other cases, the presence of weakly conserved regions within a domain may cause the domain to be split into two different Pfam families. Although structural information is used by Pfam to guide and improve family boundary definitions, this clearly does not apply to Pfam families for which no structural representative is available. Also, some Pfam families by definition do not represent structural domains. These include repeat and motif families (1.4% and 0.5% of the total families, respectively, in Pfam 27.0). The former are repeated structural elements that fold only when they occur in more than one copy along the protein sequence, e.g. the ankyrin (PF00023) and the WD40 (PF00400) repeats, while the latter are short conserved sequence motifs such as the AThook (PF02178) and the helix-hairpin-helix (PF00633) motifs. Finally, a few Pfam families cover intrinsically disordered protein regions (e.g. the human eukaryotic translation initiation factor 4E-binding protein 1; PF05456). Notwithstanding differences that may exist between Pfam families and structural domains, the Pfam classification provides a well founded means of studying sequence diversity in a protein data set. It was used, for example, as a source of novel domain targets by the Protein Structure Initiative (PSI-2; Chandonia & Brenner, 2005;Dessailly et al., 2009).
Of all protein chains released in the PDB as of 2012, 97.5% can be mapped to at least one Pfam family (using pfam_scan  and Pfam release 27.0; x2). This figure increases to 99.2% when the mapping is performed using profile-profile alignments as in PDBfam (Xu & Dunbrack, 2012;x2). The percentage of PDB residues that are found in a Pfam family is lower (80.6% when using pfam_scan). This is owing to the fact that chains with one or more Pfam matches may still have some regions/domains that are not yet classified in Pfam. Along the same lines, Pfam covers 80% of UniProtKB sequences (i.e. sequences with at least one Pfam match) and 58% of UniProtKB residues.

Latest trend: about five new protein families/clusters are structurally covered every 100 released structures
We ran pfam_scan for all 14 831 profile HMMs from Pfam release 27.0 against the collection of protein chain sequences available in the PDB at the end of 2012 (196 469 after filtering; see x2) and stored the significant Pfam matches for each sequence. We clustered the 2.5% of chains that returned no significant match to any profile HMM (4815 total) using the program MCL (Enright et al., 2002;x2). Clusters obtained using MCL were used to provide a rough estimate of the actual number of families that will need to be built to include these chains into Pfam.
In the histogram in Fig. 2, the bars represent the number of Pfam families and MCL clusters that match structures released each year from 1976 to 2012 (Pfam families and MCL clusters are summed to give a single number for each year). We see that the number of families/clusters for which structures are solved each year seems to have reached a plateau at around 2600-2800. This is despite the growing number of released structures (Fig. 1). In 2007, 6858 structures were released which fell into 2705 Pfam families and MCL clusters; in 2012, 8507 structures (+24%) were released which fell into 2757 families and clusters (+2%). This means that according to this measure the sequence redundancy in the sets of structures that are released every year is increasing.
We now look at the figure that is of most interest to us, the number of families that acquired their first structural representative each year (newly structurally covered families). In Fig. 3(a), we show the number of newly structurally covered families for the period 1976-2012 (dark blue bars). Stacked on top of the families, we show the number of newly covered MCL clusters for the corresponding year (light blue bars). We see that the number of newly covered Pfam families/MCL clusters (i.e. the sum of the newly covered families and clusters) has remained relatively stable over the last five years at around 450. This is considerably fewer than achieved in 2004 and 2007 (605 and 588, respectively), and fewer, on average, than observed in the four years between 2004 and 2007 (an average of 436 versus an average of 547).   Fig. 3(b), we show newly covered protein families and MCL clusters per year when using as an alternative method to assign PDB sequences to Pfam families the mapping provided by the PDBfam database. The trends appear similar to those observed when using the pfam_scan mapping.
Finally, we looked at the ratio between newly covered protein families/clusters and released PDB structures for each year from 1993 to 2012 , where 1993 was the first year in which more than 500 structures were released. In Fig. 3(c), we plot the number of newly covered families/clusters per 100 released structures. We see that over the last 20 years we have gone from about 20 to about five newly covered families/ clusters per 100 released structures. Thus, at the current pace (Figs. 3a and 3b), achieving complete structural coverage for families that are currently in Pfam (release 27.0, 6499 families still to cover) would take an estimated 15-20 years.
6. Families that have no structural representatives include many families that are small, not functionally characterized or enriched in challenging protein regions According to our analysis based on the Pfam classification, protein chains of newly solved structures ever more often fall into families that have a structural representative already (Fig. 3c). In this section, we investigate the reasons behind this trend.
A simple explanation would be that by now most of the sequence space could be structurally covered. A quick look at the list of Pfam families that still lack structural characterization would suggest that this is not the case. Of the 14 831 families that are part of Pfam release 27.0, just 44% have been structurally characterized as of 2012 (Fig. 4a). Also, Pfam is growing. For example, the number of Pfam families increased from 13 672 in Pfam 26.0 (released in November 2011) to 14 831 in Pfam 27.0 (released in March 2013). A more detailed analysis of structurally uncharacterized Pfam families, however, shows that on average they are much smaller in size with respect to their structurally characterized counterparts (average number of 549 versus 3594 members; Fig. 4b). Indeed, at the end of 2012, only 17% of all protein regions classified by Pfam fell into structurally uncharacterized families (Fig. 4c). Thus, whenever targets are not specifically selected based on their sequence novelty, the fact that more and more newly solved structures fall into already characterized families is to be expected, as these families cover most of the sequence space currently classified by Pfam. This also has important consequences for the techniques used to solve new structures. Indeed, the percentage of X-ray structures that are solved each year by molecular replacement, a methodology that uses information derived from homologous proteins of known structure, is now approaching 80% (Fig. 5). It should be noted, however, that when applied to multi-domain proteins or to protein complexes, molecular replacement can also allow the structure of members of new Pfam families, i.e. members of families lacking a structural representative, to be determined.  On the other hand, family size is likely to not be the only reason behind the slow down in the structural characterization of novel Pfam families. We see that families that still lack structural representatives are enriched in domains of unknown function (DUFs; 36% compared with only 10% among families with structural representatives; Fig. 4d). DUFs are families to which Pfam has not yet assigned any functional annotation. At the same time, these families are predicted (x2) to be enriched in coiled-coil, disordered and transmembrane residues (Fig. 4e), which constitute regions that often make protein experimental structural characterization all the more challenging.
Finally, we should not forget that there are proteins that are relevant for increasing structural coverage of the proteinsequence space but that may not be very important/interesting in terms of their biology or their medical benefit. On the contrary, a number of already structurally characterized families and proteins are of special (e.g. biomedical) interest and are being studied in great detail by structural biologists. Some of these protein families, such as G-protein-coupled receptors (Venkatakrishnan et al., 2013), are very functionally diverse and their members may interact with a number of different partners or adopt different conformational states. Structural data can go a long way in helping to characterize this variation if structures are determined for additional members of a family or for the same protein but under different experimental conditions. In Table 1, we report the ten most targeted Pfam families for 2012. We can see that during 2012 hundreds of structures were solved of protein kinases, immunoglobulins and trypsins.

Conclusions and (at least) one good reason not to give up on increasing the structural coverage of Pfam families
We have pointed to many valid reasons why structural biologists focus their efforts on trying to better characterize families and proteins for which at least some structural information is already available. We have also shown why a number of currently uncovered families appear to be 'unpalatable' for structure determination. While these considerations are important, we believe that there are still a considerable number of structurally uncharacterized families that are worth pursuing. Efforts have been made by some groups to identify such targets. One example is the JCSG structural genomics consortium, which is targeting structurally uncharacterized protein families that are overrepresented in metagenomic data (Ellrott et al., 2010) and, more specifically, families in the human gut microbiome (some of these currently have very few members in UniProtKB). Also, a list of uncharacterized membrane proteins has recently been brought to the attention of researchers (Pieper et al., 2013). Here, we add a brief perspective on structural coverage of the human proteome. For 90% of human sequences we have at least one match to a Pfam family (Pfam release 27.0). At the residue level, 45% of human residues fall into a Pfam region (Mistry et al., 2013). Of all Pfam families with a match in the human proteome (5494 in total), 44% have no structural representative. If we remove families that are DUFs and families where the seed alignment contains !25% of residues that are predicted to be disordered, we are left with 1238 families that do not have a structural representative, of which 1003 are not predicted to be multispan transmembrane domains (x2). This is an example of a set of human families for which some functional information is in most cases already available (81% of them have at least one literature reference in Pfam) and for which solving the structure of a first member, if by no means easy (see also Bray, 2012), may be within reach of current structural determination techniques. A concerted effort to solve families such as these could yield a structural representative for most of them within the next few years.  Table 1 The ten families with the highest number of structures released in 2012 (families from Pfam release 27.0, matches according to pfam_scan; see x2).
Note that if a structure had multiple chains that matched the same family then all of these matching chains counted as one structure in the last column of the  Proportion (%) of X-ray structures that have been solved using molecular replacement each year from 1976 to 2012.