Structural Biology and Crystallization Communications Ligands in Psi Structures

Approximately 65% of PSI structures report some type of ligand(s) that is bound in the crystal structure. Here, a description is given of how such ligands are handled and analyzed at the JCSG and a survey of the types, variety and frequency of ligands that are observed in the PSI structures is also compiled and analyzed, including illustrations of how these bound ligands have provided functional clues for annotation of proteins with little or no previous experimental characterization. Furthermore, a web server was developed as a tool to mine and analyze the PSI structures for bound ligands and other identifying features.


Introduction
International structural genomics initiatives, including the US-based Protein Structure Initiative (PSI; http://www.nigms.nih.gov/Initiatives/ PSI/), have led to an unprecedented increase in the rate at which new protein structures are being solved and made available to the scientific community (Levitt, 2007). To date, these efforts have contributed over 7500 protein structures to the Protein Data Bank (PDB; Dutta et al., 2009), more than half of which have come from the PSI. For the most part, the PSI effort has focused on determining unique structures from protein families that previously lacked any structural representative and on providing better structural coverage for large diverse protein families where more structures are needed to accurately model the entire family. Consequently, many of the proteins solved have little or no previous experimental characterization and have been classified as domains of unknown function (DUFs; Bateman et al., 2010) or have only a tentative functional annotation based on amino-acid sequence homology. A variety of online tools and web-based search engines, such as EBI-SSM (Krissinel & Henrick, 2004), DALI (Holm et al., 2008), VAST (Gibrat et al., 1996) and fast-SCOP (Chi et al., 2006), allow the inference of function based on structural similarity. However, these approaches have their limitations.
A significant number of the structures solved by structural genomics efforts can be assigned to superfolds (Orengo et al., 1994), such as TIM-barrel and ferredoxin folds, whose members perform a wide diversity of biological functions. Thus, knowledge of the structure is often not sufficient to deduce the exact cellular function of a protein.
To further aid in functional annotation, additional methods can be explored, such as catalytic residue matching and analysis of protein surface properties, although these methods usually only partially enhance the functional assignment (Binkowski et al., 2005;Laskowski et al., 2005a,b;Porter et al., 2004).
Another challenge faced by large-scale structural biology efforts is to effectively disseminate the structural results to a broad scientific community. Although all of the PSI structures are deposited immediately into the PDB and rapidly released, only a small fraction of them have been described in publications in the scientific literature. Recently, efforts have been made to develop more streamlined webbased tools to rapidly disseminate key findings and new insights derived from these structures, as exemplified by the PSI Knowledgebase (http://kb.psi-structuralgenomics.org) and The Open Protein Structure Annotation Network (TOPSAN; http://www.topsan.org/; Krishna et al., 2010). However, it is clear that complementary user-friendly tools would be extremely beneficial to mine the latest structural data for functional and methodological insights. A rich source of functional data that is often overlooked in the PSI structures are the ligands that are identified during interpretation of the electron-density maps and subsequently built into the deposited structures. More than half of the PSI structures (65%) contain bound ligands, such as metal ions, cofactors, substrates and effectors. Many of these ligands are acquired during protein production, whereas the remainder are incorporated into the protein at various steps during the purification and crystallization stages, as, for example, buffer reagents, salts, precipitants and cryoprotectants. In many cases, these non-native ligands act as surrogates for the natural ligands owing to their similar biophysical properties. Their identification can often pinpoint favorable electrostatic regions or 'hot spots' on the protein and these surrogates often mimic the natural ligand-protein interactions, thus providing functional clues and insights.
The Joint Center for Structural Genomics (JCSG; http:// www.jcsg.org) has designed the Ligand Search Server to be a fast and intuitive way to mine the PSI structures for detailed information regarding bound ligands. Searches can also be readily generated for entire families or for distinct classes of proteins or ligands, thus furthering collation and analysis of the functional knowledge derived from otherwise diverse sets of structures.

The Ligand Search Server
The JCSG Ligand Search Server (http://smb.slac.stanford.edu/jcsg/ Ligand_Search/) was created to mine PSI structures and to identify and classify the different types of bound ligands whether of functional relevance or not. The server also serves as a portal to complementary sites such as the Protein Data Bank (PDB; http://www.pdb.org), TOPSAN and Pfam (http://pfam.sanger.ac.uk;Finn et al., 2008) which facilitate further exploration. The main user interface provides eight different search fields, including (i) the PDB ligand code, (ii) the PSI target name, (iii) the PDB code, (iv) the Pfam accession, (v) the protein/gene product accession ID, (vi) the structure description, as listed in the title of the PDB header, (vii) the source organism name and (viii) the name of the PSI center. Each of these fields accepts multiple entries that are combined with a logical 'or' and entries in any of the eight search fields are then combined with a logical 'and' to generate the search query. A few search tips and examples are listed alongside the search form on the main page (see Fig. 1).
The 'Search' button submits the query against a locally maintained database which contains information on all of the PSI structures deposited in the PDB. The query results are returned as a single page that contains a concise tabular report at the top, which contains a row for every PDB structure that matches the query, lists the protein identifier used by the individual PSI center, the PDB code, the Pfam family name, the gene accession ID, the structure description, the source organism name, the bound ligands, the contributing PSI center and the deposition date. An additional column, 'Xtal ID', is included for JCSG structures which provides a link to specific information on the crystal used for structure solution, including all of the data and log files produced at various stages of structure solution and refinement. Most of the report fields are linked to other web resources to explore the structures further. This tabular report can also be exported to an Excel spreadsheet. Next, a ligand-visualization section provides links to HIC-Up (http://alpha2.bmc.uu.se/hicup/; Kleywegt, 2007) and Ligand Expo (http://ligand-depot.rcsb.org; Feng et al., 2004) for each of the ligands found. Several summary sections that include information on the nature of the ligands found, the associated Pfam families and the source organisms follow. A 'Summary' button is also provided, which if used without any search query will generate an overall statistical report on all of the PSI structures in the database. More concise 'Summary' reports can be produced by including query values in the form fields. Percentage of PSI structures that have any small-molecule ligand bound to them. The small molecules are categorized by their types.

Treatment of ligands at JCSG
During structure determination at the JCSG, attempts are made to account for all significant electron density observed during refinement. In addition to solvent molecules and chemical reagents used during protein production and crystallization, potential biological molecules, such as enzyme cofactors, substrates, products or their derivatives, which are presumably relevant to the protein's function and clearly supported by the electron density and chemical envir-onment, are modeled into the structures, even if these molecules were not explicitly present in the reagents used during the protein preparation and crystallization stages.
The JCSG routinely uses X-ray fluorescence to identify metals that are bound in the structures. This technique allows the identification of most metals in the sample with a single experimental spectrum. When multiple metals are detected, X-ray diffraction data sets are then collected above and below the relevant X-ray absorption edges of the  metals and anomalous difference Fourier maps are calculated in order to unequivocally locate and confirm the identity of the bound metals. Lighter metals, such as Mg and Na, cannot be determined by X-ray fluorescence owing to limitations in our experimental setup; therefore, these are usually identified based on their binding geometries and environment.
Nevertheless, in many cases a suitable ligand cannot be unambiguously assigned to the electron density and the true identity of the ligand is inconclusive without further experimentation. The JCSG has adopted the policy of including these ligands as 'unknown ligands' and they are identified in the PDB file as UNL. The density is modeled by positioning a group of connected atoms that match the overall shape and a relevant description is included in the 'REMARK 3' field of the PDB header. To date, this strategy has surprisingly not been widely adopted by other PSI centers as it provides extremely valuable information that can be searched by a simple query; thus, the majority of these UNL-bound structures have been deposited by the JCSG (90%). Furthermore, all structures, including bound ligands, are internally peer-reviewed by at least one other scientist as a quality-control step prior to deposition in the PDB.

Overall statistics
A preliminary analysis of the 4200 currently available PSI structures shows that more than 2700 structures ($65%) contain small-molecule ligands of some kind. These ligands can be loosely classified as biological ligands (substrate, products, cofactors, inhibitors and their analogs) or surrogates, as well as peptides, ions, buffer molecules, crystallization reagents and cryoprotectants. This classification scheme is described in more detail in Table 1 and Fig. 1. Most of the functionally relevant biological ligands, including cofactors, were not explicitly added to the crystallization experiments. Hence, these ligands are endogenous to the expression systems and were acquired during protein production.
The overall distribution of the various types of ligands bound to PSI structures is shown in Fig. 2. It is of note that the JCSG reports more ligands in their structures compared with other PSI centers, particularly for the various ligands used as crystallization agents (buffers, precipitants and cryoprotectants); however, we also report more ligands in other categories. One possible explanation for this increased reporting of ligands comes from the standardized refinement and structure validation procedures implemented at the JCSG, in which specific steps (manual inspection and modeling of appropriate ligands) are undertaken to verify that all unmodeled electron density is properly accounted for. Indeed, a significant number of JCSG structures also contain 'unknown ligands' (UNLs), which refer to bound ligands that could not be unambiguously identified based on the electron density. The majority of these UNLs appear to be of biological importance since they are often located in crevices or cavities that resemble known active-site pockets or are identified based on comparison to structural homologs or other biochemical evidence. A survey of the number of biological ligands bound to PSI structures (Fig. 3) indicates that succinic acid (SIN), thymidine-5 0 -monophosphate (DT) and palmitic acid (PLM) are the most frequently observed and are likely to originate from the expression system. Similarly, flavin mononucleotide (FMN), nicotinamide adenine dinucleotide (NAD) and flavin adenine dinucleotide (FAD) are the most common cofactors. Magnesium (Mg 2+ ), zinc (Zn 2+ ) and sodium (Na + ) are the most common metal ions and sulfate (SO 4 2À ), chloride (Cl À ) and phosphate (PO 4 3À ) are the most common nonmetal ions that are found in PSI structures. These particular ions are often present in the expression, purification and crystallization solutions, which may account for their frequent observation. A further analysis of the biological ligands reveals that 25 are unique to PSI structures and have not been observed previously in other structures deposited in the PDB, again indicating the richness and diversity of the information that is being derived from such structure determinations of proteins of unknown function (Table 2).

Unknown ligands (UNLs)
Examples of some UNL structures are shown in Fig. 4. About 75% of the UNL-bound structures now have some functional annotation and, therefore, biophysical and biochemical experiments can be designed to confirm the identity of the unknown ligands based on size and shape of the electron density as well as the nature of the environment surrounding the bound ligand. For example, in several instances the UNL resembles benzoic acid or nitrobenzene (PDB codes 2f4p, 2ig6, 2pbl, 3d82, 3ecf, 3ejv and 3ff0). However, these compounds were not modeled as such since neither was present in any of the reagents used nor was there any correlation with the structural communications Acta Cryst. (2010). F66, 1309-1316 Table 3 Ligands bound to proteins of unknown function, excluding common crystallization reagents and cryoprotectants. protein function. Uptake of endogenous molecules by proteins during the expression/purification stages is more common than is often appreciated, as exemplified by the occurrence of benzoic acid in 59 other structures in the PDB. However, in other cases, the UNL can provide functional clues about the protein. For instance, protein NP_823353.1 (PDB code 3giw) is annotated as a protein of unknown function (Pfam DUF574) with an unknown ligand bound (http:// www.topsan.org/Proteins/JCSG/3giw). The UNL resembles phenylalanine and the protein is structurally similar to SAM-dependent methyltransferases (Martin & McMillan, 2002; Fig. 4a), suggesting the possibility that it could be a phenylethanolamine N-methyltransferase (PNMT; Wong et al., 1992), histamine N-methyltransferase (HNMT; Rutherford et al., 2008) or catechol-O-methyl transferase (COMT; Weinshilboum et al., 1999). Ni 2+ , 43 Mg 2+ , 41 Ca 2+ , 47 Na + , 11 K + , three Mn 2+ , two Co 2+ and one Li + ). The majority of Fe 3+ and Zn 2+ ions in PSI structures have a higher probability of being biologically relevant, since they are less frequently present in the crystallization buffers. For example, only 20% of the structures containing Zn 2+ ions report a zinc salt in the crystallization conditions. Other metals are potentially less biologically relevant as they are more frequently used during protein purification or crystallization. PSI structures containing Ca 2+ , Mg 2+ and Na + ions were obtained when such salts were used in 77, 64 and 61% of their crystallization conditions, respectively.

Metals bound to PSI structures
The identification of a bound metal can often aid in identification of the active site in a protein. For example, the crystal structures of three proteins of unknown function, YP_164873.1 from Silicibacter pomeroyi DSS-3 (PDB code 3chv), YP_556190.1 from Burkholderia xenovorans LB400 (PDB code 3e49) and YP_555544.1 from B. xenovorans LB400 (PDB code 3e02), revealed structural similarity to 3-keto-5-aminohexamoate cleavage protein (YP_293392.1) from Ralstonia eutropha Jmp123 (PDB code 3c6c), although their sequence identity (27-32%) is relatively low. Pairwise structural alignments gave a C r.m.s.d. of 1.6 Å for 264 aligned residues between 3chv and 3c6c, a C r.m.s.d. of 1.6 Å for 275 aligned residues between 3e49 and 3c6c and a C r.m.s.d. of 1.7 Å for 259 aligned residues between 3e02 and 3c6c. All four structures share a conserved  Unknown ligands (UNL) in a few PSI structures. The UNL atoms are represented as red spheres enveloped by electron-density mesh (2F o À F c density contoured at 1 level above the mean) and surrounded by the protein rendered in cartoon representation. In many cases, the ligand could have been assigned as one or a few potential compounds, but is still annotated as a UNL since we have no definitive proof of its identity. Zn 2+ -binding site in which almost all of the active-site residues are identical. Other examples of using structural knowledge about a bound metal to enhance the functional annotation are presented elsewhere in this issue. Bakolitsa and coworkers provide an example of the identification of Zn and Ni bound to the structure of the DUF1470 protein . Axelrod and coworkers provide another good example where binding of Zn 2+ in the zincfinger domain combined with structural comparisons suggest that two of the PF02663 Pfam family members in this study may bind nucleic acids and possibly function as transcriptional regulators (Axelrod et al., 2010). These results have revealed functional and structural diversity within the PF02663 family.
6. Functional clues 6.1. Proteins of unknown function Submitting the query 'Unknown', 'Uncharacterized', 'Hypothetical' or 'DUF' in the Description field of the Ligand Search Server finds 593 PSI structures ($14% of the total) that lack any functional annotation. The vast majority (474 structures) have been assigned to families in Pfam based on their amino-acid sequence.
About 66% of these 600 or so functionally unannotated proteins have one or more bound ligands. A closer examination of those ligands that are most likely to be biologically relevant (excluding crystallization and cryogenic reagents, although in some cases these may also provide clues to function) indicates that the most frequently found are either metal ions (22% of all ligands) or ligands with unknown identity (UNL; 5%), as shown in Table 3. Further analysis is necessary to determine their functional relevance. In a few cases, analysis of these ion-binding sites has already yielded definitive functional insights (see x5).

PSI contribution to new Pfam families
One of the key goals of PSI has been to increase the structural coverage of protein family space. Pfam coverage by the current set of PSI structures now extends to 1630 families; for approximately 700 ($43%) of these the PSI has provided the first structural representative. Over 150 of these Pfam families are populated by a single structure. Analysis of these first structural representatives representing 700 families indicates that over 175 of these structures contain some biologically relevant ligands. Of these, Zn 2+ tops the list as the most frequently observed ligand in about 38 structures, followed by Mg 2+ in 35 structures, Na + in 23 structures, UNL in 16 structures and Ni 2+ in 12 structures.

Biological relevance of common molecules bound to proteins
Common reagents used during purification and crystallization, such as SO 4 2À , Cl À or PO 4 3À ions, buffer molecules such as Tris (2-amino-2-hydroxymethyl-propane-1,3-diol) or citrate, and precipitants such as polyethylene glycols etc., often bind to proteins and are identified during structure refinement. In some cases, these bound reagents improve our understanding of putative binding sites on proteins and help to identify functionally relevant interactions by mimicking substrates. Here, we discuss three such examples (Fig. 5). A SO 4 2À ion bound in the active site of YP_001181608.1 (PDB code 3gxg; http://www.topsan.org/Proteins/JCSG/3gxg) mimics a substrate phosphate moiety and lends support to its annotation as a phosphatase. Similarly, a citrate molecule helped to identify the active site in YP_001089791.1 (PDB code 3g68; http://www.topsan.org/Proteins/ JCSG/3g68), where comparison of structurally similar proteins with a substrate bound in a similar location to the citrate led to the identification of likely active-site residues. In another example, the buffer molecule Tris is bound in the active site of the protein and emulates a sugar moiety in YP_001304206.1 (PDB code 3h3l; http:// www.topsan.org/Proteins/JCSG/3h3l).

Data mining of ligands in crystal structures for improving methodology
In addition to being a rich source of functional clues, the ligands bound to PSI structures can also serve as a source of data to improve crystallographic methods and map interpretation. As an example, we examined the frequency with which various cryoprotectant reagents are observed in crystal structures. We limited our analysis to JCSG structures, since we also had the precise crystallization and cryoprotective conditions used for each structure. Analysis of about 800 structures indicates that the most frequently observed cryoprotectant is ethylene glycol (EDO), with a probability of $82% of being found in the structure if used in the crystallization/cryoprotective conditions, as shown in Table 4 Common reagents bound in the active sites of proteins. The protein structures are shown in cartoon representation and colored green or gray. The bound ligands are drawn as sticks and are colored yellow (carbon), red (oxygen) and blue (nitrogen). The interacting residues are also drawn as sticks with their C atoms colored cyan. (a) An SO 4 2À ion bound in the active site of protein YP_001181608.1 (PDB code 3gxg). (b) A citrate molecule bound to YP_001089791.1 (PDB code 3g68) helped to identify the potential active site and was supported by substrates (gray) bound to the same location in structurally similar proteins (gray; PDB codes 1mos, 2bpl, 2poc and 2v4m). (c) A Tris molecule bound in the active site of YP_001304206.1 (PDB code 3h3l). respectively, of being observed in the structure. A comparative analysis performed with all of the structures in the PDB, although limited because the crystal growth and cryoprotective conditions are often missing from the deposition record, indicates a much smaller frequency of observation of these compounds in crystal structures. For example, of the 888 structures that list ethylene glycol as a crystallization/cryoprotective component in the PDB header, it is observed in only 184 (20.7%) of these structures. Similarly, only 723 (23.5%) structures indicate the presence of bound glycerol out of 3079 structures that report its use during crystallization. The high frequency of occurrence of these cryoprotectants in our structures suggests that more care should be taken in general to identify these molecules during model building and refinement if present in the crystallization/cryoprotective conditions and to include cryoprotectants in addition to the crystallization conditions in the PDB header.

Conclusions
We have provided an overview of the various types of ligands bound in PSI structures and have tabulated their relative frequencies. Furthermore, we have described how ligands are identified and modeled into the structures at JCSG. The sheer number and diversity of ligands found in JCSG structures, based on a rigorous and systematic interpretation of the electron-density maps, suggests that for many structures in the PDB, ligands may have been overlooked or not adequately characterized. The observation of bound ligands, including unknown ligands and common chemical reagents mimicking potential biological ligands, often enhances the functional annotation of novel, uncharacterized proteins and generates hypotheses which can be validated experimentally. The JCSG Ligand Search Server provides an easy tool to survey the large collection of novel PSI structures for their bound ligands.