The structure of the first representative of Pfam family PF09836 reveals a two-domain organization and suggests involvement in transcriptional regulation

The crystal structure of the NGO1945 gene product from N. gonorrhoeae (UniProt Q5F5IO) reveals that the N-terminal domain assigned as a domain of unknown function (DUF2063) is likely to bind DNA and that the protein may be involved in transcriptional regulation.


Introduction
NGO1945 from Neisseria gonorrhoeae FA 1090 is a protein of unknown function with a molecular weight of 28.6 kDa (residues 1-248) and a calculated isoelectric point of 4.65. It contains an N-terminal DUF2063 domain (residues 10-92) that has recently been classified as a new Pfam family, PF09836 (Finn et al., 2008), which is currently comprised of 216 proteins from 173 species (170 from proteobacteria and one each from acidobacteria, bacteroidetes and planctomycetes). Pfam also indicates that this domain is observed in 68 sequences from NCBI's METASEQ metagenomics data set. Of these 216 proteins, 215 have a single N-terminal DUF2063 domain and, in one instance, this domain is found with a DUF692 domain (PF05114; TIM barrel superfamily, Pfam clan CL0152, which includes endonuclease IV), in which the DUF692 domain is at the N-terminus. Bioinformatics analysis (Altschul et al., 1997;Marchler-Bauer et al., 2007;Jaroszewski et al., 2005;Tatusov et al., 2000) revealed that of the significant sequence homologs of NGO1945, five proteins are from different species of Neisseria (94-99% sequence identity) and 14 (including predicted RNA polymerase sigma factor and repair proteins) are from different strains of Haemophilus (40-90% sequence identity). We have determined the crystal structure of NGO1945 to 2.1 Å resolution in order to expand structural coverage of novel proteinsequence space and assign putative functions to novel proteins that have not been previously studied and whose functions cannot be inferred from sequence homology. This study allows us to assign a putative molecular function to the NGO1945 protein and to the DUF2063 domain.

Protein production and crystallization
Clones were generated using the Polymerase Incomplete Primer Extension (PIPE) cloning method (Klock et al., 2008). The gene encoding NGO1945 (GenBank YP_208969) was amplified by polymerase chain reaction (PCR) from N. gonorrhoeae FA 1090 genomic DNA using PfuTurbo DNA polymerase (Stratagene) and I-PIPE primers (forward primer, 5 0 -ctgtacttccagggcATGCAGCCTGAAAC-CTCCGCCCAATACC-3 0 ; reverse primer, 5 0 -aattaagtcgcgttaTGCG-GATAGATGGTTTTGGCTCGGGG-3 0 ; target sequence in upper case) that included sequences for the predicted 5 0 and 3 0 ends. The genomic DNA used here contained a single amino-acid substitution, P94L, when compared with the available GenBank sequence from N. gonorrhoeae FA 1090. The expression vector pSpeedET, which encodes an amino-terminal tobacco etch virus (TEV) proteasecleavable expression and purification tag (MGSDKIHHHHHHEN-LYFQ/G), was PCR-amplified with V-PIPE primers (forward primer, 5 0 -taacgcgacttaattaactcgtttaaacggtctccagc-3 0 ; reverse primer, 5 0 -gccctggaagtacaggttttcgtgatgatgatgatgatg-3 0 ). The V-PIPE and I-PIPE PCR products were mixed to anneal the amplified DNA fragments together. Escherichia coli GeneHogs (Invitrogen) competent cells were transformed with the V-PIPE/I-PIPE mixture and dispensed onto selective LB-agar plates. The cloning junctions were confirmed by DNA sequencing. Expression was performed in selenomethioninecontaining medium. At the end of fermentation, lysozyme was added to the culture to a final concentration of 250 mg ml À1 and the cells were harvested and frozen. After one freeze-thaw cycle, the cells were homogenized in lysis buffer [50 mM HEPES pH 8.0, 50 mM NaCl, 10 mM imidazole, 1 mM tris(2-carboxyethyl)phosphine-HCl (TCEP)] and the lysate was clarified by centrifugation at 32 500g for 30 min. The soluble fraction was passed over nickel-chelating resin (GE Healthcare) pre-equilibrated with lysis buffer, the resin was washed with wash buffer [50 mM HEPES pH 8.0, 300 mM NaCl, 40 mM imidazole, 10%(v/v) glycerol, 1 mM TCEP] and the protein was eluted with elution buffer [20 mM HEPES pH 8.0, 300 mM imidazole, 10%(v/v) glycerol, 1 mM TCEP]. The eluate was bufferexchanged with HEPES crystallization buffer (20 mM HEPES pH 8.0, 200 mM NaCl, 40 mM imidazole, 1 mM TCEP) using a PD-10 column (GE Healthcare) and incubated with 1 mg TEV protease per 15 mg eluted protein. The protease-treated eluate was passed over nickel-chelating resin (GE Healthcare) pre-equilibrated with HEPES crystallization buffer and the resin was washed with the same buffer. The flowthrough and wash fractions were combined and concentrated to 16.2 mg ml À1 by centrifugal ultrafiltration (Millipore) for crystallization trials. NGO1945 was crystallized using the nanodroplet vapor-diffusion method (Santarsiero et al., 2002) with standard JCSG crystallization protocols (Lesley et al., 2002) by mixing 100 nl protein solution with 100 nl crystallization solution and equilibrating against a 50 ml reservoir volume. The crystallization reagent contained 0.2 M magnesium chloride, 8.2%(v/v) ethanol and 0.1 M imidazole pH 8.36. A rod-shaped crystal of approximate dimensions 20 Â 20 Â 100 mm was harvested after 34 d at 277 K for data collection. To determine its oligomeric state in solution, NGO1945 was analyzed using a 1 Â 30 cm Superdex 200 size-exclusion column (GE Healthcare) coupled with miniDAWN static light-scattering (SEC/SLS) and Optilab differential refractive-index detectors (Wyatt Technology). The mobile phase consisted of 20 mM Tris pH 8.0, 150 mM NaCl and 0.02%(w/v) sodium azide. The molecular weight was calculated using ASTRA 5.1.5 software (Wyatt Technology).

Data collection, structure solution and refinement
No additional cryoprotectant was added to the crystal during data collection. Initial screening for diffraction was carried out using the Stanford Automated Mounting system (SAM; Cohen et al., 2002) at the Stanford Synchrotron Radiation Lightsource (SSRL; Menlo Park, California, USA). The diffraction data were indexed in the monoclinic space group C2. Multi-wavelength anomalous diffraction (MAD) data were collected at SSRL on beamline BL9-2 at wavelengths corresponding to the high-energy remote ( 1 ) and inflection ( 2 ) of a selenium MAD experiment. The data sets were collected at 100 K using a MarMosaic 325 CCD detector. The MAD data were integrated and reduced using MOSFLM (Leslie, 1992) and scaled with the program SCALA from the CCP4 suite (Collaborative Computational Project, Number 4, 1994). Phasing was performed with SOLVE (Terwilliger & Berendzen, 1999) and automated model building was performed with ARP/wARP (Perrakis et al., 1999) and RESOLVE (Terwilliger, 2000). Model completion was performed with Coot (Emsley & Cowtan, 2004). Refinement was performed with REFMAC5 (Winn et al., 2003) with one TLS group, using the highenergy remote ( 1 ) data set and excluding diffraction maxima present in ice rings spanning the resolution ranges 3.97-3.82, 3.72-3.61, 3.50-3.39, 2.69-2.64 and 2.26-2.24 Å and an additional 14 unusually strong reflections (near ice rings) that had intensities 15Â greater than the average intensity for their resolution bin. The removal of reflections affected by ice rings resulted in a difference in completeness between the 1 and 2 data sets. To ensure that the 14 unusually strong reflections were indeed spurious (near ice rings) and did not reflect real intensities from the protein component of the crystal, we calculated structure factors from the final model after refinement, which revealed that these 14 reflections were not unusually strong. This analysis confirmed that the 14 reflections in the measured data should be removed from the refinement data set. Crystallographic data and refinement statistics are summarized in Table 1.

Validation and deposition
The quality of the crystal structure was analyzed using the JCSG Quality Control server, which verifies the stereochemical quality of the model using AutoDepInputTool ( Table 1 Crystallographic data and refinement statistics for NGO1945 (PDB code 3dee).
Values in parentheses are for the highest resolution shell.  (Diederichs & Karplus, 1997). § Typically, the number of unique reflections used in refinement is slightly less than the total number that were integrated and scaled. Reflections are excluded owing to negative intensities and rounding errors in the resolution limits and unit-cell parameters. In addition, ice-ring regions were excluded prior to integration and 14 reflections with intensity greater than 15 times the average for their shell were omitted prior to refinement. } R cryst = P hkl jF obs j À jF calc j = P hkl jF obs j, where F calc and F obs are the calculated and observed structure-factor amplitudes, respectively. † † R free is the same as R cryst but for 4.9% of the total reflections chosen at random and omitted from refinement. ‡ ‡ This value represents the total B, which includes TLS and residual B components. § § Estimated overall coordinate error (Collaborative Computational Project, Number 4, 1994;Cruickshank, 1999). (Davis et al., 2004) andWHATIF 5.0 (Vriend, 1990), the agreement between the atomic model and the data using SFCHECK 4.0 (Vaguine et al., 1999) and RESOLVE (Terwilliger, 2000), the protein sequence using ClustalW (Thompson et al., 1994) and the atom occupancies using MOLEMAN2 (Kleywegt, 2000). It also evaluates the difference in R cryst /R free , expected R free /R cryst and maximum/ minimum B values by parsing the refinement log file and PDB header. Protein quaternary-structure analysis was performed using the PISA server (Krissinel & Henrick, 2005). Fig. 1(b) was adapted from an analysis using PDBsum (Laskowski et al., 2005) and all other figures were prepared with PyMOL (DeLano, 2002). Atomic coordinates and experimental structure factors have been deposited in the PDB under accession code 3dee.

Results and discussion
3.1. Overall structure The crystal structure of NGO1945 was determined to 2.10 Å resolution using the MAD method (Fig. 1). Data-collection, model   Table 1. The final model includes one monomer (residues 31-230), two chloride ions, one imidazole molecule and 95 water molecules in the asymmetric unit. No electron density was observed for residues 1-30 and 231-248, even though LC-MS confirmed their presence in the purified protein before crystallization. Thus, they are either disordered in the crystal (note that there is sufficient space in the lattice to accommodate the disordered residues) or the protein may have undergone limited proteolysis in the crystallization drop. The Matthews coefficient (V M ; Matthews, 1968) is $2.6 Å 3 Da À1 and the estimated solvent content is $53%. The Ramachandran plot produced by MolProbity (Davis et al., 2004) shows that 96.5% of the residues are in favored regions, with no outliers. Crystal-packing analysis predicts that NGO1945 may dimerize via its N-terminal or C-terminal domains [using helices H1, H5 and H7 (Fig. 2a) or helices H8, H9 and H11 (Fig. 2b), respectively] with total buried surface areas of 1550 and 1930 Å 2 and ÁG int values of À53.2 and À82.0 kJ mol À1 , respectively. Analytical size-exclusion chromatography in combination with static light scattering (SEC/ SLS) revealed the oligomeric form in solution to be a dimer, but with medium confidence ($30% was clearly dimeric, $50% was interchanging between monomer and dimer and $20% was likely to be monomeric, suggesting that at any given time the dimer:monomer ratio was $60:40). An imidazole molecule (likely to be from the buffer) is bound on the surface to Glu114, Asp166 and Arg162, but its significance is known.
A systematic search for other proteins of similar structure was conducted using several different methods, including the DALI server (Holm et al., 2008), the protein structure-comparison service SSM at the European Bioinformatics Institute (http://www.ebi.ac.uk/ msd-srv/ssm; Krissinel & Henrick, 2005) and the flexible structurealignment method FATCAT (Ye & Godzik, 2004). No significant matches of the full protein structure with any other known protein structures were found. However, significant structural similarities are seen with other proteins when queried with individual NGO1945 domains, as discussed below.

N-terminal domain
The ordered region of the NGO1945 N-terminal domain (residues 31-116) includes the DUF2063 domain (Pfam PF09836; residues 31-92) and is similar to many proteins that contain -helical bundles. Some of the significant hits are with the 2 domain of RNA polymerase sigma factor SigR from Streptomyces coelicolor [Li et al., 2002;Burgess & Anthony, 2001; PDB code 1h3l; SCOP fold 88945, DALI Z score 4.4 (Z scores above 2.0 are significant), 2.9 Å r.m.s.d. over 61 C atoms, 0% sequence identity; Fig. 3a] and the 3 domain of sigma factor 28 FliA from Aquifex aeolicus (Sorenson et al., 2004; PDB code 1rp3; SCOP fold 46688, Z score 3.9, 3.3 Å r.m.s.d. over 61 C atoms, 5% sequence identity). The N-terminal domain is also similar to the $70-residue SAM (sterile alpha motif) domain (SCOP fold 47768), which contains a helix-hairpin-helix (HhH) motif (Shao & Grishin, 2000) that is found in several hundred proteins that are involved in signal transduction and transcriptional regulation (Grimshaw et al., 2004). For example, the SAM domain is found within the C-terminal domain of the transcription elongation factor NusA (Bonin et al., 2004; PDB code 1u9l; Z score 4.2, 2.5 Å r.m.s.d. over 57 C atoms, 11% sequence identity; Fig. 3a), which is involved in interaction with the C-terminal domain (CTD) of the subunit of RNA polymerase to inhibit RNA binding during transcription termination. It is also seen as the N-terminal domain of STE50, a modulator of mitogen-activated protein kinase signaling in yeast (Grimshaw et al., 2004; PDB code 1uqv; Z score 3.6, 2.80 Å r.m.s.d. over 61 C atoms, 10% sequence identity; Fig. 3a). SAM domains are involved in protein-protein interactions, either for self-association or for binding to non-SAM-domain proteins (Peterson et al., 1997). SAM domains are also implicated in RNA binding, as in the case of the positively charged residues in the SAM domain of Smaug (PDB code 1oxj; Z score 2.4, 3.10 Å r.m.s.d. over 57 C atoms with 14% sequence identity to the N-terminal domain of NGO1945; Green et al., 2003;Kim & Bowie, 2003). The Smaug SAM domain has been defined as a new family of regulators of post-transcriptional control (Aviv et al., 2003). However, none of the functionally important residues in any of the proteins discussed above are conserved in NGO1945. Inspection of the putative N-terminal dimer (Fig. 2a) and electrostatic surfaces (Fig. 4a) support the possibility of this domain being involved in protein-protein interactions, as well as DNA/RNA binding, via charged residues in a basic patch (separate from the dimerization interface) that is comprised of Arg36, Arg39, Arg48, Arg54, Lys60, Arg64, Lys66, Arg71 and Arg74. The surface-exposed NGO1945 dimerization. Based on crystal-packing analysis, dimerization may occur via (a) N-terminal domain helices H1, H5 and H7, with a total buried surface area of 1550 Å 2 (red), or (b) C-terminal domain helices H8, H9 and H11, with a total buried surface area of 1930 Å 2 (gray) (for clarity, the monomer encircled by a dashed oval is depicted in approximately the same orientation in both panels). aromatic residues Tyr34, Phe45, Trp62 and Phe80 in the monomer, which are mostly hidden in an N-terminal dimer, could play a role in base-stacking interactions with DNA if the functional form of the protein requires dissociation into monomers, but this is only spec-ulation at this point. Phe13 and Arg18 in the disordered N-terminus may also become ordered on binding to DNA and are conserved in many PF09836 proteins (Pfam alignment), suggesting functional importance.

C-terminal domain
An $20-residue loop (residues 117-137) connects the N-terminal domain to the remainder of the protein and could facilitate movement of the N-terminal and C-terminal domains with respect to each other. Residues 138-174 (Fig. 3b) are similar in structure to the WW domain (a three-stranded -sheet structure, but without the characteristic tryptophan residues) of human FE65 (Meiyappan et al., 2007; PDB code 2idh; Z score 1.7, 1.98 Å r.m.s.d. over 27 C atoms, 7% sequence identity) and to the intensely sweet protein monellin (Ogata et al., 1987; PDB code 3mon; Z score 1.9, 2.5 Å r.m.s.d. over 32 C atoms, 3% sequence identity). Some similarity is also found to the N-terminal domain of the ribosomal protein L11 from Thermotoga maritima (Wimberly et al., 1999; PDB code 1mms; Z score 1.6, 3.0 Å r.m.s.d. over 36 C atoms, 11% sequence identity). Thus, this region (residues 138-174) with its negatively charged surface (Fig. 4a) may be involved in interactions with a binding partner.

Genome-context analysis
Analysis of the ngo1945 phylogenetic co-occurrence and genomic neighborhood in related species (Jensen et al., 2009)  Electrostatic surface of potential NGO1945 dimers. (a) The N-terminal dimer. The monomer on the left is drawn as an electrostatic surface (AE kT/e), which highlights the basic patch (Arg36, Arg39, Arg48, Arg54, Lys60, Arg64, Lys66, Arg71 and Arg74) that may be involved in DNA/RNA-binding interactions. This basic region is distinct from the portion involved in the protein-protein interactions that form the dimer. These residues are conserved in many proteins belonging to DUF2063 (Pfam website alignment), suggesting functional importance. (b) The C-terminal dimer. The monomer on the left is represented as an electrostatic surface, which highlights the surfaceexposed Lys209 in NGO1945 corresponding to Lys40 that is implicated in Z-DNA binding in vaccinia virus E3L protein. The other E3L protein residues implicated in Z-DNA binding, Arg41, Asn44, Lys45, Tyr48 and Trp66, are not conserved in NGO1945.
The genomic context of other DUF2063 homologs supports an involvement in virulence. For example, MCA3109 and ABO1516, which are DUF2063 homologs from Methylococcus capsulatus and Alcanivorax borkumensis, respectively, show a predicted functional association with lipoprotein VacJ. The vacJ gene is required for intercellular spreading and virulence in Shigella flexneri and enteroinvasive E. coli (Suzuki et al., 1994). H. influenzae homologs, such as NTHI1444 and HI1599, co-occur with hemoglobin-haptoglobin binding proteins that are virulence determinants (Seale et al., 2006). Similarly, Pseudomonas aeruginosa homologs co-occur with hemeexporter protein D (a cytochrome c-type biogenesis protein) that has been implicated in invasion and virulence in Legionella pneumophila (Polesky et al., 2001), while Burkholderia and Bordetella homologs are present with hemolysin-related and exported proteins.
In conclusion, the NGO1945 crystal structure allows the assignment of a putative function for this protein and for the PF09836 family in general. Structural similarity to transcription factors and presence of a surface-exposed basic patch in the N-terminal DUF2063 domain indicates the possibility of DNA binding. The nonconservation of DNA-binding residues in NGO1945 compared with structurally similar proteins may give rise to a different mode of DNA binding. The multi-domain architecture, potential DNA binding and genome context of ngo1945 are consistent with a possible role in transcription pathways and indicate that the members of this family may be transcription factors; the genome context further supports involvement in virulence. Alternatively, they may have some novel functionality that remains to be determined. Since significant sequence homologs of NGO1945 are primarily found in different strains of the bacteria Neisseria and Haemophilus, which are human pathogens that are involved in sexually transmitted diseases, in meningitis and in ear, eye or sinus infections in infants and children, further structure-based biochemical investigation of NGO1945 may be of therapeutic value.