Structure of Bacteroides thetaiotaomicron BT2081 at 2.05 Å resolution: the first structural representative of a new protein family that may play a role in carbohydrate metabolism

The crystal structure of BT2081 from B. thetaiotaomicron reveals a two-domain protein with a putative carbohydrate-binding site in the C-terminal domain.

BT2081 from Bacteroides thetaiotaomicron (GenBank accession code NP_810994.1) is a member of a novel protein family consisting of over 160 members, most of which are found in the different classes of Bacteroidetes. Genome-context analysis lends support to the involvement of this family in carbohydrate metabolism, which plays a key role in B. thetaiotaomicron as a predominant bacterial symbiont in the human distal gut microbiome. The crystal structure of BT2081 at 2.05 Å resolution represents the first structure from this new protein family. BT2081 consists of an N-terminal domain, which adopts a -sandwich immunoglobulin-like fold, and a larger C-terminal domain with a -sandwich jelly-roll fold. Structural analyses reveal that both domains are similar to those found in various carbohydrate-active enzymes. The C-terminal -jelly-roll domain contains a potential carbohydrate-binding site that is highly conserved among BT2081 homologs and is situated in the same location as the carbohydrate-binding sites that are found in structurally similar glycoside hydrolases (GHs). However, in BT2081 this site is partially occluded by surrounding loops, which results in a deep solvent-accessible pocket rather than a shallower solvent-exposed cleft.

Introduction
Bacteroides thetaiotaomicron is a Gram-negative anaerobic bacterium that is a dominant member of the human intestinal tract microbiome. This bacterial species is essential for the metabolism and uptake of dietary plant polysaccharides by the human host (Comstock & Coyne, 2003;Xu et al., 2003;Zocco et al., 2007). To utilize these polysaccharides, B. thetaiotaomicron expresses a large number of carbohydrate-processing enzymes. Indeed, it is estimated that $6.6% of the B. thetaiotaomicron genome encodes such proteins, many of which are novel and have not been functionally or structurally characterized (Davies et al., 2005).
The BT2081 gene product of B. thetaiotaomicron encodes a protein with a molecular weight of 37 198 Da (residues 1-341) and a calculated isoelectric point of 4.9. It contains a predicted N-terminal signalling peptide, suggesting that it is secreted from the cytoplasm. Sequence and genomic context analysis suggest that it is a putative carbohydrate-processing protein that is part of the extracellular polysaccharide-processing machinery. Its N-terminal region (residues 1-114) shares high sequence similarity to the carbohydrate-binding domains of endo-1,4--xylanase from Streptomyces griseoflavus (35% sequence identity), -xylosidase from Magnetospirillum magnetotacticum (29% sequence identity) and a putative -fructosidase from Sarcoptes scabiei (39% sequence identity), all of which are glycoside hydrolases (GHs), which catalyze the cleavage of the glycosidic bonds in monosaccharide, disaccharides and polysaccharides into their constituent sugar units. Moreover, genome-context analysis shows that BT2081 and several of its paralogs belong to characteristic PULs (polysaccharide-utilization loci) of B. thetaiotaomicron and are associated with transmembrane porin domains that are homologous to the B. thetaiotaomicron outer membrane protein transporter SusC. SusC is part of the well characterized eight-component sus (starchutilization system) operon used by B. thetaiotaomicron in carbohydrate metabolism. Furthermore, in the sus operon two other BT2081 paralogs (BT0450 and BT1761) are immediately followed by GHs. To further investigate the role that BT2081 may play in carbohydrate metabolism, we have determined its crystal structure using the semiautomated high-throughput pipeline of the Joint Center for Structural Genomics (JCSG; Lesley et al., 2002) as part of the Protein Structure Initiative (PSI) of the National Institute of General Medical Sciences, NIH.

Protein production and crystallization
Clones were generated using the Polymerase Incomplete Primer Extension (PIPE) cloning method (Klock et al., 2008). The gene encoding BT2081 (GenBank NP_810994, Swiss-Prot Q8A605) was amplified by polymerase chain reaction (PCR) from B. thetaiotaomicron VPI-5482 genomic DNA using PfuTurbo DNA polymerase (Stratagene) and I-PIPE (Insert) primers (forward primer, 5 0 -ctgtacttccagggcCGCGAAGAAGCTCCCAATGCAGAGGCAG-3 0 ; reverse primer, 5 0 -aattaagtcgcgttaGTCTTCCGAGCGATAGATT-AGTTCGACT-3 0 ; target sequences in upper case) that included sequences for the predicted 5 0 and 3 0 ends. The expression vector pSpeedET, which encodes an amino-terminal tobacco etch virus (TEV) protease-cleavable expression and purification tag (MGSDK-IHHHHHHENLYFQ/G), was PCR-amplified with V-PIPE (Vector) primers (forward primer, 5 0 -taacgcgacttaattaactcgtttaaacggtctccagc-3 0 ; reverse primer, 5 0 -gccctggaagtacaggttttcgtgatgatgatgatgatg-3 0 ). V-PIPE and I-PIPE PCR products were mixed to anneal the amplified DNA fragments together. Escherichia coli GeneHogs (Invitrogen) competent cells were transformed with the I-PIPE/V-PIPE mixture and dispensed onto selective LB-agar plates. The cloning junctions were confirmed by DNA sequencing. Using the PIPE method, the gene segment encoding Met1-Ile20 was deleted because it was predicted to be a signal peptide. Expression was performed in selenomethionine-containing medium at 310 K with suppression of normal methionine synthesis. At the end of fermentation, lysozyme was added to the culture to a final concentration of 250 mg ml À1 and the cells were harvested and frozen. After one freeze-thaw cycle, the cells were sonicated in lysis buffer [50 mM HEPES pH 8.0, 50 mM NaCl, 10 mM imidazole, 1 mM tris(2-carboxyethyl)phosphine-HCl (TCEP)] and the lysate was clarified by centrifugation at 32 500g for 30 min. The soluble fraction was passed over nickel-chelating resin (GE Healthcare) pre-equilibrated with lysis buffer, the resin was washed with wash buffer [50 mM HEPES pH 8.0, 300 mM NaCl, 40 mM imidazole, 10%(v/v) glycerol, 1 mM TCEP] and the protein was eluted with elution buffer [20 mM HEPES pH 8.0, 300 mM imidazole, 10%(v/v) glycerol, 1 mM TCEP]. The eluate was bufferexchanged with TEV buffer (20 mM HEPES pH 8.0, 200 mM NaCl, 40 mM imidazole, 1 mM TCEP) using a PD-10 column (GE Healthcare) and incubated with 1 mg TEV protease per 15 mg eluted protein. The protease-treated eluate was passed over nickel-chelating resin (GE Healthcare) pre-equilibrated with HEPES crystallization buffer (20 mM HEPES pH 8.0, 200 mM NaCl, 40 mM imidazole, 1 mM TCEP) and the resin was washed with the same buffer. The flowthrough and wash fractions were combined and concentrated to 13.5 mg ml À1 by centrifugal ultrafiltration (Millipore) for crystallization trials. BT2081 was crystallized by mixing 100 nl protein solution with 100 nl crystallization solution in a sitting drop over a 50 ml reservoir volume using the nanodroplet vapor-diffusion method (Santarsiero et al., 2002) with standard JCSG crystallization protocols (Lesley et al., 2002). The crystallization reagent consisted of 42%(v/v) polyethylene glycol 600, 0.25 M calcium acetate and 0.1 M sodium cacodylate pH 6.33. PEG 400 was added to the crystal as a cryoprotectant to a final concentration of 5%(v/v). A triangular prismshaped crystal of approximately 200 Â 200 Â 200 mm in size was harvested after 43 d at 277 K for data collection. Initial screening for diffraction was carried out using the Stanford Automated Mounting system (SAM; Cohen et al., 2002) at the Stanford Synchrotron Radiation Lightsource (SSRL, Menlo Park, California, USA). The diffraction data were indexed in the trigonal space group P3 2 21. The oligomeric state of BT2081 in solution was determined using a 1 Â 30 cm Superdex 200 size-exclusion column (GE Healthcare). The mobile phase consisted of 20 mM Tris pH 8.0, 150 mM NaCl and 0.02%(w/v) sodium azide. The molecular weight was calculated using ASTRA v.5.1.5 software (Wyatt Technology). Protein concentrations were determined using the Coomassie Plus assay (Pierce).

Data collection, structure solution and refinement
Single-wavelength anomalous diffraction (SAD) data were collected on beamline 11-1 at the SSRL at a wavelength corresponding to the peak ( 1 ) of a selenium SAD experiment. The data set was collected at 100 K using a MAR Mosaic 325 mm CCD detector (Rayonix) using the Blu-Ice data-collection environment (McPhillips et al., 2002). The SAD data were integrated and reduced using MOSFLM (Leslie, 1992)   i jI i ðhklÞ À hIðhklÞij= P hkl P i I i ðhklÞ. ‡ Typically, the number of unique reflections used in refinement is slightly less than the total number that were integrated and scaled. Reflections are excluded owing to systematic absences, negative intensities and rounding errors in the resolution limits and unit-cell parameters. § R cryst = P hkl jF obs j À jF calc j = P hkl jF obs j, where F calc and F obs are the calculated and observed structure-factor amplitudes, respectively. } R free is the same as R cryst but for 5.0% of the total reflections chosen at random and omitted from refinement. † † This value represents the total B that includes TLS and residual B components. ‡ ‡ Estimated overall coordinate error (Collaborative Computational Project, Number 4, 1994;Cruickshank, 1999). anomalous scatterers per asymmetric unit (eight SeMet, two cacodylate ions and one calcium ion); Vonrhein et al., 2007]. Automatic model building was performed with ARP/wARP (Cohen et al., 2004). Model completion was performed with Coot (Emsley & Cowtan, 2004) and TLS refinement with REFMAC5 (Winn et al., 2003). The refinement included experimental phase restraints in the form of Hendrickson-Lattman coefficients and two TLS groups per chain, with the TLS groups being assigned with the aid of the TLSMD server (Painter & Merritt, 2006). Data-collection and refinement statistics are summarized in Table 1. X-ray fluorescence emission peaks for selenium, arsenic, nickel and calcium were observed when the crystal was scanned on SSRL beamline 1-5 with X-rays above (500 eV) the Se K edge. Calcium was assigned based on binding geometry and its presence in the crystallization conditions. Additional diffraction data were collected above and below the arsenic and nickel edges, with the resultant anomalous electron-density maps revealing the presence of arsenic in the form of cacodylate (which was present in the crystallization condition) and the absence of nickel.

Validation and deposition
The quality of the crystal structure was analyzed using the JCSG Quality Control server (http://smb.slac.stanford.edu/jcsg/QC). This server processes the coordinates and data through a variety of validation tools including AutoDepInputTool (Yang et al., 2004), MolProbity (Chen et al., 2010), WHAT IF v.5.0 (Vriend, 1990), RESOLVE (Terwilliger, 2004), MOLEMAN2 (Kleywegt, 2000) as well as several in-house scripts and summarizes the results. Protein quaternary-structure analysis used the PISA server (Krissinel & Henrick, 2007). Fig. 1(b) was adapted from PDBsum (Laskowski, 2009) and all other figures were prepared with PyMOL (DeLano Scientific). Atomic coordinates and experimental structure factors for BT2081 from B. thetaiotaomicron at 2.05 Å resolution were deposited in the PDB (http://www.pdb.org) under code 3hbz.

Overall structure
The crystal structure of BT2081 ( Fig. 1) was determined to 2.05 Å resolution using the SAD method. Data-collection, model and refinement statistics are summarized in Table 1. The final model includes one BT2081 molecule (residues 21-360), two calcium ions, two sodium ions, two cacodylate anions, two acetate anions, 19 polyethylene glycol molecules and 251 water molecules in the asymmetric unit. Gly0, which is part of the expression construct and   (a)] is in accord with PDBsum (http://www.ebi.ac.uk/pdbsum), in which -helices (H1, H3, H5 and H6), 3 10 -helices (H2, H4 and H7) and -strands (1-28) are labeled sequentially, -turns and -turns are designated by Greek letters (, ) and -hairpins by red loops. remained after cleavage of the N-terminal purification tag, is also part of the final model. The nucleotide sequence corresponding to residues 1-20 was omitted from the expression construct as this region was predicted to encode either a lipoprotein signal peptide or, more likely, a single transmembrane helix anchoring BT2081 in the outer membrane of the cell. Electron density was not observed for the C-terminal Asp361 or for some of the side-chain atoms of Arg21, Glu23, Glu139, Lys239, Lys249, Glu299, Lys317 and Glu318. The Matthews coefficient (V M ; Matthews, 1968) for BT2081 is 3.69 Å 3 Da À1 and the estimated solvent content is 67%. The Ramachandran plot produced by MolProbity (Chen et al., 2010) shows that 97.9% and 99.7% of the residues are in the favored and allowed regions, respectively. The single residue in the disallowed region, Ala27, is in a section of poorly defined electron density.

Similarity to other proteins
Carbohydrate-active enzymes are quite modular, often containing various carbohydrate-binding and catalytic domains in different combinations. Glycoside hydrolases (GHs), for example, often contain both catalytic and noncatalytic carbohydrate-binding modules (CBMs) which can be assembled in different orders (Davies et al., 2005;Henrissat & Davies, 1997;Cantarel et al., 2009). CBMs, which can bind a range of different polysaccharides, function to increase the catalytic efficiency of GHs by bringing the catalytic module into closer proximity with its substrate (Bolam et al., 1998;Tomme et al., 1995).
The Ig-like and jelly-roll domains of BT2081 have previously been observed as modules in carbohydrate-active enzymes. CtCel9D-Cel44A, a multi-enzyme GH complex from Clostridium thermocellum, for example, contains polycystic kidney disease (PKD) and CBM family 44 (CBM44) domains (PDB code 2c26; Najmudin et al., 2006) that also adopt Ig-like and -jelly-roll folds, respectively. However, the relative orientation of the two domains differs significantly and as a result the superposition of these two structures using FATCAT (http://fatcat.burnham.org; Ye & Godzik, 2004) required two rotations or twists around the linker region that connects the two domains in order to obtain optimal alignment of the full-length structures. When BT2081 and PKD-CBM44 are structurally aligned in this way, the r.m.s.d. between 202 C atoms is 3.15 Å despite only 6.7% sequence identity (Fig. 2). An important distinction between these two proteins, which highlights the modularity of carbohydrateactive enzymes, is that, in addition to the PKD and CBM44 domains, CtCel9D-Cel44A contains additional domains (Ig-like, CBM30, GH9 and GH44) that are absent from BT2081.
Similarly, some bacterial sialidases also contain both Ig-like and -jelly-roll domains. NedA, a sialidase from Micromonospora viridifaciens (PDB code 1euu; Gaskell et al., 1995), contains these two domains, but also has a third GH family 33 (GH33) catalytic domain at its N-terminus that adopts a six-bladed -propeller fold. The C-terminal -jelly-roll domain of NedA belongs to CBM family 32 (CBM32), while the middle Ig-like domain is thought to act as a linker region between the CBM32 and GH33 domains. Similar to CtCel9D-Cel44A, the relative orientation of the Ig-like and -jellyroll domains differs significantly between BT2081 and NedA and, as a result, four consecutive twists in the region between these domains were required to obtain an optimal full-length alignment of the structures by FATCAT. The resultant r.m.s.d. between 184 C atoms, which share only 4.8% sequence identity, is 3.11 Å .

BT2081 N-terminal domain
A structural similarity search of the N-terminal domain of BT2081 only performed using the FATCAT server, revealed many proteins that belong to the Ig-like fold family. The closest match was to the soluble upper-middle domain (UMD; residues 232-320) of the outer membrane protein usher from Yersinia pestis (PDB code 3fcg; Yu et al., 2009), with an r.m.s.d. of 2.65 Å between 49 C atoms and a sequence identity of 7.6%. The second and third closest structural neighbors were SoxY from Paracoccus pantotrophus (PDB code 2oxg; Sauve et al., 2007), with an r.m.s.d. of 3.08 Å between 64 C atoms and a sequence identity of 5.6%, and the I-set domain (resi- Structural comparison of BT2081 with PKD-CBM44 of C. thermocellum CtCel9D-Cel44A. Superposition of BT2081 (orange/yellow) with PKD-CBM44 (lilac/pale blue; PDB code 2c26), which also contain both the Ig-like and -jelly-roll domains. Calcium ions are represented as spheres. Protein regions which were used for alignment by FATCAT are shown in darker shades (orange for BT2081 and lilac for PKD-CBM44) to highlight similarities in the protein cores. Please note that two twists have been introduced by FATCAT into the structure of PKD-CBM44 in order to obtain optimal full-length alignments of the structures.
2.5%. Although none of the top structural neighbors from this search matched carbohydrate-active enzymes, the Ig-like fold has been observed in a number of these proteins. In addition to CtCel9D-Cel44A and NedA, structures from CBM families 9, 20, 25, 26, 31, 33 and 34 have also been found to adopt Ig-like folds (Hashimoto, 2006). Despite their structural similarity, the BT2081 N-terminal domain shares little sequence similarity with these CBMs (see Table 2a for superposition statistics).
Sequence analysis of the BT2081 N-terminal domain with BLAST revealed significant sequence similarity to portions of several GHs, including endo-1,4--xylanase from St. griseoflavus (35% sequence identity over 90 residues), -xylosidase from M. magnetotacticum (29% sequence identity over 94 residues) and a putative -fructosidase from Sa. scabiei (39% sequence identity over 76 residues). In all of these sequences, the putative Ig-like domain immediately precedes a putative GH43 catalytic domain, which is known to adopt a fivefold -propeller fold. The role of the putative Ig-like domain in these GHs is still currently unknown.
A calcium ion is present in the N-terminal domain on the surface near the start of the domain (Fig. 3). This calcium is octahedrally coordinated by two waters, a cacodylate ion, the carbonyl O atom of Lys103, the carboxylate group of Glu28 and the carboxylate group of Asp162 from a crystallographically related molecule. The role of this calcium (if any) is unclear. CBM9 from Thermotoga maritima xylanase 10A (PDB code 1i82; Notenboom et al., 2001) and the PKD domain of CtCel9D-Cel44A (PDB code 2c26; Najmudin et al., 2006) also contain calcium(s); however, the location of the calcium-binding site is different from that in BT2081. It has been postulated that these calcium ions may play a structural role given the nature of the buried binding sites (Najmudin et al., 2006;Notenboom et al., 2001). Unlike those proteins, however, the calcium in the BT2081 N-terminal domain is quite solvent-exposed and, as three of the six ligands are solvent molecules and another is a crystallographically related molecule, this bound calcium ion is most likely to be an artifact of crystallization.
Analysis of BT2081 and its homologs reveals a highly conserved region of the N-terminal domain that may be functionally important. These conserved residues include Glu23, Ala24, Asn26, Glu28 and  3.05 115 4.6 1.95 Â 10 À3 † The FATCAT P-value measures the probability of obtaining a similar result between two random structures. This P-value is calculated based on empirical fitting of the extreme value distribution to the FATCAT similarity score (Ye & Godzik, 2004). The smaller the P-value, the more statistically significant the similarity between corresponding structures (P-values of <0.05 are considered to be significant).

Figure 3
Molecular surface of the BT2081 N-terminal domain colored according to residue conservation by ConSurf (http://consurf.tau.ac.il; Landau et al., 2005). The most conserved residues are shown in magenta, the least conserved residues are shown in cyan and those with insufficient data to determine the conservation level are shown in yellow. The molecular surface of the C-terminal domain is shown in gray. (a) and (b) are views of opposite surfaces of the N-terminal domain, showing that residue conservation is predominantly on one side. The calcium ion in (b) is represented as a green sphere. Ile31 situated between the N-terminus and -strand 1, as well as residues from two adjacent regions: Phe69, Gly74, Ala75 and Ile77 between -strands 4 and 5, and Val95, Thr96, Asp99 and Trp102 between -strands 6 and 7. Together, these residues form a highly conserved patch on one surface of the N-terminal domain, in contrast to the minimal conservation that is observed on the opposite surface (Fig. 3).
In light of the fact that Ig-like domains in GHs are sometimes CBMs, one possibility is that the N-terminal domain of BT2081 may also be a CBM and, as such, the highly conserved region may be a carbohydrate-binding site. Another possible role for the BT2081 N-terminal domain is as a linker region, similar to the Ig-like domain in NedA. BT2081 is predicted to contain a signal sequence at the N-terminus, which, based on its predominantly hydrophobic aminoacid composition, may correspond to a single transmembrane helix that anchors BT2081 to the bacterial outer membrane. If this is indeed the case, the N-terminal domain may act as a linker region between the membrane surface and the C-terminal domain.
Despite the low sequence identity between the C-terminal domain of BT2081 and its top structural neighbors, they share the same core -jelly-roll topology (Fig. 4). Another shared feature is a conserved calcium-binding site on the convex surface of the -jelly roll (Fig. 5). The role of this conserved calcium is most likely to be structural, as it has previously been shown in the case of CBM4 to contribute greater stability to the protein fold (Abou-Hachem et al., 2002). In BT2081, this calcium is octahedrally coordinated by a water molecule, the carbonyl O atoms of Asn120, Thr174 and Lys176 and the carboxylate O atoms of Glu122 and Asp351. Of this cluster, Glu122 and Asp351 Superposition of the BT2081 C-terminal domain (yellow) with two representative top structural neighbors as assessed by FATCAT: the catalytic module of a Bacillus 1,3-1,4--glucanase (red; PDB code 1byh) and CBM29 of a mannanase from T. polysaccharolyticum (green; PDB code 2zew). (a) View from above the convex surface of the -jelly-roll core, highlighting the conserved calcium ion represented as spheres and circled. (b) View from above the concave surface of the -jelly-roll core, highlighting the extended loop regions of BT2081 (circled) which help to form a pocket on the concave surface of the core.

Figure 5
Coordination of the conserved C-terminal calcium ion of BT2081. Electron density from 2F o À F c (contoured at 2.5 level) and F o À F c (contoured at 3.0 level) maps is represented as green and blue mesh, respectively. A 13 level anomalous signal obtained from data collected below the nickel edge was seen for the calcium and is shown here contoured at the 6.0 level as magenta mesh. Distances between the calcium ion and its ligands are indicated in Å . are strictly conserved, Lys176 is highly conserved and Asn120 and Thr174 are poorly conserved among BT2081 sequence homologs, although the latter three would not necessarily be expected to be conserved in amino-acid type as they only make main-chain contacts to the calcium. Several of the immediate neighboring residues (e.g. Phe133, Trp145, Lys237, Trp300 and Phe331) are also highly conserved. Comparison of this site in BT2081 and its structural neighbors reveals that it is most similar, in terms of residue conservation, to the corresponding sites in members from the CBM11 (PDB code 1v0a) and CBM16 (PDB code 2zew) families.

Putative carbohydrate-binding pocket in the C-terminal domain
The C-terminal domain of BT2081 contains extended solventexposed loop regions (Glu144-Pro165, Thr182-Gly199, Ile203-Ile228, Ala240-Cys258, Thr267-Asn284, Phe307-Lys324 and Ser333-Ser346) that are not found in its closest structural neighbors. Several of these loops also contain additional secondary-structural elements including -strands, -helices and 3 10 -helices. As a consequence of these extended loop regions, a significant distinction between BT2081 and its structural counterparts is the presence of a deep solventaccessible pocket in BT2081 formed by highly conserved residues from the concave surface of the -jelly-roll core and five of the loop regions extending from this core. The concave surface of the -jellyroll core forms the base of this pocket, with loop 1 (residues 144-165) forming the pocket terminus, loop 2 (residues 182-199) and loop 3 (residues 203-228) comprising the middle section of the pocket, and loop 4 (residues 267-284) and loop 5 (residues 333-346) forming the side walls near the pocket entrance.
The pocket, which has a volume of $1160 Å 3 as calculated by PDBsum (Laskowski, 2009), has a slight kink near its entrance and measures $4 Å from the entrance to this bend and $19 Å from there to the bottom of the pocket. The width varies along the length of the pocket, ranging from $10 Å at the entrance to $7.5 Å at the narrowest portion at the bend and widening to $13.5 Å near the pocket terminus (Fig. 7).
Several features of this pocket suggest that it may be a carbohydrate-binding site. First, it is located in a similar location to the carbohydrate-binding sites of structurally similar GH catalytic modules and CBMs. In CBMs that can be classified as 'glycan-chainbinding' or type B CBMs (Boraston et al., 2004), the carbohydratebinding site is a cleft that extends along the concave surface of the -jelly-roll fold. These clefts consist of several subsites to which individual sugar moieties of a polysaccharide chain can bind. The topography of the cleft is defined in a large part by several aromatic residues, which are key determinants of polysaccharide binding specificity. Second, hydrogen bonding between the polysaccharide and polar residues in the binding site is also important for binding affinity in these CBMs (Boraston et al., 2004).
Similar to the carbohydrate-binding sites found in type B CBMs, the putative binding pocket of BT2081 contains several highly conserved aromatic residues arranged in such a way that they form a twisted platform to which a potential polysaccharide could bind. These aromatic residues include Phe202, Phe206, Phe221, Tyr261, Phe280 and Phe340, of which Phe202, Phe206, Tyr261 and Phe340 are strictly conserved among sequence homologs. Moreover, the BT2081 pocket contains a number of polar residues (e.g. Asn149, Thr155, Thr166, Thr185, Asn200, Thr219 and Thr332) whose side chains are positioned for potential hydrogen bonding with a bound Putative carbohydrate-binding pocket as seen from above the concave surface of the jelly-roll fold. An arrow indicates the pocket entrance. The residues that line the pocket are highlighted in stick representation and are color-coded according to type as follows: aromatic, green; hydrophobic, light green; polar, lilac; acidic, red; basic, blue.

Figure 7
Cutaway side view of the putative carbohydrate-binding pocket highlighting the nature of the residues lining the pocket, including several that may play a role in catalysis. The color scheme of the residues is the same as in Fig. 6. polysaccharide, in another characteristic feature which helps to define carbohydrate-binding specificity in type B CBMs.
The evidence from genomic context analysis and the similarity of the fold and putative binding pocket of BT2081 to those of CBMs and catalytic modules of GHs strongly suggest that the pocket in BT2081 may be a carbohydrate-binding site. If so, the next question is whether the C-terminal domain of BT2081 displays GH enzymatic activity or whether it acts instead as a noncatalytic CBM.
3.6. BT2081 as a potential GH The hydrolysis of glycosidic bonds by GHs usually occurs via general acid catalysis involving two GH carboxylate residues that act as a proton donor and a nucleophile/base (Koshland, 1953;Sinnott, 1990). Hydrolysis can occur via two canonical mechanisms, which result in either the inversion or retention of the configuration of the anomeric carbon that undergoes nucleophilic attack.
In Bacillus 1,3-1,4--glucanase (PDB code 1byh), a member of GH family 16 (GHF16) and one of the top structural neighbors of the BT2081 C-terminal domain, Glu105 and Glu109 act as the nucleophile and acid/base in a retaining mechanism for the glycosidic bond cleavage (Keitel et al., 1993). A comparison with the BT2081 pocket reveals that BT2081 lacks these two catalytic glutamate residues at the corresponding locations. However, several other charged residues line the pocket, including Asp259, Arg290, Glu336 and Asp338, all of which are near each another at the pocket entrance (Figs. 5, 6 and 7). Of these residues, Asp259 and Asp338 at the bend of the pocket are particularly interesting candidates for performing catalytic roles because they are positioned opposite each other in the pocket, with their carboxylate O atoms being sufficiently far apart ($7.5 Å ) to Sequence alignment between BT2081 and its top 20 PSI-BLAST homologs (the alignment is only shown for residues 255-361 of BT2081 for clarity). Residues in white characters on a red background are strictly conserved, while those in red characters with a white background are highly conserved. Red arrows indicate the two residues of BT2081 (Asp259 and Asp338) which may play a role in catalysis. Sequences with aspartates at these positions are highlighted in yellow. accommodate a carbohydrate substrate. The feasibility of Asp259 as a potential catalytic residue in a hydrolysis reaction may further be supported by its interaction with the neighboring Arg290, which would be expected to reduce the pK a of Asp259 and thereby prime it for a role as a nucleophile in a retaining mechanism (Fig. 7).
Among the top 20 BT2081 homologs derived from PSI-BLAST Asp259 is almost absolutely conserved (19/20 = 95%), while Asp338 (12/20 = 60%) is less so (Fig. 8). In the eight sequences in which the position equivalent to Asp338 is not an aspartate, seven have an alanine and one has a glycine. The lack of strict conservation at these sites, particularly Asp338, may raise doubts as to the functional role of these residues; however, such an apparent lack of conservation among active-site residues is not unprecedented. In GH family 97 (GH97), for example, its members have evolutionarily diverged into two main subfamilies which differ in their catalytic mechanism, so that the members of one subfamily are retaining GHs while the members of the other are inverting GHs. In addition to these two main subfamilies, six outlier sequences in the GH97 family were found which lacked key catalytic residues, indicating that these members may be inactive or have evolved a different catalytic mechanism (Gloster et al., 2008). It is feasible that a similar evolutionary divergence may also have occurred in the BT2081 protein family.
The fact that the putative carbohydrate-binding site in BT2081 is a pocket rather than a tunnel or a cleft, as seen in other carbohydrateactive enzymes, suggests that BT2081, if it were indeed a GH, would be likely to be classified as an 'exo' GH (i.e. it would cleave the polysaccharide chain at its ends rather than in the middle) and in this regard would be similar to other exopolysaccharidases, such as glucoamylase and -amylase (Davies & Henrissat, 1995).
Manual docking of various polysaccharides (e.g. cellopentaose, mannopentaose and cellohexaose) into the BT2081 pocket reveals that the pocket can accommodate a polysaccharide as large as a pentaose, with three of the sugar units fitting into the section of the pocket extending from the bend to the pocket terminus and the other two sugar units fitting into the portion that extends from the bend to the pocket entrance. Consequently, if BT2081 were indeed proven to be a GH with Asp259 and Asp338 being the catalytic residues, it would be feasible that up to three terminal sugar units of a bound polysaccharide chain could be cleaved off at one time.

Conclusions
BT2081 is the first structural representative of a new protein family which is likely to play a role in carbohydrate metabolism in the distal human gut based on structural and genomic context analyses. BT2081 contains Ig-like and jelly-roll domains, which have been observed to be modules in carbohydrate-active enzymes, such as GHs. The N-terminal Ig-like domain may act as a CBM and/or a linker region between the outer membrane and the C-terminal domain. The C-terminal domain contains a pocket that is similar in many respects to carbohydrate-binding sites in CBMs and catalytic modules of GHs. This domain may be catalytic, with Asp259 and Asp338 acting as the catalytic residues in glycosidic bond cleavage. Further biochemical investigation of BT2081 is needed to ascertain whether this is indeed the case.