The hypothetical periplasmic protein PA1624 from Pseudomonas aeruginosa folds into a unique two-domain structure

Crystal structure analysis of the hypothetical protein PA1624 from P. aeruginosa reveals a novel two-domain protein architecture that is only distantly reminiscent of previously characterized structural domains.


Introduction
As of November 2020, the Protein Data Bank (PDB; Berman et al., 2000) contains more than 170 000 structural entries of biological macromolecules, of which more than 90% have been determined by X-ray crystallography. However, most of the newly deposited entries comprise folds that have already been observed in other, homologous structures. This is reflected in the notion that most of the new structures determined by X-ray crystallography are solved by molecular replacement (Long et al., 2008) and also in the fact that the number of unique protein folds has not significantly increased over the last 15 years (Liu et al., 2004). This suggests that the structural universe is much smaller than the sequence universe (Chothia, 1992;Levitt, 2009). Completing the catalog of protein folds invented by nature is a prerequisite for unveiling and comprehending the rules governing protein evolution, understanding the relationship between protein structure and function, and advances in de novo protein design. New folds may not be expected in well characterized genetic landscapes but are more likely to be found within uncharacterized gene products. Therefore, incompletely characterized genomes offer a comparatively higher chance of identifying novel and probably therapeutically interesting protein structures. One of these incompletely understood organisms is the human pathogen Pseudomonas aeruginosa. This Gram-negative bacterium is ubiquitous in nature (Green et al., 1974) and can colonize a variety of different host organisms ranging from insects and animals to plants and mammals (D'Argenio et al., 2001;Mahajan-Miklos et al., 1999;Walker et al., 2004). Its versatile metabolism provides a prominent evolutionary advantage, enabling P. aeruginosa to inhabit niches that are harmful or toxic to others (Tsuji et al., 1982;Tü mmler et al., 2014). This makes the bacterium a severe threat to immunecompromised individuals such as AIDS patients or persons suffering from neutropenia and cystic fibrosis (CF) (Aloush et al., 2006;Hidron et al., 2008;Hogardt & Heesemann, 2013) and establishes it as one of the most prevalent nosocomial pathogens worldwide (Bereket et al., 2012;Santajit & Indrawattana, 2016). In CF, P. aeruginosa evokes chronic lung infections, which is one of the main reasons for lower life expectancy, and is a significant determinant of morbidity and mortality in these patients (Kosorok et al., 2001;Li et al., 2005). During later stages of infection, the bacteria can disseminate via the bloodstream and affect any part of the body, making antimicrobial treatment almost impossible ( van Delden, 2007;Shorr, 2009). Therefore, it is not surprising that Pseudomonas has been listed amongst the five top pathogens in modern times (Santajit & Indrawattana, 2016).
P. aeruginosa possesses a large genome that contains more than 5500 open reading frames (ORFs) in the case of the well researched strain P. aeruginosa PAO1. However, even though the genome sequence was completed in 2000 (Stover et al., 2000), and despite the existence of a large community-based annotation effort (Winsor et al., 2016), there are still more than 2200 genes predicted by bioinformatics, amounting to 35% of all predicted ORFs using DOOR (Mao et al., 2009), that lack characterization. This uncharted territory is likely to harbor potential drug targets, and it is expected that amongst these uncharacterized genes those that encode proteins with nonpredictable folds will be highly attractive for drug development because it is less probable that they will have an overlapping function with proteins of the host organism.
Here, we describe the X-ray crystallographic structural characterization of one such gene product with unknown function and novel structure, namely the hypothetical protein PA1624 from P. aeruginosa PAO1.

Macromolecule production
The coding region of PA1624 lacking the first 18 amino acids, representing the periplasmic localization signal, was PCR-amplified from P. aeruginosa PAO1 genomic DNA using the appropriate DNA primer set for cloning into p10$, which generates a rhinovirus 3C protease-cleavable N-terminally tagged His 6 -T7-lysozyme fusion construct, p10$_Á 18 PA1624 (Table 1; Bock et al., 2017). The amino-acid sequence of the entire construct is mghhhhhhaenlyfqghTARVQFKQRESTD AIFVHCSATKPSQNVGVREIRQWHKEQGWLDVGYHFI IKRDGTVEAGRDEMAVGSHAKGYNHNSIGVCLVGGI DDKGKFDANFTPAQMQSLRSLLVTLLAKYEGAVLRA HHEVAPKACPSFDLKRWWEKNELVTSDRGHTlevlfq|gp hMADLPGSHDLDILPRFPRAEIVDFRQAPSEERIYPLG AISRISGRLRMEGEVRAEGELTALTYRLPPEHSSQEAF AAARTALLKADATPLFWCERRDCGSSSLLANAVFGNA KLYGPDEQQAYLLVRLAAPQENSLVAVYSITRGNRRA YLQAEELKADAPLAELLPSPATLLRLLKANGELTLSH VPAEPAGSWLELLVRTLRLDTGVRVELSGKHAQEWR DALRGQGVLNSRMELGQSEVEGLHLNWLR, with lower case letters indicating the His 6 tag and TEV cleavage site, italic letters indicating the T7-lysozyme moiety and bold letters indicating the Á 18 PA1624 part. The symbol | denotes the cleavage site of rhinovirus 3C protease. The plasmid is available upon request.
Plasmid-harboring Escherichia coli BL21(DE3) cells were grown in TB medium in a 2 l fermenter at 37 C. When an OD 600 of 2.8 was reached, the temperature was lowered to 20 C, 0.5 mM isopropyl -d-1-thiogalactopyranoside (IPTG) was added and overexpression was carried out for 16 h. The harvested cells were resuspended in buffer A (150 mM Na x H y PO 4 pH 8.0, 300 mM NaCl) and lysed. Cell debris and insoluble matter were separated from the soluble fraction before loading onto a precharged nickel HiTrap Chelating HP column equilibrated in buffer A. Nonspecifically bound proteins were removed by washing with 2% buffer B (buffer A with 500 mM imidazole) before a gradient elution to 100% buffer B was performed. 1 mg rhinovirus 3C protease (Cordingley et al., 1990;Stanway et al., 1984) was added to 40 mg of the fusion protein to remove the His 6 -T7-lysozyme tag during dialysis (10 kDa cutoff membrane) against buffer GF (50 mM HEPES, 150 mM NaCl pH 8.0) at 4 C overnight. The next day, the protein solution was loaded onto a HiTrap Chelating HP column precharged with nickel to separate noncleaved fusion protein from Á 18 PA1624. The concentrated flowthrough (Macrosep Advance, 10 kDa; Pall Corporation) was applied to size-exclusion chromatography using a Superdex 26/600 S75 prep-grade column mounted on an Ä KTA system (GE Healthcare).
Seleno-l-methionine-labeled (SeMet) protein was expressed in E. coli Rosetta2 pLysS cells harboring p10$_Á 18 PA1624. Briefly, a preculture was grown in LB medium supplemented with appropriate antibiotics at 37 C overnight and harvested the next day. The cells were resuspended in M9 medium, incubated for 1 h and used as an  Table 1 Macromolecule-production information. inoculum for the primary culture in prewarmed M9 medium supplemented with selective antibiotics. The cell cultures were incubated at 37 C and vigorously shaken. When the cell density reached an OD 600 of 0.6, an amino-acid mixture inhibiting natural methionine biosynthesis was added (100 mg l À1 lysine, phenylalanine and threonine; 50 mg l À1 isoleucine, leucine and valine) and incubation was continued.
The temperature was decreased to 20 C after 10 min and 0.5 mM IPTG and 60 mg l À1 seleno-l-methionine were added. The cultures were shaken for 10 h. Purification steps were performed as described for the native protein. Seleno-lmethionine incorporation was confirmed by MALDI-MS analysis.

Crystallization
Crystallization screening using both the native and the SeMet variant was carried out in 96-well plates. Standard sitting-drop vapor-diffusion experiments were set up at 20 C employing the commercial screens JCSG Core Suites I-IV (Qiagen). An automated liquid-dispensing robot (Phoenix, Art Robbins Instruments, USA) was employed to mix 0.1 ml concentrated protein solution (12 mg ml À1 ) with an equal volume of precipitant solution. Initial small plate-shaped crystals were obtained after five days and were refined in a grid screen using a hanging-drop vapor-diffusion setup in 24-well Linbro plates ( Table 2). The final mother-liquor composition for the native crystals was 0.2 M sodium acetate, 0.1 M HEPES pH 7.7, 24.5%(w/v) PEG 4000. SeMet protein crystals were obtained using concentrated protein solution (15 mg ml À1 ) with 0.15 M sodium acetate, 0.1 M HEPES pH 7.1, 23.3%(w/v) PEG 4000. Typical protein crystals grew in thin plates to about 250 Â 900 mm within ten days for native and 15 days for SeMet protein (Fig. 1). Harvested crystals were cryoprotected in mother liquor supplemented with 20%(v/v) PEG 400 and then flash-cooled in liquid nitrogen.

Data collection and processing
Diffraction data were collected on beamline BL14.1 at the electron-storage ring operated by the Helmholtz-Zentrum Berlin (Mueller et al., 2015). Data were collected from native crystals using a CCD detector. Data from the derivatized crystal were collected in eight 360 passes. The crystal was translated between passes. All data were indexed and integrated with XDSAPP (Sparta et al., 2016) and scaled with AIMLESS (Evans & Murshudov, 2013

Figure 1
Crystals of native Á 18 PA1624 (a) and selenomethionine-derivatized protein (b) could be obtained with slightly different shapes. The native crystals (a) grew as thin fragile plates with sizes of up to 950 Â 250 mm. SeMet crystals (b) could be grown to a size of about 450 Â 180 mm with a substantial third dimension.
non-optimal crystal-to-detector distance during data collection, suggesting that this crystal may have diffracted to an even higher resolution than the reported 1.96 Å . The calculated Matthews coefficient of 2.27 Å 3 Da À1 indicated the presence of two monomers in the asymmetric unit. All relevant data-collection and processing statistics are given in Table 3.

Structure solution and refinement
Initial phases were obtained by the single-wavelength anomalous dispersion (SAD) method using the SeMet crystals. Since the protein sequence contains only two seleniumlabeled methionine residues (Table 1), highly redundant data were collected. After data reduction and scaling using XDSAPP, structure solution was achieved with SHELX (Sheldrick, 2010). The anomalous signal was extracted using SHELXC, the substructure was determined with SHELXD, and SHELXE was used to carry out the initial model building of a polyalanine chain.
The initial model was manually corrected and adjusted in Coot (Emsley et al., 2010). Automated refinement was carried out with the Phenix application phenix.refine (Afonine et al., 2012;Liebschner et al., 2019). MolProbity (Williams et al., 2018) was used for Ramachandran analysis and evaluation of the model quality. The final model was refined to an R cryst of 17.5% and an R free of 23.9% against the higher resolution (1.96 Å ) SeMet data set. The collected diffraction data were processed to 1.96 Å resolution. Despite a rather high signalto-noise ratio of 2.6 in the outermost resolution bin, data beyond this resolution limit are incomplete owing to a nonoptimal crystal-to-detector distance during the experiment. The Ramachandran plot shows all residues to be in the allowed region and 97% to be in the favored region. Atomic coordinates and structure factors have been deposited in the PDB with accession code 6td9. All relevant refinement and validation statistics are shown in Table 4. The secondarystructure elements were defined using DSS as included in the Phenix suite and PSIPRED (Buchan & Jones, 2019).

Results and discussion
Here, we present the crystal structure of PA1624, a 268-aminoacid hypothetical protein from the human opportunistic pathogen P. aeruginosa strain PAO1 that is localized in its periplasm (Fig. 2). The protein was heterologously expressed in E. coli without its periplasmatic localization signal (Á 18 PA1624). We tested several standard expression plasmids, including, for example, pET-19, pMal and pET-28, but using an N-terminal T7-lysozyme fusion as encoded in our selfdesigned p10$ plasmid provided the best results with respect to the yield of soluble protein.
PA1624 does not display any detectable sequence homology to previously determined protein structures. Structure prediction using Phyre2 (Kelley et al., 2015) failed to produce a reliable structure model for molecular replacement. We therefore resorted to phasing by the Se-SAD method, allowing us to determine and refine the structure to 1.96 Å resolution with R cryst = 17.5% and R free = 23.9%. The data collected from the native crystal were not further used for refinement and structure analysis as the diffraction data obtained from the SeMet crystals were of higher quality.   The asymmetric unit of the orthorhombic crystal form studied here contained two chains of Á 18 PA1624, which superpose with a C r.m.s.d. of 0.5 Å , which is only slightly higher than the coordinate error. PISA analysis (Krissinel & Henrick, 2007) indicates that the protein is monomeric, which is in line with observations made during the course of purification by size-exclusion chromatography.
Except for a handful of flexible residues at the N-terminus, both chains could be traced with confidence. The Á 18 PA1624 monomer has approximate dimensions of 54 Â 45 Â 48 Å . It folds into two distinguishable domains, comprising residues 24-184 and 185-268, as determined by PiSQRD (Aleksiev et al., 2009). The two domains interact through a relatively small hydrophobic interface covering about 600 Å 2 . The larger domain is dominated by a six-stranded antiparallel -sheet that is covered by one -helix on the face that also harbors the N-terminus and by a mixed / structure on the other. The smaller domain features a four-stranded mixed -sheet lined by four -helices on the face contacting the N-terminal domain (Fig. 2a). A disulfide bridge between cysteine residues 110 and 115 provides rigidity to the structures (Fig. 2c).
The presence of two domains in PA1624 was not anticipated, since an automated Pfam sequence analysis (Finn et al., 2010) had predicted only one domain, namely a DUF4892 domain extending from positions 20 to 202. Consequently, the question arose whether the two observed domains may be related to other, already known structural building blocks or whether they indeed represent new folds. Despite no apparent sequence similarity or large conserved protein regions that could be identified (Fig. 2b), we found that PA1624 is composed of two previously identified domains. For the Nterminal domain, analysis with DALI (Holm & Laakso, 2016) reveals distant yet significant structural homology to the DUF1795-containing lipoprotein DcrB from Salmonella enterica (Z-score 8.5; PDB entry 6e8a; Rasmussen et al., 2018). The proteins align with a C r.m.s.d. of 3.2 Å over 101 residues with only 7% sequence identity, and differences are mainly owing to a -structure insertion between -strands 1 and 2 and additional -helical structure between -strands 3 and 4 in PA1624 (Fig. 3a). The closest homolog of the C-terminal domain is a building block of Tp0624 from Treponema pallidum (Z-score 8.3; PDB entry 5jir; Parker et al., 2016), which aligns with a C r.m.s.d. of 2.8 Å over 78 residues, displaying a sequence identity of 15% (Figs. 3b and 3c). The Tp0624 domain appears to be larger owing to an additionalhelix inserted between the -strands corresponding to the third and fourth -strand of the domain in PA1624, as well as a significantly longer -helix following the second -strand. Further, the first secondary-structure element of this domain in PA1624 is an -helix, whereas Tp0624 possesses a -strand in this position (Fig. 3b, lower panel).
It is interesting to speculate about the implications for the function of PA1624 that these similarities may suggest. The previous analysis indicated that the DcrB protein is a membrane-anchored periplasmatic protein that belongs to the  proteins that perform diverse functions but may be associated with membrane-anchored complexes in bacteria. The identified domain of Tp0624, on the other hand, possesses strong similarities to the OmpA family, a class of proteins involved in proteoglycan binding. In comparison, this hints at a membrane-associated function within the periplasm of P. aeruginosa for PA1624, in line with the anticipated and the experimentally confirmed location of the protein (Imperi et al., 2009). However, there are also indications that contradict such direct conclusions. Firstly, the N-terminal domain of PA1624 does not contain a cysteine at its N-terminus, as is implicated in lipid modification and membrane anchoring in DcrB. Secondly, the C-terminal OmpA-like domain lacks the conserved sequence motifs that are required for protein glycan binding in these proteins. These motifs reside in the missing secondary-structure elements mentioned above. Therefore, additional studies will be necessary to identify the function of PA1624. Towards this, it is interesting to note that the interior of the C-terminal domain of the protein is not optimally packed, leaving a cavity lined by hydrophobic residues unoccupied. This cavity may sequester a hydrophobic ligand, such as a lipidic component of the membrane (Fig. 2d).
Overall, the structure of PA1624 described here confirms that the vast amount of available structural data makes it challenging to discover new protein folds, even if relationships are not apparent at the sequence level. This seems particularly true for smaller building blocks such as the two unanticipated domains found here in PA1624, since these domains will be dominated by secondary-structure elements that can only fold into a limited number of arrangements. Consequently, domains with no common ancestry will display similar structures, requiring further structure determination to reveal these Superposition of the C-and N-terminal domains of Á 18 PA1624 with structurally related proteins. Á 18 PA1624 is color-coded according to its secondarystructure elements. (a) The N-terminal domain superposes on the full-length lipoprotein DcrB from S. enterica, colored light blue (PDB entry 6ea8), with a C r.m.s.d. of 3.2 Å over 101 amino acids. (b) The smaller C-terminal domain structurally aligned with the blue-colored domain of Tp0624 from Treponema pallidum with a C r.m.s.d. of 2.8 Å over 78 residues (PDB entry 5jir). (c) Superposition of the C-terminal domain of PA1624 with full-length Tp0624 from T. pallidum. relationships and inform structure-prediction programs. Therefore, we suggest that PA1624 has a novel, yet-to-benamed architectural domain arrangement.