Gene Composer in a structural genomics environment
The structural genomics effort at the Seattle Structural Genomics Center for Infectious Disease (SSGCID) requires the manipulation of large numbers of amino-acid sequences and the underlying DNA sequences which are to be cloned into expression vectors. To improve efficiency in high-throughput protein structure determination, a database software package, Gene Composer, has been developed which facilitates the information-rich design of protein constructs and their underlying gene sequences. With its modular workflow design and numerous graphical user interfaces, Gene Composer enables researchers to perform all common bioinformatics steps used in modern structure-guided protein engineering and synthetic gene engineering. An example of the structure determination of H1N1 RNA-dependent RNA polymerase PB2 subunit is given.
The Seattle Structural Genomics Center for Infectious Disease (SSGCID; https://www.ssgcid.org/home/index.asp ) is devoted to the application of state-of-the-art structural genomics technologies to structurally characterize targeted proteins from NIAID Category A–C pathogens and organisms. The goal is to create a collection of three-dimensional protein structures that are widely available to the broad scientific community and serve as a blueprint for structure-based drug development for infectious diseases. The SSGCID uses an escalating tier approach to protein production (Fig. 1). The overall SSGCID structure-determination pipeline involves a number of activities that are distributed between the Target Selection, Cloning & Expression Screening, Protein Production, Crystallization and Data Collection & Structure Solution teams. In order to maximize the likelihood of success of each target, yet minimize the cost per structure, we have adopted a multipronged serial escalation approach, whereby targets initially enter a standard high-throughput bacterial protein-expression system (Tier 1) and enter more resource-intensive `rescue pathways' (Tiers 2–9) only after failing the initial approach. Tier 3 uses gene synthesis and multiple construct design as its technological focus.
To maximize our chances of success, we choose homologues and orthologues from multiple organisms with the goal of characterizing multiple examples of targets. It is well established in structural biology that making multiple different constructs of each target with various N- and C-terminal deletions or surface substitutions increases the chance of successfully obtaining soluble crystallizable protein (Chandonia & Brenner, 2006; Gräslund et al., 2008). Deciding on where to make deletions and modifications can be accomplished in several ways. Making multiple sequence alignments using, for example, ClustalW (Chenna et al., 2003; Larkin et al., 2007) is straightforward and can help the researcher decide on conserved domains. Using secondary-structure information to identify structured domains can be accomplished by examining the structures of related entities using software tools such as PyMOL (DeLano, 2002) or Coot (Emsley & Cowtan, 2004). Creating constructs for cloning can be accomplished with molecular-biology tools such as Vector NTI (Invitrogen). Obtaining genomic DNAs from hard-to-obtain species or from metagenomic projects is difficult if not impossible. In such cases gene synthesis is the most attractive approach. We have developed a single computer application called Gene Composer that incorporates all of these functions into one package. The researcher can create information-rich multiple sequence alignments incorporating not only amino-acid sequences but also secondary-structural information from PDB files. Individual sequences can then be extracted from the alignment and used to create a fully codon- and sequence-engineered synthetic DNA sequence that can be incorporated into a virtual cloning strategy. Vector constructs are given unique vector-construct identifications (VCIDs) that are stored in the database. The VCID database can be queried and the necessary information extracted for each individual clone.
A typical design cycle starts by defining a desired target protein or family of targets. We first download the relevant target sequences from the SSGCID central target tracking database (CTTdb). We then pull in additional information from multiple sources such as FASTA files from GenBank (https://www.ncbi.nlm.nih.gov/ ), structure files from the Protein Data Bank (PDB; https://www.rcsb.org/pdb/home/home.do ) or simple text (.txt) files with homologous sequences of related proteins or orthologs. Gene Composer automatically creates the familiar ClustalW (Chenna et al., 2003; Larkin et al., 2007) multiple sequence alignments, pointing out areas of conservation, gaps and dissimilar regions. Adding structural information is simple. Coordinate files from the PDB of related proteins or domains can be added to the alignment and used to display experimental information. Secondary-structural information is annotated and amino acids are identified that participate in ligand-binding sites, are water-exposed or form crystal contacts. At this point the researcher may decide that it is sufficient to express the activity-bearing domain only or that multiple amino-acid sequence variants be generated, including N- and C-terminal truncations, variants with surface mutations and single or combinations of tags at either end of the protein. Once domains and constructs have been identified, the user then defines the underlying DNA sequence, either as the native sequence or an engineered sequence. In the final step the user can virtually clone the constructs into vectors stored in the database using user-defined adapters and assemblies to create proteins with desired purification tags and features.
The H1N1 RNA-dependent RNA polymerase subunit PB2 offers an example of Gene Composer engineering leading to structure determination. This target has multiple SSGCID identifiers of the form InvaX.07055, where the X refers to multiple different strains or variants (Table 1). We started with several recent isolates and a published structure of the influenza A virus RNA-dependent RNA polymerase PB2 subunit (PDN entry 2vy6 ; Guilligay et al., 2008; Tarendeau et al., 2008) and created the alignment shown in Fig. 2. The degree of similarity is quite high amongst the SSGCID targets and 2vy6 (Guilligay et al., 2008; Tarendeau et al., 2008). The software not only extracts the structural information from the PDB but also displays (i) the chain sequence, which represents the protein put into crystallization trials, and (ii) the model, which represents the visible amino acids. In addition to displaying the conservation between sequences and the secondary structure, we can also interrogate the PDB file to extract other structure information such as B factors, crystal contacts, solvent accessibility and water contacts for any residue. For example, in Fig. 2 we have highlighted Gln591, which makes crystal contacts with Gln566, has a B factor of 29.3 Å2 and makes two water contacts in the structure.
At this point the researcher can define a base construct from which to further design the expressed protein and to begin the gene-design process. Fig. 3 shows how N- and C-terminal truncations are set by inserting a character at the desired end points. Gene Composer then makes all combinations of truncations.
From information such as that displayed in Figs. 2 and 3 the user can now start to design constructs, making N- and C-terminal deletions, amino-acid insertions/deletions or surface mutations (Figs. 3, 4 and 5). We choose to make truncations at the N- and C-termini or remove internally disordered loops by simply indicating in the Construct Design Viewer where the N- and C-termini should be by right-clicking at the position to start or stop (Fig. 3). To remove an internal sequence we simply highlight the residues to delete, right click and chose `delete residues' from the list of options. Multiple combinations of truncations, mutations, insertion and deletions can be specified and all combinations are generated by the program.
The features of the Gene Design Module have been presented in detail elsewhere (Lorimer et al., 2009), so we will only briefly discuss them here. Once the user has defined the base construct to be made, he or she must now define a DNA sequence for use in the virtual cloning steps. The user can choose to use the natural sequence as found, for example, in GenBank or the PDB. Alternatively, the user may choose to design a gene from scratch by back-translating the target protein sequence using a codon-usage table (CUT) defined for the expression host. The CUT tables define the frequency of codon occurrence in the host genes, allowing the user to match the gene-codon frequencies to that of the host's preferred codons and therefore avoid rare codons. Table 2 shows an example of the Escherichia coli codon-usage table. Gene Composer comes pre-loaded with CUTs for mammalian, insect cell/baculovirus, E. coli and combined E. coli/baculovirus genes. Once the basic back-translated gene has been designed, we can choose to introduce or remove restriction-enzyme sites, remove cryptic Shine–Dalgarno sequences, remove repeated sequences, introduce out-of-frame stop codons etc. (Fig. 4; Lorimer et al., 2009).
Once all constructs have been decided on and the underlying DNA sequence has been defined the user can now perform a virtual cloning step in which the gene sequence is inserted, in silico, into a vector sequence (Fig. 6). The user can import vectors from commercial sources or create their own vectors. The user defines the potential cloning sites, vector-encoded tags, starts and stops. The user can also define adapters such as restriction-enzyme recognition sequences as well as tags and other types of features to be added to the target gene. These adapters are entered as DNA sequences and can be added to the ends of coding sequences as desired. These features are color-coded for easy visualization of the final construct-vector map. Adapters can be concatenated into assemblies that are stored for easy use. Once a gene sequence has been virtually cloned and stored in the database it is given a unique identifier called a Vector Construct ID or VCID. Once created, a VCID cannot be altered in any way so as to protect the integrity of the database. Deletion of VCIDs is password-protected to prevent accidental loss of data. Since the VCID is a software-generated piece of code, users can freely use it as a secure identifier in communication between parties.
Structural genomics efforts are becoming commonplace in the field of functional and structural biology. For example, there are currently 36 centers tracked in the Protein Data Bank, including sites in the US, Japan and Europe. As of July 2011 these centers have submitted 10 370 structures to the PDB. At present the SSGCID is the eighth most productive center tracked. A key to our success is the ability to clone, express, purify and crystallize multiple versions of any single target in our pipeline. The SSGCID uses an escalating-tier approach to protein production (Fig. 1). Tier 3 uses gene synthesis and multiple construct design as its technology. As such, in Tier 3 Gene Composer provides an excellent tool for performing rational construct design using parameters such as sequence homology, synthetic DNA engineering and structural features. Examples of Tier 3 rescue are discussed in Raymond et al. (2011). For the example given here of the H1N1 RNA-dependent RNA polymerase subunit PB2, specific DNAs were difficult if not impossible to obtain for this work and therefore gene synthesis was the only option. Multiple versions of the protein were created and expression and purification were accomplished (Raymond et al., 2011; Yamada et al., 2010). Purified protein entered into crystallization trials and a structure model was determined (Fig. 7). The structure and its impact on our understanding of cross-species transmission are discussed elsewhere (Yamada et al., 2010).
Gene Composer software can be downloaded for free from https://www.genecomposer.net by following a simple click-through registration license. The operating system required is Windows 2000/XP and the hardware is a Pentium 4 or Athlon at 1 Ghz with 1 GB RAM. Gene Composer is a registered trademark of Emerald BioStructures Inc.
The authors wish the thank all of the members of the SSGCID team. This research was funded under Federal Contract No. HHSN272200700057C from the National Institute of Allergy and Infectious Diseases, the National Institutes of Health, Department of Health and Human Services and by the NIGMS–NCRR co-sponsored PSI-2 Specialized Center Grant U54 GM074961 for the Accelerated Technologies Center for Gene to Three-Dimensional Structure.
Chandonia, J. M. & Brenner, S. E. (2006). Science, 311, 347–351. Web of Science CrossRef PubMed CAS Google Scholar
Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T. J., Higgins, D. G. & Thompson, J. D. (2003). Nucleic Acids Res. 31, 3497–3500. Web of Science CrossRef PubMed CAS Google Scholar
DeLano, W. L. (2002). PyMOL. https://www.pymol.org . Google Scholar
Emsley, P. & Cowtan, K. (2004). Acta Cryst. D60, 2126–2132. Web of Science CrossRef CAS IUCr Journals Google Scholar
Gräslund, S., Sagemark, J., Berglund, H., Dahlgren, L. G., Flores, A., Hammarström, M., Johansson, I., Kotenyova, T., Nilsson, M., Nordlund, P. & Weigelt, J. (2008). Protein Expr. Purif. 58, 210–221. Web of Science PubMed Google Scholar
Guilligay, D., Tarendeau, F., Resa-Infante, P., Coloma, R., Crepin, T., Sehr, P., Lewis, J., Ruigrok, R. W., Ortin, J., Hart, D. J. & Cusack, S. (2008). Nature Struct. Mol. Biol. 15, 500–506. Web of Science CrossRef CAS Google Scholar
Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGettigan, P. A., McWilliam, H., Valentin, F., Wallace, I. M., Wilm, A., Lopez, R., Thompson, J. D., Gibson, T. J. & Higgins, D. G. (2007). Bioinformatics, 23, 2947–2948. Web of Science CrossRef PubMed CAS Google Scholar
Lorimer, D., Raymond, A., Walchli, J., Mixon, M., Barrow, A., Wallace, E., Grice, R., Burgin, A. & Stewart, L. (2009). BMC Biotechnol. 9, 36. Google Scholar
Raymond, A., Haffner, T., Ng, N., Lorimer, D., Staker, B. & Stewart, L. (2011). Acta Cryst. F67, 992–997. Web of Science CrossRef IUCr Journals Google Scholar
Tarendeau, F., Crepin, T., Guilligay, D., Ruigrok, R. W., Cusack, S. & Hart, D. J. (2008). PLoS Pathog. 4, e1000136. Web of Science CrossRef PubMed Google Scholar
Yamada, S. et al. (2010). PLoS Pathog. 6, e1001034. Web of Science CrossRef PubMed Google Scholar
This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.