How do I get the most out of my protein sequence using bioinformatics tools?

Bioinformatics tools, primarily those available through the MPI Bioinformatics Toolkit, for the annotation of protein sequences are described. These include tools for the identification of homologs of known structure, protein domains, sequence repeats, coiled coils, transmembrane segments and signal sequences.


Introduction
With a protein sequence of interest at hand, life scientists aim at obtaining all possible information towards uncovering its biological role. Wet-laboratory experiments are fundamental to such a task. However, the increasing availability of protein sequence, structural and functional data has allowed the development of multiple computational resources that help to make informative predictions to guide experiments. These resources include methods for detecting homologs in protein sequence and structure databases, detecting sequence features such as repeats, coiled coils, transmembrane segments, signal sequences and secondary structures, and predicting threedimensional structures. Their combined results generally help to answer various questions regarding a protein of interest, including (i) which domains may be present, (ii) what its cellular localization may be, (iii) which segments may be fibrous and be responsible for its function or impair some experimental steps and (iv) which molecular functions may be expected.
Bioinformatics tools for protein sequence analysis have been developed over more than 30 years by groups worldwide, and their list is so extensive that choosing the most suitable and performant ones to use can often be overwhelming. Integrative web resources, where multiple best-performing tools are available within the same platform, provide a great solution. Examples include the EMBL-EBI Bioinformatics Web Services (Madeira et al., 2019), the SIB Bioinformatics Resource Portal (SIB Swiss Institute of Bioinformatics Members, 2016), the National Center for Biotechnology Information Web Resources (NCBI Resource Coordinators, 2018), the PredictProtein server (Bernhofer et al., 2021) and the Max Planck Institute (MPI) Bioinformatics Toolkit (Zimmermann et al., 2018).
The MPI Bioinformatics Toolkit (https://toolkit.tuebingen. mpg.de/) was launched in 2005 to provide life scientists with easy, web-based access to the best-performing bioinformatics tools and databases. It currently includes 36 in-house and external tools for (i) sequence-similarity searching, (ii) sequence-repeat detection and (iii) sequence-feature prediction, including that of secondary structure, disordered regions, coiled-coil regions, transmembrane segments and signal sequences. The Toolkit also offers easy, web-based access to HHblits and HHpred (Steinegger et al., 2019), two of the most sensitive tools for the detection of remote evolutionary relationships. Because of the popularity of these two tools, the Toolkit has established itself as an important integrative resource for molecular-biology research.
Here, we provide practical tips for using some of the main tools available within the Toolkit; for comprehensive protocols, please refer to Gabler et al. (2020). To demonstrate the different steps involved in the annotation of an uncharacterized protein, we use a metagenome-derived hypothetical protein, EHM23_20970 (EntrezID RPJ57313.1), thought to originate from an acidobacterium, as an example. This protein was recently predicted to contain a -propeller domain of the VCBS superfamily, but its biological role is currently unknown (Pereira & Lupas, 2021).

Homology searches
When analyzing an uncharacterized protein sequence of interest, the first step is to identify functionally or structurally characterized homologs in protein sequence and structure databases such as the nonredundant database (nr) at NCBI or the Protein Data Bank (PDB). This helps in the inference of function and the modeling of three-dimensional structures through extrapolation by homology. Common sequencesearch methods include (i) BLASTp (Altschul et al., 1997;Ladunga, 2017), which compares a single sequence with a sequence database, (ii) PSI-BLAST (Altschul et al., 1997), which compares a position-specific scoring matrix (PSSM), also commonly referred to as a profile, with a sequence database, (iii) HMMER (Prakash et al., 2017;Potter et al., 2018), which compares a profile hidden Markov model (HMM) with a sequence database, and (iv) HHsearch (Steinegger et al., 2019;Sö ding, 2005), which compares a profile HMM with a profile HMM database. Due to their different underlying approaches, each of these methods has different sensitivities and detects homologous relationships at different evolutionary distances. These four tools are available through the 'Search' section of the Toolkit and allow searches in various sequence databases.
By comparing single sequences, BLASTp searches for close homologs of a query protein sequence. The horizon of its search can be expanded further by iteration using PSI-BLAST (Altschul et al., 1997), wherein the multiple sequence align-ment (MSA) of the matches in one round is used to build a PSSM that captures the conservation pattern in the alignment and records it as a matrix of scores for each position in the alignment. This PSSM is used in the next round to detect new matches, and after each round it is updated with the newly detected matches, allowing a continuous expansion of the sampled sequence space until no new homologs are found.
HMMER, on the other hand, compares a profile HMM, another statistical description of the conservation pattern of a sequence alignment, with a database of protein sequences. As in PSSMs, for each column in an MSA, the equivalent column in the corresponding HMM contains the probability of occurrence for each of the 20 amino acids; the difference lies in the presence of four additional probabilities that describe how often amino acids are inserted and deleted at that position. With this, HMMER evaluates the probability of a database sequence containing the sequence pattern of a given profile HMM and can often detect more distant evolutionary relationships. Like BLAST, its search horizon can be expanded by iteration, as implemented in JackHMMER (not available through the Toolkit; https://www.ebi.ac.uk/Tools/ hmmer/search/jackhmmer; Johnson et al., 2010).
HHsearch and its accelerated and iterative version HHblits (Remmert et al., 2012) achieve a further increase in sensitivity by comparing profile HMMs with a database of profile HMMs and by incorporating secondary-structure information in the underlying profile HMMs, either as predicted by PSIPRED (Jones, 1999) or assigned from a 3D structure by DSSP (Touw et al., 2015;Kabsch & Sander, 1983). They are currently the most sensitive methods for detecting distant evolutionary relationships that typically remain undetected by other search methods. HHsearch and HHblits are therefore at the core of multiple state-of-the-art structure-prediction workflows, from template-based methods such as HHpred (Hildebrand et al., 2009;Zimmermann et al., 2018) and SWISS-MODEL (Waterhouse et al., 2018) to ab initio contact-based methods such as AlphaFold (Senior et al., 2020;Jumper et al., 2021), trRosetta (Yang, Anishchenko et al., 2020) and RoseTTAFold (Baek et al., 2021).

HHpred for sensitive protein-homology detection and structure prediction
The most widely used tool within the Toolkit is HHpred, a server for protein-domain annotation and structure prediction based on HHsearch (Hildebrand et al., 2009;Gabler et al., 2020;Zimmermann et al., 2018). Starting from an input protein sequence, HHpred builds an MSA and generates a profile HMM. By default, MSA generation is carried out with three iterations of HHblits over the UniRef clusters database filtered for a maximum pairwise sequence identity of 30% (UniRef30; Mirdita et al., 2017). The number of iterations, the E-value cutoff for sequence inclusion and the search method itself can be changed depending on how deep the user desires the MSA to be. If PSI-BLAST is used for this step, sequence searches are carried out against the nr protein-sequence database filtered for a maximum sequence identity of 70% (nr70; Zimmermann et al., 2018).
The calculated profile HMM is then searched against userselected profile HMM databases. By default, the Protein Data Bank filtered for a maximum pairwise sequence identity of 70% (PDB70) is searched, but several other databases are also offered, including the Structural Classification of Proteins (Chandonia et al., 2019) and the Evolutionary Classification of Protein Domains (Cheng et al., 2014) databases filtered for a maximum sequence identity of 70% (SCOPe70 and ECOD_F70, respectively), Pfam-A (Mistry et al., 2020) and the NCBI database of Conserved Domains (CD; Yang, Derbyshire et al., 2020). Presently, up to four databases can be searched at a time. While searches against the PDB70 database allow the identification of homologs of known structure that may be used for homology modeling, searches against domain databases aid in the identification and annotation of putative domain regions and the inference of function.  Identification of homologs of known structure for the hypothetical protein EHM23_20970 using HHpred. Output pages for searches against the (a) PDB70 and (b) ECOD70 (ECOD_F70) profile HMM databases are shown. The alert message displayed when coiled-coiled segments and/or signal peptides are predicted is highlighted by a red box. (c) Sequence alignments for the best match for the N-and C-terminal regions. The sevenfold repetition of a conserved putative cation-binding motif in the C-terminal region is highlighted.
Additionally, HHpred also offers profile HMM databases for several representative archaeal, bacterial and eukaryotic proteomes (Zimmermann et al., 2018).
Upon completion of the search, the results page provides three outputs: (i) a visual summary of the matches color-coded by their HHsearch probability [red (100%) to blue (20%)] (Figs. 1a and 1b), (ii) a table summarizing the matches and (iii) pairwise query-template alignments (Fig. 1c). The matches are sorted by their HHpred probability value and by default only the top 250 matches are displayed, but a maximum of 10 000 can be shown. Most representations on the results page are interactive; for example, clicking on a match in the visual summary takes the user to the corresponding alignment. Before selecting any match for downstream analysis (for example homology modeling or further sequence searches), it is advisable to analyze the corresponding alignment for conserved sequence motifs (Fig. 1c) or important deletions or insertions to make the most informed predictions or manual selection of templates. For detailed information on various search parameters and on understanding the search results, please refer to Gabler et al. (2020).
After careful inspection of the query-template alignments, manually selected alignments from a search against the PDB70 database can be forwarded to MODELLER (Webb & Sali, 2021) for homology modeling; that is, for building a structural model of the query protein sequence by using the match as a template (Fig. 2). This can be achieved by clicking 'Model using selection', which will start a new Toolkit job, generating an alignment in PIR format to be forwarded as input to MODELLER; if necessary, this alignment can be manually adjusted before starting the MODELLER job.
Users with a precomputed HHpred query-template alignment in PIR format can also run MODELLER directly from the '3ary Structure' section. As with any method for homology modeling, some important considerations should be taken into account: (i) errors in the alignment will introduce errors in the model, and the lower the sequence similarity between the query and the template, the higher the probability of such errors will be, (ii) side-chain placement becomes unreliable at sequence identities below 70% and (iii) as no dedicated loopmodeling tool is employed, long loops for which no templates are available are not modeled reliably. For these reasons, before any downstream application of the calculated model its quality should be evaluated and, if necessary, it should be refined; for a detailed review of this topic, please refer to Haddad et al. (2020).
Often, HHpred searches may not identify homologs for specific regions of the input sequence. The reasons for this could be manifold: (i) the region may not have any homologs of known structure, (ii) it may correspond to an intrinsically disordered region or (iii) it may be a highly diverged form of a known domain. In such cases, it is typically helpful to re-run the search for that region alone. Highly conserved sequences tend to bias the profile HMM by contributing a high number of homologs, down-weighting less conserved regions and making them undetectable. By running HHpred with the region of interest alone, the profile HMM will not be biased by its flanking regions in the full-length sequence.
The following recommendations are made.
(i) When only very close homologs are to be found, set the 'MSA generation iterations' to 0.
(ii) Set 'Max target hits' to 10 000 to obtain a more comprehensive set of matches.
(iii) To compare two sequences or MSAs, use the pairwise mode of HHpred; click on the switch labeled 'Align two sequences/MSAs' located below the input textbox to activate it.
(iv) When an HHpred search yields no matches for certain regions of a protein, re-run the searches with those regions alone.
(v) Always inspect the alignments for conserved sequence motifs. In particular, check the row between the query and template consensus sequences for clusters of three or more matching columns (marked by a '|' sign). Check whether the identified conserved motifs have a characterized function in homologs detected by HHpred.
(vi) Always check the quality of a homology-based model before any downstream application.

Repeat detection
In addition to identifying experimentally characterized homologs in sequence and structure databases, it is often also helpful to detect internal sequence repeats in the protein sequence of interest. More than 14% of all proteins, and as many as 25% in eukaryotes, are predicted to contain internal sequence repeats (Marcotte et al., 1999), which often correspond to structural or functional units (Andrade, Perez- Homology modeling of the hypothetical protein EHM23_20970. In the top panel, a screenshot of the HHpred results page header is shown, highlighting the option 'Model using selection'. The top match for each domain in EHM23_20970 was selected (PDB entries 5ife chain C and 2bwr chain A) and forwarded to MODELLER. The resulting model is shown in the bottom panel. Iratxeta et al., 2001;Sö ding & Lupas, 2003). Therefore, their identification provides clues about the domain organization, fold and function of proteins, especially of those without any homologs of known structure and function. Additionally, it may help to gain insights into possible fibrous, elongated or symmetric segments of the protein that may affect its experimental characterization. In the specific case of macromolecular crystallography, local structural symmetry is a source of noncrystallographic symmetry (NCS), which can be both a valuable asset (Kleywegt, 1996;Terwilliger, 2002) and a complication (Ruf et al., 2016;Jamshidiha et al., 2019) in crystallographic structure determination.

HHrepID for de novo repeat detection
HHrepID employs HMM-HMM comparison for the de novo detection of highly divergent tandem repeats in protein sequences (Biegert & Sö ding, 2008;Remmert et al., 2010). It starts by generating a profile HMM for the input sequence using three iterations of HHblits over the UniRef30 database. Next, it uses HMM-HMM self-comparison to search for local suboptimal alignments and to detect sequence signatures of repeats. The result is a graphical representation of the selfcomparison matrix, where entries (i, j) correspond to the probability of residue i being aligned with residue j, and a multiple sequence alignment of the repeat units found by analyzing the matrix (including their significance value and boundaries; Fig. 3).
A repeat sequence is considered to be significant if the selfalignment p-value is below a given threshold (1 Â 10 À1 by default) and is highlighted in the self-comparison matrix as a blue line. However, highly divergent repeats may not pass this threshold and may not be included in the alignment, but signals for them may still be observed in the matrix as dark or light gray lines (Fig. 3a). Such divergent repeats can be found at the termini, between detected repeat-containing regions or even as long linkers. Therefore, it is always helpful to analyze the self-comparison matrix and the linkers between repeats for divergent repeats. If the linkers are about the same size as the detected repeats, they may represent degenerate forms. In such cases, it is advisable to realign the automatically detected and manually included repeats with multiple sequence alignment tools (available through the 'Alignment' section).
HHrepID works best with protein sequences containing a single domain or a single repeat type. Ideally, sequences should be between 100 and 300 residues long. While shorter sequences usually gather too few homologs in the MSA generation step, longer sequences may contain multiple domains or different types of repeats. HHpred could be used first to detect domains, and subsequently HHrepID could be run on the individual domains. Also, if an HHrepID job yields repeats of different types, the detected repeats could be refined by re-running HHrepID with sequence segments corresponding to one repeat type.
The following recommendations are made.
(i) Carry out an HHpred search against a database of domains (for example SCOPe, ECOD or Pfam-A) and subsequently run HHrepID for each domain individually.  (ii) Pay attention to linker regions between repeats. If they are about the same length as the detected repeats, they may represent highly divergent repeats that scored below the significance threshold.
(iii) When analyzing an alignment of repeats yielded by HHrepID, it can be useful to realign them using other sequence-alignment tools (for example MSAProbs; Liu et al., 2010).

Coiled-coil prediction
Coiled coils are a ubiquitous class of repetitive protein segments that support various biological roles from transport to structural rigidity and signal transduction . They consist of two or more -helices that wind around each other in a parallel or antiparallel orientation to form a superhelical bundle. The bundle is held together by a primarily hydrophobic core following a 'knobs-into-holes' packing . Canonical coiled coils are characterized by a seven-residue sequence repeat (the 'heptad'), where each position is labeled a-g; the residues at positions a and d are usually oriented towards the core and are primarily hydrophobic. However, coiled coils with other periodicities are also known and are referred to as 'noncanonical' coiled coils. These are described as combinations of three-and fourresidue sequence segments (for example an 11-residue repeat, or hendecad, is the result of the combination of 3 + 4 + 4 segments), and their packing deviates from the knobs-intoholes geometry. The MPI Toolkit offers four tools for the prediction of coiled-coil regions from sequence alone: PCOILS (Gruber et al., 2006;Lupas et al., 1991), MARCOIL (Delorenzi & Speed, 2002), DeepCoil (Ludwiczak et al., 2019) and DeepCoil2.
PCOILS detects coiled-coil segments in a protein sequence or an MSA using sequence-profile or profile-profile comparisons. The Toolkit implementation of PCOILS allows the user to run the predictions on a protein sequence, a custom MSA or an MSA built internally by the Toolkit. Additionally, the user can set two parameters: the profile matrix (MTIDK, MTK, PDB or Iterated) and a weighting option for core residues (yes or no). The MTIDK and MTK matrices are based on myosins, paramyosins, tropomyosins, intermediate filaments type IV, desmosomal proteins and kinesins (Lupas et al., 1991), whereas the PDB and Iterated matrices are based on larger coiled-coil data sets derived from the PDB and the nr database, respectively. The Iterated matrix performs the best and the MTK matrix the worst when the prediction is carried out only using the input sequence alone. However, when the predictions are carried out using an MSA, all matrices perform similarly.
As coiled coils are typically fibrous and solvent-exposed, all but the core positions (a and d) have a high probability of being occupied with hydrophilic residues. Consequently, since all positions are weighted equally in the unweighted mode of PCOILS, highly charged hydrophilic sequences are often predicted to be coiled coils, even in the absence of heptad periodicity. This issue can be resolved using the weighted mode of PCOILS, which assigns the same weight to the two hydrophobic positions a and d as to the five hydrophilic positions b, c, e, f and g (2.5:1 in the weighted mode versus 1:1 in the unweighted mode). PCOILS proceeds by comparing the input sequence or MSA with the user-selected matrix using sliding windows of three different sizes (14, 21 and 28 residues, corresponding to two, three and four heptads, respectively). The output is the coiled-coil-forming probability and frame (a-g) for each residue in the input sequence; as a rule of thumb, residues with probabilities above 50% can be considered to be part of a coiled-coil segment. In the MPI Toolkit, these probabilities are made available through a table and a graph, shown together with the secondary-structure prediction carried out with PSIPRED (Jones, 1999;Figs. 4a and 4b).
Unlike PCOILS, MARCOIL is a windowless HMM-based tool for the detection of canonical coiled-coil regions. Its output is similar to that of PCOILS, with a graphical representation of the posterior probabilities along the sequence and a probability list with the corresponding predicted per-residue heptad frame. The performance of MARCOIL is comparable to that of PCOILS, but it is extremely sensitive to highly charged false positives.
DeepCoil and its updated version DeepCoil2 are the most recent addition to the Toolkit's repertoire of coiled-coil prediction methods. DeepCoil is a neural network-based method trained on more than 10 400 nonredundant canonical coiled coils of known structure. It detects both canonical and noncanonical coiled coils, including many that are undetectable with PCOILS and MARCOIL. DeepCoil predictions can be carried out based on a single sequence or an MSA provided by the user or built by the Toolkit using three iterations of PSI-BLAST over the nr90 database. While DeepCoil uses PSSMs generated by PSI-BLAST to capture evolutionary information, DeepCoil2 uses the pre-trained protein language model SeqVec that is based on the ELMo language model from the domain of natural language processing (Heinzinger et al., 2019;Peters et al., 2018). The output of DeepCoil is similar to that of PCOILS and MARCOIL, with a graphic summary of the predictions (Fig. 4b) and a text output with per-residue probability values; DeepCoil2 also predicts the heptad registers.
The decision on which of these methods to use is a trade-off between required runtime and accuracy. PCOILS and MARCOIL are extremely fast and good at predicting canonical coiled coils (Ludwiczak et al., 2019;Gruber et al., 2006;Li et al., 2016). However, they often assign high coiled-coil probabilities to highly charged sequences. DeepCoil and DeepCoil2 are slower but are more accurate and can detect noncanonical coiled coils.
The following recommendations are made.
(i) When using PCOILS, the probabilities assigned using a 28-residue window are best suited for detecting new coiledcoil regions in a protein of interest, whereas 14-residue windows are good for defining boundaries and heptad registers of detected coiled coils.
(ii) Since PCOILS is biased towards highly charged sequences, predictions should be made and compared using research papers the weighted and unweighted modes and corroborated further using DeepCoil2.

Integrative annotation of sequence features
Sequence features are segments that confer specific characteristics on a protein and are important for its function or structure. These include not only domains, short sequence motifs and repeats, but also secondary structure, intrinsically disordered regions, transmembrane segments and signal sequences. While the prediction of secondary structure and intrinsically disordered regions provides information complementary to the annotations carried out with HHpred and repeat detection, their annotation is especially important when no homology is found to any protein of known structure. Transmembrane segments and signal sequences, on the other hand, provide additional information regarding the possible cellular localization of a protein of interest.
If a signal sequence is detected, a notification is displayed at the top of the output page. Additionally, if the first 20-35 residues of a protein are predicted to be disordered (Fig. 5b), it is highly likely to be a signal peptide. However, while SignalP also predicts the potential secretory pathway that a query protein is targeted to based on whether it originates from a eukaryote, an archaeon or a Gram-positive or Gramnegative bacterium (Almagro Armenteros et al., 2019), Quick2D does not display such information. Similarly, while TMHMM and Phobius also predict the topology of membrane segments, Quick2D does not. Such information could be obtained using the SignalP, THHMM or Phobius servers directly or using the TOPCONS metaserver (Tsirigos et al., 2015; https://topcons.net), which runs several methods for the prediction of transmembrane -helices. We note that all search tools with the Toolkit (for example HHpred) also display a message if coiled coils, signal peptides or transmembrane segments are detected (Figs. 1a and 1b).
Quick2D does not predict transmembrane -strands, such as those found in outer membrane -barrels (OMBBs). OMBBs can be predicted using the HMM-based tool HHomp ) within the Toolkit, or using the external servers BOCTOPUS2 (Hayat et al., 2016) or BetAware-Deep (Madeo et al., 2020).
The following recommendations are made.
(i) In order to obtain further insights about a sequence predicted to contain a signal peptide or -helical transmembrane segments, re-run the prediction again using a dedicated server such as SignalP, TMHMM, Phobius or TOPCONS.
(ii) If a notification concerning the prediction of a signal peptide is displayed and the N-terminal part, i.e. the first 20-35 residues, is predicted to be disordered, it is likely to correspond to the detected signal sequence. Coiled-coil prediction for the hypothetical protein EHM23_20970 using PCOILS and DeepCoil2. The graphical output of each tool, including the secondary-structure prediction carried out by PCOILS with PSIPRED, is shown and aligned to provide a comparison.
(iii) Always inspect the output of search tools (for example HHpred) within the Toolkit for notifications regarding the presence of sequence features.

Example: annotation of the hypothetical protein EHM23_20970
The hypothetical protein EHM23_20970 (EntrezID RPJ57313.1) is a 1051-residue putative protein derived from a sediment metagenome that was obtained from the sequencing of environmental samples collected at the North Dakota Cottonwood Lake Study Area and the Prairie Pothole Region wetland (Dalcin Martins et al., 2018). EHM23_20970 is thought to originate from an acidobacterium, and we came across it while studying the evolutionary relationships between the -propeller domains in -integrin, tachylectin-2 and proteins of the VCBS superfamily (Pereira & Lupas, 2021). Its -propeller domain formed a distinct group with the -propellers of several other hypothetical proteins.
A BLASTp search over the nr database (version of January 2021) resulted in 5002 hits to hypothetical proteins at an E-value cutoff of 10 À3 . While ten of these hits made full-length matches, the rest only matched its C-terminal segment (residues 700-1051) which, as we will see in the following, corresponds to its -propeller domain. Running three iterations of PSI-BLAST yielded the same result, suggesting that while EHM23_20970 is not a singleton, it is also not a close homolog of any hitherto characterized protein. Similarly, when we searched for homologs of known structure in PDB70 (version of January 2021) with HHpred, no full-length match was found (Fig. 1a). However, the obtained matches indicated the presence of two distinct regions: an N-terminal region from residues 60 to 700 and a C-terminal region from residues 711 to 1050. The best match to the N-terminal region was made by an all--helical region of human splicing factor 3B subunit 5 (PDB entry 5ife, chain C; probability 99.71%) and that to the C-terminal region was made by an all--fold lectin (PVL) from Psathyrella velutina (PDB entry 2bwr, chain A; probability 98.84%) (Fig. 1c). A subsequent HHpred search against the ECOD70 domain database indicated that the N-terminal region contains HEAT repeats (Jernigan & Bordenstein, 2015;Andrade, Petosa et al., 2001) and the C-terminal region contains a VCBS-like -propeller domain (Fig. 1c). Both of these domains are repetitive: while the HEAT repeat comprises two -helices connected by a short linker and is typically tandemly repeated 3-36 times to form open-ended solenoids, -propellers are toroids built of 4-12 four-stranded -meanders. By forwarding the aforementioned best-scoring templates, PDB entry 5ife chain C and PDB entry 2bwr chain A, to MODELLER, we built a preliminary full-length model of EHM23_20970 (Figs. 2 and 6b).
To analyze the tandem repeats within the two domains of EHM23_20970 at the sequence level, we used HHrepID with default settings. We detected 20 short -hairpins in the N-terminal region and seven four-stranded -meanders in the C-terminal -propeller region. These repeats are, however, quite degenerate and hard to find, especially in the propeller domain (Fig. 2a), and therefore they had to be manually realigned using the structural model as a reference (Fig. 6a). The obtained sequence alignments for the two regions highlight the presence of conserved sequence motifs (Fig. 1c): while the HEAT repeats show a pattern of hydrophobic residues characteristic of amphiphilic -helices, the -propeller contains seven conserved DxDGDGxxD sequence motifs. A very similar aspartic acid-rich motif is characteristic of the VCBS superfamily of -propeller-containing proteins, Sequence-feature annotation in the hypothetical protein EHM23_20970 using Quick2D. (a) The list of tools executed by Quick2D, depicting their target features. (b) Example output for the first 170 residues, highlighting its all-helical propensity and the presence of a putative signal peptide and an intrinsically disordered N-terminal segment. especially PVL lectins, where it is usually involved in binding cations; this motif is also found in -integrin (Pereira & Lupas, 2021;Rigden & Galperin, 2004;Rigden et al., 2011).
PSI-BLAST and HHpred searches for homologs of EHM23_20970 alerted us to the possible presence of putative coiled-coil regions, detected with PCOILS, and a signal peptide, detected with SignalP (Figs. 1a and 1b). While we could not detect coiled coils using the sensitive coiled-coil prediction method DeepCoil and manual inspection (Fig. 4), we detected a signal sequence using Quick2D (Fig. 5b), SignalP and TOPCONS. Put together, our annotation suggests that the hypothetical protein EHM23_20970 is a secreted, two-domain protein, with N-terminal HEAT repeats and a C-terminal seven-bladed, PVL-like -propeller with seven conserved cation-binding motifs. Given that HEAT repeats are usually involved in protein-protein interactions and PVL is a lectin (Yoshimura & Hirano, 2016;Cioci et al., 2006), it is likely that EHM23_20970 is a secreted binder (perhaps a lectin) involved in a scaffolding role. However, it remains unclear whether it is a periplasmic protein or whether it is exported further across the outer membrane.

Summary
The MPI Bioinformatics Toolkit provides easy and integrative access to a wide variety of bioinformatics tools and databases. It includes tools for the annotation of sequence features, the detection of remote homologs and the generation of homology models. Most tools within the Toolkit are interconnected, allowing the output of one to be forwarded as input to another. Starting from the amino-acid sequence of a hypothetical protein (EHM23_20970), the combination of these tools allowed us to predict that it contains two repetitive domains, which are likely to be involved in macromolecular binding, that it contains seven putative cation-binding sites and that it is likely to be transported across the inner membrane. Although no full-length homologs of known structure are presently available for this protein, we could build a preliminary three-dimensional model for it. This knowledge could now be used to design more streamlined experiments for its biochemical and biophysical characterization or to solve its structure using molecular replacement.
In addition to the tools described here, the Toolkit offers several other useful tools such as CLANS (Frickey & Lupas, 2004), which allows the generation of sequence-similarity networks (SSNs) for the visualization of relationships in large protein sequence sets (see Gabler et al., 2020). Furthermore, we note that most of the analyses described in this article can also be performed using other web-based bioinformatics resources. For instance, the CBS (https:// services.healthtech.dtu.dk/) and PredictProtein (Bernhofer et al., 2021) servers are excellent resources for the prediction of sequence features in proteins, the NCBI BLAST (NCBI Resource Coordinators, 2018) and EBI HMMER (Potter et al., 2018) servers for sequence-similarity searching, EFI-EST for the generation of SSNs (Zallot et al., 2021)   The domain organization and structural model of the hypothetical protein EHM23_20970. (a) In the sequence annotation, the N-terminal signal peptide detected by SignalP is colored gray and the alignments of repeats in the two domains are shown. Predicted -helices are highlighted in red and -strands in blue, and the repetitive, putative cation-binding motif in the -propeller domain is highlighted by a dashed box. The region predicted to be a putative coiled coil by PCOILS is underlined. (b) A full-length homology model constructed with MODELLER is shown; secondary structure is colored as in (a). et al., 2021; https://colab.research.google.com/github/deepmind/ alphafold/blob/main/notebooks/AlphaFold.ipynb or https:// github.com/sokrypton/ColabFold) and RoseTTAFold (Baek et al., 2021; https://robetta.bakerlab.org), both of which promise to revolutionize the field of structural biology.