High-performance macromolecular data delivery and visualization for the web

This article provides a survey of available web services and tools for data delivery and visualization of macromolecular structures.


Introduction
Biomacromolecular structural data, originating from more than seven decades of intensive research, form a highly valuable and scientifically vital resource that provides a mechanistic understanding of biological systems. Currently, more than 166 000 experimentally determined three-dimensional structures of biological macromolecules are available from the open-access Protein Data Bank (PDB) that is jointly managed by the Worldwide Protein Data Bank consortium (wwPDB Consortium, 2019), with 200-300 new structures being added every week. Biomacromolecular structural data (atomic coordinates) obtained from macromolecular crystallography (MX) or electron microscopy (3DEM) are accompanied by experimental data: measured structure factors yielding electron-density maps or directly measured Coulomb electric potential maps. NMR entries in the PDB are represented by ensembles of 10-20 structures, reflecting the uncertainty in the structural model estimated from experimentally derived restraints.
The PDB archive continues to grow both in terms of the number of new entries (count of structures; see Fig. 1a) and also in the complexity and size of the structures deposited in the PDB (the size of individual structures; see Fig. 1b). The largest single entry is the HIV-1 capsid (PDB entry 3j3q), with 2.4 million atoms (Zhao et al., 2013). However, the entries in the PDB also serve as building blocks for much larger models of biological systems, such as the cellPACK (Johnson et al., 2013a(Johnson et al., ,b, 2015 model of the HIV-1 capsid in blood serum (Johnson et al., 2014), with nearly 68 million atoms, as shown in Fig. 2(a). The PDB-Dev (Johnson et al., 2013a(Johnson et al., ,b, 2014 prototype system collects gigantic structural models obtained using integrative/hybrid modelling; for example, the nuclear pore complex (PDB-Dev ID PDBDEV_00000012; Kim et al., 2018), which contains 243 000 residues (represented as coarse elements; see Fig. 2b).
Value-added annotations provide the necessary biological context for the macromolecular structure data. The increasing amount of value-added annotations include information about many biological properties (mutation positions and effects, functional and active sites, channels and pores etc.), physicochemical properties (charges, flexibility etc.) and structural properties (ligand binding, various quality criteria etc.). These data are stored in multiple databases, and many of them are collected in the PDBe-KB database (PDBe- KB Consortium, 2020) and the RCSB PDB database (Goodsell et al., 2020). Moreover, not surprisingly, the number of available annotation types is also growing.
Biological macromolecules are inherently dynamic in nature, and hence the research community is not only interested in the static data available in the PDB but also generates information on the dynamics of the structures using molecular dynamics and, increasingly, experimental techniques such as electron microscopy, X-ray free-electron lasers and serial crystallography. These data are the basis of various complex analyses and examinations providing mechanistic insights into the function of biological macromolecules.
Visualization of these scientific data is pivotal in the analysis of biomacromolecular structures by the broader scientific user base, with most non-structural biology users expecting access to these data via web-based visualization tools. Specifically, researchers require a fast, responsive and fully interactive web visualization of all of the data mentioned above: atomic coordinates, experimental data, annotations and molecular dynamics.     Here, we describe the currently available cutting-edge web services that address these challenges.
Firstly, we focus on data-delivery problems. We then discuss the visualization of a single structure, including experimental data and annotations, and conclude with a focus on dynamic data and the visualization of structural ensembles.
For alternative surveys of tools for molecular visualization, we direct the reader to Martinez et al. (2020) and Miao et al. (2019).

Data delivery: coordinates
Until now, the naïve approach of delivering the complete coordinate model worked owing to the modest size of the macromolecular structures studied using traditional structuredetermination methods. This approach was used successfully even when only a small part was visualized, such as a binding site or a cartoon representation of a backbone without side chains. With rapid advances in structure-determination techniques, increasingly large macromolecular machines are within the reach of structure-determination studies (Zhao et al., 2013;Kim et al., 2018). The naïve approach of delivering the complete coordinate model is increasingly inadequate because of its low performance for large systems. To alleviate this issue, advanced approaches such as selective data delivery are required.
One approach is to precompute static subsets of data on the server for delivery, but this would require every possible combination to be precomputed, making this approach unfeasible. Another more practical approach is to create a system with a dynamic query language to compute the subsets of macromolecular structures on the fly  for efficient data delivery. This second approach was further developed in the LiteMol suite (Sehnal et al., 2017), which includes a module called CoordinateServer. Coordinate-Server performs an on-the-fly selection of several critical parts of the biomacromolecule; for example, backbone, side chains, heteroatoms, atoms necessary for cartoon model visualization, chains, residues and ligands including their surroundings etc. Based on the user request, a relevant subset of atoms is selected by CoordinateServer and delivered for further analysis or visualization. The same approach of selective delivery is also implemented in Mol* . CoordinateServer (and its Mol* successor ModelServer 1 ) is available for use by any software that supports the standard mmCIF format for the storage and efficient delivery of macromolecular structure data.
Further improvements were made to the data-delivery process by introducing novel coordinate-file formats requiring markedly less space than standard PDBx/mmCIF and PDB files; namely, the MMTF format (Bradley et al., 2017) and the BinaryCIF format (originally introduced as part of the LiteMol suite), which was later integrated into Mol*. Fig. 3(a) shows a marked reduction in the size of the data delivered for the visualization of the HIV-1 capsid via the LiteMol suite.
In parallel, enhancements can be achieved by utilizing an efficient in-memory representation of the macromolecular structure. Fig. 3(b) illustrates the efficient memory usage in Mol* that is necessary to represent large viral assemblies.
2.2. Data delivery: volumetric data (electron-density and electric potential maps) Volumetric data such as electron-density or electric potential maps are markedly larger than coordinate files. For example, the Zika virus capsid (PDB entry 5ire) electric potential map file (EMDB entry EMD-8116; Sirohi et al., 2016) is 1.6 GB when stored using the CCP4 format. This amount of data is prohibitively large for display in a web browser. Therefore, a more efficient system for data delivery is even more crucial than for coordinates. The selective delivery of electron-density and electric potential maps was first   introduced in the LiteMol suite. It includes DensityServer, which can adaptively downsample, slice and compress the volumetric data. For the Zika virus capsid data set, it can reduce the size to 1 MB while still maintaining visual fidelity (see Fig. 4).
In Mol*, DensityServer was superseded by VolumeServer (available to users at https://maps.rcsb.org/), which adds support for more data formats. Web tools that support DensityServer/VolumeServer are JSmol (Hanson et al., 2013), the LiteMol suite and Mol*.

Data delivery: annotations
Many tools and biological databases provide enriched annotations for macromolecular data. Some of them (for example, structure-validation data, as collected in wwPDB validation reports; Gore et al., 2017) are part of the Protein Data Bank FTP site. Many others are collected together in PDBe-KB, which is a community-driven resource for structural and functional annotations. Specifically, PDBe-KB currently contains more than 500 million manually curated or predicted residue-level annotations for PDB structures obtained from close to 20 partner resources: for example, residue depths, binding-site predictions, ligand interactions, interaction interfaces, backbone flexibility predictions, solvent accessibility, kinase-target predictions, molecular channels, functional site predictions, druggable pocket predictions, energetic consequences of mutations, curated regulatory sites, curated post-translational modification (PTM) sites, curated catalytic sites, mutations in the human proteome, curated metal-binding sites and short linear motifs. PDBe-KB data are integrated with the core PDBe data in a graph database. Weekly snapshots of the graph database index are made available on the PDBe FTP area (ftp://ftp.ebi.ac.uk/pub/ databases/msd/graphdb/). To ensure consistent and robust access to all of the PDBe-KB data, PDBe-KB also maintains a REST API (implemented in the Flask framework for Python), which includes 50 public endpoints. This API provides a source of annotation for web visualization tools.

Visualization: coordinate and experimental data
NGL (Rose & Hildebrand, 2015), LiteMol and Mol* can provide interactive and highly responsive visualizations. Electron-density and electric potential maps can be displayed using JSmol, the LiteMol suite and Mol*. The LiteMol suite visualizes either experimental data for the whole structure (an overview, no details depicted) or for individual residues and their surroundings (only part of the data is shown). Mol* uses a similar approach to the LiteMol suite. JSmol, the LiteMol suite and Mol* all have support for DensityServer/Volume-Server. Mol* is the successor to the NGL and LiteMol viewers, combining the strengths of both viewers.

Visualization: annotations
Annotation data include structural properties, physicochemical properties, charges and biological properties. They can be based on the whole structure or only a subset, such as the ligands.
Because validation annotation data are directly supported in the PDB archive, they are also integrated into the LiteMol suite, i.e. residues are coloured according to a value that describes the cumulative number of validation problems that occur in the residue (see Fig. 5a). In this way, residue quality is also included in JSmol and NGL. Mol* offers even more detail: each residue can also be coloured according to several individual validation problems (atom clashes, Ramachandran outliers etc.). Moreover, the LiteMol suite and Mol* also integrate information about the quality of ligands (e.g. chirality problems; see Fig. 5d). The LiteMol suite and Mol* also include nomenclature annotations, i.e. Carbohydrate Symbols (3D-SNFG; see Fig. 5f ).
External tools and data resources deliver other types of annotation data, and their visualization is performed on the individual web pages. Specifically, these resources integrate and customize a web visualization tool and utilize it for displaying niche data specific to that particular resource. This integration and customization can be very straightforward, such as merely colouring residues or atoms according to some property (for example partial atomic charge; see Figs. 5b and 5f ), or may require some code extensions, such as showing defined protein surfaces, which is necessary for channel visualization (see Fig. 5c). Other similar examples include Fragalysis (https://fragalysis.diamond.ac.uk), a web-based platform for fragment-based drug discovery, ProteinPlus Reduction of data delivery for Zika virus (PDB entry 5ire) in the LiteMol suite by the application of DensityServer. The data size is reduced from 1.6 GB to 1 MB by lowering the resolution while maintaining visual fidelity. For a live version, see https://v.litemol.org/?example=zika-cryo-em. modelling support server, and 3DBionotes-WS, which automatically annotates biochemical and biomedical information onto structural models (Segura et al., 2019). First of all, online data-storage capacity increased to allow the storage of gigabytes of data daily. Molecular-dynamics trajectories are thus shared in general-purpose scientific datastorage systems such as Zenodo (https://zenodo.org/), OSF (Center for Open Science; https://osf.io) and FigShare (https:// figshare.com), on the webpages of individual institutions or within specialized journals such as Scientific Data (Hoffmann et al., 2020). This is the method that is used for the sharing of simulation data within the NMRlipids (Botan et al., 2015) project and the current BioExcel/MolSSI COVID-19 repository (Amaro et al., 2020;https://covid.molssi.org/simulations/). This approach allows data sharing, but it does not provide easy-to-use visualization, nor does it support any form of unified metadata describing the simulations.
Visualization can be provided based on the download and visualization of individual frames, such as with MDsrv (Tiemann et al., 2017), HTMol (Carrillo-Tripp et al., 2018, JSmol or even with generic 3D viewers such as Autodesk360 (https://a360.autodesk.com/). Autodesk360 was used for the visualization of cellPACK ensembles (Johnson et al., 2016). JSmol is used in the BiGNAsim repository for the visualization of molecular-dynamics simulations of nucleic acids (Hospital et al., 2016). However, JSmol is only capable of reading very few trajectory file formats and requires expert knowledge to use, and also has size limits. MDsrv together with NGLviewer (Rose & Hildebrand, 2015) was used in the novel GPCRmd platform (http://gpcrmd.org/), which offers insight into the 3D-GPCRome (analysis of the intrinsic flexibility of the structures of G-protein-coupled receptors; GPCRs), which are the primary targets for many drugs and endogenous signalling pathways (Rodríguez-Espigares et al., 2020). To date, GPCRmd contains more than 580 moleculardynamics trajectories along with an interactive on-site analysis provided by MDtraj (McGibbon et al., 2015) and Caver 3.0 (Chovancova et al., 2012), and custom scripts for more than 200 GPCR systems. GPCRmd thus allows comparative analyses of the whole GPCR protein family to be performed in an open, collaborative and reproducible manner. Furthermore, this is currently a goal of many approaches in the field of molecular simulation (Abraham et al., 2019).
However, network capabilities limit making such approaches available on a general basis, as molecular-dynamics simulations can generate almost limitless amounts of data depending on the system size and sampling. Downloads of complete simulation data are thus slow owing to their size, which often can range from hundreds of megabytes to terabytes or more. Since the analysis of such data requires only a subset of all the generated data, it is probably the best option to transfer only the subset of data necessary for the analysis (i.e. visualization for insight and a list of critical molecular variables over time). A possible solution would be the utilization of selective data delivery that enables web visualization to speed up. MDSrv enables the selection and display of a specific frame and atomic selection. Mol* allows the animated visualization of the trajectory, and hence optimization of the data-delivery protocol via ModelServer allows the more efficient online visualization of molecular-dynamics trajectories.

Conclusion
Data delivery and responsive web visualization of biomacromolecular structures, including their experimental data, annotations and data on dynamics, is a large challenge in the utilization of macromolecular structure data, especially since the sizes and numbers of snapshots of structures are still growing. To solve this complex challenge, relevant algorithms and software solutions have been developed and are in the process of further improvement. Specifically, selective data delivery, providing only the coordinate data necessary for visualization, was introduced in the LiteMol suite (using CoordinateServer and later improved by ModelServer in Mol*) and adopted by NGL and JSmol. In parallel, the selective delivery of volumetric data was implemented first in the LiteMol suite (via DensityServer) and then also in NGL. This was followed by the development of VolumeServer, which is integrated with Mol* and JSmol. Annotation data are collected into PDBe-KB and are accessible using its REST API. The visualization tools LiteMol suite, NGL, JSmol and Mol* benefit from these data-delivery services. These tools are integrated into many tools focused on providing structure annotations for specific domains. Structural ensembles can be visualized using MDSrv with NGL or Mol* to perform analysis of structure ensembles such as MD trajectories, and also support selective data delivery for its effective visualization, even for large systems. The future thus might lead to the establishment of online environments for structural data analysis based on cloud solutions, effectively speeding up their investigation.