Vagabond: bond-based parametrization reduces overfitting for refinement of proteins

Reparametrizing the positions and flexibility of protein models for refinement against diffraction data results in a substantial reduction in the number of parameters. This produces maps that are less affected by model bias, which allow the maps to more faithfully reflect the true crystal contents and reveal more biological information.

Structural biology methods have delivered over 150 000 high-resolution structures of macromolecules, which have fundamentally altered our understanding of biology and our approach to developing new medicines. However, the description of molecular flexibility is instrinsically flawed and in almost all cases, regardless of the experimental method used for structure determination, there remains a strong overfitting bias during molecular model building and refinement. In the worst case this can lead to wholly incorrect structures and thus incorrect biological interpretations. Here, by reparametrizing the description of these complex structures in terms of bonds rather than atomic positions, and by modelling flexibility using a deterministic ensemble of structures, it is demonstrated that structures can be described using fewer parameters than in conventional refinement. The current implementation, applied to X-ray diffraction data, significantly reduces the extent of overfitting, allowing the experimental data to reveal more biological information in electron-density maps.

Introduction
Overfitting bias arises from the inherent mismatch between the paucity of experimental data and the complexity and size of biological macromolecules. It is exacerbated by the currently universal atomistic description of molecular structure, in which structures are described by atomic coordinates (x, y, z) combined with some indication of flexibility (typically the so-called B factor, which applies a Gaussian blur to the atomic position) (Adams et al., 2010;Murshudov et al., 2011;Bricogne et al., 2017;Sheldrick, 2015). In the vast majority of cases this description requires a minimum of four parameters per atom to be refined against the experimental data. Additional parameters must be refined to permit more complex descriptions of macromolecular flexibility (anisotropic B factors, alternate conformations and rigid-body motions). In order to control overfitting, aspects of the model (bond lengths, angles, chirality, neighbouring B factors) are restrained using prior chemical knowledge, effectively increasing the number of observations (Konnert & Hendrickson, 1980), and a cross-validation metric (i.e. R free ; Brü nger, 1992) is used to check for undue overfitting. Despite these efforts, individual bond lengths in atomic models vary in a chemically unreasonable manner, frequently by over 0.01 Å from the expected values, in part owing to their systematic shortening due to anharmonic motion (Stuart & Phillips, 1985). In addition, quality metrics for the fit of the model to the data (for example R factors and real-space correlation coefficients) do not match the quality of the data. Thus, R factors generally stall 10-15% higher than the precision of the data itself (Holton et al., 2014). A number of papers have addressed the damage to biological interpretation from these ISSN 2059-7983 various pitfalls (Wlodawer et al., 2017;Dauter et al., 2014;Baker et al., 2010;Jeffrey, 2009). Early work in model refinement began in earnest in the 1980s (Hendrickson, 1985;Sussman, 1985;Tronrud et al., 1987) and some researchers began to explore hybrid models of positional and bond-based models (Sussman, 1985;Oldfield, 2001). Although maximum likelihood does not require a certain parameterization scheme, the success of maximumlikelihood methods cemented the use of the atomistic models with which they were co-developed, and these models have become standard procedure. One major beneficial consequence of atomistic parametrization is that R values can be rapidly brought down from those calculated with an initial molecular-replacement solution, often 35-40%, to an interpretable structure with R values of around 20-25%, with some manual structural rearrangement required by the researcher. However, after decades of improvement in X-ray data quality, overfitting issues remain in model-building tools and in refinement software, suggesting that they cannot be addressed using the current atomistic modelling scheme. Therefore, I have developed an alternative approach that reduces the number of parameters by $60% and is freely available as a new macromolecular refinement package, Vagabond. This paper provides a proof of principle that changing the parametrization scheme can clarify difficultto-interpret electron density, and that refinement is broadly stable across a random selection of entries from the PDB.

Methods
Defined terms used throughout this section are highlighted in bold and are defined in Table 1. Key terms are illustrated in Figs. 1(a) and 1(b). Symbols are defined in Table 2. This section is divided into a description of the model and the calculation of electron density, followed by the refinement method.

Generation of an initial bondbased model
Atoms from a PDB file are loaded into memory and the connectivity between atoms is calculated from the sequence, residue numbers and prior information of residue connectivity. Atoms listed under the heteroatom (HETATM) category are included using their original isotropic or anisotropic B-factor definition, and are not currently remodelled by the Vagabond refinement process. An anchor atom is chosen per amino-acid chain, from which the rest of the same chain will be generated. There is no clear dependency of the outcome of the refinement on the choice of the anchor atom, and so as a conservative choice, to avoid the extremes of the model, Vagabond defaults to the backbone N atom of the residue closest to the centre of mass of the chain. Backbone N atoms are not considered if they have an alternative conformation as defined in the PDB file. The initial positions from the PDB file are held in memory, and are hereby referred to as original positions. Canonical bond lengths, angles (Engh & Huber, 2001) and chirality are imposed on the model, whereas initial torsion angles are calculated for appropriate atoms from the original positions relating each set of four sequential atoms. A number of fixed torsion angles can also be enforced (such as within the tyrosine ring). Downstream bonds connect the anchor atom to the N-and C-termini. At branch points, such as carbons, carbonyl O atoms and all H atoms, multiple downstream bonds constitute sister bonds (Fig. 1b). The branched bonds off the main chain are all affected by the torsion angles along the main chain. The bonds for these branches are related to the main-chain upstream and sister  Table 1 List of commonly used terms in this text beyond those attributable to the standard description of the chemical structure.
These are also written in bold throughout Section 2.

Original position
Coordinate describing an atom position as found in the initial PDB file Atom point Coordinate at which a delta function describes a single potential position of an atom Atom-point cloud A set of atom points that cumulatively describe all modelled positions of a given atom Bond A chemical link between two atoms for which various parameters may be defined Point-to-point bond A link pairing an atom point each from one atom-point cloud connected by a bond (Fig. 1a Sister bonds are those which share the same upstream bond

Major atom
The major atom of a bond has an atom-point cloud inherited from the upstream bond

Minor atom
The minor atom of a bond is under direct control by the set of parameters of that bond Flexless structure Ensemble for all atoms in a chain calculated without any description of flexibility Real-space structure segment Calculated electron density for a portion of a structure on a cubic voxel grid Asymmetric unit structure Special case of a real-space structure segment corresponding to the full modelled contents of the asymmetric unit Unit-cell grid Grid of voxels with appropriate morphology such that the reciprocal-structure factors correspond to the amplitudes of crystal reflections bonds by two bond angles. H atoms are regenerated at appropriate torsion angles at a bond length of 0.968 Å .

Atom-point cloud generation
A cloud of points for the anchor atom is generated by taking the desired number of conformers (J; default 120) and arranging them into ten concentric spherical surfaces centred around the original position of the anchor atom (Fig. 1b), populated by a number of points proportional to the surface area of each layer. The outermost layer has a radius !, which has an initial value of 0.356 Å and may be changed during refinement. The atom points on each layer are arranged using a Fibonacci lattice (Gonzá lez, 2010) to produce a roughly uniform distribution on each layer.

Calculation of the flexless structure
All sequentially bonded atom-point clouds are recursively placed towards both the N-and C-termini, according to their relationship to the previous three atom-point clouds using only the bond lengths, angles ( n ) and mean torsion angles (t n ) for each atom n. Each atom point is uniquely paired with an atom point in each of the sequentially bonded atom-point clouds. This calculation does not include any description of flexibility, leading to a simple duplication of the initial atompoint cloud at every atom average position. The resulting ensemble is referred to as the flexless structure (Fig. 1c).

Calculation of whole-molecule movements
The set of whole-molecule movements provide rules upon which to apply rotations, translations and rotation-translation coupling, and are applied to the flexless structure before any other contributions from intramolecular flexibility. The description is inspired by TLS (Winn et al., 2001), but is not equivalent, as it defines an explicit and unique combination of rotations and translations for each atom point of the anchor atom, while TLS describes the average motion derived from   The translation is conferred by a symmetrical tensor T with six refineable parameters. The initial value of T is set to the identity matrix. To apply translations to each atom point, singular value decomposition (SVD) is performed on this matrix, As T is invertible, the matrix U contains the eigenvectors of this matrix in each column and the matrix W contains the eigenvalues. For each atom point in the atom-point cloud of the anchored atom in the flexless structure, an offset per atom point Áa j T is calculated (see Table 2), Any number of additional screw motions (a default of three, but user-changeable) are applied to these updated positions. Each comprises a rotation and a rotation-dependent translation. For each screw motion, a three-dimensional vector r defines the rotation axis and angle, and a three-dimensional vector w defines a rotation-dependent translation, of which the third vector component is set to zero. This constitutes five parameters per screw motion.
Let d j be the dot product between Áa j T and r for every atom point j. A rotation matrix D j is constructed to rotate by an angle of d j around the unit vector for r,r r. The corresponding point-to-point bond vector is multiplied by D j . This has the effect of 'fanning out' the conformers in the ensemble. Additionally, a basis matrix S is calculated as in (3) to produce an additional translational offset per atom point: Àr r 2 x r r zr r y 1 Àr r x r r z 0r r x r r ẑ r r xr r yr r z 8 > > > > > > > > < > > > > > > > > : This produces a matrix of orthogonal basis vectors. The vector Sw produces a translational offset in the plane perpendicular tor r, which is then applied as in (4). The magnitude of the second term in (4) is proportional to the value of d j and is therefore a rotation-dependent translation. Here, the summation symbol indicates the sequential application of each pair of w and r parameters.
The translated positions for each atom point of the anchor atom are then calculated as Áa j S þ ha n i. The application of whole-molecule movements only changes the values of the atom points and point-to-point bonds of the anchor atom directly. The global motions are then propagated through the rest of the polypeptide chain. Certain intramolecular move-ments are further applied to the atom-point cloud of the anchor atom in the next section.

Calculation of intramolecular flexibility
This section describes how torsion angles are allowed to vary between conformers of an ensemble, describing how the magnitude of the variation is defined and also the use of two variable angles to modify the axis upon which this variation is applied. Kick parameters associated with a bond confer flexibility through deviation of the torsion angles between the point-to-point bonds controlling the distribution of the atom points of the minor atom of the bond. The flexless structure is used to determine how the kicks will be applied to each pointto-point bond torsion angle. The plane P intersects the average vector of each bond hb n i and the previous bond hb nÀ1 i. The unit vectorp p n is perpendicular to plane P. A rotation matrix M n is defined in terms of two angles, n and n .  Table 2 List of commonly used symbols in this text. * denotes a potentially refinable parameter.

Symbol
Meaning J Number of atom points per atom-point cloud N Number of atoms in bonded chain . . . , n À 1, n, n + 1, . . .
Numbering scheme for sequentially bonded atoms n *Bond angle directly dictating the placement of the nth atompoint cloud t n *Mean torsion angle directly dictating the placement of the nth atom-point cloud k n *Magnitude of torsion deviations affecting atom point placements for the nth atom-point cloud a j n jth atom point of the atom-point cloud of the nth atom ha n i Atom average position of nth atom; ¼ Vector describing point-to-point bonds connecting the (n À 1)th atom point to the nth atom point; = ða j nÀ1 À a j n Þ hb n i Average bond vector connecting the (n À 1)th atom average position to the nth atom average position; = P J j¼1 ðb j n Þ=J Áa j n Displacement of the jth atom point from the atom average position for atom n; ¼ ðPa j n À ha n iÞ P n Plane intersecting ha n i and parallel to hb n i and hb nÀ1 î p p n Unit vector perpendicular to P n n , n **Two angles used to form a rotation matrix associated with the nth atom M n Rotation matrix M n defined in terms of n and n as in (5)  This custom rotation definition rotates around the x axis by an angle of . The multiplication of this rotation matrix with a vector in the direction of the y axis will produce vector y 0 . A rotation angle of is then applied around the y 0 axis. A single matrix describing the product of these sequential rotations is described in (5): sin 2 n ð1 À cos n Þ À cos n sin n sin n cos n Â ð1 À cos n Þ cos n sin n cos n À sin n sin n cos n sin n Â ð1 À cos n Þ sin n sin n cos n þ cos 2 n ð1 À cos n Þ 2 6 6 6 6 6 6 6 6 6 4 3 7 7 7 7 7 7 7 7 7 5 : This definition is used as this matrix can produce a custom eigenvector with two angles without the risk of gimbal lock.
The default values of n and n are zero, and therefore produce no effect, but may optionally be refined. For each atom point, a vector can be used to describe the displacement of the atom point from the average atom position, Áa j n . The scalar s j n is the sine of the angle between the vectors M np p n and Áa j n per point in the atom-point cloud. These are only ever calculated from the information in the flexless structure and are then stored for future use (Fig. 1d).
Each C atom-point cloud in the structure may have two kicks associated with it. Flexes onwards from the N-terminal and C-terminal sides of the anchor atom are considered to always be in the forward direction, with an associated kick parameter k n . The sign of k n is interchangeable with the direction of the custom eigenvector generated by n and n . The reverse direction has a kick parameter k 0 n . For forwarddirection kicks, the torsion-angle deviation Át j n per atom point is calculated as the product of its pre-calculated sine angles and both of the associated kick parameters of the bond, s j n ðk n þ k 0 n Þ. These deviations will then be propagated to the downstream terminus on calculation of the downstream bonds (Fig. 1e). Reverse-direction propagation is conceptually different as it involves breaking the recursion. In this case, for each atom point of the C atom, an appropriate rotation is applied to the atom point of the anchor atom in the same conformer. This rotation has an angle of Às j n k 0 n around the corresponding point-to-point bond vector b j n that has the C atom as its minor atom, and the centre of rotation is the C atom. Note that the s j n k 0 n term in the forward direction cancels out the Às j n k 0 n term in the reverse direction to terminate flexibility in the downstream bonds (Fig. 1f). The order in which these kicks are applied to the chain runs from N-terminus to C-terminus; the order choice is arbitrary but must be maintained for correct recalculation of the structure.

Calculation of real-space structure segments
In order to calculate a portion of real-space structure from the model, electron density is generated on a grid composed of voxels with cubic morphology, known as a real-space structure segment. The method described here differs from the usual method (Ten Eyck, 1977) as this method is computationally very slow for Vagabond models, and the lack of individual atomic B factors provides new opportunities for calculating maps with efficiency. For a nominal resolution d, the cube voxel length is d/4. The nominal resolution is taken as the highest resolution recorded reflection in the reflection list. This cube is capped at a maximum length of 0.8 Å . A cuboid grid is generated by choosing a number of voxels in each dimension capable of encompassing the calculated volume of interest (the bounding box containing the full ensemble) and an additional margin of 2 Å on each cuboid face.
Atoms to be inserted into the calculated regions are grouped by element and treated separately. Atoms currently fall into two categories. Some employ a B-factor-based model. These are recorded as HETATMs in the original PDB entry and have not been remodelled in the current study. The second category are those generated by Vagabond as detailed above. In the former case, the anisotropic or isotropic B factor associated with the atom produces an atomic distribution, which is first calculated in reciprocal space and then transformed to real space. This is added to the electron-density map at the appropriate grid points near the atom average position using 11-point interpolation as implemented in the General Averaging Program (GAP; Ginn & Stuart, 2016). For the Vagabond atoms, the atom-point clouds are added individually to the map. Each atom point will land between eight grid points and contribute some density to each. Eight smaller fractional cuboids are delineated by drawing dividing planes passing through the atom points orthogonal to each principal axis. Each voxel vertex is then assigned a proportional contribution according to the volume of the diagonally opposed fractional cuboid. Although this is not necessarily in the spirit of discrete sampling for Fourier transforms, it has the advantage of maintaining the total electron density per atom and tends towards similar behaviour with sufficiently small grid spacing. The addition of a large number of interpolated samples for each atom point at slightly different positions smoothens the map, protecting against Fourier truncation errors in a similar manner to applying a real-space B factor. Once all the atoms of a given element have been placed in the map, the map is transformed to reciprocal space. The reciprocal-space scattering factors of the given element from  (Brown et al., 2006) are used to calculate the appropriate amplitude for each structure factor. These two structure-factor lists are multiplied and then transformed to produce a real-space map of all atoms of a given element type. The partial maps for every element are then summed together to produce the final calculated real-space map.
Note that the reciprocal-space transformation of this map does not correspond to F model . It is generated on a cubic grid spacing with an unrelated origin and different dimensions to the crystallographic unit cell, and may also only contain a region of interest rather than the entire structure. In the special case where the region of interest encompasses the research papers entire modelled content of the asymmetric unit, this is referred to as the asymmetric unit structure.

Calculation of asymmetric unit structure factors
On the way to producing F calc structure factors, the asymmetric unit structure is remapped onto a unit-cell grid, which has a voxel grid spacing consistent with the unit-cell dimensions of the crystal. The voxel dimensions for the unit-cell grid are also chosen to sample the contents at a d/4 spacing. The contents of the asymmetric unit structure, as generated above, are added into the appropriate positions of the unit-cell grid. By looping through the voxel vertices within the bounding volume in the unit-cell grid, the appropriate fractional voxel position is calculated from the asymmetric unit structure. The density value at this position is estimated via 11-point interpolation (Ginn & Stuart, 2016) and added to the unit-cell grid. F expl_asu are the structure factors of defined atoms belonging to one asymmetric unit, calculated by taking the Fourier transform of this final real-space map.

Calculation of bulk-solvent model
The solvent mask is calculated for the unit-cell grid. A separate solvent mask is calculated from each conformer in the ensemble. Protein voxels are set to zero and solvent voxels are set to a nominal positive density of 1.0 for the mask of each conformer, and then all J solvent masks are averaged. Due to the lack of an F 000 measurement and a separate solvent scale factor, the exact value of the nominal positive density does not matter. To create each jth solvent mask, as previously determined (Jiang & Brü nger, 1994), non-H atom points of the jth conformer are used to mask out the voxel positions within the specific radii defined previously (Jiang & Brü nger, 1994) around their designated position by element or atom type. The solvent mask is then expanded by switching all model voxels that occur at most 0.4 Å away from a solvent voxel to solvent. The final stage removes small internal strips of solvent density which may have been retained inside the protein interior by setting solvent voxels to protein if they occur in strips of less than 2.0 Å along each crystallographic axis.
In order to make multi-conformer solvent-mask calculations feasible for large values of J, in terms of both memory consumption and computation time, solvent masks are calculated using bitwise operators over 32 bits of memory, allowing 16 masks to be calculated concurrently without overwhelming memory consumption and with an increase in computation speed within a single CPU thread. F solv_asu are the structure factors for the average bulk solvent belonging to one asymmetric unit, calculated by taking the Fourier transform of this final real-space map.

Application of space-group symmetry
The space-group operators corresponding to those listed in the original PDB header file are applied by cumulatively adding the symmetry-transformed complex Fourier coefficients of each structure factor to its symmetry-related Miller indices in reciprocal space. Reciprocal-space addition is preferred to avoid interpolation errors for space groups where symmetry operations would not correspond to integral voxel spacing in real space. This is performed separately for both F solv_asu , to produce F solvent , and F expl_asu , to produce F explicit .

Calculation of F calc structure factors
The F solvent structure factors are scaled using two parameters, an absolute scale k and a B factor B, such that ðF explicit þ F scaled solvent Þ has the highest correlation coefficient when compared with F observed , where h is the reciprocal-lattice point, F scaled solvent ðhÞ is the structure factor for a reciprocal-lattice point at vector h, d is the reciprocal of the magnitude of h, and (x) is the density of the solvent at fractional real-space position x. The values of the two parameters are re-refined during each recalculation of the weighted electron-density map using the simplex method of gradient descent for 100 cycles (Nelder & Mead, 1965), using initial step sizes of k = 0.4 and B = 40 Å 2 , with initial values of 0 and 40 Å 2 , respectively. The addition of the two sets of structure factors according to these chosen scales, ðF explicit þ F scaled solvent Þ, produces the set of F calc structure factors.

Calculation of weighted electron-density maps
F calc is scaled to F observed by multiplying the F calc values so that the average amplitude is equal to that for F observed in 20 equal-volume resolution bins in reciprocal space. A weighted electron-density map is then generated using a newly devised weighting scheme to generate the target for real-space refinement. This relies on the principle that downweighting reflections by multiplying the amplitude by a weighting factor is a simulation of uncertainty in phase angle. A standard error, s k , is calculated for each reflection k equal to the ratio between (F observed,k ) and F observed,k . In this weighting scheme, a Gaussian distribution of phase angles is determined for each acentric reflection. A downweighting term d k is defined as expðÀs 2 k Þ and lies between 0 and 1. An appropriate standard deviation for a Gaussian-distributed set of phase angles ( k ' ) must be calculated. Considering a circle of radius 1, inclusion of the radial point at each phase angle ' at its relative weight as determined by the Gaussian function N ½0; ð k ' Þ 2 produces a weighted arc. The centroid of this weighted arc should have a radius of d k . This centroid was calculated for values of k ' between 0 and 6.27 radians (%2) in steps of 0.01, covering values of d k between 0.0308 and 1. This is then used as a lookup table in reverse for establishing values of k ' for a given value of d k . The maximum value of k ' used is 6.27 radians. Each reflection is introduced into a unit-cell grid in reciprocal space at an amplitude of 2F o À F c in this current implementation. 25 separate Fourier transforms are calculated, where each reflection is included at its calculated phase plus an incremented phase shift ranging from À2 k ' to +2 k ' in 25 equal steps, and then multiplied by its Gaussian-derived weight. These are transformed to real space and summed to produce a new weighted density map for further refinement. Reflections included in the R free set are omitted from this calculation. By including a range of phases, this allows interference which would occur between structure factors due to the uncertainty in their phase, as the sum of the 25 contributions for each structure factor is no longer truly harmonic.

Vagabond refinement engine
Refinement is split into several modes: positional refinement against the original PDB file (mode 1), mainchain flexibility macrocycles comprising the refinement of one whole molecule (mode 2a) and intramolecular flexibility (mode 2b) against the electron density, and a final mode of side-chain refinement against the electron density (mode 3). After each stage of whole-molecule refinement, if the R work value increases the structure is reverted to the parameters in the previous cycle. This allows compensation to be made for more motion using intramolecular movement instead. If the value of R work is increased after three cycles of intramolecular refinement, the overall flexibility is incrementally reduced by a 10% reduction in the parameter ! for each polymer chain for as long as R work continues to decrease due to the contraction. Additionally, before the refinement of side-chain motion, the algorithm reverts to the cycle with the lowest value of R work . If side-chain refinement causes an increase in R work then these parameter changes are also reverted.

Positional refinement (mode 1)
Refined torsion parameters and bond angles are shown for the backbone and each amino acid in Fig. 2. The simplex method of gradient descent (Nelder & Mead, 1965 Refined torsion-angle (blue ink) and bondangle parameters (red ink) marked on the backbone and additional parameters for each amino acid. All marked torsion angles also have refined kick parameters apart from proline side-chain bonds and the peptide bond. unconstrained torsion-angle parameters of the bond-based model to match the mean absolute difference between the original positions reported in the original PDB file and the average atom positions calculated from the Vagabond model. This removes the initial error propagated through the chain due to imperfect initial torsion-angle estimations. Torsion angles around four sequential backbone bonds are considered in one simplex minimization run, extending from the anchor atom to the N-terminus and then from the anchor atom to the C-terminus. Atoms considered in evaluation of the target function include all major and non-H minor atoms of refined bonds, as well as non-H minor atoms of any sister bonds. This therefore includes the carbonyl O and C atoms. After minimization of four bonds, the starting bond is advanced by one and the process is repeated until the chain ends. Convergence is considered to be no shift of torsion angles beyond 0.005 or when a maximum of 60 simplex descent steps is reached. After all backbone atoms which may affect a given C atom have been refined, refinement of the torsion angles in the side chains is allowed to proceed. This scheme therefore prioritizes fitting the path of the backbone over the side chain to the original positions. This is carried out five times before calculation of the first electron-density map.
Two of these cycles are carried out without refining bond angles and with a torsion step size of 2 . A further three cycles allow bond-angle refinement for all C atoms and C atoms of aromatic residues with starting torsion-angle and bond-angle step sizes of 0.2 . This therefore prioritizes the reduction of errors with torsion angles before introducing bond angles into refinement.

Correlation coefficients between weighted and calculated electron density
When a portion of the structure is selected as a target for a number of cycles of refinement, several aspects of the realspace structure segment are calculated only once and reused, including the map segment position and dimensions. The exact choice of real-space structure segment are clarified in the following sections. A number of atoms may be selected for refinement (active atoms). The real-space structure segment containing this portion of the structure will encompass additional atoms (surrounding atoms). The real-space structure segment for the surrounding atoms is calculated once in a separate map in memory and added to the calculated density on each cycle of refinement. On each update of the model, the set of atom-point clouds for the active atoms are recalculated and the real-space structure segment is recalculated. This segment will fall in a region of the weighted electron density. In order to compare the calculated and the weighted structures, a list of voxels is prepared in the coordinate system of the unit-cell grid which map onto the full volume of the realspace structure segment. Each weighted density grid point will then fall upon a fractional voxel coordinate in the real-space structure segment. The calculated density is then estimated via 11-point interpolation, both for the active-atom map alone and for the full map corresponding to the addition of the active-atom map and the surrounding-atom map. A Pearsonweighted correlation coefficient is calculated comparing the weighted density values with the full calculated map density values, and the weights are for the active-atom map values alone. This target function is then maximized by changing various model parameters.

Whole-molecule refinement (mode 2a)
Each cycle of whole-molecule refinement will cycle through each polymer chain in the asymmetric unit, choosing the limiting real-space structure segment which contains all starting atom point positions of the polymer backbone with a 3 Å margin. The eventual target is to have three screw motions defined, which are introduced one by one during the first macrocycle. Initially only a single screw motion is introduced. Each of the following refinement protocols occur using the simplex method of gradient descent for a maximum of 100 cycles. The target function is the correlation coefficient for the set of active atoms encompassing the backbone atoms of the given polymer chain, including C atoms.
For an introduced screw motion, a new rotation direction is determined by testing 31 roughly equally spaced directions generated using a Fibonacci lattice. This is followed by refinement of the three parameters of r, and then a separate refinement of the two parameters of w for all existing screw motions. In the first cycle, second and third screw motions are introduced directly afterwards. Finally, the six components of the translation tensor T are refined. In subsequent macrocycles no additional screw motions are introduced, but the r, w and T parameters each receive one round of refinement against the target function.

Intramolecular flexibility refinement (mode 2b)
Using the same real-space structure segment as in mode (2a), bonds with similar effects are grouped for refinement in batches. To determine the batches, each bond receives a similarity score with every other bond under the following regime. For every pair of bonds in a polymer, each C atom is considered in turn. For each atom point of the C atom, the direction and magnitude of a rotation around each of the point-to-point bonds in the bond pair is known. The similarity of the directions and magnitudes of the potential rotation vectors are scored against one another by the cosine of the angle between them, multiplied by the smaller ratio of the two magnitudes. This score is multiplied by both of the s j n values associated with the atom points of the minor atoms of the two bonds. These scores are unity in the case of a perfect match. These are summed over all atom points of a C atom and are further summed over the set of C atoms of the same polymer. If two bonds (p, q) have a score of x, the bonds (q, p) have a score of À x. The value of (p, p) is set to zero instead of unity. SVD is performed on the full set of bond-bond relationships. Either the top ten vectors, or the number required to cover 50% of the total sum of decomposed vector lengths, whichever is larger, are considered for refinement. For each cluster, the k and k 0 parameters can be varied independently, thereby research papers producing a minimum of 20 refined parameters at this stage. For each round of refinement, the relative contribution of a cluster to the values of k and k 0 associated with each bond are calculated and summed over all clusters for each bond. The target function, as for whole-molecule refinement, is the correlation coefficient for the active atoms encompassing the backbone and C angles of the polymer chain. This is repeated once for each polymer chain.

Side-chain refinement (mode 3)
Every side chain undergoes refinement in turn. The chosen real-space structure segment bounds all atoms in the amino acid being refined with a 3 Å margin. For each side chain, the target function is the correlation coefficient against the realspace structure segment for the entire monomer, including the backbone, as the set of active atoms. The parameters refined are the side-chain torsion and refinable bond angles (step size of 0.1 ), kick parameters (a dimensionless step size of 0.5) and and angles (step size of 30 ).

Choice of PDB entries from the Protein Data Bank
111 803 structures were marked as suitable for refinement; they were required to contain X-ray data and no nucleic acid polymer chains. Resolution was limited to between 1.2 and 3.5 Å . Of these, every hundredth structure was taken for benchmarking. Structures were removed if they contained alternative conformers of varying sequences, and other edge cases were removed (such as incomplete amino-acid backbones). This left a total of 920 structures in the benchmarking set.

Results
Vagabond describes a macromolecular structure as a series of deterministically related conformers, collectively forming an ensemble, with one ensemble calculated for each protein chain in the model. Each conformer comprises a network of bonds connecting atoms (Rice & Brü nger, 1994). The construction of the ensemble starts from a single anchored atom, which is defined with an absolute position within the crystallographic unit cell. Each atom is described by an atom-point cloud (Figs. 3a and 3b). Applying a forwards and backwards kick to a C atom spreads the distribution of every atom accordingly (Fig. 3c). The number of atom points calculated in each atompoint cloud determines the level of sampling of the statistical distribution, but does not affect the overall flexibility or the number of parameters used to describe the model (Fig. 3d). This description of flexibility allows a bulk-solvent model comprising the sum of all solvent distributions calculated for each explicit individual conformer. The workflow for refinement in Vagabond is summarized in Fig. 3(e). For the present purposes, models from the endpoint of refinement using current state-of-the-art atomistic refinement serve as the input to generate a starting model for Vagabond. The vagabond GUI requires two inputs: a reflection list supplied as an MTZ file and atomic coordinates supplied as a PDB file. Output files from Vagabond are weighted structure factors (MTZ file) for the electron density, summary output coordinates (PDB files, average and ensemble) and a Vagabond (.vbond) file format to store the bond-based model description.
The following analysis will discuss model bias. In this paper, the term 'model bias' is used to refer to the total observable effect of model-derived errors on the electron density. This comprises overfitting bias, noise bias and phase bias. Overfitting bias is caused by errors in the model definition itself and/or insufficient data to support its refinement. Noise bias involves misfitting a model to noise in the data, and can also exacerbate overfitting bias. Phase bias is the intrinsic unavoidable effect of using estimated phases in the Fourier transform from an incorrect model, leading to a bias in the electron density towards the incorrect model. This phase bias cannot be avoided in Vagabond, but model bias can be reduced through a reduction of the overfitting bias.
Describing structures through torsion-angle-mediated flexibility facilitates the subdivision of a protein into constituent domains and subdomains. Fig. 3( f ) shows the effect of introducing equal backwards and forwards flexibility in each C backbone torsion angle on the change in r.m.s.d. for each C atom-point cloud for the Streptococcus pneumoniae ABCtransporter protein FusA (Culurgioni et al., 2017). This can be visually segmented into regions corresponding to domains (and subdomains) within the polypeptide chain, each of which has been given distinct colours (Fig. 3g). This simple analysis should also facilitate the automated definition of appropriate domains for TLS refinement.
To objectively investigate the applicability of Vagabond, 920 PDB entries were used as a benchmark set (Section 2). The data and the corresponding models were taken from the PDB-REDO server (Joosten et al., 2014), which generates optimized structures using the latest algorithms and a state-of-the-art refinement engine. Vagabond was run with default settings (no user-defined per-data-set parameters) across this data set to produce mean R factors of R work = 24.5% and R free = 27.1%, which are compared in Figs. 4(a) and 4(b). These are higher than the original mean R factors from PDB-REDO (R work = 18.5%, R free = 22.2%) but exhibit a substantial reduction in the R work /R free gap (Fig. 4c), which is a generally accepted metric for the degree of overfitting (Brü nger, 1992). This reduction remains where Vagabond achieves comparable R free values. Thus, where the Vagabond-derived R free values are within 4% of the original value, the R work /R free gap is 2.5%, compared with the original value of 3.9%. Structures at a higher resolution than 1.5 Å have the largest discrepancy between the original and Vagabond R free values (6.7%). The discrepancy is smaller for the remainder of the structures (4.8%). The B-factor equivalents derived from each C atom-point cloud in a structure sometimes show very high correlation with the original B factors in the PDB file (Fig. 4d). For 224 structures (average resolution 2.7 Å ), PDB-REDO did not attempt to model individual B factors and these results are not shown. In several cases the maps from Vagabond show additional information beyond that gained from conventional refinement. Two examples are detailed below.

research papers
The first example shows that even at high resolution, atomistic refinement overfitting can obscure electron-density maps. The small immunoglobulin-binding domain G1 is a frequent target for in vitro evolution and computational design in protein engineering (Wunderlich et al., 2007;Ross et al., 2001;He et al., 2005;Thoms et al., 2009;Reinert & Horne, 2014;Tavenor et al., 2016). PDB entry 2on8 is an engineered mutant solved at 1.35 Å resolution (Wunderlich et al., 2007). The original deposition had many lysine side-chain atoms reduced to an occupancy of 0.01; after correction and atomistic refinement with anisotropic B factors, R work and R free reached 16.0% and 20.7%, respectively. Refinement with Vagabond produced similar derived relative B factors to the original model (Fig. 5a), with the exception of an absolute B-factor offset and a reduced B-factor peaks in loops. This resulted in higher R factors, except in the lowest resolution bin (Fig. 5b), but with a smaller R work /R free gap (24.7% and 26.3%, respectively). Vagabond clarified the original density (Fig. 4c) for Lys10 to show a flipped peptide bond (Fig. 5d), which was remodelled in Coot ( Fig. 5e; Emsley et al., 2010) to occupy a different valid minimum in the Ramachandran plot (Fig. 5 f). This also explains the reduced peak in B factors in the loop region including Lys10, as this has been modelled with a single conformation in atomistic refinement, whereas the true structure is likely to have a major occupancy for the flipped Lys10 peptide bond and minor occupancy for the unflipped peptide bond. The B factors from the original atomistic distribution would be likely to inflate under these circumstances in order to attempt to compensate for the assignment of full occupancy to a single conformer in this region.  However, for the purposes of preserving the number of parameters in refinement for a clear comparison, the following discussion will not consider modelling alternative conformations, but switching the predominant conformation modelled. The deposited structure has the same conformation as the molecular-replacement model from which it was derived (PDB entry 1pgb, no X-ray data deposition, not shown). When refined with REFMAC5 (Murshudov et al., 2011), the remodelled structure showed a reduction in the R work /R free gap (16.0/20.2%). Re-running refinement on this structure also reduced the R work /R free gap in Vagabond (25.0/25.7%).
To objectively judge whether Vagabond was providing a stronger indication of this required correction, the real-space correlation coefficients (Adams et al., 2010) for the backbone of the tripeptide (residues 9-11) in the corrected and original states were calculated against electron density for maps calculated using both models with sufficient refinement using the chosen software. This was carried out using both REFMAC5 and Vagabond (the results are summarized in Table 3). In the case of REFMAC5, the correlation coefficient was higher for whichever model was used to generate the input map. Comparing the non-input and input models, this   correlation increased from 58.5% to 59.4% against the map generated from the corrected model, but also increased from 56.6% to 57.0% if the noncorrected model was used as the input. These maps were therefore unable to clearly distinguish between the true observation and the calculated model due to bias from the input. On the other hand, Vagabond shows a clear favour towards the corrected model regardless of which is used for the input map: if the correct model is used to generate the map, the incorrect model correlation of 53.6% is significantly lower than that of the corrected model, 59.6%. However, contrary to conventional refinement, this preference holds even in the case where the input model is incorrect, as the correlation increases from 53.0% for the incorrect model to 58.0% for the correct model. This shows that Vagabond refinement produces maps which are not as strongly biased by the input model.
The second example demonstrates the clarification of electron-density features in a binding site achieved using Vagabond. The S. pneumoniae ABC-transporter protein FusA (PDB entry 5g5y, 1.73 Å resolution; Culurgioni et al., 2017) was re-refined using Vagabond to produce R factors of R work = 20.6% and R free = 22.8%, reducing the R work /R free gap compared with the PDB-REDO calculation (R work = 15.7%, R free = 18.7%). Upon refinement with Vagabond, density in the ligand-binding site (Fig. 6a) (Merritt & Bacon, 1997). ( f ) A Kleywegt plot drawn using Coot shows large movement of three residues, of which Lys10 switches from one minimum in the Ramachandran plot to another. Table 3 Comparison of real-space correlation coefficients for the analysis of PDB entry 2on8 in the text using main-chain atoms for the tripeptide residues 9-11. Atomistic parametrization was provided by REFMAC5 and bond-based parametrization by Vagabond.

Atomistic
Bond-based  (Fig. 6b). Although this unexpected ligand is not identifiable, the density packed against Trp314 is consistent with a pyrimidine ring (demonstrated by modelling methylguanine; Fig. 6c). This structure had been assumed to be of the apo state, consistent with the observed domain structure. Other structures (PDB entries 5g62, 5g61, 5g60 and 5g5z) show that cognate ligands can successfully displace this contaminant and cause functional domain shifts, none of which exhibit stacking interactions against Trp314. This suggests that Trp314 has bound to a low-affinity metabolite in an unusual binding mode which does not trigger domain closure. The ability to pinpoint such weak binding events may facilitate the detection of unsuspected druggable binding sites. Once again, the Vagabond flexibility model follows that derived from atomistic refinement (Fig. 6d).

Conclusions
This study has revisited the fundamentals of how the structural biology community defines macromolecules. The use of atomistic parametrization to refine crystallographic models has been standard since the introduction of the maximumlikelihood method in the late 1990s (Bricogne & Irwin, 1996;Murshudov et al., 1996). Since then, macromolecular refine- (a) For PDB entry 5g5y, averaged weighted density over two copies in the asymmetric unit (mF o À DF c , 1), 0.7 level, and difference density (F o À F c ), 2.7 level, for the PDB-REDO structure recalculated without water molecules. (b) Refinement with Vagabond from the initial PDB file without water molecules; averaged weighted density over two copies in the asymmetric unit (2F o À F c and phase spread, Section 2), 0.7 level, and difference density (F o À F c ), 2.7 level. (c) Methylguanine placed using Coot real-space refinement to demonstrate the fit to density. (a), (b) and (c) were drawn in Coot (Emsley et al., 2010) and Raster3D (Merritt & Bacon, 1997 ment has been treated mostly as an optimization problem (Afonine et al., 2012), and the model definition itself has not been successfully revisited, except for further variations on the theme of atomistic schemes. Vagabond now provides an alternative method. There is still considerable scope for improvements, such as ensuring a comprehensive fit of backbone flexibility parameters to electron density, refining the details of the interaction between the protein and the solvent, and finding robust validation metrics for these bond-based structures.
Validation is an open area of investigation, as validation metrics tuned to Cartesian refinement are unsuitable for Vagabond models. For example, bond length and angle geometry should not be imposed on the average structure, but individual conformers will report near-perfect results. As the benefits of this model are independent of the refinement method, these should, for instance, provide a fundamentally fresh approach to modelling flexibility in cryoelectron microscopy data, where B factors are currently biochemically nonsensical (Wlodawer et al., 2017). The results presented here suggest that the potential impact of this more biochemically relevant parameter space may extend to fields such as structural bioinformatics and molecular dynamics. Often, these fields balance atomic motions against bond geometry within the target function, whereas Vagabond, by incorporating bond geometry into the parameter space, aligns it with the most biochemically accessible motions.

Code availability
This software is distributed as free, open-source software under the General Public Licence (GPL) version 3, with both a command-line and graphical user interface. Vagabond (and the libraries on which it depends) can be installed on Linux or Mac OS X without expense. See https://vagabond.hginn.co.uk for download, a manual, installation instructions and documentation.