Computational strategies and challenges for using native ion mobility mass spectrometry in biophysics and structural biology

Native mass spectrometry (MS) allows the interrogation of structural aspects of macromolecules in the gas phase, under the premise of having initially maintained their solution-phase non-covalent interactions intact. In the more than 25 years since the first reports, the utility of native MS has become well established in the structural biology community. The experimental and technological advances during this time have been rapid, resulting in dramatic increases in sensitivity, mass range, resolution, and complexity of possible experiments. As experimental methods are improved, there have been accompanying developments in computational approaches for analysing and exploiting the profusion of MS data in a structural and biophysical context. Here, based on discussions within the EU COST Action BM1403 on Native MS and Related Methods for Structural Biology with broad participation from Europe and North America, we consider the computational strategies currently being employed by the community, aspects of best practice, and the challenges that remain to be addressed.


Introduction
Native mass spectrometry (MS) involves the transfer of proteins and other macromolecules intact into the gas phase with minimal disruption to the non-covalent interactions that are present in their solvated form.This then allows a range of experiments to probe the macromolecules' higher-order structure, including their fold, assembly and non-covalent interactions [1][2][3][4] .Native MS has helped elucidate various aspects of biomolecular structure, including the subunit composition, stoichiometry and stability of complexes, as well as the dynamic behaviour they display.When combined with ion mobility (IM), where ions are separated based on their mobility through an inert buffer gas kept at constant pressure and temperature under a weak electric field, the size, in the form of a rotationally averaged collision cross section (CCS), of a macromolecule can be probed 5 .
By virtue of being inherently dispersive, native IM-MS has a unique capability to characterize individual states in heterogeneous and dynamic systems, such as co-populated conformations or assembly states of complexes.
Thus, native IM-MS has enabled a large number of insights into a diverse array of macromolecular systems, encompassing proteins, nucleic acids, carbohydrates and lipids, and combinations thereof [6][7][8] .
Proteins and other macromolecules are typically dynamic, in that they populate a range of interconverting structures at equilibrium.Frequently, this heterogeneity is such that macromolecules are better described as structural ensembles (of conformations and/or assemblies), defined by the free-energy landscape accessible at given conditions.IM-MS is sensitive to some of this complexity, providing sparse data that can be a powerful descriptor of molecular states.These data on their own are not sufficient for characterizing molecular structure at atomic detail, but they can, in combination with other information, provide insight into the native state and surrounding free-energy landscape [9][10] .
Native IM-MS is conducted in the absence of bulk solvent, a factor which may induce some structural changes in the molecules under analysis.Because the gas-phase structures of large biomolecules are dictated by numerous non-covalent interactions -many of which are far from the molecular surface -they hence typically retain the vast majority of their solution-phase character [11][12] .However, the removal of solvent and acquisition of charges alters the physico-chemical environment of the protein, and leads to some degree of restructuring into different conformations, particularly for states that are intrinsically disordered or only marginally stable 13 .This provides an opportunity for experimental exploration of their free-energy landscape, albeit one reflecting -and dependent on -the gas-phase interaction strengths of residues involved 14 .
The large body of work developing and employing native IM-MS has indicated that a wealth of information is obtainable from such experiments.Yet structural interpretation and translation of the data into structural biology information is often not straightforward.Here we give a perspective on the computational frameworks that must be put in place to address this challenge, and we describe the current thinking and state-of-the-art of the approaches that are being developed.We chart where we believe the field stands in terms of progress in five key computational themes (and their interconnections) namely: 1) IM-MS data extraction and analysis, 2) CCS calculation, 3) determining charge locations, 4) computational modelling, and 5) gas-phase molecular dynamics (MD) (Figure 1).Our thoughts are heavily influenced by the discussions and contributions from the wider native IM-MS community, nucleated through the EU COST Action BM1403, and we refer to our companion article that details and directs the reader to specific software that will aid users in extracting the most, and most reliable, information from their data.

Computational considerations in converting native IM-MS data into information
The first step in using IM-MS data is to extract the raw data into a format from which it is possible to determine the key physical properties of the ions under investigation.At the most basic level, this comprises the mass, charge and mobility.All of these properties do not have single values but rather populate distributions, reflecting at least in part the heterogeneity of the system at hand (Table 1).
While instrument manufacturers' software typically allows the transformation of the measured mass-to-charge (m/z) spectrum onto a mass axis via the assignment of charge states, the frequent complexity of native MS data can make this process difficult.Charge-state assignment can be ambiguous for high charge states, and residual adducts are typical for large macromolecules 15 .Moreover, the samples under analysis themselves frequently contain multiple components, and can sometimes be extremely heterogeneous.Another challenge is that spectral peaks can be poorly resolved, due to the gentle nature of the ionization process employed.To overcome these challenges, both researchers and instrument vendors have developed software and algorithms tailored specifically to native MS data in order to aid users in their analysis (see our companion article for a comprehensive catalogue of available tools).
While calibration of the m/z axis is straightforward, in order to transform the mobility information (typically acquired in the form of an arrival time distribution, ATD) into a CCS axis, a calibration procedure is typically required 16 .In the overwhelming majority of cases, this is achieved using reference standards appropriate to the target analyte 17 .This process is sensitive to the conditions under which the experiments are performed, and care must be taken to minimise biases associated with the choice of solution and sampling conditions, instrument settings, selection of standards, and the calibration procedures 16 .The information encoded in the CCS (and in CCS distributions, CCSDs) is often used to infer structural properties of a given analyte and can inform computational modelling and (in principle) molecular dynamics (MD) simulations.It also enables direct comparisons of molecular states without additional calibration and computational modelling -as systematic biases cancel when making relative measurements.Nonetheless in all these uses, an important (but underexplored) consideration is the appropriate incorporation of uncertainties associated with the native IM-MS measurement and its transformation into CCS.
The ATDs and corresponding CCSDs can differ considerably in profile and width, reflecting (after accounting for instrument-dependent resolving power and other effects 18 ) the conformational heterogeneity of the analyte [19][20] .The width of these distributions can be exploited directly, or deconvolved into multiple Gaussian contributions in the case of feature-rich peak shapes 21 .IM-MS experiments can be data-rich, but objective deconvolution of complex ATDs into information of value remains challenging.The difficulty arises in having to decide on the number of conformational families present in the data, and the selection of appropriate width for each Gaussian.Higher resolution IM instrumentation and/or use of tandem IM-MS approaches might enable the separation and resolution of overlapping populations, at least for certain types of samples [22][23][24][25] .

Calculating CCSs from structures and models
The translation of CCS data obtained during native IM-MS experiments into structural information involves several challenges, including determining how best to obtain the CCS values of the relevant reference structure of the computational model (generation of structures and models is outside the scope of this review).For instance, the user may wish to compare their experimental CCS to available atomic coordinates or to use the CCS to distinguish between various structural hypotheses.A number of approaches exist and selection of the most appropriate method depends on a multitude of factors including the chemical nature of the system under investigation, its shape and intrinsic dynamics, and experimental conditions such as the IM buffer gas 5 .A practical consideration is a trade-off between computational expediency and accuracy in CCS estimations: building a large number of models lets one screen a wider structural space, while performing higher accuracy calculations necessitates screening a smaller range of structures.
In its most simplistic form, the CCS can be viewed as the rotationally averaged projected area ("shadow") of an object 26 , plus a layer having a thickness related to the gas radius and its polarizability 5 .For any convex object, the projected area is equal to a quarter of its surface area 27 .This simple analytical relationship is useful when considering protein structure at an extremely coarse-grained level 28 .However, when considering protein structure at higher resolution, it is however clear that they are not convex, but feature cavities and protrusions that can lead to multiple collisions or occlude portions of the protein surface from collisions with the buffer gas 29 .On a finer scale, the surface roughness due to the amino acids that decorate the exterior influence the drag a protein experiences during the IM-MS experiment and severs the relation between surface area and projected area.Furthermore, the charge on the protein is inherently non-zero in ion mobility and is expected to impact on CCSs, modulated by the dipole moment and polarizability volume of the gas.The exact distribution of charge can in principle affect the mobility 30 , but appears to have a minor effect on the CCSs of proteins [31][32] .For moderate charge states (i.e. the low amount of charge per unit mass typical in native mass spectra), the CCS appears to be relatively constant in He, but less so in N2 [31][32] .How this phenomenon manifests itself for proteins of all sizes and shapes, and for other types of macromolecules, is currently not known, but neglecting these effects is unlikely to be the major source of bias; more important perhaps are the perturbations [33][34] , considerable scope remains to ensure that local charges and interaction potentials are effectively accommodated in CCS calculations.Different computational approaches (and implementations thereof) for estimating CCSs from structures exist, at differing levels of complexity and computational cost (see our companion paper, and others 5,16 ).The simplest and fastest approach is to consider a protein in terms of its area when projected from different viewpoints.Here the gas atoms are represented by hard spheres that are 'fired' through the sampling volume, and the projected area is calculated from the fraction of trajectories that collide with the protein.A bit more advanced, the exact hard spheres scattering model computes the angle of deflection of the gas to calculate the corresponding deflection (momentum transfer) for the ion.Both approaches ignore electrostatic interactions, and they ignore London dispersion forces acting at long range.
In the methods at the other end of the complexity spectrum however (several methods are found between these extremes), the short-and long-range interactions of the protein with the gas molecules are modelled explicitly, accounting for both the physico-chemical properties (polarizability, charge, Van der Waals interactions, and potentially internal degrees of freedom) of the gas and of the atoms in the protein, requiring numerical integration of gas-particle trajectories with numerous iterations for each such trajectory.While this more rigorous and explicit consideration of the physical processes underpinning the IM separation might provide more accurate CCSs for atomistic structure models, it does not readily lend itself to coarse-grained structural representations, whereas it is readily achievable to calculate the projected area of e.g.SAXS-derived bead models or iso-surfaces from electron microscopy 20,35 .Consequently, the nature of the structure model can effectively narrow the repertoire of applicable methods for CCS calculation.
The difference in computational cost between these two extremes currently spans several orders of magnitude, with the most complex approaches taking hours to converge when applied to macromolecules.This renders them intractable for assessing the hundreds of thousands of models needed to explore adequately the rototranslational space associated with structure modelling, or the thousands of frames from MD simulations.As a result, it is often only feasible to use simpler approaches, potentially compromising on the accuracy of the CCS estimation.However, in order to deduce ion shapes from IM-MS, what matters is not so much the accuracy of the absolute calculated values but rather how accurately they can be matched to experiment.For example, for large and globular proteins the simplest projection approximation method can be generally parameterised (i.e.scaled, or calibrated) to reproduce the results from the most computationally costly trajectory method with a relative error within 1% 20 , and experimental drift-tube helium CCS values to within 3% RMSD 36 .In general, appropriate parameterization of the CCS calculation is as important as the underlying physical model that is being used 16 , and one must pay attention to the type and size of system for which a given parameterization was developed, as well as to the type of experiment it was designed to match.For example, no simple parameterization has been thoroughly validated for proteins that are grossly convex, intrinsically disordered, or in extreme charge states.For smaller systems, the relative effect of surface interactions will be proportionally greater than for very large ones.For highly concave structures, a simple projection approach will not take into account "parachute" effects on ion friction.In all these cases, or whenever in doubt, more expensive methods are necessary for good accuracy [37][38] .

Modelling protein structures using IM-MS data
Computational methods are needed to exploit native IM-MS data for validating or modelling three-dimensional protein structures.A typical workflow involves distinct steps: converting the experimental data acquired into modelling restraints, building models that sample the conformational space of individual proteins or protein assemblies, and evaluating the models in light of the data.Currently, there are two strategies for building models using MS and other related structural datasets.The first strategy filters models generated by computational methods based on their "goodness-of-fit" to the experimental datasets [39][40][41][42] .The second strategy samples models by directly integrating the experimentally derived restraints with an appropriate scoring function a into the computational workflow -i.e. using the restraint to optimise dynamically the model building [43][44] .
For modelling analysis, it is important to use appropriate "building blocks".In general, the individual subunits and or complexes can be represented as atomic coordinates (e.g.crystal structures, homology models), as coarse-grained models (e.g.spheroids), or as density maps.Furthermore, it can be important to consider multiple alternative starting structures to ensure that the space is suitably explored 45 .This is pertinent for proteins or complexes that are particularly flexible or are characterised by intrinsically dynamic regions, and where maybe only one particularly stable or abundant structure has been characterized previously e.g. by Xray crystallography.In such cases, developing robust methods for building alternative starting structures for downstream model building becomes a critical aspect of the computational workflow.
An important aspect of any modelling pipeline is the consideration of the uncertainty introduced at each step of the analysis.First, one must consider ambiguity in the data caused by the limited resolving power of the instruments, the conformational heterogeneity of the protein (which manifests itself as a CCSD broader than the instrumentation limit), and the possibility of low-quality data which can compromise the discriminatory ability of the CCS measurements 46-47 .a A modelling restraint is defined as an assembly/protein feature (e.g.volume, shape, flexibility) quantified with respect to the data used to generate it.It represents the 'force' that glues the individual subunits and forms configurations consistent with the input data.The scoring function sums up all restraints and may be thought as the force field that enables to make up the assembly.
values if proteins undergo a significant degree of structural change upon transfer to the gas phase, and these discrepancies bring challenges for modelling.Side chains that are solvent-exposed in solution take advantage of the low permittivity of vacuum to collapse onto the surface by forming new interactions [48][49][50] .In the case of protein ions that are intrinsically malleable, e.g.hollow structures, those with hinges, or low charge states of intrinsically disordered proteins, these additional (non-native) non-covalent interactions can lead to unstimulated compaction of the overall protein structure {Rolland, 2019 #179;Hall, 2012 #115;Hansen, 2018 #90;Pacholarz, 2014 #102;Pagel, 2013 #104;Landreh, 2017 #9;van der Spoel, 2011 #56}.Gas-phase induced unfolding happens when the native intramolecular interactions are too weak compared to the repulsion between like charges, and is more likely to occur for high charge states (and at higher activation energies).Gas-phase structural changes require some energy barriers to be overcome, which in turn depends on the native interactions, on the charge state adopted during electrospray, on the internal energy uptake and on the time spent in the mass spectrometer.Despite notable advances made {Konermann, 2017 #178;Marchese, 2012 #67}, gas-phase structural changes remain hard to fully predict, and thus contribute to the uncertainty of the CCS calculation.
Uncertainty from computations that aim to match experimental data to structural models comprises contributions from the choice of representations [55][56] , the completeness of the information available, the use of the appropriate scoring function, and the biases of individual sampling algorithms (e.g. if they don't accurately capture the data).Finally, measurable errors may be introduced by the post-processing step which typically scores models based on how well they match the input datasets, which may include clustering approaches for generating an ensemble of computational models.A final challenge comes in weighting the merits, and biases, of individual methods based on their ability to contribute to accurate models.As such, the final output of a combined experimental and modelling effort is best represented by an ensemble of structures that encapsulates the convolution of both the inherent conformational heterogeneity of the protein and the various sources of uncertainty in the IM-MS pipeline 42,55 .Benchmarking studies have provided some ways of efficiently integrating the different methods by taking into account the relative uncertainty of the different methods [57][58] , such that it is becoming increasingly possible to bring together the individual techniques in a single workflow

Combining molecular dynamics with native IM-MS
The integration of native IM-MS experiments with molecular dynamics (MD) simulations is highly desirable, as the two methods are complementary with respect to the resolution of structural information they provide, and the timescales that they operate on 9 .In the first instance, solvent-free MD plays an important role in understanding the fundamentals of MS and for interpreting MS data 12,50 .For example, the effects of solvent, temperature and charge on protein structure have been studied in this way, and there are numerous examples of system-specific investigations where MD has been used together with MS 9 .The most widespread MD methods have been developed mainly for condensed-phase calculations, which presents specific challenges when applying them to simulations in vacuum.For example, electrostatic interactions are significant over much longer distances in the absence of solvent which, if taken into account slows down the calculations considerably, thus limiting the sampling and simulation timescales.Moreover, the commonly used force fields are designed to match the solution phase, and hence the effective polarization at the solution interface might not reflect gas-phase conditions.The magnitude of this inaccuracy is currently unquantified, however employing polarizable force fields could be a means to mitigate such errors at an additional computational cost 50 .
Another challenge stems from considering how charge is distributed on a macromolecule.While the locations of charges do not appear critical for CCS calculations on large molecules, they remain an integral part of the physical model and help determine the system dynamics at the atomic level, thereby greatly influencing the accuracy of the simulations.This, of course, reflects the fact that the location of charges to a large extent 'drives' the structural dynamics, and vice versa.For macromolecules, charging in electrospray takes place via the protonation of basic sites, and deprotonation of acidic sites b -with the note that additional sites become available during electrospray due to their high gas-phase basicity or acidity 60 , that Zwitterionic states are frequently stable in the gas phase [61][62] , and that, depending on solution conditions, charged buffer components can act as charge carriers.Experimentally pinpointing the location of charges is extremely difficult however, b Note that 'basic/acidic sites' is here used according to the Brønsted-Lowry definition, that is, their ability to accept or donate a proton.As such, aspartate and glutamate residues are basic sites, as they are corresponding bases to aspartic acid and glutamic acid, whereas they are typically considered to be acidic residues in biochemistry, regardless of protonation state.and one cannot assume that protonation states simply carry over from solution to the gas phase.Depending on the conditions under which the electrospray process generates charged particles, particularly the presence/absence of protic solvent and the time frame of ionization, the removal of solvent greatly affects the energetics of both the protonated and deprotonated form.However, because of a certain amount of kinetic trapping, the site might still carry some "memory" of its protonation state in solution over the experimental time scales 63 .
The number of possible charge isomers grows rapidly with the number of (de)protonatable sites, meaning that a complete consideration of isomers is usually not feasible.In lieu of complete enumeration, Monte-Carlo approaches, where protons are moved randomly between basic sites to generate new charge isomers, have been developed to address this issue 51,64 .While the details in how the energies are evaluated and in how the charge isomers are sampled differ between the different approaches, they all compute energy as the sum of the proton affinities for all protonated sites and the electrostatic interactions between charged sites and their surroundings (including other charged sites).The interplay between charge and conformation means that even if the lowestenergy charge isomer can be identified for a crystal structure, relaxation of sidechain conformations, as well as on higher structural levels, might shift the energy considerably 64 .Therefore, care must be taken to not let the rich structural detail in a crystal structure, obtained under considerably different conditions, bias the calculations towards "incorrect" charge isomers.
Hybrid MD and Monte-Carlo approaches have been developed for the combined search of conformer and charge-isomer space in the gas phase.These have shown that side chains have a propensity to fold onto the protein surface with consequent structure contraction and formation of new charged and neutral hydrogen bonds and salt bridges 62 .These structural rearrangements promote self-solvation and are compatible with maintenance of a native-like fold.An interesting feature in the emerging picture of folded protein ions in the gas phase is the capability to compensate for the energetic penalty of charge separation in vacuo with favourable, conformation-specific intramolecular interactions, in line with growing experimental and theoretical evidence [65][66] .Persistence of zwitterionic states in protein structures provides a rationale for conformational stability in the gas phase and conformational effects on charge-state distributions and is a feature that simulation methods should accommodate.
In addition to the combinatorial challenges in choosing a "correct" charge isomer, there may be several coexisting charge isomers, and protons could in principle transfer between sites in the gas phase (the "mobile proton model" 67 ), following or promoting structural transitions 68 .As classical MD typically disallows the breakage or cleavage of chemical bonds, protonation dynamics cannot readily be incorporated into such simulations.Recently there has been progress in accommodating proton mobility, with simulations being stopped at regular intervals, and charges being transferred at random towards charge isomers of lower energy 61, 69-71 .Current implementations of this approach are however not truly thermodynamic, in the sense that they do not adhere to Boltzmann statistics, and consequently, they might be error-prone in quantifying how probable the different charge isomers are.Nevertheless, this represents an important step towards accommodating the important role of charges in gas-phase MD, and future integration with popular MD software will be instrumental for the community.Combined quantum mechanics/molecular dynamics (QM/MM) would be a more accurate way to account for proton transfer 72 ; although computationally much more costly than force field MD, it may prove valuable to IM-MS modelling in the future.
The transition from solution to the gas phase can also incur changes in the structure of the protein.Though these are often small in amplitude 73 , they can significantly alter the contacts made between amino acids 50 .
This, together with the need to consider electrostatic interactions over long distances, means that MD might struggle to explore experimentally relevant parts of the conformational landscape 50,54 .Experimental data from solution-phase methods are frequently used to restrain the MD simulations, facilitating the transition from the starting structure to the conformations that pertain to the question at hand.In principle, experimentally derived CCSs can be used in a similar fashion, but the considerable overhead required for continuously calculating the CCS during the simulation, and comparing with a given reference value has so far limited the use of CCSbased restraints 9 .Instead, other, more computationally expedient quantities, such as the radius of gyration or solvent accessible surface area (SASA), have been used as proxies for the CCS 38,[74][75] .Recent speed increases in CCS calculations might enable explicit CCS restraints, strengthening the link between simulation and experiments, especially for systems where non-globular structures or conformational transitions might complicate the relationship between proxies and CCSs.