## Efficient sampling of equilibrium states

Molecular dynamics or Monte Carlo methods can be used to sample equilibrium states, but these methods become computationally expensive for complex systems, where the transition from one equilibrium state to another may only occur through rare events. Noé *et al.* used neural networks and deep learning to generate distributions of independent soft condensed-matter samples at equilibrium (see the Perspective by Tuckerman). Supervised training is used to construct invertible transformations between the coordinates of the complex system of interest and simple Gaussian coordinates of the same dimensionality. Thus, configurations can be sampled in this simpler coordinate system and then transformed back into the complex one using the correct statistical weighting.

## Structured Abstract

### INTRODUCTION

Statistical mechanics aims to compute the average behavior of physical systems on the basis of their microscopic constituents. For example, what is the probability that a protein will be folded at a given temperature? If we could answer such questions efficiently, then we could not only comprehend the workings of molecules and materials, but we could also design drug molecules and materials with new properties in a principled way.

To this end, we need to compute statistics of the equilibrium states of many-body systems. In the protein-folding example, this means to consider each of the astronomically many ways to place all protein atoms in space, to compute the probability of each such “configuration” in the equilibrium ensemble, and then to compare the total probability of unfolded and folded configurations.

As enumeration of all configurations is infeasible, one instead must attempt to sample them from their equilibrium distribution. However, we currently have no way to generate equilibrium samples of many-body systems in “one shot.” The main approach is thus to start with one configuration, e.g., the folded protein state, and make tiny changes to it over time, e.g., by using Markov-chain Monte Carlo or molecular dynamics (MD). However, these simulations get trapped in metastable (long-lived) states: For example, sampling a single folding or unfolding event with atomistic MD may take a year on a supercomputer.

### RATIONALE

Here, we combine deep machine learning and statistical mechanics to develop Boltzmann generators. Boltzmann generators are trained on the energy function of a many-body system and learn to provide unbiased, one-shot samples from its equilibrium state. This is achieved by training an invertible neural network to learn a coordinate transformation from a system’s configurations to a so-called latent space representation, in which the low-energy configurations of different states are close to each other and can be easily sampled. Because of the invertibility, every latent space sample can be back-transformed to a system configuration with high Boltzmann probability (Fig. 1). We then employ statistical mechanics, which offers a rich set of tools for reweighting the distribution generated by the neural network to the Boltzmann distribution.

### RESULTS

Boltzmann generators can be trained to directly generate independent samples of low-energy structures of condensed-matter systems and protein molecules. When initialized with a few structures from different metastable states, Boltzmann generators can generate statistically independent samples from these states and efficiently compute the free-energy differences between them. This capability could be used to compute relative stabilities between different experimental structures of protein or other organic molecules, which is currently a very challenging problem. Boltzmann generators can also learn a notion of “reaction coordinates”: Simple linear interpolations between points in latent space have a high probability of corresponding to physically realistic, low-energy transition pathways. Finally, by using established sampling methods such as Metropolis Monte Carlo in the latent space variables, Boltzmann generators can discover new states and gradually explore state space.

### CONCLUSION

Boltzmann generators can overcome rare event–sampling problems in many-body systems by learning to generate unbiased equilibrium samples from different metastable states in one shot. They differ conceptually from established enhanced sampling methods, as no reaction coordinates are needed to drive them between metastable states. However, by applying existing sampling methods in the latent spaces learned by Boltzmann generators, a plethora of new opportunities opens up to design efficient sampling methods for many-body systems.

## Abstract

Computing equilibrium states in condensed-matter many-body systems, such as solvated proteins, is a long-standing challenge. Lacking methods for generating statistically independent equilibrium samples in “one shot,” vast computational effort is invested for simulating these systems in small steps, e.g., using molecular dynamics. Combining deep learning and statistical mechanics, we developed Boltzmann generators, which are shown to generate unbiased one-shot equilibrium samples of representative condensed-matter systems and proteins. Boltzmann generators use neural networks to learn a coordinate transformation of the complex configurational equilibrium distribution to a distribution that can be easily sampled. Accurate computation of free-energy differences and discovery of new configurations are demonstrated, providing a statistical mechanics tool that can avoid rare events during sampling without prior knowledge of reaction coordinates.

Statistical mechanics is concerned with computing the average behavior of many copies of a physical system based on its microscopic constituents and their interactions. For example, what is the average magnetization in an Ising model of interacting magnetic spins? Or what is the probability of a protein to be folded as a function of the temperature? Under a wide range of conditions, the equilibrium probability of a microscopic configuration **x** (setting of all spins, positions of all protein atoms, etc.) is proportional to

Except for simple model systems, we presently have no approach to directly draw “one-shot,” i.e., statistically independent, samples x from Boltzmann-type distributions to compute statistics of the system, such as free-energy differences. Therefore, one currently must rely on trajectory methods such as Markov chain Monte Carlo (MCMC) or molecular dynamics (MD) simulations that make tiny changes to **x** in each step. These methods sample from the Boltzmann distribution in the long run, but many simulation steps are needed to produce a statistically independent sample. This is because complex systems often have metastable (long-lived) phases or states and the transitions between them are rare events; for example, 10^{9} to 10^{15} MD simulation steps are needed to fold or unfold a protein. As a result, MCMC and MD methods are extremely expensive and consume much of the worldwide supercomputing resources.

A common approach to enhance sampling is to speed up rare events by biasing user-defined order parameters, or “reaction coordinates” (RCs), that may be of a mechanical (*1*–*4*), thermodynamic (*5*–*7*), or alchemical nature (*8*, *9*). Applying these techniques to high-dimensional systems with a priori unknown transition mechanisms is challenging, as identifying suitable order parameters and avoiding rare events in other, unbiased directions becomes extremely difficult. For example, the development of enhanced simulation protocols for the binding of small drug molecules to proteins has become a research area in its own right (*10*).

In this study, we set out to develop a “Boltzmann generator” machine that is trained on a given energy function

Key to the solution is combining the strengths of deep machine learning (*11*) and statistical mechanics (Fig. 1A). We train a deep invertible neural network to learn a coordinate transformation from **x** to a so-called “latent” representation **z**, in which the low-energy configurations of different states are close to each other and can be easily sampled, e.g., using a Gaussian normal distribution. Enhancing MD sampling by user-defined coordinate transformations has been proposed previously (*12*). The novelty of Boltzmann generators is that this transformation is learned and, owing to the deep transformation network, can be as complicated as needed to represent state changes in the many-body system. As Boltzmann generators are invertible, every sample **z** can be back transformed to a configuration **x** with high Boltzmann probability. We can improve the ability to find relevant parts of configuration space by “learning from example,” where the potential energy **x**, e.g., from the folded or unfolded state of a protein but without knowing the probabilities of these states. Then, we use statistical mechanics, which offers a rich set of tools to generate the target distribution

This study demonstrates that Boltzmann generators can be trained to generate low-energy structures of condensed matter systems and protein molecules in one shot, as shown for model systems and a millisecond-timescale conformational change of the bovine pancreatic trypsin inhibitor (BPTI) protein. When the Boltzmann generator is initialized with a few structures from different metastable states, it can generate statistically independent samples from these states and can compute the free-energy profiles of the corresponding transitions without suffering from rare events. Although Boltzmann generators do not require RCs, they can be included in the training to sample continuous free-energy profiles and low-probability states. When trained in this way, Boltzmann generators can also generate physically realistic transition pathways by performing simple linear interpolations in latent space. We also show that multiple independent Boltzmann generators, trained on disconnected MD or MCMC simulations of different states, can be used to compute free-energy differences between these states in a direct and inexpensive way and without requiring any RCs. Finally, we demonstrate that when using established sampling methods such as Metropolis Monte Carlo in the latent space of a Boltzmann generator, efficient methods can be constructed to find new states and gradually explore state space.

## Boltzmann generators

Neural networks that can draw statistically independent samples from a desired distribution are called directed generative networks (*13*, *14*). Such networks have been demonstrated to generate photorealistic images (*15*), to produce deceivingly realistic speech audio (*16*), and even to sample formulae of chemical compounds with certain physicochemical properties (*17*). In these domains, the exact target distribution is not known and the network is “trained by example” using large databases of images, audio, or molecules. Here, we are in the inverse situation, as we can compute the Boltzmann weight of each generated sample **x**, but we do not have samples from the Boltzmann distribution a priori. The idea of Boltzmann generators is as follows (Fig. 1A):

1) A neural network transformation **z** from a simple prior, e.g., a Gaussian normal distribution, **x** that has a high Boltzmann weight, i.e., is coming from a distribution

2) To obtain an unbiased sample and to compute Boltzmann-weighted averages, the generated distribution **x** and then compute desired statistics, such as free-energy differences using this weight.

For both training and reweighting, it is important that we can compute the probability **x**. This can be achieved when *18*, *19*). Physically, invertible transformations are analogous to flows of a fluid that transform the probability density from configuration space to latent space, or vice versa. Volume-preserving transformations, comparable to incompressible fluids, were introduced in (*19*). Here, we use the non–volume-preserving transformations introduced in (*20*) (Fig. 1B), as they allow the probability distribution to be scaled differently at different parts of configuration space. Alternatively, Boltzmann generators can be built using more general invertible transformations (*21*–*23*). Invertibility is achieved by adopting special neural network architectures (Fig. 1B; materials and methods). Multiple trainable invertible “blocks” can be stacked, thus encoding complicated variable transformations in the form of a deep invertible neural network (Fig. 1A).

Boltzmann generators are trained by combining two modes: training by energy and training by example. Training by energy is the main principle behind Boltzmann generators, and proceeds as follows: We sample random vectors **z** from a Gaussian prior distribution, and then transform them through the neural network to proposal configurations, **z** (materials and methods):**z**. The invertible network layers are designed such that

The KL divergence (Eq. 1) is equivalent to the free-energy difference of transforming the Gaussian prior distribution to the generated distribution (materials and methods, supplementary materials): The first term *13*), i.e., the repetitive sampling of a single minimum-energy configuration that would minimize the first term.

Despite the entropy term in Eq. 1, training by energy alone is not sufficient, as it tends to focus sampling on the most stable metastable state (fig. S2). We therefore additionally use training by example, which is the standard training method used in other machine learning applications, and is here implemented with the maximum likelihood (ML) principle. We initialize the Boltzmann generator with some “valid” configurations **x**, e.g., from short initial MD simulations or experimental structures, and transform them to latent space via *18*, *19*):

By combining training by energy and training by example, we can sample configurations that have high probabilities and low free energies. However, sometimes we want to sample states with low equilibrium probabilities, such as transition states along a certain RC whose free-energy profile is of interest. For this purpose, we introduce an RC loss that can optionally be used to enhance the sampling of a Boltzmann generator along a chosen RC (materials and methods).

## Results

### Illustration on model systems

We first illustrate Boltzmann generators using two-dimensional model potentials that have metastable states separated by high energy barriers: the double well potential and the Mueller potential (Fig. 2, A and G). MD simulations stay in one metastable state for a long time before a rare transition event occurs. Hence, the distributions in configuration space

We use the Boltzmann generators by sampling from their latent spaces according to Gaussian distributions. After transforming these variables via

The Boltzmann generator repacks the high-probability regions of configuration space into a concentrated latent space density. We therefore wondered about the physical interpretation of direct paths in latent space. Specifically, we interpolate linearly between the latent space representations of samples from different energy minima, similar to what is done with generative networks in other disciplines (*22*, *17*). When mapping these linear interpolations back to configuration space, they result in nonlinear pathways that have low energies and high probabilities (Fig. 2, F and L). Although there is no general guarantee that linear paths in latent space will result in low energies, this result indicates that the latent spaces learned by Boltzmann generators can be used to provide candidates of order parameters for bias-enhanced or path-based sampling methods (*1*, *3*, *24*).

For the double-well system, the unbiased MD simulation needs, on average, 4 × 10^{6} MD steps for a single return trip between the two states (supplementary materials), and ~100 such crossings are required to compute the free-energy difference with the same precision as the Boltzmann generator results shown in Fig. 2E. The total effort of training the Boltzmann generator (including generating the initial simulation data) corresponds to ~10^{6} steps, but once this is done, statistically independent samples can be generated at no significant cost. For this simple system, the Boltzmann generator is therefore about a factor of 100 more efficient than direct simulation, but much more extreme savings can be observed for complex systems, as shown below.

### Thermodynamics of condensed-matter systems

As a second example, we demonstrate that Boltzmann generators can sample high-probability structures and efficiently compute the thermodynamics in crowded condensed-matter systems. We simulated a dense system of two-dimensional particles confined to a box, as suggested in (*25*) (Fig. 3A). Immersed in a fluid is a bistable particle dimer whose open and closed states are separated by a high barrier (Fig. 3B). Opening or closing the dimer directly is not possible due to the high density of the system but requires concerted rearrangement of solvent particles. At close distances, particles repel each other strongly, resulting in a crowded system. Thus, the fraction of low-energy configurations is vanishingly small, and manually designing a sampling method that simultaneously places all 38 particles and achieves low energies appears unfeasible.

We trained a Boltzmann generator to sample one-shot low-energy configurations and used it to compute the free-energy profiles of opening or closing the dimer. Key to treating explicit-solvent systems such as this one is to incorporate the indistinguishability of solvent molecules. If physically identical solvent molecules were to be distinguished, then every exchange of solvent molecule positions due to diffusion would represent a new configuration, resulting in an enormous configuration space even for this 38-particle system. We therefore removed identical-particle permutations from all configurations input into or sampled by the Boltzmann generator by exchanging particle labels to minimize the distance to a reference configuration (supplementary materials).

The training is initialized with examples from separate, disconnected simulations of the open and closed states, but in later stages, training by energy Eq. 1 dominates (supplementary materials, fig. S1, and table S1). The trained Boltzmann generator has learned a transformation of the complex configuration space density to a concentrated, 76-dimensional ball in latent space (Fig. 3C). Indeed, direct sampling from a 76-dimensional Gaussian in latent space and transformation via *F _{zx}* generates configurations in which all particles are placed without significant clashes and potential energies that overlap with the energy distribution of the unbiased MD trajectories (Fig. 3D). Also, realistic transition states that have not been included in any training data are sampled (Fig. 3D, middle).

To demonstrate the computation of thermodynamic quantities, we perform training by energy (Eq. 1) simultaneously to a range of temperatures (supplementary materials). Although the temperature changes the configuration space distribution in a complex way, it can be modeled as a simple scaling factor in the width of the Gaussian prior distribution in latent space (materials and methods). Then, we sample the Boltzmann generator for a range of temperatures and use simple reweighting to compute the free energies along the dimer distances. As shown in Fig. 3E, these temperature-dependent free energies agree precisely with extensive umbrella-sampling simulations that use bias potentials along the dimer distance ((*1*), supplementary materials).

We estimate that the MD simulation needs at least 10^{12} steps to spontaneously sample a single transition from closed to open state and back (supplementary materials), and ~100 such transitions would be needed to compute free-energy differences with the precision of Boltzmann Generators shown in Fig. 3E. The total effort to train the Boltzmann generator is ~3 × 10^{7} energy evaluations, but then statistically independent samples can be drawn in one shot at the entire temperature range trained, resulting in at least seven orders of magnitude speed-up compared with MD.

As above, we perform linear interpolations between the latent space representations of open- and closed-dimer samples. A significant fraction of all pair interpolations results in low-energy pathways. The lowest-energy interpolation of 100 randomly selected pairs of end states is shown in Fig. 3F, representing a physically meaningful rearrangement of the dimer and solvent.

### Exploring configuration space

In the previous examples, Boltzmann generators were used to sample known regions of configuration space and compute statistics thereof. Here, we demonstrate that Boltzmann generators can help to explore configuration space. The basic idea is as follows: we construct an exploratory sampling method using an established sampling algorithm in latent space while simultaneously training the Boltzmann generator transformation using the configurations found so far.

We initialize the method with a (possibly small) set of configurations *X*. Training is done here by minimizing the symmetric loss function *X* approaches an unbiased Boltzmann sample, the symmetric loss converges to a meaningful distance of probabilities (materials and methods). As an example, here, we use Metropolis Monte Carlo in the Boltzmann generator latent space to update *X* (materials and methods). The step size is chosen adaptively but reaches the order of the latent space distribution width. Thus, large-scale configuration transitions in physical space can be overcome in a single Monte Carlo step.

We now revisit the three previous examples and initialize *X* with only a single input configuration from the most stable state (Fig. 4, A, D, and G). The exploration method quickly fills the local metastable states and finds new metastable states within a few 10^{5} energy calls, i.e., orders of magnitude faster than direct MD (Fig. 4, B, E, and H). This demonstrates that Boltzmann generators sample new, previously unseen states with a significant probability, and that this ability can be turned into exploring configuration space when past samples are stored and reused for training.

The Metropolis Monte Carlo method causes the sample to converge toward the Boltzmann distribution. However, we do not need to wait for this method to be converged because, with sufficient samples in the states of interest, the equilibrium free energies can be computed by reweighting as in Figs. 2 and 3 above. Although new states are found, the data-based loss

Due to the invertible transformation between latent and configuration space, any sampling method that involves reweighting or Monte Carlo acceptance steps can be reformulated in Boltzmann generator latent space and potentially yield enhanced performance.

### Complex molecules

We demonstrate that Boltzmann generators can generate equilibrium all-atom structures of macromolecules in one shot using the BPTI in an implicit solvent model (Fig. 5 and supplementary materials). To train Boltzmann generators for complex molecular models, we wrapped the energy and force computation functions of the OpenMM simulation software (*26*) in the standard deep learning library Tensorflow (*27*).

Training a Boltzmann generator directly on the Cartesian coordinates resulted in large energies and unrealistic structures with distorted bond lengths and angles. This problem was solved by incorporating the following coordinate transformation in the first layer of the Boltzmann generator that is invertible up to rotation and translation of the molecule (Fig. 5A and materials and methods): the coordinates are split into Cartesian and internal coordinate sets. The Cartesian coordinates include heavy atoms of the backbone and the disulfide bridges. Cartesian coordinates are whitened, i.e., decorrelated and normalized, using a principal component analysis (PCA) of the input data. During whitening, the six degrees of freedom corresponding to global translation and rotation of the molecule are discarded. The remaining side-chain atoms are measured in internal coordinates (bond lengths, angles, and torsions with respect to parent atoms) and subsequently normalized.

After this coordinate transformation, the learning problem is substantially simplified as the transformed input data are already nearly Gaussian distributed. We first demonstrate that Boltzmann generators can learn to sample heterogeneous equilibrium structures when trained with examples from different configurations. To this end, we generated six short simulations of 20 ns each, starting from snapshots of the well-known 1-ms simulation of BPTI produced on the Anton supercomputer (*28*). In the more common situation that no ultralong MD trajectory is available, the Boltzmann generator can be seeded with different crystallographic structures or homology models. To promote simultaneous sampling of high- and low-probability configurations, we included an RC loss using two slow collective coordinates that have been used earlier in the analysis of this BPTI simulation (supplementary materials) (*29*, *30*). Boltzmann generators of BPTIs that do not use RCs are discussed in the subsequent section.

A trained Boltzmann generator with eight invertible blocks can sample all 892 atom positions (2676 dimensions) in one shot and produce locally and globally valid structures (Fig. 5C). The potential energies of samples exhibit significant overlap with the potential energy distribution of MD simulations (Fig. 5D), and thus samples can be reweighted for free-energy calculations. The probability distributions of most bond lengths and valence angles are almost indistinguishable from the distributions of the equilibrium MD simulations (Fig. 5E). The only exception is that the distributions of valence angles involving sulfur atoms are slightly narrower.

The trained Boltzmann generator learns to encode and sample significantly different structures (Fig. 5F). In particular, it generates independent one-shot samples of the near-crystallographic structure “X” (1.4-Å mean backbone RMSD to crystal structure), and the open “O” structure, which involves significant changes in flexible loops and repacking of side chains (Fig. 5. G and H). The X→O transition has been sampled only once in the millisecond Anton trajectory, which is consistent with the observation of a millisecond-timescale “major–minor” state transition observed in nuclear magnetic resonance spectroscopy (*31*). We note that X↔O transition states are not included in the Boltzmann generator training data.

Sampling such a transition and collecting statistics for it is challenging for any existing simulation method. Brute-force MD simulations would require several milliseconds of simulation data. To use enhanced sampling, an order parameter or RC able to drive the complex rearrangement shown in Fig. 5, G and H, would need to be found, but because BPTI has multiple states with lifetimes on the order of ~10 to 100 μs (*28*, *30*), the simulation time required for convergence would still be extensive. The computation of free-energy differences using Boltzmann generators will be discussed in the next section.

### Thermodynamics between disconnected states

We develop a reaction-coordinate–free approach to compute free-energy differences from disconnected MD or MCMC simulations in separated states, such as two conformations of a protein. As demonstrated above, this can be achieved by a single Boltzmann generator that simultaneously captures multiple metastable states and maps them to the same latent space *Z*, where they are connected via the Gaussian prior distribution (Figs. 2, B and H, and 3C). However, a more direct statistical mechanics idea that has been successfully applied to certain simple liquids and solids is to compute free-energy differences by relating to a tractable reference state, e.g., ideal gas or crystal (*32*–*34*). Here, we show that Boltzmann generators can turn this idea into a general method applicable to complex many-body systems.

Recall that the value of the energy-loss function

We illustrate our method by computing temperature-dependent free-energy differences for the four systems discussed above, each using two completely disconnected MD simulations as input. Because the estimate of the free-energy difference is readily available from the value of the loss function, it can be conveniently tracked for convergence while the Boltzmann generators are trained (Fig. 6B and fig. S3).

For the two-dimensional systems (double well, Mueller potential), exact reference values for the free-energy differences can be computed, and the Boltzmann generator method recovers them accurately with a small statistical uncertainty over the entire temperature range (Fig. 6, C and D) using 10-fold less simulation data and an ~10-fold shorter training time than for the estimates using a single joint Boltzmann generator reported in Fig. 2 (supplementary materials).

For the solvated bistable dimer, we use simulations that are 10-fold shorter than for the single joint Boltzmann generator reported in Fig. 3 and train two independent Boltzmann generators at multiple temperatures. As a reference, three independent umbrella-sampling simulations were conducted at each of five different temperatures. Both predictions of the free-energy difference between open and closed dimer states are consistent and have overall similar uncertainties (Fig. 6E and fig. S3B; note that the uncertainty of umbrella sampling is strongly temperature dependent). Although umbrella sampling is well-suited for this system with a clear RC, the two–Boltzmann-generator method required 50 times fewer energy calls than the umbrella-sampling simulations at five temperatures and yet makes predictions across the full temperature range (supplementary materials).

Finally, we used the same method to predict the temperature-dependent free-energy difference of the “X” and “O” states in the BPTI protein. Sampling this millisecond transition 10 times by brute-force MD would take ~30 years on one of the GTX1080 graphics cards that are used for computations in this paper and would only give us the free-energy difference at a single temperature. Although speeding up this complex transition with enhanced sampling may be possible, engineering a suitable order parameter is a time-consuming trial-and-error task. Training two Boltzmann generators does not require any RC to be defined.

Here, we used the two–Boltzmann-generator method starting from two simulations of 20 ns each, which were started from selected frames of the 1-ms trajectory, but could generally be started from crystallographic structures or homology models. Conducting the MD simulations, training and analyzing the Boltzmann generators used a total of <3 × 10^{7} energy calls, which results in a converged free-energy estimate within a total of about 10 hours on a GTX1080 graphics card, i.e., about five orders of magnitude faster than the brute-force approach (Fig. 6F and fig. S3C). Although no reference for this free-energy difference in the given simulation model is known, the temperature profile admits basic consistency checks: The x-ray structure is identified as the most stable structure at temperatures below 330 K. The internal energy and entropy terms of the free-energy difference (Eq. 1) are both positive across all temperatures. Therefore, the free-energy decreases at high temperatures as the entropic stabilization becomes stronger. A higher configurational entropy of the “O” state is consistent with its more open loop structure (compare Fig. 5, G and H) and the higher degree of fluctuations in the “O” state observed by the analysis in (*30*).

## Discussion and conclusion

Boltzmann generators can overcome rare event-sampling problems in many-body systems by generating independent samples from different metastable states in one shot. We have demonstrated this for dense and unstructured many-body systems with up to 892 atoms (over 2600 dimensions) that are placed simultaneously, with most samples having globally and locally valid structures and potential energies in the range of the equilibrium distribution. In contrast to other generative neural networks, Boltzmann generators produce unbiased samples, as the generated probability density is known at each sample point and can be reweighted to the target Boltzmann distribution. This feature directly translates into being able to compute free-energy differences between different metastable states.

In contrast to enhanced sampling methods that directly operate in configuration space, such as umbrella sampling or metadynamics, Boltzmann generators can sample between metastable states without any predefined RC connecting them. This is achieved by learning a coordinate transformation in which different metastable states become neighbors in the transformed space where the sampling occurs. If suitable RCs are known, then these can be incorporated into the training to sample continuous pathways between states, e.g., to compute a continuous free-energy profile along a reaction.

As in many areas of machine learning, the key to success is to choose a representation of the input data that supports the learning problem. For macromolecules, we have found that a successful representation is to describe its backbone in Cartesian coordinates and all atoms that branch off from the backbone in internal coordinates. Additionally, normalizing these coordinates to mean 0 and variance 1 already makes their probability distribution close to a Gaussian normal distribution, thus simplifying the learning problem considerably.

We have shown, in principle, how explicit solvent systems can be treated. For this, it is essential to build the physical invariances into the learning problem. Specifically, we need to account for permutation invariance: when two equivalent solvent molecules exchange positions, the potential energy of the system is unchanged and so is the Boltzmann probability.

We have demonstrated scaling of Boltzmann generators to 1000’s of dimensions. Generative networks in other fields have been able to generate photorealistic images with 10^{6} dimensions in one shot (*15*). However, for the present application, the statistical efficiency, i.e., the usefulness of these samples to compute equilibrium free energies, will decline with increasing dimension. Scaling to systems with 100,000’s of dimensions or more, such as solvated atomistic models of large proteins, can be achieved in different ways. Sampling of the full atomistic system may be addressed with a divide-and-conquer approach: In each iteration of such an approach, one would resample the positions of a cluster of atoms using the sum of potential energies between cluster atoms and all system atoms and then, e.g., perform Monte Carlo steps using these cluster proposals. Alternatively, Boltzmann generators could be used to sample lower-dimensional free-energy surfaces learned from all-atom models (*35*, *36*).

A caveat of Boltzmann generators is that, depending on the training method, they may not be ergodic, i.e., they may not be able to reach all configurations. Here, we have proposed a training method that promotes state space exploration by performing Monte Carlo steps in the Boltzmann generator’s latent space while training the network. This may be viewed as a general recipe: The whole plethora of existing sampling algorithms, such as umbrella sampling, metadynamics, and replica exchange, can be reformulated in the latent space of a Boltzmann generator, potentially leading to dramatic performance gains. Any such approach can always be combined with MD or MCMC moves in configuration space to ensure ergodicity.

Finally, the Boltzmann generators described here learned a system-specific coordinate transformation. The approach would become much more general and efficient if Boltzmann generators could be pretrained on certain building blocks of a molecular system, such as oligopeptides in solvent or a protein, and then reused on a complex system consisting of these building blocks. A promising approach is to involve transferrable featurization methods developed in the context of machine learning for quantum mechanics (*37*–*40*).

In summary, Boltzmann generators represent a powerful approach to addressing the long-standing rare-event sampling problem in many-body systems and open the door for new developments in statistical mechanics.

## Materials and methods

### Invertible networks

We used invertible networks with trainable parameters θ to learn the transformation between the Gaussian random variables **z** and the Boltzmann-distributed random variables **x**:*z* is scaled by the transformation. Below, we will omit the symbol θ and use the abbreviations:

*Trainable invertible layers*

We use the RealNVP transformation as trainable part of invertible networks (*20*). The main idea is to split the variables into two channels, *S* and *T* that respectively scale and translate the second input channel *T* and *S* and parameters (Fig. 1B).

*PCA whitening layer*

We define a fixed-parameter layer “*W*” to transform the input coordinates into whitened principal coordinates. For systems with rototranslationally invariant energy, we first remove global translation and rotation by superimposing each configuration to a reference configuration. We then perform PCA on the N input coordinates *X* by solving the eigenvalue problem:*W* are:

*Mixed coordinate transformation layer*

To treat macromolecules, we defined a new transformation layer “*M*” that transforms into mixed whitened Cartesian and normalized internal coordinates. We first split the coordinates into a Cartesian and an internal coordinate set, *i* in *j*, *k*, and *l*, and the Cartesian coordinates of particles *i*, *j*, *k*, and *l* are converted into distance, angle and dihedral

The inverse transformation is straightforward: The Cartesian set is first restored by applying *I*, whose parent particles are all in the Cartesian set, then particles with parents that have just been placed, etc. As for the *W* layer, the *M* layer is invertible up to global translation and rotation of the molecule that may have been removed during whitening. Additionally, we prevent noninvertibility in dihedral space by avoiding the generation of angle values outside the range

The Jacobians of the *M* layer are computed using Tensorflow’s automatic differentiation methods.

### Training and using Boltzmann generators

The Boltzmann generator is trained by minimizing a loss functional of the following form:*w*’s control their weights. Below, we will derive these terms in detail.

We call the “exact” distributions μ and the generated distributions *q*. In particular, *41*). The Boltzmann distribution has the form:*T*. When we only have one temperature, we define the reduced energy:**z** from the isotropic Gaussian distribution:

*Latent KL divergence*

The KL divergence measures the difference between two distributions *q* and *p*:*q*. Here, we minimize the difference between the probability densities predicted by the Boltzmann generator and the respective reference distribution. Using Eqs. 3, 4, and 10, the KL divergence in latent space is:*U* and entropic factor

We can extend Eq. 13 to simultaneously train at multiple temperatures, obtaining:*16*).

*Reweighting and interpretation of latent KL as reweighting loss*

A simple way to compute quantitative statistics using Boltzmann generators is to use reweighting of probability densities by assigning the statistical weight

Using Eq. 15, it can be shown that minimization of the KL divergence (Eq. 13) is equivalent to maximizing the sample weights:

*Configuration KL divergence and ML*

Likewise, we can express the KL divergence in **x** space where we compute the divergence between the probability of generated samples with their Boltzmann weight. Using Eqs. 3, 4, and 11:

*Symmetric divergence*

The two KL divergences above can be naturally combined to the symmetric divergence

*Reaction coordinate loss*

In some applications, we do not want to sample from the Boltzmann distribution but promote the sampling of high-energy states in a specific direction of configuration space, for example, to compute a free-energy profile along a predefined RC

### Adaptive sampling and training

We define the following adaptive sampling method that trains a Boltzmann generator while simultaneously using it to propose new samples. The method has a sample buffer *X* that stores a predefined number of x samples. This number is chosen such that low-probability states of interest still have a chance to be part of the buffer when it represents an equilibrium sample. For the examples in Fig. 4, it was chosen to be 10,000 (double well, Mueller potential) and 100,000 (solvated particle dimer). *X* can be initialized with any candidates for configurations; in the examples in Fig. 4, it was initialized with only one configuration (copied to all elements of *X*); for the particle dimer system, we additionally added small Gaussian noise with standard deviation 0.05 nm to avoid the initial Boltzmann generator overfitting on a single point. The Boltzmann generator was initially trained by example, minimizing *X*.2. Update Boltzmann generator parameters θ by training on batch.3. For each x in batch, propose a Metropolis Monte Carlo step in latent space with step size *s*:**x** by

## Supplementary Materials

This is an article distributed under the terms of the Science Journals Default License.

## References and Notes

**Acknowledgments:**We are grateful to C. Clementi (Rice University), B. Husic, M. Sadeghi, M. Hoffmann (FU Berlin), and P. Shanahan (MIT) for valuable comments and discussions.

**Funding:**We acknowledge funding from the European Commission (ERC CoG 772230 “ScaleCell”), the Deutsche Forschungsgemeinschaft (CRC1114/A04, GRK2433 DAEDALUS), the MATH+ Berlin Mathematics Research Center (AA1x8, EF1x2), and the Alexander von Humboldt Foundation (postdoctoral fellowship to S.O.).

**Author contributions:**F.N., S.O., and J.K. designed and conducted research. F.N., J.K., and H.W. developed theory. F.N., S.O., and J.K. developed computer code. F.N. and S.O. wrote the paper.

**Competing interests:**The authors declare no competing interests.

**Data and materials availability:**Data and computer code for generating results of this paper are available at Zenodo (

*42*).