Anatomy of BoltzGen

A comprehensive exploration of BoltzGen's architecture: from molecular representations to diffusion-based generation of protein binders

We have entered the era of generative biology. While tools like AlphaFold solved the problem of predicting how nature folds proteins, the frontier has shifted to designing novel molecules that nature never evolved: proteins that can bind to specific targets, neutralize pathogens, or catalyze new reactions.

Unlike previous approaches that often simplified proteins to their backbones, BoltzGen is an all-atom diffusion model capable of "co-folding" binders and targets simultaneously. It extends beyond just proteins, integrating DNA, RNA, and small molecules into a single, unified generative framework. But how does a neural network hallucinate a physically valid, atomic-level 3D structure from pure noise? In this article, we will tear apart the architecture to understand the machinery under the hood.

Inputs to outputs

If you, like me, have mostly an ML background, you may be used to thinking in terms of inputs and outputs and how we can effectively represent them for a task. In order to do so, we have to take a quick detour and learn the basics of the domain we are dealing with. In our case, the output goal is to generate binders (proteins or peptides) given as input a biomolecular target (other proteins, small molecules, DNA/RNA) and some specified conditions.

Let's not be afraid of all this terminology. Learning in detail about all these entities might require you to follow an entire biochem curriculum, but I am going to try to introduce the bare minimum to make some sense of this.

Fancy an intro?

Both proteins and peptides are fundamentally chains of the same basic units called amino acids, which are essentially small organic molecules.
All amino acids share a fundamental structure built around a central carbon atom known as the alpha-carbon (α-carbon). This central carbon is bonded to four different groups:

An amino group (–NH₂)
A carboxyl group (–COOH)
A hydrogen atom (H)
An R-group (or side chain)

Basic structure of amino acids — Basic structure of all amino acids

While all amino acids share the same first three groups, what characterizes their chemical properties is the structure of the side chain, that varies for each one of them.
In order to form a chain, these aminoacids bind in a dehydration process (losing an H2O molecule). We call shorter chains (usually between 2 and 50 aminoacids) peptides, while longer chains are called proteins.

Each unit of a chain is called a residue, and has two main parts:

A backbone, which is a repeating chain of 4 atoms (N-Cα-C-O).
A sidechain, consisting of a variable structure, characterizes the properties of each residue.

Currently, we can formulate a preliminary hypothesis on how to encode this information. It would look really tempting to express a chain the same way we express a sentence: a sequence of tokens! Could we just get away with defining something like:

# tripeptide glycylhistidyllysine (Gly-His-Lys residues sequence)
tripeptide = ["GLY","HIS","LYS"]

and then let the model train on a next token prediction task?

Right now, we are actually missing another key information that we need to model from this data: these are not simple sequences of residues, but each of the atoms making up those residues has a particular position in 3D space. Modelling this is of fundamental importance because the binding process is something that occurs in physical space, where the geometry of both binder and target is essential.

14 atoms

Now that we've established the dual nature of our problem (continuous 3D coordinates and discrete amino acid labels), we face a challenge: how do we treat this duality in a single model? The solution is the 14 atoms representation.

Here's how it works: Each residue is represented by 14 atoms, the length of the longest possible residue. The first 4 are always the backbone atoms (N, Cα, C, O), while the remaining 10 are "virtual atoms" that act as markers. The model signals which amino acid it wants by placing a specific number of these virtual atoms directly onto the backbone atoms. We then decode the amino acid identity by counting how many virtual atoms land on each backbone position (within 0.5 Å).

For example: threonine is signaled by placing 3 virtual atoms on the backbone nitrogen and 4 on the backbone oxygen. Proline uses 7 atoms on the oxygen. Glycine uses 10 on the oxygen. The remaining atoms that aren't used as markers become the actual side chain.

This clever encoding lets the model work entirely in continuous 3D space, avoiding the messy problem of mixing discrete and continuous representations. It enables efficient joint training for both structure prediction and sequence design. To see this in action, I prepared a minimal example from the original codebase in this notebook.

Non-designed inputs

Recalling what we introduced at the beginning, we are interested in designing binders that interact with some defined target. BoltzGen allows for non-designed inputs that can be proteins (or protein sequences), nucleic acids (DNA/RNA) and small molecules. Being an all-atom model, we can pass detailed atomic level information such as atom positions, element types and charge.

Raw information for these entities is tipically stored in PDB or mmCIF formats. Let's take a look at an example description of a small protein chain in PDB format:

ATOM      1  N   GLY A   1      25.864  33.665  -2.618 ... N
ATOM      2  CA  GLY A   1      26.039  32.227  -2.528 ... C
ATOM      3  C   GLY A   1      27.247  31.766  -3.284 ... C
ATOM      4  O   GLY A   1      27.359  32.164  -4.437 ... O
ATOM      5  N   ALA A   2      28.140  30.984  -2.678 ... N
ATOM      6  CA  ALA A   2      29.336  30.435  -3.271 ... C
ATOM      7  C   ALA A   2      30.455  31.439  -3.133 ... C
ATOM      8  O   ALA A   2      30.334  32.498  -3.611 ... O
ATOM      9  CB  ALA A   2      29.623  29.088  -2.639 ... C
...

Taking a look at this we can see that a chain is described as a long list of atoms. Let's break down what we're seeing:

ATOM: Tells us this is an atom record.
CA, N, C, CB: These are the specific atom names (e.g., Carbon-alpha, Nitrogen).
GLY, ALA: These are the residue (or "token") names (Glycine, Alanine).
A: This is the chain ID.
1, 2: These are the residue (token) numbers.
25.864 33.665 -2.618: These are the X, Y, and Z coordinates in 3D space.
N, C, O: The element type.

Embedding inputs

Once we have defined our inputs and their nature, it's finally time to talk about machine learning. The first piece we need is an encoder, which in BoltzGen is defined as the trunk. The trunk module is responsible of creating embeddings for designed and non designed inputs, together with conditioning specifications. Within the trunk, we have a multi-stage encoding approach that aims at extracting meaningful information at different levels. Let's now see how it works in detail.

Atoms to tokens

When studying a certain molecule, we can derive meaningful information at different scales, from the individual atomic level to the interaction between residues. It makes sense then, to design an encoder that tries to extrapolate information in a hierarchical way.

Examples of hierarchical molecular tokenization — Tokenization at different scales

In BoltzGen, inputs are handled by an Input Embedder model, which is made up by two core modules: an atom encoder, and an atom attention encoder.

The Atom Encoder takes the raw features for each atom and return two types of output: single atom q and atom pair p representations. While the first are obtained by simple Linear or MLP layers, the latter are derived by computing geometric relationships between pairs of atoms.

# Compute raw input embedding
q, c, p, to_keys = self.atom_encoder(feats)
# Project pairwise features
atom_enc_bias = self.atom_enc_proj_z(p)

After deriving representations at the atomic level, it's time to go up in the hierarchy and model th interaction between residues. This operation is done in two stages by the Atom Attention Encoder. Here we have two stages:

An Atom Transformer takes individual atom embeddings, and uses a window attention operation to model the relationships between each atom and each other atom in input. This operation is biased by the pairwise features computed previosly, which tells the model which atoms are close each other before it even computes attention scores.
An aggregation operation. Here we pass from "characters" (the atoms) to tokens (the residues). This operation computes a weighted average of all atom features belonging to a specific token, producing a single feature vector per residue.

# a are the token embeddings from the attention encoder
a, _, _, _ = self.atom_attention_encoder(
    feats=feats,
    q=q,
    c=c,
    atom_enc_bias=atom_enc_bias,
    to_keys=to_keys,
)

Input embedder architecture — The Input Embedder architecture showing the flow from atoms to tokens

Initial conditioning

We already discussed that the generation procedure can be conditioned on a number of different specifications. We can steer generation specifying:

Residue type, MSA profile
Method conditioning, modified flags, cyclic flags
Molecule type, pH, temperature
Binding specifications
Secondary structure specifications
Design masks

The conditioning process in the trunk is fairly simple, and is computed only at the initial stage. All of the specs are embedded either using nn.Embeddings modules or linear layers. Once embedded, this information is added to the token embeddings a we computed previously by simple addition operations, to derive our final token embeddings s:

# condition the token embeddings
s = (
    a
    + self.res_type_encoding(res_type)
    + self.msa_profile_encoding(torch.cat([profile, deletion_mean], dim=-1))
)

if self.add_method_conditioning:
    s = s + self.method_conditioning_init(feats["method_feature"])
# ... sum of all other possible conditions

Tokens to relationships

Let's make a small recap. At this stage we computed token-level embeddings s. These embeddings captured useful information about the single tokens starting from single atoms and their structures. This next part will focus on building on top of these token embeddings to model their relationships. Our goal here will be to derive pairwise features z starting from s.

This is a two stage process, where we first initialize pairwise features z_{init} from s and then iteratively refine them via Pairformer blocks to get a final z.

Pairwise features init

Outer sum

As a first step, starting from our token embeddings, we build a pairwise feature matrix z utilizing an outer sum operation. Let's make a small example, where we have tokens a,b,c:

z_{init}=\begin{bmatrix} a \\ b \\ c \end{bmatrix} \oplus \begin{bmatrix} a & b & c \end{bmatrix} = \begin{bmatrix} a+a & a+b & a+c \\ b+a & b+b & b+c \\ c+a & c+b & c+c \end{bmatrix}

As you can see, after applying the outer sum we create a matrix where for each entry z_{init}[i,j] we represent the relationship of token i with respect to each token j. At this stage we are starting to represent simple pairwise relationships.

Relative position encoding

We now go a step further, trying to add to this representation some geometric and topological information for each pair. This operation is done by adding to the pairwise feature matrix relative position encodings derived from the featurs. At this stage we add information about:

Residue distance: How far apart in sequence (e.g., i=5, j=20 → distance=15)
Token distance: For multi-atom tokens, the distance between atoms in the same residue
Chain separation: Are they in the same chain or different chains?
Same entity: Are they copies of the same protein (symmetry)?
Cyclic handling: Special treatment for cyclic peptides

Chemical bonds

Now we proceed adding information by deriving a bynary adjacency matrix token_bonds , where the entry token_bonds[i,j] is 1 if tokens i,j are chemically bonded. This matrix is then projected via a linear layer and summed to z_{init}.

This adds information about bond type (single, double, triple, aromatic, etc.) via an embedding layer.

Contact conditioning

In order to steer conditional generation, the user is allowed to enforce specific distances between two residues, and say if these residues should be in either of these three states:

UNSPECIFIED: No constraint
UNSELECTED: Don't form contact
SPECIFIED: Form contact at specific distance

This information is computed using Fourier embeddings and added to our pairwise feature matrix.

Quick recap

The following image show schematically the initialization of pairwise features, starting from individual token embeddings, to pairwise matrix creation and conditioning to get to the feature matrix z_{init}.

Pairwise features initialization — Visualization of the process: from token embeddings to the conditioned pairwise feature matrix z_{init}.

Pairs to… triangles?

As you might have noticed, we are slowly building up representations at increasingly higher levels of abstractions. We started from single atoms to tokens, and got to pairs of tokens. What we are still missing at this point is trying to build a more complete 3D-aware understanding of these relationships. The initial pairwise feature matrix is not enough to capture this, as it only blindly looks at pairs of tokens.

For this reason we use a pairformer module, whose job is to take this basic information and build a sophisticated, geometrically-aware understanding of the entire molecular complex. This is done by trying to iteratively enforce the respect of the triangle inequality.

Let's try to make an example to explain this rule: if I told you that city A is distant 10km from city B, and city B is distant 10km to city C, would you believe that the distance from A to C is 100km?

Triangle inequality example — Triangle inequality: the distance AC cannot be longer than AB + BC

Well, the short answer is no because of the simple reason that the shortest path between two points is always a straight line: the distance between A and C cannot be longer tha the distance from C to A passing by B.

The problem right now is that our pairwise feature matrix might represent incorrect information that does not respect this property, because we only looked at pairs of tokens.

My second question for you now is: why should we care about ensuring that this inequality is satisfied among the triplets in our representations?

Because the triangle is the smallest, simplest polygon. It's the bare minimum unit needed to introduce spatial logic. What we expect is that by respecting this rule for all triplets, we are able to compositionally get a whole 3D geometric structure representation that actually makes sense.

Let's now get to how the Pairformer works, but first I just want to show you a picture that represents how a triplet can be seen from a pairwise matrix perspective and a graph one.

Triangle attention visualization — Visualizing triangles in the pairwise matrix

Notice that for each pair of nodes i,j we can have two edges: an outgoing one (information from i to j) and an ingoing one (information from j to i).

What the Pairformer does is apply a stack of "reasoning layers". Each layer in this stack has two main jobs.

First, it updates the pairwise matrix z by reasoning about triplets. It uses Triangle Multiplication and Triangle Attention to refine the relationship between any two nodes i and j by looking at all possible "intermediate" nodes k. This allows information from paths like i \rightarrow k \rightarrow j to directly update and enforce geometric consistency in the i \rightarrow j relationship.

Triangle update operations — Triangle multiplication and attention operations

Second, it updates the single token representations s using this new pairwise information. This creates a crucial feedback loop. It's achieved with Pair-Biased Attention, where the single tokens attend to each other, but how much token i listens to token j is directly influenced by their geometric relationship stored in the z[i,j] matrix.

This whole two-part process (updating pairs, then updating tokens) forms one single Pairformer layer. This block is stacked on top of itself multiple times, allowing the model to iteratively refine its understanding of the 3D geometry until it represents a physically plausible structure.

Checkpoint recap

We have now established a complete overview of the processing performed by the trunk. This module encodes physical and context-aware features, transforming raw atomic data into rich token embeddings and pairwise representations.

Operating at this atomic level enables the model to unify a diverse range of input modalities within a single framework. These extracted features serve as the critical conditioning signal used to steer the generative diffusion process, which we're going to discuss in the next section.

Generation process

Let's now get to the core functionality of BoltzGen, which is designing proteins and peptides that bind to given biomolecular targets. The generative process is based on the paradigm of diffusion models, where our model starts from purely noisy atomic coordinates and will try to iteratively denoise them until we get to a final generated structure.

While the most traditional approaches are based on docking (given a binder structure, predict the target), one of the most notable points of this generation process is that our model is trained on a co-folding task. This means that we are trying to make the model generate both the structure of the target and the binder at the same time.

The key insight is that by forcing the model to reason about sequence and structure together at every step, the resulting designs are more likely to be physically coherent and plausible. Moreover, the model should learn to capture all-atom interactions: this is critical because it allows the model to learn how the sequence (which determines the sidechains) can influence the backbone structure, and vice-versa.

Diffusion process

As we already discussed, the generation process is driven by a diffusion module. I will not go to deep in explaining how diffusion models work in general, as there are plenty of already good resources (I really like this video by Depth First). In this section I'd rather describe how the standard diffusion process has been adapted for our molecular generation task.

What happens at a high level is that our model, starting from pure noisy atomic coordinates for both the target and the binder learn to reconstruct both structures at the same time via a multi-step denoising process, which is steered by all the meaningful information that has been embedded by the trunk.

Diffusion process — The diffusion process: from noise to structured protein

Mathematical formulation

Mathematically, given a sample expressed as 3D atomic coordinates from our training data:

X \sim p_{data}, X\in \mathbb{R}^{N \times 3}

We increasingly add noise for T timesteps via the forward diffusion process, until for a large T the final noisy sample will be pure gaussian noise with variance T^2:

dX_t=\sqrt{2t}dB_t \ \ \ \ X_0\sim p_{data}

The goal of our denoiser module D_\theta is to learn to reverse this diffusion process at each step t given a noisy sample x and conditioned on trunk features z. In math terms, our goal is to make our denoiser approximate the posterior mean:

D_{\theta}(x,t;z)\approx \mu_t(x)=\mathbb{E}_{X_{0}|X_{t}=x}[X_{0}]

RFdiffusion process — Denoising process visualization

EDM sampling

At each denoising steps, sampling from the denoiser is done by following the EDM framework proposed by Karras et al. where at each step, before sampling from a denoiser, we add a bit of randmom noise to the previous sample.

This may sound counterintuitive, but it's one of the key parts for improving the quality of our generation. This kind of "shake" that we give to our denoising trajectory helps with two things: trying to bump back on the correct path if the model starts going towards an incorrect route, and most of all encouraging our model to explore diverse protein structures.

This procedure is regulated by two scaling factors: \beta regulates the amount of random noise to be added, while \alpha scales the denoising step towards the model's prediction.

Dilated scheduling

This is another clever trick implemented in the model generation phase, which comes from an observation of the authors: the most critical part of the design happens between 60% and 80% of the denoising process. Within this window, the model makes the actual design decision of the amino acid type for each residue.

The natural consequence was to dilate the time in this window. This means that if we defined a number of denoising steps N=300, we will spend more of this computation on taking more smaller denoising steps in the critical \tau=[0.6,0.8] window, at the cost of skipping some steps in less significative regions.

Recap and overview

With the model's architecture fully defined, we can now summarize the operational workflow. BoltzGen takes a fixed target and a conditional binder as inputs, encoding them into rich, property-aware structural embeddings. In the subsequent design phase, the process begins with a vector of atom coordinates initialized as random noise. The model then iteratively denoises this vector, using the condition embeddings as a guide to progressively reveal the final binder structure.

Training the model

Now that we have seen the various parts that compose the model, it's time to talk about how it's trained. As we discussed, we want the model to learn how to behave given a diverse set of conditions.

To achieve this, the model isn't just trained on one specific problem (like only predicting a structure), but on a mixture of different tasks and conditions all at once.

This "jack-of-all-trades" approach is what makes BoltzGen a universal model. During training, the pipeline randomly samples a known biomolecular structure from its database, crops it, and then "pretends" parts of it are unknown. This setup can create several different types of problems for the model to solve.

Training tasks

The model learns its versatility by being randomly assigned one of several key tasks for each training example. The main tasks include:

Folding (Structure Prediction): This is the most basic task. The model is given the amino acid sequence and must predict its 3D structure. No residues are "designed."
Binder Design: The model is given a target structure (like a protein or small molecule) and is tasked with "designing" a new protein chain that binds to it.
Motif Scaffolding: This is a more complex task. The model might be given a small, functional part of a protein (a "motif") and must design the larger "scaffold" around it. Conversely, it might be given the scaffold and asked to design the functional motif within it.
Unconditional Design: This is the "blank slate" task. The model is asked to generate a completely new protein from scratch, without any target to bind to.

By constantly switching between these tasks, BoltzGen learns the fundamental rules of protein physics, folding, and interaction simultaneously, all within a single model.

Conditioning

On top of the general task, the model is also trained to obey a rich set of specific conditions. This is what gives the user precise control over the final output. During training, these conditions are randomly applied to the known structures:

Structure Conditioning: The model is told to "lock" certain parts of the structure. For example, it might be given the exact 3D coordinates for the target protein but must generate the binder's structure relative to it. This is also used to fix nanobody frameworks while only designing the loops.
Binding Site: The model can be given "hints" about where to bind. Specific residues on the target can be labeled as "binding" (telling the model to interact with this spot) or "not-binding" (telling the model to avoid this area).
Secondary Structure: The user can specify that certain parts of the designed protein must be an alpha-helix, beta-sheet, or a simple coil.
Covalent Bonds: This is a powerful feature. The model can be forced to create a covalent bond between two specific atoms in the design. This is the key to designing modalities like cyclic peptides, where the protein's head is bonded to its tail.

Designing the loss

The loss used in BoltzGen training is used to ensure our model learns to reconstruct the correct molecular structure at different scales, and takes the name of diffusion objective.

The diffusion objective computes measures on the structure of the prediction with respect to the ground truth. Let's start by defining our denoiser's prediction as \hat X=D_\theta(X+t\epsilon,t;z) and X our ground truth.

The first measure we compute is the Mean Squared Error between the model's predicted atomic coordinates and the true ones:

\mathcal{L}_{\text{MSE}}(\theta;X,t,\epsilon)=\frac{1}{3}\sum_{l}w_{l}||\hat{X}_{l}-X_{l}^{aligned}||_{w}^{2}

This is a fairly standard measure, but we have to put the focus on two things. First, before computing the MSE, a rigid alignment algorithm is applied to find the optimal transformation that matches at best both structures. The reason for this is that since we're in 3D, our model might predict the correct structure, but rotated with respect to the ground truth. This is why we create an X_l^{aligned}.

Second, we apply a weight w_l to give a different importance to each atom based on the type of molecule it belongs to:

w_l = \begin{cases} 1 & \text{if protein} \\ 6 & \text{if DNA or RNA} \\ 11 & \text{if ligand} \end{cases}

The second component of our diffusion objective is the Bond Loss. Here we encourage the model to generate correct bond lengths. For all the pairs of atoms that are bonded in the ground truth, we compare their distance with the distance in the predicted structure. This ensures the local atomic geometry is chemically correct:

\mathcal{L}_{\text{bond}}(\theta;X,t,\epsilon)=\frac{1}{|\mathcal{B}|}\sum_{l,m\in\mathcal{B}}(||\hat{X}_{l}-\hat{X}_{m}||-||X_{l}^{aligned}-X_{m}^{aligned}||)^{2}

Third and last component is the Smooth lDDT Loss. While before we focused on individual bond distances, now we try to check if the local environment around each atom is correct. We define a certain region cutoff and check all distances between neighbors in that region. We call it smooth because it gives partial credit (being 0.6 Å off is penalized less than being off 4.1 Å):

\mathcal{L}_{\text{smooth lDDT}} = 1 - \frac{1}{|S|} \sum_{(i,j) \in S} \epsilon_{ij}

Where \epsilon_{ij} is the score computed for each pair of atoms.

The final formulation of our diffusion objective is:

\mathcal{L}(\theta)=\mathbb{E}_{X \sim p_{\text{data}}, t \sim p_{\text{noise}}, \epsilon \sim \mathcal{N}}[w(t)(\mathcal{L}_{\text{MSE}}(\theta;X,t,\epsilon)+\mathcal{L}_{\text{bond}}(\theta;X,t,\epsilon))+\mathcal{L}_{\text{smooth\_lDDT}}(\theta;X,t,\epsilon)]

Notice how here we have a weighting for MSE and bond loss w(t). Since the loss is computed at each noise level, it is natural that at initial stage (where noise is very large), the error will be larger. At lower noise levels, we expect that the model focuses on fine details, which are tiny but essential on high quality results. This weighting is inversely proportional to the noise levels: it dampens the large error signal from coarse errors and amplifies the one for small details.

Model inference

Design specification language

We have seen that BoltzGen was structured to work on a wide possible set of inputs and conditions. A key usability feature of the design process is the Design Specification Language, which allows the user to tell the model exactly what we want to design, what's the target and which rules to follow. This file follows the YAML convention, where we define entities and constraints. Let's now break down a couple of examples from the paper:

Cyclic peptide design against streptavidin

Our goal here is to design a small, cyclic peptide (between 8 and 18 residues long) that binds to the specific chain A of the protein streptavidin.

Notice how here we are adding even more specs than just residue length and binding site. This example shows also how we can characterize parts of the input. In this case we are saying that our target has a flexible group, meaning that our model will take into account this and generate a better fitting binder and simultaneously predicting how the target will adapt.

BoltzGen pipeline

Although BoltzGen is the core for the generation of candidate designs, it can be considered a part of a pipeline that serves as a funnel that uses a series of increasingly rigorous computational checks to filter them down to a small, diverse, and high-quality set of candidates ready for expensive and time-consuming wet-lab validation.

The whole BoltzGen pipeline is a 6 steps process, that we will now look in detail:

BoltzGen Diffusion: Using a design specification (like the target molecule's structure), the BoltzGen diffusion model generates a large number of potential binder structures (e.g., 60,000 nanobodies in some experiments).
Inverse Folding: This optional step takes the 3D structure generated by BoltzGen and uses a different model (BoltzIF) to predict an optimal amino acid sequence that would fold into that specific shape. While BoltzGen designs the sequence and structure together, this "inverse folding" step acts as a refiner. Its goal is to find a sequence that is more likely to be stable and soluble in a real-world setting.
Folding: This stage takes the binder's sequence and uses the Boltz 2 model to predict how it will fold in complex with the target. Once folded, we compare the RMSD (a measure of similarity) between the Boltz 2 version and BoltzGen’s one. A close match (low RMSD) gives high confidence that the design is physically plausible. Designs that fail to refold correctly are filtered out.
Affinity prediction: When designing binders for small molecules, this stage uses a specific module within Boltz-2 to computationally estimate the binding affinity (how tightly the binder will "stick" to its target).
Analyze: The pipeline runs a battery of computational analyses on the refolded structures to calculate physics-based metrics like interaction strength, shape complementarity and developability.
Filter: This final stage takes all the metrics and scores from the previous steps to produce a final, ranked list of the best designs. It uses a "quality-diversity algorithm" to select the final set of candidates (e.g., the top 15) to be sent for experimental validation.

BoltzGen pipeline represents a significant shift from isolated generative models to a comprehensive, end-to-end framework for biomolecular design. By integrating a powerful all-atom generative model with a rigorous filtering and ranking system, BoltzGen bridges the gap between computational generation and real-world applicability. Its ability to design across diverse modalities (from nanobodies to cyclic peptides) and its success in targeting novel proteins demonstrate its potential as a general-purpose tool for drug discovery.

Closing thoughts

BoltzGen represents a significant leap forward in the field of generative biology. Through extensive wet-lab validation (achieving success rates as high as 80% on complex targets) it has demonstrated how all-atom diffusion models can effectively design binders across a diverse set of modalities, including proteins, nucleic acids, and small molecules. Furthermore, by releasing the model and its comprehensive filtering pipeline as open source, the project empowers researchers globally to democratize access to these advanced tools, fostering a collaborative environment that accelerates the pace of discovery..

However, we are far from considering drug design a solved problem. While we now possess the capability to computationally generate thousands of promising candidates in a quick and integrated pipeline, the bottleneck has shifted to the laborious and costly phase of downstream validation. The rigorous demands of clinical trials remain a significant hurdle that must be cleared to translate designs into therapies. There is immense potential for artificial intelligence to bridge this gap, specifically through technologies like virtual cells and predictive modeling for clinical outcomes, which could drastically reduce the time and cost of bringing a drug to market.

Finally, there remains substantial room for the architectural optimization of these models. As detailed in the technical specifications, BoltzGen currently relies on a heavy "trunk" architecture for complex multi-level feature extraction, employing intricate components like PairFormer stacks and triangular attention mechanisms. This approach is now being challenged by emerging models like SimpleFold, which aim for simpler, general-purpose backbones. This tension raises a critical question for the future of the field: are we witnessing another instance of the "bitter lesson," where simpler, scalable architectures will eventually supersede complex, domain-specific engineering?

In short, BoltzGen shows what is possible today, but also highlights the vast landscape of challenges that remain. Improving controllability, accelerating downstream validation, simplifying the architecture, and better integrating predictive models of biological function will all be crucial steps toward making AI-driven drug design not just powerful, but practical.

Acknowledgments

I would like to thank @mozzarellapesto and @tensorqt for the friendly help in the review process.