PLAID: Repurposing Protein Folding Models for Latent-Diffusion Generated Multimodal Proteins

TL;DR

PLAID is a multimodal generative model that simultaneously generates protein 1D sequence and 3D structure by learning the latent space of protein folding models. PLAID article
It supports compositional prompts for function and organism, trained on sequence databases that are 2–4 orders of magnitude larger than structure databases. PLAID article
The method uses diffusion over a latent space of a folding model and decodes structure with frozen weights from the folding model, here using ESMFold as the decoder. PLAID article
CHEAP (Compressed Hourglass Embedding Adaptations of Proteins) reduces embedding dimensionality to tackle large latent spaces and regularization challenges. PLAID article
The work demonstrates how multimodal generation can be guided by functional and organism prompts and discusses potential extensions to more complex molecular contexts. PLAID article

Context and background

The PLAID project emerges in a moment of heightened attention to AI in biology, following the 2024 Nobel Prize recognition of AlphaFold2. In this context, PLAID asks what comes next after protein folding: how to move from generating proteins to generating useful proteins in a controlled, multimodal fashion. The work develops a method to sample from the latent space of protein folding models to generate new proteins, combining sequence and structure in a single unified process. PLAID article What makes PLAID notable is its emphasis on multimodal co-generation: unlike prior models that may focus on a single modality (sequence or structure), PLAID aims to produce both discrete sequence and continuous all-atom coordinates in a cohesive generation. This approach is motivated by the observation that simply generating proteins may be insufficient without mechanisms to control the outputs for practical use. PLAID article

What’s new

PLAID introduces a diffusion-based approach operating in the latent space of a protein folding model. The training process uses only sequence data to learn embeddings, with structure decoded during inference by applying frozen weights from a protein folding model. In particular, the authors use ESMFold as a modern successor to AlphaFold2, replacing a retrieval step with a protein language model. The key idea is to sample embeddings that correspond to valid proteins and then decode both sequence and full atomic structure from those embeddings. PLAID article A central challenge is that the latent spaces of transformer-based models can be very large and require substantial regularization. To address this, PLAID proposes CHEAP (Compressed Hourglass Embedding Adaptations of Proteins), a compression model for the joint embedding of protein sequence and structure. The combination of a diffusion process on a latent space and a compressed embedding space enables more tractable training and inference while preserving structural and sequence diversity. PLAID article During training, the system relies on sequences to learn embeddings; during inference, the model can decode both sequence and structure from the sampled embedding using frozen weights from the folding model. This design leverages structural understanding embedded in pretrained protein folding models as priors, analogous to how vision-language-action models use priors from large multi-modal training. PLAID article The latent space of ESMFold—and by extension many transformer-based models—exhibits some characteristic patterns when probed. The authors report that the latent space is highly compressible and exhibits notable activations across layers; this observation informs the development of CHEAP and the overall sampling strategy. PLAID article Beyond sequence-to-structure generation, PLAID is framed as a general approach for multimodal generation where a predictor from a abundant modality to a scarce one (e.g., sequence to structure) enables joint generation across modalities. As protein design advances (including complex contexts like interactions with nucleic acids and ligands), the authors suggest the method could be extended to more complex systems using the same latent-diffusion paradigm. PLAID article The authors invite collaboration to extend the method or to test it in wet-lab settings. They also provide BibTeX references for PLAID and CHEAP and point readers to preprints and codebases (PLAID and CHEAP). PLAID article

Why it matters (impact for developers/enterprises)

Multimodal protein generation could streamline design workflows by jointly optimizing sequence and structure rather than treating them separately. This may enable more efficient exploration of functional protein variants guided by prompts. PLAID article
Training on large sequence databases, which are 2–4 orders of magnitude larger than structure databases, hints at scalability and practicality for data-rich protein design tasks. The approach leverages latent embeddings rather than requiring structural data for every training example. PLAID article
Using a frozen folding model during decoding allows practitioners to exploit established structural representations without retraining the entire folding stack, potentially reducing development time and computational cost. PLAID article
The compositional control interface—functions and organism prompts—emulates intuitive user control patterns from image generation, suggesting a path toward user-friendly protein design tooling. PLAID article
The CHEAP compression approach addresses technical challenges in high-dimensional latent spaces, which is relevant to any organization seeking scalable multimodal design tools that integrate sequence and structure. PLAID article

Technical details or Implementation

Core concept: diffusion in the latent space of a protein folding model, enabling sampling of valid proteins whose sequence and structure can be decoded from the embedding. The decoder leverages a pretrained folding model with frozen weights, specifically ESMFold, a successor to AlphaFold2, to produce all-atom structures. PLAID article
Training regime: only sequences are required to train the generative model, taking advantage of the abundance of sequence data relative to structural data. This enables learning a robust latent space that maps to realistic structure when decoded. PLAID article
CHEAP (Compressed Hourglass Embedding Adaptations of Proteins): a compression model for the joint embedding of sequence and structure to manage the large latent spaces and facilitate learning. PLAID article
Multimodal co-generation: PLAID addresses the challenge of generating both a discrete sequence and continuous structural coordinates in a single pass, enabling end-to-end generation with controllable prompts. PLAID article
Control and prompts: PLAID supports compositional prompts for function and organism as a proof-of-concept for controlling both axes of generation; the goal is to eventually enable full-text prompt control. PLAID article
Notable results and examples: PLAID demonstrates capabilities such as learning the tetrahedral cysteine-Fe2+/Fe3+ coordination patterns common in metalloproteins while preserving sequence-level diversity; transmembrane proteins with hydrophobic cores are also observed under function-based prompting. PLAID article

Data considerations and tables

| Data type | Characteristic | Notes |---|---|---| | Protein sequences | Abundant | Training data are 2–4 orders of magnitude larger than structure databases. |Protein structures | Scarce | Used for decoding, via frozen folding-model weights. |Latent space | Large but compressible | Demonstrated by activations across layers and the CHEAP approach. |

Key takeaways

PLAID combines diffusion in a folding-model latent space with sequence-to-structure decoding to generate proteins multimodally. PLAID article
Training relies on sequences; structure is produced at inference time using frozen weights from a folding model (ESMFold). PLAID article
CHEAP offers a compression strategy to manage the joint embedding of sequence and structure, addressing high-dimensional latent spaces. PLAID article
The approach supports function- and organism-based prompts as a proof-of-concept for controllable generation. PLAID article
The authors see potential for extending multimodal generation to more complex systems, including interactions with nucleic acids and ligands. PLAID article

FAQ

What does PLAID generate?

PLAID generates both protein sequences (1D) and full all-atom 3D structures by sampling the latent space of protein folding models. [PLAID article](http://bair.berkeley.edu/blog/2025/04/08/plaid)
What data does PLAID require for training?

Training uses sequence databases; structure data is not required for training but is produced during inference via decoding from frozen folding-model weights. [PLAID article](http://bair.berkeley.edu/blog/2025/04/08/plaid)
What role does CHEAP play?

CHEAP is a compression model for the joint embedding of sequence and structure to manage large latent spaces. [PLAID article](http://bair.berkeley.edu/blog/2025/04/08/plaid)
How is control implemented?

The system uses compositional prompts for function and organism as a proof-of-concept for controllable generation. [PLAID article](http://bair.berkeley.edu/blog/2025/04/08/plaid)
What is the potential scope for future work?

The method could be extended to multimodal generation over more complex systems, including interactions with nucleic acids and ligands. [PLAID article](http://bair.berkeley.edu/blog/2025/04/08/plaid)