[Paper review] Sequence modeling and design from molecular to genome scale with Evo

2024-05-19 3 minute read

Introduction

DNA language model

DNA as the fundamental layer of biological information –> DNA sequencing enabled the systematic mapping of the evolutionary diversity at the the whole-genome scale
Towards a general biologiccal foundation model that learns the intrinsic logic of whole genome current efforts to model molecular biology with ML: modality-specific (proteins, regulatory DNA, RNA), design of single molecule, simple complexes, short DNA sequences
A DNA model that unifies information across the molecular, systems and genome scale could learn systems-wide interactions, enable the design of more sophisticated biological functions

LLM in biology

recent success of LLM using transformer architecture
- existing attempts to model DNA as a language: limited by computational cost, generally underperforms at single-nucleotide or byte-level resolution
- transformer-based DNA models: constrained to short context, sacrifice single-nucleotide resolution by aggregating nucleotides to tokens
Evo: a 7B genomic foundation model, trained to generate DNA sequences at whole-genome scale
- context length of 131k tokens
- based on StripedHyena architeccture: hybridizes attention and data-controlled convolutional operators, efficiently deal with long sequences
- trained on prokaryotic whole-genome dataset (300B nucleotides)
- byte-level, single-nucleotide tokenizer

Introduction to Evo

Downstream tasks – used in both prediction and generation tasks at the molecular, systems, and genome scale
- zero-shot prediction
  - predicting the fitness effects of mutations on proteins
  - predicting the fitness effects of mutations on noncoding RNAs
  - predicts the combinations of prokaryotic promoter-ribosome binding site (RBS) pairs from regulatory sequence alone
- designing synthetic multi-component biological systems
  - learns the co-evolutionary linkage of coding & noncodoing sequences
  - designed CRISPR-Cas systems, and transposable elements
- whole-genome scale
  - can predict essential genes in bacteria/bacteriophages without any supervision
  - generated sequesnces over 650kb with plausible genomic coding architecture

-> Evo establishes a foundational paradigm for predictive & generative biological sequence modeling

alt text

Model architecture

StripedHyena architecture

first alternative model architecture competitive with ‘Transformers’
efficiet autoregressive generation
low latency, faster decoding, higher throughput than transformers
faster training & finetuning at long context (>3x at 131k)
robust to training beyond the compute-optimal frontier

alt text

hybrid of 29 layers of Hyena layers (data-controlled convolutional operators)
interleaved with 3 layers (10%) of multi-head attention equipped with RoPE (rotary position embeddings)
Hyena Hierarchy: Towards Larger Convolutional Language Models

alt text

StripedHyena was an optimal architecture for long DNA sequences pretraining
Compared to the previous DNA model HyenaDNA, which also utilizes the Hyena architecture, the model size has expanded by over 1000 times and the data by over 100 times.
maximum number of tokens (base count) at compute-optimal (minimum Eval. perplexity): 250 billion

Training data

alt text

trained with GTDB, IMG/PR and viral sequences from IMG/VR
viruses that infect eukaryotic hosts were excluded

Zero-shot function prediction

Predicting mutational effects on protein function

alt text

Predicting fitness (study-specific metric quantifying how well a protein performs a certain function) upon mutation
Utilizing deep mutational scanning (DMS) dataset -> exhaustive set of mutations to a protein coding sequence, experimentally measured fitness
Competitive performance despite not learning protein language, only using DNA
However, failed to predict mutational effects on human proteins DMS dataset as it was pre-trained only on prokaryote data

alt text

Predicting mutational effects on ncRNA function / predicting gene expression from regulatory DNA

alt text

Fine-tuning: Generative design of coplex systems

Generative design of CRISPR-Cas molecular complexes

Finetuning on generating CRISPR-Cas system (~8kb)

alt text

Generative design of transposable biological systems

finetuning on generating IS200/IS605 system (~2kb)

alt text

Analyzing whole genome

Predicting gene essentiality with long genomic context

second-stage pretraining with species-level special tokens: extending context to 131k-long genomic segments

alt text

Generating DNA sequences at genome scale

alt text

Synthetic sequence generation

prompted with species-level tokens during the second pretraining
bacterial species promts -> generate seuqnces of ~650kb in length

alt text

Evaluation

depicts the organization of coding sequences

alt text

ESMFold structure predictions

Usage

Paper and GitHub Repository You can use Evo model in Together AI

Discussion

Safety and ethics discussion

whole-genome foundation models have thepotential for misuse
threat to biosafety and biosecurity?
can also catalyze the development of harmful synthetic microorganisms
whole-genome foundation models could contribute to social and health inequity
companies may accelerate research that prioritizes returns-on-investment over the global disease burden or health equity
may enable an organization to bypass current intellectual property
whole-genome foundation models could contribute to disruptions to the natural environment
intellectual property law should evolve as generative models increasingly automate the biological discovery and design process

The path forward

establishment of clear, comprehensive guidelines that delineate ethical practices
community partnerships and international collaborations – address disparities in access and capabilities
create a dynamic feedback loop that engages all share-holders in a continuous dialogue

-> lays the groundwork for a future where genetic engineering advances in harmony with ethical principles and societal values

Share on

Twitter Facebook LinkedIn

[Paper review] Sequence modeling and design from molecular to genome scale with Evo

Introduction

DNA language model

LLM in biology

Introduction to Evo

Model architecture

StripedHyena architecture

Training data

Zero-shot function prediction

Predicting mutational effects on protein function

Predicting mutational effects on ncRNA function / predicting gene expression from regulatory DNA

Fine-tuning: Generative design of coplex systems

Generative design of CRISPR-Cas molecular complexes

Generative design of transposable biological systems

Analyzing whole genome

Predicting gene essentiality with long genomic context

Generating DNA sequences at genome scale

Usage

Discussion

Share on

Leave a comment

You may also enjoy

[Linux Server] Troubleshooting NVIDIA Driver/Library Version Mismatch Error (Without Rebooting)

Gene Expression Regulation: A Comprehensive Guide

CellRank 2: A Unified Framework for Single-Cell Fate Mapping

Building a Sophisticated Stock Analysis Bot with CrewAI and Telegram