GenerRNA: A generative pre-trained language model for de novo RNA design

Zhao, Yichong; Oono, Kenta; Takizawa, Hiroki; Kotera, Masaaki

doi:10.1371/journal.pone.0310814

GenerRNA at a glance

What it is: Generative RNA language model (GPT-style Transformer)
What it does: Designs novel RNA sequences de novo
Parameters: ~350M · 24 layers · dim 1280
Context: 1024 tokens (~4000 nucleotides)
Training data: ~16M sequences / ~17.4B nucleotides (RNAcentral)
Input needed: None — no structure or alignment
License: MIT (free to use & modify)
Published: PLOS ONE 19(10):e0310814 (2024)

What is GenerRNA?

GenerRNA is a generative pre-trained language model for de novo RNA sequence design. It is a Transformer (decoder-only, GPT-style) model that learns the "language" of RNA from millions of natural sequences and can generate novel, realistic RNA sequences without any structural input, functional label, or sequence alignment. To our knowledge, GenerRNA is the first application of a generative language model to RNA generation.

Use it to generate RNA zero-shot to explore the RNA sequence space, or fine-tune it to design RNAs from a particular family or with specific characteristics — such as high binding affinity to a target protein.

Key capabilities

🧬 De novo RNA generation — create novel sequences from scratch; no structure, label, or alignment required.
🎯 Zero-shot or fine-tuned — explore RNA space out of the box, or specialize for a target family or function.
🔬 Structurally plausible — generated sequences fold into stable secondary structures (low minimum free energy).
🧩 Transformer / GPT architecture — a scalable decoder-only design (~350M parameters).
⚡ Ready-to-use checkpoints — hosted on Hugging Face.
📖 Open & reproducible — MIT-licensed code, tokenizer, and figure data.

How GenerRNA's approach differs

GenerRNA is generative, which sets it apart from the predictive and alignment-based paradigms commonly used for RNA.

	GenerRNA (generative LM)	Structure / property predictors	Alignment / covariance models
Primary goal	Generate novel sequences	Predict structure or properties from a sequence	Model a known family from an alignment
Produces entirely new sequences	Yes	No	Limited (within-family)
Needs structure / alignment as input	No	Often	Yes (MSA)
Prior family knowledge required	No (zero-shot) or optional (fine-tune)	—	Yes
Core method	Decoder-only Transformer trained on ~16M sequences	Varies (CNN/Transformer/thermodynamic)	Statistical / covariance

Comparison of paradigms, not specific tools; characteristics for GenerRNA are from the PLOS ONE (2024) paper.

Model details

Model type	Generative language model (decoder-only Transformer, GPT-style)
Domain	RNA / nucleotide sequences
Parameters	350M (24 transformer layers, model dimension 1280)
Context window	1024 tokens (~4000 nucleotides)
Tokenizer	Byte-Pair Encoding (BPE), vocabulary size 1024
Training data	~16 million RNA sequences (16.09M) / ~17.4 billion nucleotides, from RNAcentral (release 22), deduplicated with MMseqs2 at 80% identity
Weights	huggingface.co/pfnet/GenerRNA
License	MIT
Paper	PLOS ONE 19(10):e0310814 (2024) · doi:10.1371/journal.pone.0310814

How to use

🤗 The model weights are hosted on Hugging Face. The GitHub repository contains the code and documentation.

# Get the code
git clone https://github.com/ekkkkki/GenerRNA && cd GenerRNA

# Download the weights from Hugging Face
pip install -U "huggingface_hub[cli]"
huggingface-cli download pfnet/GenerRNA model_updated.pt --local-dir .

# Generate RNA de novo
python sampling.py --out_path generated.txt --max_new_tokens 256 \
    --ckpt_path model_updated.pt --tokenizer_path tokenizer

Build on GenerRNA

GenerRNA is MIT-licensed — you are free to use, modify, fine-tune, and build on it, including for commercial work. It's meant to be a starting point, not a black box.

🔧 Fine-tune for your target — adapt GenerRNA to a specific RNA family or function with your own data.
🧪 Design functional RNA — e.g., aptamers or protein binders (ELAVL1 and SRSF1 demonstrated in the paper).
🧱 Use as a backbone — a pre-trained foundation for downstream RNA modeling and design tasks.
🔁 Reproduce & extend — code, tokenizer, and the data behind the paper's figures are all open.

⑂ Fork & build on GitHub 📘 Fine-tuning guide

Frequently asked questions

What is GenerRNA?: A generative, pre-trained language model (a decoder-only Transformer) that designs novel RNA sequences de novo, without requiring structural information, functional labels, or sequence alignments.
Can AI design RNA sequences?: Yes. GenerRNA is a generative AI model that designs novel RNA sequences de novo — it learns from millions of natural RNAs and samples new, realistic sequences that fold into stable secondary structures.
How is GenerRNA different from other RNA models?: Most RNA models are discriminative (they predict structure or properties). GenerRNA is generative: it samples entirely new sequences. To our knowledge, it is the first application of a generative language model to RNA generation.
How can I design RNA that binds a target protein?: Fine-tune GenerRNA on a target-specific dataset. The paper demonstrates designing RNA with high binding affinity to the RNA-binding proteins ELAVL1 and SRSF1.
What data was GenerRNA trained on?: About 16 million RNA sequences (~17.4 billion nucleotides) derived from RNAcentral release 22 and deduplicated with MMseqs2 at 80% sequence identity.
How large is GenerRNA?: ~350 million parameters, 24 layers, model dimension 1280, a 1024-token context window (~4000 nucleotides), and a BPE tokenizer with vocabulary size 1024.
Can I use and modify GenerRNA freely?: Yes — the code and weights are MIT-licensed, so you can use, modify, fine-tune, and build on it (including commercially).
Where can I download GenerRNA?: The model weights are on Hugging Face: pfnet/GenerRNA. Code and docs are on GitHub.
How do I cite GenerRNA?: See the citation below, or use the BibTeX in the repository.

Cite this work

Found GenerRNA useful in your research or software? Please cite the paper — it's the best way to support the project and help others discover it.

@article{zhao2024generrna,
  title     = {GenerRNA: A generative pre-trained language model for de novo RNA design},
  author    = {Zhao, Yichong and Oono, Kenta and Takizawa, Hiroki and Kotera, Masaaki},
  journal   = {PLOS ONE},
  volume    = {19},
  number    = {10},
  pages     = {e0310814},
  year      = {2024},
  doi       = {10.1371/journal.pone.0310814},
  publisher = {Public Library of Science}
}

Zhao Y, Oono K, Takizawa H, Kotera M (2024) GenerRNA: A generative pre-trained language model for de novo RNA design. PLOS ONE 19(10): e0310814. https://doi.org/10.1371/journal.pone.0310814