GenerRNA at a glance
- What it is
- Generative RNA language model (GPT-style Transformer)
- What it does
- Designs novel RNA sequences de novo
- Parameters
- ~350M · 24 layers · dim 1280
- Context
- 1024 tokens (~4000 nucleotides)
- Training data
- ~16M sequences / ~17.4B nucleotides (RNAcentral)
- Input needed
- None — no structure or alignment
- License
- MIT (free to use & modify)
- Published
- PLOS ONE 19(10):e0310814 (2024)
What is GenerRNA?
GenerRNA is a generative pre-trained language model for de novo RNA sequence design. It is a Transformer (decoder-only, GPT-style) model that learns the "language" of RNA from millions of natural sequences and can generate novel, realistic RNA sequences without any structural input, functional label, or sequence alignment. To our knowledge, GenerRNA is the first application of a generative language model to RNA generation.
Use it to generate RNA zero-shot to explore the RNA sequence space, or fine-tune it to design RNAs from a particular family or with specific characteristics — such as high binding affinity to a target protein.
Key capabilities
- 🧬 De novo RNA generation — create novel sequences from scratch; no structure, label, or alignment required.
- 🎯 Zero-shot or fine-tuned — explore RNA space out of the box, or specialize for a target family or function.
- 🔬 Structurally plausible — generated sequences fold into stable secondary structures (low minimum free energy).
- 🧩 Transformer / GPT architecture — a scalable decoder-only design (~350M parameters).
- ⚡ Ready-to-use checkpoints — hosted on Hugging Face.
- 📖 Open & reproducible — MIT-licensed code, tokenizer, and figure data.
How GenerRNA's approach differs
GenerRNA is generative, which sets it apart from the predictive and alignment-based paradigms commonly used for RNA.
| GenerRNA (generative LM) | Structure / property predictors | Alignment / covariance models | |
|---|---|---|---|
| Primary goal | Generate novel sequences | Predict structure or properties from a sequence | Model a known family from an alignment |
| Produces entirely new sequences | Yes | No | Limited (within-family) |
| Needs structure / alignment as input | No | Often | Yes (MSA) |
| Prior family knowledge required | No (zero-shot) or optional (fine-tune) | — | Yes |
| Core method | Decoder-only Transformer trained on ~16M sequences | Varies (CNN/Transformer/thermodynamic) | Statistical / covariance |
Comparison of paradigms, not specific tools; characteristics for GenerRNA are from the PLOS ONE (2024) paper.
Model details
| Model type | Generative language model (decoder-only Transformer, GPT-style) |
|---|---|
| Domain | RNA / nucleotide sequences |
| Parameters | 350M (24 transformer layers, model dimension 1280) |
| Context window | 1024 tokens (~4000 nucleotides) |
| Tokenizer | Byte-Pair Encoding (BPE), vocabulary size 1024 |
| Training data | ~16 million RNA sequences (16.09M) / ~17.4 billion nucleotides, from RNAcentral (release 22), deduplicated with MMseqs2 at 80% identity |
| Weights | huggingface.co/pfnet/GenerRNA |
| License | MIT |
| Paper | PLOS ONE 19(10):e0310814 (2024) · doi:10.1371/journal.pone.0310814 |
How to use
🤗 The model weights are hosted on Hugging Face. The GitHub repository contains the code and documentation.
# Get the code
git clone https://github.com/ekkkkki/GenerRNA && cd GenerRNA
# Download the weights from Hugging Face
pip install -U "huggingface_hub[cli]"
huggingface-cli download pfnet/GenerRNA model_updated.pt --local-dir .
# Generate RNA de novo
python sampling.py --out_path generated.txt --max_new_tokens 256 \
--ckpt_path model_updated.pt --tokenizer_path tokenizer
Build on GenerRNA
GenerRNA is MIT-licensed — you are free to use, modify, fine-tune, and build on it, including for commercial work. It's meant to be a starting point, not a black box.
- 🔧 Fine-tune for your target — adapt GenerRNA to a specific RNA family or function with your own data.
- 🧪 Design functional RNA — e.g., aptamers or protein binders (ELAVL1 and SRSF1 demonstrated in the paper).
- 🧱 Use as a backbone — a pre-trained foundation for downstream RNA modeling and design tasks.
- 🔁 Reproduce & extend — code, tokenizer, and the data behind the paper's figures are all open.
Frequently asked questions
- What is GenerRNA?
- A generative, pre-trained language model (a decoder-only Transformer) that designs novel RNA sequences de novo, without requiring structural information, functional labels, or sequence alignments.
- Can AI design RNA sequences?
- Yes. GenerRNA is a generative AI model that designs novel RNA sequences de novo — it learns from millions of natural RNAs and samples new, realistic sequences that fold into stable secondary structures.
- How is GenerRNA different from other RNA models?
- Most RNA models are discriminative (they predict structure or properties). GenerRNA is generative: it samples entirely new sequences. To our knowledge, it is the first application of a generative language model to RNA generation.
- How can I design RNA that binds a target protein?
- Fine-tune GenerRNA on a target-specific dataset. The paper demonstrates designing RNA with high binding affinity to the RNA-binding proteins ELAVL1 and SRSF1.
- What data was GenerRNA trained on?
- About 16 million RNA sequences (~17.4 billion nucleotides) derived from RNAcentral release 22 and deduplicated with MMseqs2 at 80% sequence identity.
- How large is GenerRNA?
- ~350 million parameters, 24 layers, model dimension 1280, a 1024-token context window (~4000 nucleotides), and a BPE tokenizer with vocabulary size 1024.
- Can I use and modify GenerRNA freely?
- Yes — the code and weights are MIT-licensed, so you can use, modify, fine-tune, and build on it (including commercially).
- Where can I download GenerRNA?
- The model weights are on Hugging Face: pfnet/GenerRNA. Code and docs are on GitHub.
- How do I cite GenerRNA?
- See the citation below, or use the BibTeX in the repository.
Cite this work
Found GenerRNA useful in your research or software? Please cite the paper — it's the best way to support the project and help others discover it.
@article{zhao2024generrna,
title = {GenerRNA: A generative pre-trained language model for de novo RNA design},
author = {Zhao, Yichong and Oono, Kenta and Takizawa, Hiroki and Kotera, Masaaki},
journal = {PLOS ONE},
volume = {19},
number = {10},
pages = {e0310814},
year = {2024},
doi = {10.1371/journal.pone.0310814},
publisher = {Public Library of Science}
}
Zhao Y, Oono K, Takizawa H, Kotera M (2024) GenerRNA: A generative pre-trained language model for de novo RNA design. PLOS ONE 19(10): e0310814. https://doi.org/10.1371/journal.pone.0310814