Open-source · MIT licensed · Generative model for RNA design

GenerRNA

A generative pre-trained language model for de novo RNA design — generate novel RNA sequences with no structural input required.

GenerRNA at a glance

What it is
Generative RNA language model (GPT-style Transformer)
What it does
Designs novel RNA sequences de novo
Parameters
~350M · 24 layers · dim 1280
Context
1024 tokens (~4000 nucleotides)
Training data
~16M sequences / ~17.4B nucleotides (RNAcentral)
Input needed
None — no structure or alignment
License
MIT (free to use & modify)
Published
PLOS ONE 19(10):e0310814 (2024)

What is GenerRNA?

GenerRNA is a generative pre-trained language model for de novo RNA sequence design. It is a Transformer (decoder-only, GPT-style) model that learns the "language" of RNA from millions of natural sequences and can generate novel, realistic RNA sequences without any structural input, functional label, or sequence alignment. To our knowledge, GenerRNA is the first application of a generative language model to RNA generation.

Use it to generate RNA zero-shot to explore the RNA sequence space, or fine-tune it to design RNAs from a particular family or with specific characteristics — such as high binding affinity to a target protein.

Key capabilities

How GenerRNA's approach differs

GenerRNA is generative, which sets it apart from the predictive and alignment-based paradigms commonly used for RNA.

 GenerRNA (generative LM)Structure / property predictorsAlignment / covariance models
Primary goalGenerate novel sequencesPredict structure or properties from a sequenceModel a known family from an alignment
Produces entirely new sequencesYesNoLimited (within-family)
Needs structure / alignment as inputNoOftenYes (MSA)
Prior family knowledge requiredNo (zero-shot) or optional (fine-tune)Yes
Core methodDecoder-only Transformer trained on ~16M sequencesVaries (CNN/Transformer/thermodynamic)Statistical / covariance

Comparison of paradigms, not specific tools; characteristics for GenerRNA are from the PLOS ONE (2024) paper.

Model details

Model typeGenerative language model (decoder-only Transformer, GPT-style)
DomainRNA / nucleotide sequences
Parameters350M (24 transformer layers, model dimension 1280)
Context window1024 tokens (~4000 nucleotides)
TokenizerByte-Pair Encoding (BPE), vocabulary size 1024
Training data~16 million RNA sequences (16.09M) / ~17.4 billion nucleotides, from RNAcentral (release 22), deduplicated with MMseqs2 at 80% identity
Weightshuggingface.co/pfnet/GenerRNA
LicenseMIT
PaperPLOS ONE 19(10):e0310814 (2024) · doi:10.1371/journal.pone.0310814

How to use

🤗 The model weights are hosted on Hugging Face. The GitHub repository contains the code and documentation.

# Get the code
git clone https://github.com/ekkkkki/GenerRNA && cd GenerRNA

# Download the weights from Hugging Face
pip install -U "huggingface_hub[cli]"
huggingface-cli download pfnet/GenerRNA model_updated.pt --local-dir .

# Generate RNA de novo
python sampling.py --out_path generated.txt --max_new_tokens 256 \
    --ckpt_path model_updated.pt --tokenizer_path tokenizer

Build on GenerRNA

GenerRNA is MIT-licensed — you are free to use, modify, fine-tune, and build on it, including for commercial work. It's meant to be a starting point, not a black box.

⑂ Fork & build on GitHub   📘 Fine-tuning guide

Frequently asked questions

What is GenerRNA?
A generative, pre-trained language model (a decoder-only Transformer) that designs novel RNA sequences de novo, without requiring structural information, functional labels, or sequence alignments.
Can AI design RNA sequences?
Yes. GenerRNA is a generative AI model that designs novel RNA sequences de novo — it learns from millions of natural RNAs and samples new, realistic sequences that fold into stable secondary structures.
How is GenerRNA different from other RNA models?
Most RNA models are discriminative (they predict structure or properties). GenerRNA is generative: it samples entirely new sequences. To our knowledge, it is the first application of a generative language model to RNA generation.
How can I design RNA that binds a target protein?
Fine-tune GenerRNA on a target-specific dataset. The paper demonstrates designing RNA with high binding affinity to the RNA-binding proteins ELAVL1 and SRSF1.
What data was GenerRNA trained on?
About 16 million RNA sequences (~17.4 billion nucleotides) derived from RNAcentral release 22 and deduplicated with MMseqs2 at 80% sequence identity.
How large is GenerRNA?
~350 million parameters, 24 layers, model dimension 1280, a 1024-token context window (~4000 nucleotides), and a BPE tokenizer with vocabulary size 1024.
Can I use and modify GenerRNA freely?
Yes — the code and weights are MIT-licensed, so you can use, modify, fine-tune, and build on it (including commercially).
Where can I download GenerRNA?
The model weights are on Hugging Face: pfnet/GenerRNA. Code and docs are on GitHub.
How do I cite GenerRNA?
See the citation below, or use the BibTeX in the repository.

Cite this work

Found GenerRNA useful in your research or software? Please cite the paper — it's the best way to support the project and help others discover it.

@article{zhao2024generrna,
  title     = {GenerRNA: A generative pre-trained language model for de novo RNA design},
  author    = {Zhao, Yichong and Oono, Kenta and Takizawa, Hiroki and Kotera, Masaaki},
  journal   = {PLOS ONE},
  volume    = {19},
  number    = {10},
  pages     = {e0310814},
  year      = {2024},
  doi       = {10.1371/journal.pone.0310814},
  publisher = {Public Library of Science}
}

Zhao Y, Oono K, Takizawa H, Kotera M (2024) GenerRNA: A generative pre-trained language model for de novo RNA design. PLOS ONE 19(10): e0310814. https://doi.org/10.1371/journal.pone.0310814