File size: 1,404 Bytes
c67f305
 
 
 
 
 
9513732
c67f305
 
 
 
 
 
 
 
 
 
 
f6ed259
c67f305
9513732
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
---
tags:
- biology
- DNA
- genomics
---
This is the official pre-trained model introduced in [DNA language model GROVER learns sequence context in the human genome](https://www.nature.com/articles/s42256-024-00872-0)



    from transformers import AutoTokenizer, AutoModelForMaskedLM

    # Import the tokenizer and the model
    tokenizer = AutoTokenizer.from_pretrained("PoetschLab/GROVER")
    model = AutoModelForMaskedLM.from_pretrained("PoetschLab/GROVER")


Some preliminary analysis shows that sequence re-tokenization using Byte Pair Encoding (BPE) changes significantly if the sequence is less than 50 nucleotides long. Longer than 50 nucleotides, you should still be careful with sequence edges.
We advice to add 100 nucleotides at the beginning and end of every sequence in order to guarantee that your sequence is represented with the same tokens as the original tokenization.
We also provide the tokenized chromosomes with their respective nucleotide mappers (They are available in the folder tokenized chromosomes).

### BibTeX entry and citation info

```bibtex
@article{sanabria2024dna,
  title={DNA language model GROVER learns sequence context in the human genome},
  author={Sanabria, Melissa and Hirsch, Jonas and Joubert, Pierre M and Poetsch, Anna R},
  journal={Nature Machine Intelligence},
  pages={1--13},
  year={2024},
  publisher={Nature Publishing Group UK London}
}
```