hdallatorre commited on
Commit
a5bd6ee
1 Parent(s): 2caa091

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -1
README.md CHANGED
@@ -4,4 +4,94 @@ tags:
4
  - DNA
5
  - biology
6
  - genomics
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - DNA
5
  - biology
6
  - genomics
7
+ datasets:
8
+ - InstaDeepAI/human_reference_genome
9
+ ---
10
+ # nucleotide-transformer-500m-human-ref model
11
+
12
+ The Nucleotide Transformers are a collection of foundational language models that were pre-trained on DNA sequences from whole-genomes. Compared to other approaches, our models do not only integrate information from single reference genomes, but leverage DNA sequences from over 3,200 diverse human genomes, as well as 850 genomes from a wide range of species, including model and non-model organisms. Through robust and extensive evaluation, we show that these large models provide extremely accurate molecular phenotype prediction compared to existing methods
13
+
14
+ Part of this collection is the **nucleotide-transformer-500m-human-ref**, a 500M parameters transformer pre-trained on the human reference genome.
15
+
16
+ **Developed by:** InstaDeep, NVIDIA and TUM
17
+
18
+ ### Model Sources
19
+
20
+ <!-- Provide the basic links for the model. -->
21
+
22
+ - **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
23
+ - **Paper:** [The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics](https://www.biorxiv.org/content/10.1101/2023.01.11.523679v1)
24
+
25
+ ### How to use
26
+
27
+ <!-- Need to adapt this section to our model. Need to figure out how to load the models from huggingface and do inference on them -->
28
+ ```python
29
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
30
+ import torch
31
+
32
+ # Import the tokenizer and the model
33
+ tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")
34
+ model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-500m-human-ref")
35
+
36
+ # Create a dummy dna sequence and tokenize it
37
+ sequences = ['ATTCTG' * 9]
38
+ tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt")["input_ids"]
39
+
40
+ # Compute the embeddings
41
+ attention_mask = tokens_ids != tokenizer.pad_token_id
42
+ torch_outs = model(
43
+ tokens_ids,
44
+ attention_mask=attention_mask,
45
+ encoder_attention_mask=attention_mask,
46
+ output_hidden_states=True
47
+ )
48
+
49
+ # Compute sequences embeddings
50
+ embeddings = torch_outs['hidden_states'][-1].detach().numpy()
51
+ print(f"Embeddings shape: {embeddings.shape}")
52
+ print(f"Embeddings per token: {embeddings}")
53
+
54
+ # Compute mean embeddings per sequence
55
+ mean_sequence_embeddings = torch.sum(attention_mask.unsqueeze(-1)*embeddings, axis=-2)/torch.sum(attention_mask, axis=-1)
56
+ print(f"Mean sequence embeddings: {mean_sequence_embeddings}")
57
+ ```
58
+
59
+
60
+ ## Training data
61
+
62
+ The **nucleotide-transformer-500m-human-ref** model was pretrained on the [GRCh38 human reference genome](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/), which is available as a HuggingFace dataset [here](https://huggingface.co/datasets/InstaDeepAI/human_reference_genome), consisting of 3B nucleotides, making up for roughly 500M 6-mers tokens.
63
+ ## Training procedure
64
+
65
+ ### Preprocessing
66
+
67
+ The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokenizer when possible, otherwise tokenizing each nucleotide separately as described in the [Tokenization](https://github.com/instadeepai/nucleotide-transformer#tokenization-abc) section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form:
68
+
69
+ ```
70
+ <CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>
71
+ ```
72
+
73
+ The tokenized sequence have a maximum length of 1,000.
74
+
75
+ The masking procedure used is the standard one for Bert-style training:
76
+ - 15% of the tokens are masked.
77
+ - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
78
+ - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
79
+ - In the 10% remaining cases, the masked tokens are left as is.
80
+
81
+ ### Pretraining
82
+
83
+ The model was trained with 8 A100 80GB on 300B tokens, with an effective batch size of 1M tokens. The sequence length used was 1000 tokens. The Adam optimizer [38] was used with a learning rate schedule, and standard values for exponential decay rates and epsilon constants, β1 = 0.9, β2 = 0.999 and ε=1e-8. During a first warmup period, the learning rate was increased linearly between 5e-5 and 1e-4 over 16k steps before decreasing following a square root decay until the end of training.
84
+
85
+
86
+ ### BibTeX entry and citation info
87
+
88
+ ```bibtex
89
+ @article{dalla2023nucleotide,
90
+ title={The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics},
91
+ author={Dalla-Torre, Hugo and Gonzalez, Liam and Mendoza Revilla, Javier and Lopez Carranza, Nicolas and Henryk Grywaczewski, Adam and Oteri, Francesco and Dallago, Christian and Trop, Evan and Sirelkhatim, Hassan and Richard, Guillaume and others},
92
+ journal={bioRxiv},
93
+ pages={2023--01},
94
+ year={2023},
95
+ publisher={Cold Spring Harbor Laboratory}
96
+ }
97
+ ```