hdallatorre commited on
Commit
3173a3e
1 Parent(s): c13dc5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -10,11 +10,11 @@ datasets:
10
  - InstaDeepAI/multi_species_genome
11
  - InstaDeepAI/nucleotide_transformer_downstream_tasks
12
  ---
13
- # nucleotide-transformer-v2-50-multi-species
14
 
15
  The Nucleotide Transformers are a collection of foundational language models that were pre-trained on DNA sequences from whole-genomes. Compared to other approaches, our models do not only integrate information from single reference genomes, but leverage DNA sequences from over 3,200 diverse human genomes, as well as 850 genomes from a wide range of species, including model and non-model organisms. Through robust and extensive evaluation, we show that these large models provide extremely accurate molecular phenotype prediction compared to existing methods
16
 
17
- Part of this collection is the **nucleotide-transformer-v2-50-multi-species**, a 50M parameters transformer pre-trained on a collection of 850 genomes from a wide range of species, including model and non-model organisms.
18
 
19
  **Developed by:** InstaDeep, NVIDIA and TUM
20
 
@@ -39,8 +39,8 @@ from transformers import AutoTokenizer, AutoModelForMaskedLM
39
  import torch
40
 
41
  # Import the tokenizer and the model
42
- tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50-multi-species")
43
- model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50-multi-species")
44
 
45
  # Create a dummy dna sequence and tokenize it
46
  sequences = ['ATTCTG' * 9]
@@ -68,7 +68,7 @@ print(f"Mean sequence embeddings: {mean_sequence_embeddings}")
68
 
69
  ## Training data
70
 
71
- The **nucleotide-transformer-v2-50-multi-species** model was pretrained on a total of 850 genomes downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/). Plants and viruses are not included in these genomes, as their regulatory elements differ from those of interest in the paper's tasks. Some heavily studied model organisms were picked to be included in the collection of genomes, which represents a total of 174B nucleotides, i.e roughly 29B tokens. The data has been released as a HuggingFace dataset [here](https://huggingface.co/datasets/InstaDeepAI/multi_species_genomes).
72
 
73
  ## Training procedure
74
 
 
10
  - InstaDeepAI/multi_species_genome
11
  - InstaDeepAI/nucleotide_transformer_downstream_tasks
12
  ---
13
+ # nucleotide-transformer-v2-50m-multi-species
14
 
15
  The Nucleotide Transformers are a collection of foundational language models that were pre-trained on DNA sequences from whole-genomes. Compared to other approaches, our models do not only integrate information from single reference genomes, but leverage DNA sequences from over 3,200 diverse human genomes, as well as 850 genomes from a wide range of species, including model and non-model organisms. Through robust and extensive evaluation, we show that these large models provide extremely accurate molecular phenotype prediction compared to existing methods
16
 
17
+ Part of this collection is the **nucleotide-transformer-v2-50m-multi-species**, a 50M parameters transformer pre-trained on a collection of 850 genomes from a wide range of species, including model and non-model organisms.
18
 
19
  **Developed by:** InstaDeep, NVIDIA and TUM
20
 
 
39
  import torch
40
 
41
  # Import the tokenizer and the model
42
+ tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50m-multi-species")
43
+ model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/nucleotide-transformer-v2-50m-multi-species")
44
 
45
  # Create a dummy dna sequence and tokenize it
46
  sequences = ['ATTCTG' * 9]
 
68
 
69
  ## Training data
70
 
71
+ The **nucleotide-transformer-v2-50m-multi-species** model was pretrained on a total of 850 genomes downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/). Plants and viruses are not included in these genomes, as their regulatory elements differ from those of interest in the paper's tasks. Some heavily studied model organisms were picked to be included in the collection of genomes, which represents a total of 174B nucleotides, i.e roughly 29B tokens. The data has been released as a HuggingFace dataset [here](https://huggingface.co/datasets/InstaDeepAI/multi_species_genomes).
72
 
73
  ## Training procedure
74