hdallatorre commited on
Commit
4e68cd2
1 Parent(s): 62ceedb

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -0
README.md ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ widget:
4
+ - text: ACCTGA<mask>TTCTGAGTC
5
+ tags:
6
+ - DNA
7
+ - biology
8
+ - genomics
9
+ - segmentation
10
+ ---
11
+ # segment-nt-30kb-multi-species
12
+
13
+ Segment-NT-30kb-multi-species is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
14
+ elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [Segment-NT-30kb](https://huggingface.co/InstaDeepAI/segment_nt_30kb) model on a dataset encompassing the human genome
15
+ but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm.
16
+
17
+ For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train **Segment-NT-30kb**, mainly because only this subset of annotations is
18
+ available for these species. The annotations therefore concern the 7 main gene elements available from Ensembl [REF], namely protein-coding gene, 5’UTR, 3’UTR, intron, exon,
19
+ splice acceptor and donor sites.
20
+
21
+
22
+ **Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
23
+
24
+ ### Model Sources
25
+
26
+ <!-- Provide the basic links for the model. -->
27
+
28
+ - **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
29
+ - **Paper:** [Segmenting the genome at single-nucleotide resolution with DNA foundation models]() TODO: Add link to preprint
30
+
31
+ ### How to use
32
+
33
+ <!-- Need to adapt this section to our model. Need to figure out how to load the models from huggingface and do inference on them -->
34
+ Until its next release, the `transformers` library needs to be installed from source with the following command in order to use the models:
35
+ ```bash
36
+ pip install --upgrade git+https://github.com/huggingface/transformers.git
37
+ ```
38
+
39
+ A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
40
+ ```python
41
+ # Load model and tokenizer
42
+ from transformers import AutoTokenizer, AutoModel
43
+ import torch
44
+
45
+ tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_30kb_multi_species", use_auth_token=hf_token, trust_remote_code=True)
46
+ model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_30kb_multi_species", use_auth_token=hf_token, trust_remote_code=True)
47
+
48
+
49
+ # Choose the length to which the input sequences are padded. By default, the
50
+ # model max length is chosen, but feel free to decrease it as the time taken to
51
+ # obtain the embeddings increases significantly with it.
52
+ max_length = tokenizer.model_max_length
53
+
54
+ # Create a dummy dna sequence and tokenize it
55
+ sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
56
+ tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]
57
+
58
+ # Compute the embeddings
59
+ attention_mask = torch_tokens != tokenizer.pad_token_id
60
+ outs = model(
61
+ torch_tokens,
62
+ attention_mask=attention_mask,
63
+ output_hidden_states=True
64
+ )
65
+
66
+ logits = outs.logits.detach().numpy()
67
+ probabilities = torch.nn.functional.softmax(logits, dim=-1)
68
+ ```
69
+
70
+
71
+ ## Training data
72
+
73
+ The **segment-nt-30kb-multi-species** model was finetuned on human, mouse, chicken, fly, zebrafish and worm genomes. For each specie, a subset of chromosomes is kept as
74
+ validation for training monitoring and test for final evaluation.
75
+
76
+ ## Training procedure
77
+
78
+ ### Preprocessing
79
+
80
+ The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokens as described in the [Tokenization](https://github.com/instadeepai/nucleotide-transformer#tokenization-abc) section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form:
81
+
82
+ ```
83
+ <CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>
84
+ ```
85
+
86
+ ### Training
87
+
88
+ The model was finetuned on a DGXH100 node with 8 GPUs on a total of 8B tokens for 3 days.
89
+
90
+
91
+ ### Architecture
92
+
93
+ The model is composed of the [nucleotide-transformer-v2-50m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) encoder, from which we removed
94
+ the language model head and replaced it by a 1-dimensional U-Net segmentation head [4] made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these
95
+ blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. This additional segmentation head accounts for 53 million parameters, bringing the total number of parameters
96
+ to 562M.
97
+
98
+ ### BibTeX entry and citation info
99
+
100
+ #TODO: Add bibtex citation here
101
+ ```bibtex
102
+
103
+ ```