hdallatorre commited on
Commit
f488e09
1 Parent(s): 2899450

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ widget:
4
+ - text: ACCTGA<mask>TTCTGAGTC
5
+ tags:
6
+ - DNA
7
+ - biology
8
+ - genomics
9
+ - segmentation
10
+ ---
11
+ # segment-nt-30kb
12
+
13
+ Segment-NT-30kb is a segmentation model leveraging the Nucleotide Transformer (NT) DNA foundation model to predict the location of several types of genomics
14
+ elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes of human genomics elements in input sequences up to 30kb. These
15
+ include gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and
16
+ tissue-specific promoters and enhancers, and CTCF-bound sites) elements.
17
+
18
+
19
+ **Developed by:** InstaDeep, NVIDIA and TUM
20
+
21
+ ### Model Sources
22
+
23
+ <!-- Provide the basic links for the model. -->
24
+
25
+ - **Repository:** [Nucleotide Transformer](https://github.com/instadeepai/nucleotide-transformer)
26
+ - **Paper:** [Segmenting the genome at single-nucleotide resolution with DNA foundation models]() TODO: Add link to preprint
27
+
28
+ ### How to use
29
+
30
+ <!-- Need to adapt this section to our model. Need to figure out how to load the models from huggingface and do inference on them -->
31
+ Until its next release, the `transformers` library needs to be installed from source with the following command in order to use the models:
32
+ ```bash
33
+ pip install --upgrade git+https://github.com/huggingface/transformers.git
34
+ ```
35
+
36
+ A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
37
+ ```python
38
+ # Load model and tokenizer
39
+ from transformers import AutoTokenizer, AutoModel
40
+ import torch
41
+
42
+ tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_30kb", use_auth_token=hf_token, trust_remote_code=True)
43
+ model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_30kb", use_auth_token=hf_token, trust_remote_code=True)
44
+
45
+
46
+ # Choose the length to which the input sequences are padded. By default, the
47
+ # model max length is chosen, but feel free to decrease it as the time taken to
48
+ # obtain the embeddings increases significantly with it.
49
+ max_length = tokenizer.model_max_length
50
+
51
+ # Create a dummy dna sequence and tokenize it
52
+ sequences = ["ATTCCGATTCCGATTCCG", "ATTTCTCTCTCTCTCTGAGATCGATCGATCGAT"]
53
+ tokens_ids = tokenizer.batch_encode_plus(sequences, return_tensors="pt", padding="max_length", max_length = max_length)["input_ids"]
54
+
55
+ # Compute the embeddings
56
+ attention_mask = torch_tokens != tokenizer.pad_token_id
57
+ outs = model(
58
+ torch_tokens,
59
+ attention_mask=attention_mask,
60
+ output_hidden_states=True
61
+ )
62
+
63
+ logits = outs.logits.detach().numpy()
64
+ probabilities = torch.nn.functional.softmax(logits, dim=-1)
65
+ ```
66
+
67
+
68
+ ## Training data
69
+
70
+ The **segment-nt-30kb** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set.
71
+
72
+ ## Training procedure
73
+
74
+ ### Preprocessing
75
+
76
+ The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, which tokenizes sequences as 6-mers tokens as described in the [Tokenization](https://github.com/instadeepai/nucleotide-transformer#tokenization-abc) section of the associated repository. This tokenizer has a vocabulary size of 4105. The inputs of the model are then of the form:
77
+
78
+ ```
79
+ <CLS> <ACGTGT> <ACGTGC> <ACGGAC> <GACTAG> <TCAGCA>
80
+ ```
81
+
82
+ ### Training
83
+
84
+ The model was trained on a DGXH100 on a total of 23B tokens. The model was trained on 3kb, 10kb, 20kb and finally 30kb sequences, at each time with an effective batch size of 256 sequences.
85
+
86
+
87
+ ### Architecture
88
+
89
+ The model is composed of the [nucleotide-transformer-v2-50m-multi-species](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-50m-multi-species) encoder, from which we removed
90
+ the language model head and replaced it by a 1-dimensional U-Net segmentation head [4] made of 2 downsampling convolutional blocks and 2 upsampling convolutional blocks. Each of these
91
+ blocks is made of 2 convolutional layers with 1, 024 and 2, 048 kernels respectively. This additional segmentation head accounts for 53 million parameters, bringing the total number of parameters
92
+ to 562M.
93
+
94
+ ### BibTeX entry and citation info
95
+
96
+ #TODO: Add bibtex citation here
97
+ ```bibtex
98
+
99
+ ```