hdallatorre commited on
Commit
3baab28
1 Parent(s): 82d9b23

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -34,12 +34,14 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
34
  ```
35
 
36
  A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
37
-
38
-
39
  ⚠️ The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However,
40
  Segment-NT-multi-species has been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change
41
- the `rescaling_factor` argument in the config to `num_dna_tokens_inference / max_num_tokens_nt` where `num_dna_tokens_inference` is the number of tokens at inference
42
- (i.e 6669 for a sequence of 40008 base pairs) and `max_num_tokens_nt` is the max number of tokens on which the backbone nucleotide-transformer was trained on, i.e `2048`.
 
 
 
 
43
 
44
  ```python
45
  # Load model and tokenizer
 
34
  ```
35
 
36
  A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
 
 
37
  ⚠️ The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However,
38
  Segment-NT-multi-species has been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change
39
+ the `rescaling_factor` of the Rotary Embedding layer in the esm model `num_dna_tokens_inference / max_num_tokens_nt` where `num_dna_tokens_inference` is the number of tokens at inference
40
+ (i.e 6669 for a sequence of 40008 base pairs) and `max_num_tokens_nt` is the max number of tokens on which the backbone nucleotide-transformer was trained on, i.e `2048`.
41
+
42
+ [![Open All Collab](https://colab.research.google.com/assets/colab-badge.svg)]
43
+ The `inference_segment_nt.ipynb` notebook shows how to set the rescaling factor and infer on a 50kb sequence of the human chromosome 20 in order to reproduce Fig.3 of the
44
+ paper.
45
 
46
  ```python
47
  # Load model and tokenizer