hdallatorre commited on
Commit
ea67c45
1 Parent(s): 8a46454

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -10,7 +10,7 @@ tags:
10
  ---
11
  # segment-nt
12
 
13
- Segment-NT is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
14
  elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes of human genomics elements in input sequences up to 30kb. These
15
  include gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and
16
  tissue-specific promoters and enhancers, and CTCF-bound sites) elements.
@@ -36,7 +36,7 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
36
  A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
37
 
38
  ⚠️ The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However,
39
- Segment-NT-multi-species has been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change
40
  the `rescaling_factor` of the Rotary Embedding layer in the esm model `num_dna_tokens_inference / max_num_tokens_nt` where `num_dna_tokens_inference` is the number of tokens at inference
41
  (i.e 6669 for a sequence of 40008 base pairs) and `max_num_tokens_nt` is the max number of tokens on which the backbone nucleotide-transformer was trained on, i.e `2048`.
42
 
 
10
  ---
11
  # segment-nt
12
 
13
+ SegmentNT is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
14
  elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes of human genomics elements in input sequences up to 30kb. These
15
  include gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and
16
  tissue-specific promoters and enhancers, and CTCF-bound sites) elements.
 
36
  A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
37
 
38
  ⚠️ The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However,
39
+ SegmentNT-multi-species has been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change
40
  the `rescaling_factor` of the Rotary Embedding layer in the esm model `num_dna_tokens_inference / max_num_tokens_nt` where `num_dna_tokens_inference` is the number of tokens at inference
41
  (i.e 6669 for a sequence of 40008 base pairs) and `max_num_tokens_nt` is the max number of tokens on which the backbone nucleotide-transformer was trained on, i.e `2048`.
42