hdallatorre commited on
Commit
9b6c4ab
1 Parent(s): 941645b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -8,9 +8,9 @@ tags:
8
  - genomics
9
  - segmentation
10
  ---
11
- # segment-nt-30kb
12
 
13
- Segment-NT-30kb is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
14
  elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes of human genomics elements in input sequences up to 30kb. These
15
  include gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and
16
  tissue-specific promoters and enhancers, and CTCF-bound sites) elements.
@@ -63,8 +63,8 @@ features = [
63
  "promoter_Tissue_invariant",
64
  ]
65
 
66
- tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_30kb", trust_remote_code=True)
67
- model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_30kb", trust_remote_code=True)
68
 
69
  # Choose the length to which the input sequences are padded. By default, the
70
  # model max length is chosen, but feel free to decrease it as the time taken to
@@ -106,7 +106,7 @@ print(f"Intron probabilities shape: {probabilities_intron.shape}")
106
 
107
  ## Training data
108
 
109
- The **segment-nt-30kb** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set.
110
  During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by
111
  using a sliding window of length 30,000 over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.
112
 
 
8
  - genomics
9
  - segmentation
10
  ---
11
+ # segment-nt
12
 
13
+ Segment-NT is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
14
  elements in a sequence at a single nucleotide resolution. It was trained on 14 different classes of human genomics elements in input sequences up to 30kb. These
15
  include gene (protein-coding genes, lncRNAs, 5’UTR, 3’UTR, exon, intron, splice acceptor and donor sites) and regulatory (polyA signal, tissue-invariant and
16
  tissue-specific promoters and enhancers, and CTCF-bound sites) elements.
 
63
  "promoter_Tissue_invariant",
64
  ]
65
 
66
+ tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
67
+ model = AutoModel.from_pretrained("InstaDeepAI/segment_nt", trust_remote_code=True)
68
 
69
  # Choose the length to which the input sequences are padded. By default, the
70
  # model max length is chosen, but feel free to decrease it as the time taken to
 
106
 
107
  ## Training data
108
 
109
+ The **segment-nt** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set.
110
  During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by
111
  using a sliding window of length 30,000 over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.
112