hdallatorre commited on
Commit
870bbf9
1 Parent(s): 4260bc0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -68,6 +68,8 @@ probabilities = torch.nn.functional.softmax(logits, dim=-1)
68
  ## Training data
69
 
70
  The **segment-nt-30kb** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set.
 
 
71
 
72
  ## Training procedure
73
 
@@ -81,7 +83,7 @@ The DNA sequences are tokenized using the Nucleotide Transformer Tokenizer, whic
81
 
82
  ### Training
83
 
84
- The model was trained on a DGXH100 on a total of 23B tokens. The model was trained on 3kb, 10kb, 20kb and finally 30kb sequences, at each time with an effective batch size of 256 sequences.
85
 
86
 
87
  ### Architecture
 
68
  ## Training data
69
 
70
  The **segment-nt-30kb** model was trained on all human chromosomes except for chromosomes 20 and 21, kept as test set, and chromosome 22, used as a validation set.
71
+ During training, sequences are randomly sampled in the genome with associated annotations. However, we keep the sequences in the validation and test set fixed by
72
+ using a sliding window of length 30,000 over the chromosomes 20 and 21. The validation set was used to monitor training and for early stopping.
73
 
74
  ## Training procedure
75
 
 
83
 
84
  ### Training
85
 
86
+ The model was trained on a DGXH100 node with 8 GPUs on a total of 23B tokens for 3 days. The model was trained on 3kb, 10kb, 20kb and finally 30kb sequences, at each time with an effective batch size of 256 sequences.
87
 
88
 
89
  ### Architecture