hdallatorre commited on
Commit
3683fa4
1 Parent(s): 69f5390

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -8,13 +8,13 @@ tags:
8
  - genomics
9
  - segmentation
10
  ---
11
- # segment-nt-30kb-multi-species
12
 
13
- Segment-NT-30kb-multi-species is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
14
- elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [Segment-NT-30kb](https://huggingface.co/InstaDeepAI/segment_nt_30kb) model on a dataset encompassing the human genome
15
  but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm.
16
 
17
- For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train **Segment-NT-30kb**, mainly because only this subset of annotations is
18
  available for these species. The annotations therefore concern the 7 main gene elements available from Ensembl [REF], namely protein-coding gene, 5’UTR, 3’UTR, intron, exon,
19
  splice acceptor and donor sites.
20
 
@@ -59,8 +59,8 @@ features = [
59
  "promoter_Tissue_invariant",
60
  ]
61
 
62
- tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_30kb_multi_species", trust_remote_code=True)
63
- model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_30kb_multi_species", trust_remote_code=True)
64
 
65
  # Choose the length to which the input sequences are padded. By default, the
66
  # model max length is chosen, but feel free to decrease it as the time taken to
@@ -100,7 +100,7 @@ print(f"Intron probabilities shape: {probabilities_intron.shape}")
100
 
101
  ## Training data
102
 
103
- The **segment-nt-30kb-multi-species** model was finetuned on human, mouse, chicken, fly, zebrafish and worm genomes. For each specie, a subset of chromosomes is kept as
104
  validation for training monitoring and test for final evaluation.
105
 
106
  ## Training procedure
 
8
  - genomics
9
  - segmentation
10
  ---
11
+ # segment-nt-multi-species
12
 
13
+ Segment-NT-multi-species is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
14
+ elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [Segment-NT](https://huggingface.co/InstaDeepAI/segment_nt) model on a dataset encompassing the human genome
15
  but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm.
16
 
17
+ For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train **Segment-NT**, mainly because only this subset of annotations is
18
  available for these species. The annotations therefore concern the 7 main gene elements available from Ensembl [REF], namely protein-coding gene, 5’UTR, 3’UTR, intron, exon,
19
  splice acceptor and donor sites.
20
 
 
59
  "promoter_Tissue_invariant",
60
  ]
61
 
62
+ tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)
63
+ model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)
64
 
65
  # Choose the length to which the input sequences are padded. By default, the
66
  # model max length is chosen, but feel free to decrease it as the time taken to
 
100
 
101
  ## Training data
102
 
103
+ The **segment-nt-multi-species** model was finetuned on human, mouse, chicken, fly, zebrafish and worm genomes. For each specie, a subset of chromosomes is kept as
104
  validation for training monitoring and test for final evaluation.
105
 
106
  ## Training procedure