lgq12697 commited on
Commit
3b7feee
1 Parent(s): 308629b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -22
README.md CHANGED
@@ -18,43 +18,32 @@ All the models have a comparable model size between 90 MB and 150 MB, BPE tokeni
18
  ### Model Sources
19
 
20
  - **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs)
21
- - **Manuscript:** [Versatile applications of foundation DNA language models in plant genomes]()
22
 
23
  ### Architecture
24
 
25
- The model is trained based on the OpenAI GPT-2 model with modified tokenizer specific for DNA sequence.
 
 
26
 
27
  ### How to use
28
 
29
  Install the runtime library first:
30
  ```bash
31
  pip install transformers
 
 
32
  ```
33
 
34
- Here is a simple code for inference:
35
- ```python
36
- from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
37
-
38
- model_name = 'plant-dnagpt-H3K27ac'
39
- # load model and tokenizer
40
- model = AutoModelForSequenceClassification.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
41
- tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
42
-
43
- # inference
44
- sequences = ['GCTTTGGTTTATACCTTACACAACATAAATCACATAGTTAATCCCTAATCGTCTTTGATTCTCAATGTTTTGTTCATTTTTACCATGAACATCATCTGATTGATAAGTGCATAGAGAATTAACGGCTTACACTTTACACTTGCATAGATGATTCCTAAGTATGTCCT',
45
- 'TAGCCCCCTCCTCTCTTTATATAGTGCAATCTAATATATGAAAGGTTCGGTGATGGGGCCAATAAGTGTATTTAGGCTAGGCCTTCATGGGCCAAGCCCAAAAGTTTCTCAACACTCCCCCTTGAGCACTCACCGCGTAATGTCCATGCCTCGTCAAAACTCCATAAAAACCCAGTG']
46
- pipe = pipeline('text-classification', model=model, tokenizer=tokenizer,
47
- trust_remote_code=True, top_k=None)
48
- results = pipe(sequences)
49
- print(results)
50
-
51
- ```
52
 
53
 
54
  ### Training data
55
- We use GPT2ForSequenceClassification to fine-tune the model.
56
  Detailed training procedure can be found in our manuscript.
57
 
58
 
59
  #### Hardware
60
- Model was trained on a NVIDIA GTX1080Ti GPU (11 GB).
 
18
  ### Model Sources
19
 
20
  - **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs)
21
+ - **Manuscript:** [Versatile applications of foundation DNA large language models in plant genomes]()
22
 
23
  ### Architecture
24
 
25
+ The model is trained based on the State-Space Mamba-130m model with modified tokenizer specific for DNA sequence.
26
+
27
+ This model is fine-tuned for predicting open chromatin.
28
 
29
  ### How to use
30
 
31
  Install the runtime library first:
32
  ```bash
33
  pip install transformers
34
+ pip install causal-conv1d<=1.2.0
35
+ pip install mamba-ssm<2.0.0
36
  ```
37
 
38
+ Since `transformers` library (version < 4.43.0) does not provide a MambaForSequenceClassification function, we wrote a script to train Mamba model for sequence classification.
39
+ An inference code can be found in our [GitHub](https://github.com/zhangtaolab/plant_DNA_LLMs).
40
+ Note that Plant DNAMamba model requires NVIDIA GPU to run.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
 
43
  ### Training data
44
+ We use a custom MambaForSequenceClassification script to fine-tune the model.
45
  Detailed training procedure can be found in our manuscript.
46
 
47
 
48
  #### Hardware
49
+ Model was trained on a NVIDIA GTX4090 GPU (24 GB).