Update README.md
Browse files
README.md
CHANGED
@@ -18,43 +18,32 @@ All the models have a comparable model size between 90 MB and 150 MB, BPE tokeni
|
|
18 |
### Model Sources
|
19 |
|
20 |
- **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs)
|
21 |
-
- **Manuscript:** [Versatile applications of foundation DNA language models in plant genomes]()
|
22 |
|
23 |
### Architecture
|
24 |
|
25 |
-
The model is trained based on the
|
|
|
|
|
26 |
|
27 |
### How to use
|
28 |
|
29 |
Install the runtime library first:
|
30 |
```bash
|
31 |
pip install transformers
|
|
|
|
|
32 |
```
|
33 |
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
model_name = 'plant-dnagpt-H3K27ac'
|
39 |
-
# load model and tokenizer
|
40 |
-
model = AutoModelForSequenceClassification.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
|
41 |
-
tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
|
42 |
-
|
43 |
-
# inference
|
44 |
-
sequences = ['GCTTTGGTTTATACCTTACACAACATAAATCACATAGTTAATCCCTAATCGTCTTTGATTCTCAATGTTTTGTTCATTTTTACCATGAACATCATCTGATTGATAAGTGCATAGAGAATTAACGGCTTACACTTTACACTTGCATAGATGATTCCTAAGTATGTCCT',
|
45 |
-
'TAGCCCCCTCCTCTCTTTATATAGTGCAATCTAATATATGAAAGGTTCGGTGATGGGGCCAATAAGTGTATTTAGGCTAGGCCTTCATGGGCCAAGCCCAAAAGTTTCTCAACACTCCCCCTTGAGCACTCACCGCGTAATGTCCATGCCTCGTCAAAACTCCATAAAAACCCAGTG']
|
46 |
-
pipe = pipeline('text-classification', model=model, tokenizer=tokenizer,
|
47 |
-
trust_remote_code=True, top_k=None)
|
48 |
-
results = pipe(sequences)
|
49 |
-
print(results)
|
50 |
-
|
51 |
-
```
|
52 |
|
53 |
|
54 |
### Training data
|
55 |
-
We use
|
56 |
Detailed training procedure can be found in our manuscript.
|
57 |
|
58 |
|
59 |
#### Hardware
|
60 |
-
Model was trained on a NVIDIA
|
|
|
18 |
### Model Sources
|
19 |
|
20 |
- **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs)
|
21 |
+
- **Manuscript:** [Versatile applications of foundation DNA large language models in plant genomes]()
|
22 |
|
23 |
### Architecture
|
24 |
|
25 |
+
The model is trained based on the State-Space Mamba-130m model with modified tokenizer specific for DNA sequence.
|
26 |
+
|
27 |
+
This model is fine-tuned for predicting open chromatin.
|
28 |
|
29 |
### How to use
|
30 |
|
31 |
Install the runtime library first:
|
32 |
```bash
|
33 |
pip install transformers
|
34 |
+
pip install causal-conv1d<=1.2.0
|
35 |
+
pip install mamba-ssm<2.0.0
|
36 |
```
|
37 |
|
38 |
+
Since `transformers` library (version < 4.43.0) does not provide a MambaForSequenceClassification function, we wrote a script to train Mamba model for sequence classification.
|
39 |
+
An inference code can be found in our [GitHub](https://github.com/zhangtaolab/plant_DNA_LLMs).
|
40 |
+
Note that Plant DNAMamba model requires NVIDIA GPU to run.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
|
43 |
### Training data
|
44 |
+
We use a custom MambaForSequenceClassification script to fine-tune the model.
|
45 |
Detailed training procedure can be found in our manuscript.
|
46 |
|
47 |
|
48 |
#### Hardware
|
49 |
+
Model was trained on a NVIDIA GTX4090 GPU (24 GB).
|