philipp-zettl commited on
Commit
193de8a
1 Parent(s): a64809f

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ pretrained/tokenizer/tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ license: mit
4
+ language: multilingual
5
+ library_name: torch
6
+ tags: []
7
+ base_model: BAAI/bge-m3
8
+ datasets:
9
+ - philipp-zettl/GGU-xx
10
+ metrics:
11
+
12
+ - accuracy
13
+ - precision
14
+ - recall
15
+ - f1-score
16
+ model_name: Multi-Head Sequence Classification Model
17
+ pipeline_tag: text-classification
18
+ widget:
19
+ - text: "Hello, how are you?"
20
+ label: "[GGU] Greeting"
21
+ - text: "Thank you for your help"
22
+ label: "[GGU] Gratitude"
23
+ - text: "Hallo, wie geht es dir?"
24
+ label: "[GGU] Greeting (de)"
25
+ - text: "Danke dir."
26
+ label: "[GGU] Gratitude (de)"
27
+ - text: "I am not sure what you mean"
28
+ label: "[GGU] Other"
29
+ - text: "Generate me an image of a dog!"
30
+ label: "[GGU] Other"
31
+ - text: "What is the weather like today?"
32
+ label: "[GGU] Other"
33
+ - text: "Wie ist das Wetter heute?"
34
+ label: "[GGU] Other (de)"
35
+ ---
36
+ # Multi-Head Sequence Classification Model
37
+ ## Model description
38
+ The model is a simple sequence classification model based on hidden output layers of a pre-trained transformer model. Multiple heads are added to the output of the backbone to classify the input sequence.
39
+
40
+ ### Model architecture
41
+ The model is a simple sequence classification model based on hidden output layers of a pre-trained transformer model.
42
+
43
+ The backbone of the model is BAAI/bge-m3 with 1024.
44
+
45
+ An additional layer of (GGU: 3) is added to the output of the backbone to classify the input sequence.
46
+
47
+ Using the provided implementation (in repository) of `MultiHeadClassificationTrainer`.
48
+
49
+ ### Use cases
50
+ Use cases: text classification, sentiment analysis.
51
+
52
+ ## Model Inference
53
+ Inference code:
54
+
55
+ ```python
56
+
57
+ from transformers import AutoModel, AutoTokenizer
58
+ from .model import MultiHeadSequenceClassificationModel
59
+ import torch
60
+
61
+ model = MultiHeadSequenceClassificationModel.from_pretrained('philipp-zettl/multi-head-sequence-classification-model')
62
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-m3')
63
+
64
+ def predict(text):
65
+ inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True)
66
+ outputs = model(**inputs)
67
+ return outputs
68
+
69
+ ```
70
+
71
+ ## Model Training
72
+
73
+ #### Confusion Matrix
74
+ **GGU**
75
+ ![Confusion Matrix GGU](assets/confusion_matrix_GGU.png)
76
+
77
+ #### Training Loss
78
+ **GGU**
79
+ ![Loss GGU](assets/loss_plot_GGU.png)
80
+
81
+
82
+ ### Training data
83
+ The model has been trained on the following datasets:
84
+ - [philipp-zettl/GGU-xx](https://huggingface.co/datasets/philipp-zettl/GGU-xx)
85
+ using the implementation provided by MultiHeadClassificationTrainer
86
+
87
+ ### Training procedure
88
+ The following code has been executed to train the model:
89
+
90
+ ```python
91
+ def train_classifier():
92
+ backbone = AutoModel.from_pretrained('BAAI/bge-m3').to(torch.float16)
93
+ tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-m3')
94
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
95
+
96
+ label_map = {
97
+ 0: 'Greeting',
98
+ 1: 'Gratitude',
99
+ 2: 'Other'
100
+ }
101
+ map_label = {
102
+ label_map[i]: i
103
+ for i in label_map.keys()
104
+ }
105
+
106
+ num_labels = len(label_map.keys())
107
+
108
+ # HParams
109
+
110
+ dropout = 0.25
111
+ learning_rate = 3e-5
112
+ momentum = 0.9
113
+ l2_reg = 0.25
114
+
115
+ num_epochs = 35
116
+ l2_loss_weight = 0.25
117
+
118
+ model_conf = {
119
+ 'backbone': backbone,
120
+ 'head_config': {
121
+ 'GGU': 3,
122
+ },
123
+ 'dropout': dropout,
124
+ 'l2_reg': l2_reg,
125
+ }
126
+
127
+ optimizer_conf = {
128
+ 'lr': learning_rate,
129
+ 'momentum': momentum
130
+ }
131
+
132
+ scheduler_conf = {
133
+ 'factor': 0.2,
134
+ 'patience': 3,
135
+ 'min_lr': 1e-8
136
+ }
137
+
138
+ train_run = 1000
139
+ trainer = MultiHeadClassificationTrainer(
140
+ model_conf=model_conf,
141
+ optimizer_conf={**optimizer_conf, 'lr': 1e-4},
142
+ scheduler_conf=scheduler_conf,
143
+ num_epochs=1,
144
+ l2_loss_weight=l2_loss_weight,
145
+ use_lr_scheduler=True,
146
+ train_run=train_run
147
+ )
148
+
149
+ new_model, history = trainer.train(dataset_name='philipp-zettl/GGU-xx', target_heads=['GGU'])
150
+ metrics = history['metrics']
151
+ history['loss_plot'] = trainer._plot_history(**metrics)
152
+ return new_model, history, trainer, label_map
153
+
154
+ ```
155
+
156
+ ### Evaluation
157
+ ### Evaluation data
158
+ For model evaluation, a 20% validation split was used from the training data.
159
+
160
+ ### Evaluation procedure
161
+ The model was evaluated using the `eval` method provided by the `MultiHeadClassificationTrainer` class:
162
+
163
+ ```python
164
+ def _eval_model(self, dataloader, label_map):
165
+ self.classifier.train(False)
166
+ eval_heads = list(label_map.keys())
167
+ y_pred = {h: [] for h in eval_heads}
168
+ y_test = {h: [] for h in eval_heads}
169
+ for sample in tqdm(dataloader, total=len(dataloader), desc='Evaluating model...'):
170
+ labels = {name: sample['label'] for name in eval_heads}
171
+ embeddings = BatchEncoding({k: torch.stack(v, dim=1).to(self.device) for k, v in sample.items() if k not in ['label', 'sample']})
172
+ output = self.classifier(embeddings.to('cuda'), head_names=eval_heads)
173
+ for head in eval_heads:
174
+ y_pred[head].extend(output[head].argmax(dim=1).cpu())
175
+ y_test[head].extend(labels[head])
176
+ torch.cuda.empty_cache()
177
+
178
+ accuracies = {h: accuracy_score(y_test[h], y_pred[h]) for h in eval_heads}
179
+ f1_scores = {h: f1_score(y_test[h], y_pred[h], average="macro") for h in eval_heads}
180
+ recalls = {h: recall_score(y_test[h], y_pred[h], average='macro') for h in eval_heads}
181
+
182
+ report = {}
183
+ for head in eval_heads:
184
+ cm = confusion_matrix(y_test[head], y_pred[head], labels=list(label_map[head].keys()))
185
+ disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=list(label_map[head].values()))
186
+ clf_report = classification_report(
187
+ y_test[head], y_pred[head], output_dict=True, target_names=list(label_map[head].values())
188
+ )
189
+ del clf_report["accuracy"]
190
+ clf_report = pd.DataFrame(clf_report).T.reset_index()
191
+ report[head] = dict(
192
+ clf_report=clf_report, confusion_matrix=disp, metrics={'accuracy': accuracies[head], 'f1': f1_scores[head], 'recall': recalls[head]}
193
+ )
194
+ return report
195
+
196
+ ```
197
+
198
+ ### Metrics
199
+ For evaluation, we used the following metrics: accuracy, precision, recall, f1-score. You can find a detailed classification report here:
200
+
201
+ **GGU:**
202
+ | | index | precision | recall | f1-score | support |
203
+ |---:|:-------------|------------:|---------:|-----------:|----------:|
204
+ | 0 | Greeting | 0.725 | 0.935484 | 0.816901 | 31 |
205
+ | 1 | Gratitude | 0.952381 | 0.740741 | 0.833333 | 27 |
206
+ | 2 | Other | 0.954545 | 0.893617 | 0.923077 | 47 |
207
+ | 3 | macro avg | 0.877309 | 0.856614 | 0.857771 | 105 |
208
+ | 4 | weighted avg | 0.886218 | 0.866667 | 0.868653 | 105 |
209
+
assets/confusion_matrix_GGU.png ADDED
assets/loss_plot_GGU.png ADDED
heads/GGU.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f81781b33246b9e3a70fe2b4954bb548d29884b5902be26965e94e6653312345
3
+ size 7552
multi-head-sequence-classification-model-model.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf5bed442997472409a9e1a7115df63c9712e773362c31621f60be199f9935d3
3
+ size 1135694619
pretrained/backbone/config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "BAAI/bge-m3",
3
+ "architectures": [
4
+ "XLMRobertaModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 1024,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 8194,
17
+ "model_type": "xlm-roberta",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 24,
20
+ "output_past": true,
21
+ "pad_token_id": 1,
22
+ "position_embedding_type": "absolute",
23
+ "torch_dtype": "float16",
24
+ "transformers_version": "4.41.2",
25
+ "type_vocab_size": 1,
26
+ "use_cache": true,
27
+ "vocab_size": 250002
28
+ }
pretrained/backbone/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:68440cc1b73b9af8ab85ecdc138b51877493ffbcec92a0a16e5d7e518eb22908
3
+ size 1135554344
pretrained/tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
pretrained/tokenizer/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c119aa9bc83a5d76efbbc831b23e5790727c12fde474f6519dd96cde6550ffd7
3
+ size 17083052
pretrained/tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 128,
50
+ "pad_token": "<pad>",
51
+ "sep_token": "</s>",
52
+ "sp_model_kwargs": {},
53
+ "tokenizer_class": "XLMRobertaTokenizer",
54
+ "unk_token": "<unk>"
55
+ }
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ transformers
2
+ accelerate
3
+ datasets
4
+ pytorch
5
+ scikit-learn
6
+ pandas
7
+ matplotlib
train.py ADDED
@@ -0,0 +1 @@
 
 
1
+ /home/phil/work/mb/easybits/model-zoo/model_zoo/train_classifier.py