Upload folder using huggingface_hub

Browse files

Files changed (13) hide show

.gitattributes +1 -0
README.md +209 -0
assets/confusion_matrix_GGU.png +0 -0
assets/loss_plot_GGU.png +0 -0
heads/GGU.pth +3 -0
multi-head-sequence-classification-model-model.pth +3 -0
pretrained/backbone/config.json +28 -0
pretrained/backbone/model.safetensors +3 -0
pretrained/tokenizer/special_tokens_map.json +51 -0
pretrained/tokenizer/tokenizer.json +3 -0
pretrained/tokenizer/tokenizer_config.json +55 -0
requirements.txt +7 -0
train.py +1 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+pretrained/tokenizer/tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,209 @@

+---
+license: mit
+language: multilingual
+library_name: torch
+tags: []
+base_model: BAAI/bge-m3
+datasets:
+- philipp-zettl/GGU-xx
+metrics:
+  - accuracy
+  - precision
+  - recall
+  - f1-score
+model_name: Multi-Head Sequence Classification Model
+pipeline_tag: text-classification
+widget:
+- text: "Hello, how are you?"
+  label: "[GGU] Greeting"
+- text: "Thank you for your help"
+  label: "[GGU] Gratitude"
+- text: "Hallo, wie geht es dir?"
+  label: "[GGU] Greeting (de)"
+- text: "Danke dir."
+  label: "[GGU] Gratitude (de)"
+- text: "I am not sure what you mean"
+  label: "[GGU] Other"
+- text: "Generate me an image of a dog!"
+  label: "[GGU] Other"
+- text: "What is the weather like today?"
+  label: "[GGU] Other"
+- text: "Wie ist das Wetter heute?"
+  label: "[GGU] Other (de)"
+---
+# Multi-Head Sequence Classification Model
+## Model description
+The model is a simple sequence classification model based on hidden output layers of a pre-trained transformer model. Multiple heads are added to the output of the backbone to classify the input sequence.
+### Model architecture
+The model is a simple sequence classification model based on hidden output layers of a pre-trained transformer model.
+The backbone of the model is BAAI/bge-m3 with 1024.
+An additional layer of (GGU: 3) is added to the output of the backbone to classify the input sequence.
+Using the provided implementation (in repository) of `MultiHeadClassificationTrainer`.
+### Use cases
+Use cases: text classification, sentiment analysis.
+## Model Inference
+Inference code:
+```python
+from transformers import AutoModel, AutoTokenizer
+from .model import MultiHeadSequenceClassificationModel
+import torch
+model = MultiHeadSequenceClassificationModel.from_pretrained('philipp-zettl/multi-head-sequence-classification-model')
+tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-m3')
+def predict(text):
+    inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True)
+    outputs = model(**inputs)
+    return outputs
+```
+## Model Training
+#### Confusion Matrix
+**GGU**
+![Confusion Matrix GGU](assets/confusion_matrix_GGU.png)
+#### Training Loss
+**GGU**
+![Loss GGU](assets/loss_plot_GGU.png)
+### Training data
+The model has been trained on the following datasets:
+- [philipp-zettl/GGU-xx](https://huggingface.co/datasets/philipp-zettl/GGU-xx)
+ using the implementation provided by MultiHeadClassificationTrainer
+### Training procedure
+The following code has been executed to train the model:
+```python
+def train_classifier():
+    backbone = AutoModel.from_pretrained('BAAI/bge-m3').to(torch.float16)
+    tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-m3')
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    label_map = {
+        0: 'Greeting',
+        1: 'Gratitude',
+        2: 'Other'
+    }
+    map_label = {
+        label_map[i]: i
+        for i in label_map.keys()
+    }
+    num_labels = len(label_map.keys())
+    # HParams
+    dropout = 0.25
+    learning_rate = 3e-5
+    momentum = 0.9
+    l2_reg = 0.25
+    num_epochs = 35
+    l2_loss_weight = 0.25
+    model_conf = {
+        'backbone': backbone,
+        'head_config': {
+            'GGU': 3,
+        },
+        'dropout': dropout,
+        'l2_reg': l2_reg,
+    }
+    optimizer_conf = {
+        'lr': learning_rate,
+        'momentum': momentum
+    }
+    scheduler_conf = {
+        'factor': 0.2,
+        'patience': 3,
+        'min_lr': 1e-8
+    }
+    train_run = 1000
+    trainer = MultiHeadClassificationTrainer(
+        model_conf=model_conf,
+        optimizer_conf={**optimizer_conf, 'lr': 1e-4},
+        scheduler_conf=scheduler_conf,
+        num_epochs=1,
+        l2_loss_weight=l2_loss_weight,
+        use_lr_scheduler=True,
+        train_run=train_run
+    )
+    new_model, history = trainer.train(dataset_name='philipp-zettl/GGU-xx', target_heads=['GGU'])
+    metrics = history['metrics']
+    history['loss_plot'] = trainer._plot_history(**metrics)
+    return new_model, history, trainer, label_map
+```
+### Evaluation
+### Evaluation data
+For model evaluation, a 20% validation split was used from the training data.
+### Evaluation procedure
+The model was evaluated using the `eval` method provided by the `MultiHeadClassificationTrainer` class:
+```python
+def _eval_model(self, dataloader, label_map):
+    self.classifier.train(False)
+    eval_heads = list(label_map.keys())
+    y_pred = {h: [] for h in eval_heads}
+    y_test = {h: [] for h in eval_heads}
+    for sample in tqdm(dataloader, total=len(dataloader), desc='Evaluating model...'):
+        labels = {name: sample['label'] for name in eval_heads}
+        embeddings = BatchEncoding({k: torch.stack(v, dim=1).to(self.device) for k, v in sample.items() if k not in ['label', 'sample']})
+        output = self.classifier(embeddings.to('cuda'), head_names=eval_heads)
+        for head in eval_heads:
+            y_pred[head].extend(output[head].argmax(dim=1).cpu())
+            y_test[head].extend(labels[head])
+        torch.cuda.empty_cache()
+    accuracies = {h: accuracy_score(y_test[h], y_pred[h]) for h in eval_heads}
+    f1_scores = {h: f1_score(y_test[h], y_pred[h], average="macro") for h in eval_heads}
+    recalls = {h: recall_score(y_test[h], y_pred[h], average='macro') for h in eval_heads}
+    report = {}
+    for head in eval_heads:
+        cm = confusion_matrix(y_test[head], y_pred[head], labels=list(label_map[head].keys()))
+        disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=list(label_map[head].values()))
+        clf_report = classification_report(
+            y_test[head], y_pred[head], output_dict=True, target_names=list(label_map[head].values())
+        )
+        del clf_report["accuracy"]
+        clf_report = pd.DataFrame(clf_report).T.reset_index()
+        report[head] = dict(
+            clf_report=clf_report, confusion_matrix=disp, metrics={'accuracy': accuracies[head], 'f1': f1_scores[head], 'recall': recalls[head]}
+        )
+    return report
+```
+### Metrics
+For evaluation, we used the following metrics: accuracy, precision, recall, f1-score. You can find a detailed classification report here:
+**GGU:**
+|    | index        |   precision |   recall |   f1-score |   support |
+|---:|:-------------|------------:|---------:|-----------:|----------:|
+|  0 | Greeting     |    0.725    | 0.935484 |   0.816901 |        31 |
+|  1 | Gratitude    |    0.952381 | 0.740741 |   0.833333 |        27 |
+|  2 | Other        |    0.954545 | 0.893617 |   0.923077 |        47 |
+|  3 | macro avg    |    0.877309 | 0.856614 |   0.857771 |       105 |
+|  4 | weighted avg |    0.886218 | 0.866667 |   0.868653 |       105 |

assets/confusion_matrix_GGU.png ADDED Viewed

assets/loss_plot_GGU.png ADDED Viewed

heads/GGU.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f81781b33246b9e3a70fe2b4954bb548d29884b5902be26965e94e6653312345
+size 7552

multi-head-sequence-classification-model-model.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bf5bed442997472409a9e1a7115df63c9712e773362c31621f60be199f9935d3
+size 1135694619

pretrained/backbone/config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "_name_or_path": "BAAI/bge-m3",
+  "architectures": [
+    "XLMRobertaModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 8194,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float16",
+  "transformers_version": "4.41.2",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 250002
+}

pretrained/backbone/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:68440cc1b73b9af8ab85ecdc138b51877493ffbcec92a0a16e5d7e518eb22908
+size 1135554344

pretrained/tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

pretrained/tokenizer/tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c119aa9bc83a5d76efbbc831b23e5790727c12fde474f6519dd96cde6550ffd7
+size 17083052

pretrained/tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": "<mask>",
+  "model_max_length": 128,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "unk_token": "<unk>"
+}

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+transformers
+accelerate
+datasets
+pytorch
+scikit-learn
+pandas
+matplotlib

train.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ /home/phil/work/mb/easybits/model-zoo/model_zoo/train_classifier.py