philipp-zettl's picture
Upload folder using huggingface_hub
79c7b39 verified
license: mit
language: multilingual
library_name: torch
tags: []
base_model: BAAI/bge-m3
  - philipp-zettl/GGU-xx
  - philipp-zettl/sentiment
  - accuracy
  - precision
  - recall
  - f1-score
model_name: Multi-Head Sequence Classification Model
pipeline_tag: text-classification
  - text: Hello, how are you?
    label: '[GGU] Greeting'
  - text: Thank you for your help
    label: '[GGU] Gratitude'
  - text: Hallo, wie geht es dir?
    label: '[GGU] Greeting (de)'
  - text: Danke dir.
    label: '[GGU] Gratitude (de)'
  - text: I am not sure what you mean
    label: '[GGU] Other'
  - text: Generate me an image of a dog!
    label: '[GGU] Other'
  - text: What is the weather like today?
    label: '[GGU] Other'
  - text: Wie ist das Wetter heute?
    label: '[GGU] Other (de)'

Multi-Head Sequence Classification Model

Model description

The model is a simple sequence classification model based on hidden output layers of a pre-trained transformer model. Multiple heads are added to the output of the backbone to classify the input sequence.

Model architecture

The model is a simple sequence classification model based on hidden output layers of a pre-trained transformer model.

The backbone of the model is BAAI/bge-m3 with 1024 output dimensions.

An additional layer of (GGU: 3, sentiment: 3) is added to the output of the backbone to classify the input sequence.

You can find a mapping for the labels here:


  • 0: Greeting
  • 1: Gratitude
  • 2: Other


  • 0: Positive
  • 1: Negative
  • 2: Neutral

The joint architecture was trained using the provided implementation (in repository) of MultiHeadClassificationTrainer.

Use cases

Use cases: text classification, sentiment analysis.

Model Inference

Inference code:

from transformers import AutoModel, AutoTokenizer
from .model import MultiHeadSequenceClassificationModel
import torch

model = MultiHeadSequenceClassificationModel.from_pretrained('philipp-zettl/multi-head-sequence-classification-model')
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-m3')

def predict(text):
    inputs = tokenizer([text], return_tensors="pt", padding=True, truncation=True)
    outputs = model(**inputs)
    return outputs

Model Training

Confusion Matrix

GGU Confusion Matrix GGU

sentiment Confusion Matrix sentiment

Training Loss


sentiment Loss sentiment

Training data

The model has been trained on the following datasets:

Using the implementation provided by MultiHeadClassificationTrainer

Training procedure

The following code has been executed to train the model:

def train_classifier():
    backbone = AutoModel.from_pretrained('BAAI/bge-m3').to(torch.float16)
    tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-m3')
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    ggu_label_map = {
        0: 'Greeting',
        1: 'Gratitude',
        2: 'Other'
    sentiment_label_map = {
        0: 'Positive',
        1: 'Negative',
        2: 'Neutral'

    num_labels = len(ggu_label_map.keys())

    # HParams

    dropout = 0.25
    learning_rate = 3e-5
    momentum = 0.9
    l2_reg = 0.25

    l2_loss_weight = 0.25

    model_conf = {
        'backbone': backbone,
        'head_config': {
            'GGU': num_labels,
        'dropout': dropout,
        'l2_reg': l2_reg,

    optimizer_conf = {
        'lr': learning_rate,
        'momentum': momentum

    scheduler_conf = {
        'factor': 0.2,
        'patience': 3,
        'min_lr': 1e-8

    train_run = 1000
    trainer = MultiHeadClassificationTrainer(
        optimizer_conf={**optimizer_conf, 'lr': 1e-4},

    new_model, history = trainer.train(dataset_name='philipp-zettl/GGU-xx', target_heads=['GGU'])
    metrics = history['metrics']
    history['loss_plot'] = trainer._plot_history(**metrics)
    res = trainer.eval({'GGU': ggu_label_map})
    history['evaluation'] = res['GGU']

    total_history = {
        'GGU': deepcopy(history),

    trainer.classifier.add_head('sentiment', 3)
    trainer.auto_find_batch_size = False
    new_model, history = trainer.train(dataset_name='philipp-zettl/sentiment', target_heads=['sentiment'], sample_key='text', num_epochs=10, lr=1e-4)
    metrics = history['metrics']
    history['loss_plot'] = trainer._plot_history(**metrics)
    res = trainer.eval({'sentiment': sentiment_label_map}, sample_key='text')
    history['evaluation'] = res['sentiment']

    total_history['sentiment'] = deepcopy(history)

    label_maps = {
        'GGU': ggu_label_map,
        'sentiment': sentiment_label_map,

    return new_model, total_history, trainer, label_maps


Evaluation data

For model evaluation, a 20% validation split was used from the training data.

Evaluation procedure

The model was evaluated using the eval method provided by the MultiHeadClassificationTrainer class:

def _eval_model(self, dataloader, label_map, sample_key, label_key):
    eval_heads = list(label_map.keys())
    y_pred = {h: [] for h in eval_heads}
    y_test = {h: [] for h in eval_heads}
    for sample in tqdm(dataloader, total=len(dataloader), desc='Evaluating model...'):
        labels = {name: sample[label_key] for name in eval_heads}
        embeddings = BatchEncoding({k: torch.stack(v, dim=1).to(self.device) for k, v in sample.items() if k not in [label_key, sample_key]}) 
        output = self.classifier('cuda'), head_names=eval_heads)
        for head in eval_heads:

    accuracies = {h: accuracy_score(y_test[h], y_pred[h]) for h in eval_heads}
    f1_scores = {h: f1_score(y_test[h], y_pred[h], average="macro") for h in eval_heads}
    recalls = {h: recall_score(y_test[h], y_pred[h], average='macro') for h in eval_heads}
    report = {}
    for head in eval_heads:
        cm = confusion_matrix(y_test[head], y_pred[head], labels=list(label_map[head].keys()))
        disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=list(label_map[head].values()))
        clf_report = classification_report(
            y_test[head], y_pred[head], output_dict=True, target_names=list(label_map[head].values())
        del clf_report["accuracy"]
        clf_report = pd.DataFrame(clf_report).T.reset_index()
        report[head] = dict(
            clf_report=clf_report, confusion_matrix=disp, metrics={'accuracy': accuracies[head], 'f1': f1_scores[head], 'recall': recalls[head]}
    return report


For evaluation, we used the following metrics: accuracy, precision, recall, f1-score. You can find a detailed classification report here:


index precision recall f1-score support
0 Greeting 0.904762 0.974359 0.938272 39
1 Gratitude 0.958333 0.851852 0.901961 27
2 Other 1 1 1 39
3 macro avg 0.954365 0.94207 0.946744 105
4 weighted avg 0.953912 0.952381 0.951862 105


index precision recall f1-score support
0 Positive 0.783088 0.861878 0.820596 12851
1 Negative 0.802105 0.819524 0.810721 14229
2 Neutral 0.7874 0.6913 0.736227 13126
3 macro avg 0.790864 0.790901 0.789181 40206
4 weighted avg 0.791226 0.7912 0.789557 40206