Finetuning PaliGemma with AutoTrain

Community Article Published July 25, 2024

In this blog, we will see how we can finetune PaliGemma using AutoTrain for VQA and captioning tasks.

AutoTrain is a no-code solution designed to make life easier for data scientists, machine learning engineers and enthusiasts. It allows you to train (almost) any state of the art model without writing a single line of code. To get started with AutoTrain, check out the docs and Github repo.

Dataset

You can use a dataset from the hub or a local dataset.

Hub Dataset

A hub dataset should be in the following format:

image/png

The columns of interest are:

  • image: the image (image)
  • question: the question (prompt_text_column)
  • multiple_choice_answer: the answer (text_column)

Note: we use the above three columns for VQA. For captioning task, we use only the image and text_column.

Local Dataset

If using a dataset locally, it should be formatted like this:

train/
├── 0001.png
├── 0002.png
├── 0003.png
├── .
├── .
├── .
└── metadata.jsonl

where metadata.jsonl looks like the following:

{"file_name": "0001.jpg", "question": "What vehicles are shown?", "multiple_choice_answer": "motorcycles"}
{"file_name": "0002.jpg", "question": "Is the plane upside down?", "multiple_choice_answer": "no"}
{"file_name": "0003.jpg", "question": "What is the boy doing?", "multiple_choice_answer": "batting"}

the metadata.jsonl must have a file_name column you can change the other column names.

If you have validation data, you can add a folder in the same format as above.

NOTE: When using the AutoTrain UI, the folders need to be compressed as ZIP files. When train.zip is expanded, it should have all the images and metadata.jsonl, no folders, no subfolders.

Training Locally

Locally, autotrain can be used both in UI mode or CLI mode.

To install autotrain, use the pip command:

$ pip install -U autotrain-advanced

Once done, you can start the UI using the command:

$ autotrain app

Training using CLI/config

To train using a config file, create a config.yml that looks like the following:

task: vlm:vqa
base_model: google/paligemma-3b-pt-224
project_name: autotrain-paligemma-finetuned-vqa
log: tensorboard
backend: local

data:
  path: abhishek/vqa_small
  train_split: train
  valid_split: validation
  column_mapping:
    image_column: image
    text_column: multiple_choice_answer
    prompt_text_column: question

params:
  epochs: 3
  batch_size: 2
  lr: 2e-5
  optimizer: adamw_torch
  scheduler: linear
  gradient_accumulation: 4
  mixed_precision: fp16
  peft: true
  quantization: int4

hub:
  username: ${HF_USERNAME}
  token: ${HF_TOKEN}
  push_to_hub: true

The above config uses dataset from hub, if using a local dataset, change the following:

data:
  path: local_dataset_folder_path # where training and validation (optional) folders are
  train_split: train # name of training folder
  valid_split: validation # name of validation folder or none
  column_mapping:
    image_column: image
    text_column: multiple_choice_answer
    prompt_text_column: question

Please double check the column mappings!

Once done, run:

$ export HF_USERNAME=your_hugging_face_username
$ export HF_TOKEN=your_hugging_face_write_token
$ autotrain --config path_to_config.yml

And wait and watch the training progress :)

Training using UI

Here's a screenshot of the UI with HuggingFace Hub dataset:

image/png

And one with local dataset:

image/png

Again, take special care of column mappings ;)

Finally, your model can be pushed to hub (your choice) and will be available for use. In case of any issues, use github issue tracker here.

Happy AutoTraining! 🤗