metadata

language: ca
datasets:
  - projecte-aina/3catparla_asr
tags:
  - audio
  - automatic-speech-recognition
  - catalan
  - whisper-large-v3
  - projecte-aina
  - barcelona-supercomputing-center
  - bsc
license: apache-2.0
model-index:
  - name: whisper-large-v3-ca-3catparla
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: 3CatParla (Test)
          type: projecte-aina/3catparla_asr
          split: test
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 0.96
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: 3CatParla (Dev)
          type: projecte-aina/3catparla_asr
          split: dev
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 0.92
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Mozilla Common Voice 17.0 (Test)
          type: mozilla-foundation/common_voice_17_0
          split: test
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 10.32
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Mozilla Common Voice 17.0 (Dev)
          type: mozilla-foundation/common_voice_17_0
          split: validation
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 9.26
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Balearic fem)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Balearic female
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 12.25
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Balearic male)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Balearic male
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 12.18
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Central fem)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Central female
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 8.51
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Central male)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Central male
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 8.73
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Northern fem)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Northern female
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 8.09
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Northern male)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Northern male
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 8.28
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Northwestern fem)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Northwestern female
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 7.88
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Northwestern male)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Northwestern male
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 8.44
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Valencian fem)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Valencian female
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 9.58
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Valencian male)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Valencian male
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 9.1
library_name: transformers

whisper-large-v3-ca-3catparla

Click to expand

Model Description
Intended Uses and Limitations
How to Get Started with the Model
Training Details
Citation
Additional Information

Summary

The "whisper-large-v3-ca-3catparla" is an acoustic model based on "openai/whisper-large-v3" suitable for Automatic Speech Recognition in Catalan.

Model Description

The "whisper-large-v3-ca-3catparla" is an acoustic model suitable for Automatic Speech Recognition in Catalan. It is the result of finetuning the model "openai/whisper-large-v3" with 710 hours of Catalan data released by the Projecte AINA from Barcelona, Spain.

Intended Uses and Limitations

This model can used for Automatic Speech Recognition (ASR) in Catalan. The model is intended to transcribe audio files in Catalan to plain text without punctuation.

How to Get Started with the Model

Installation

In order to use this model, you may install datasets and transformers:

Create a virtual environment:

python -m venv /path/to/venv

Activate the environment:

source /path/to/venv/bin/activate

Install the modules:

pip install datasets transformers

For Inference

In order to transcribe audio in Catalan using this model, you can follow this example:

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

#Load the processor and model.
MODEL_NAME="projecte-aina/whisper-large-v3-ca-3catparla"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")

#Load the dataset
from datasets import load_dataset, load_metric, Audio
ds=load_dataset("projecte-aina/3catparla_asr",split='test')

#Downsample to 16kHz
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

#Process the dataset
def map_to_pred(batch):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['normalized_text'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    
    return batch
    
#Do the evaluation
result = ds.map(map_to_pred)

#Compute the overall WER now.
from evaluate import load

wer = load("wer")
WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"])
print(WER)

Test Result: 0.96

Training Details

Training data

The specific dataset used to create the model is called "3CatParla".

Training procedure

This model is the result of finetuning the model "openai/whisper-large-v3" by following this tutorial provided by Hugging Face.

Training Hyperparameters

language: catalan
hours of training audio: 710
learning rate: 1.95e-07
sample rate: 16000
train batch size: 32 (x4 GPUs)
- gradient accumulation steps: 1
eval batch size: 32
save total limit: 3
max steps: 19842
warmup steps: 1984
eval steps: 3307
save steps: 3307
shuffle buffer size: 480

Citation

If this model contributes to your research, please cite the work:

@misc{mena2024whisperlarge3catparla,
      title={Acoustic Model in Catalan: whisper-large-v3-ca-3catparla.}, 
      author={Hernandez Mena, Carlos Daniel; Armentano-Oller, Carme; Solito, Sarah; Külebi, Baybars},
      organization={Barcelona Supercomputing Center},
      url={https://huggingface.co/projecte-aina/whisper-large-v3-ca-3catparla},
      year={2024}
}

Additional Information

Author

The fine-tuning process was perform during July (2024) in the Language Technologies Unit of the Barcelona Supercomputing Center by Carlos Daniel Hernández Mena.

Contact

For further information, please send an email to [email protected].

Copyright

License

Apache-2.0

Funding

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

The training of the model was possible thanks to the compute time provided by Barcelona Supercomputing Center through MareNostrum 5.

projecte-aina
/

whisper-large-v3-ca-3catparla