README.md · AdrienB134/ColBERTv1.0-BERTugues-base-portuguese-mmarcoPT at main

metadata

license: mit
datasets:
  - unicamp-dl/mmarco
language:
  - pt
tags:
  - colbert
  - ColBERT

Disclaimer: This model is based on a model trained for brazilian portuguese, furthermore mMARCO was translated from MSMARCO using Google Translate which also tends to be biased towards brazilian portuguese, therefore it might not do well on european portuguese.

Training

Details

The model is initialized from the ricardoz/BERTugues-base-portuguese-cased model and fine-tuned on 10M triples via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. It was trained on a single Tesla A100 GPU with 40GBs of memory during 200k steps with 10% of warmup steps using a batch size of 96 and the AdamW optimizer with a constant learning rate of 3e-06. Total training time was around 12 hours.

Data

The model is fine-tuned on the Portugueses version of the mMARCO dataset, a multi-lingual machine-translated version of the MS MARCO dataset. The triples are sampled from the ~39.8M triples of triples.train.small.tsv

Evaluation

The model is evaluated on the smaller development set of mMARCO-es, which consists of 6,980 queries for a corpus of 8.8M candidate passages. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k).

model	Vocab.	#Param.	Size	MRR@10	R@50	R@1000
ColBERTv1.0-BERTugues-base-portuguese-mmarcoPT	portuguese	110M	440MB	26.90	65.26	70.21