Edit model card

TavBERT base model

An Arabic BERT-style masked language model operating over characters, pre-trained by masking spans of characters, similarly to SpanBERT (Joshi et al., 2020).

How to use

import numpy as np
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("tau/tavbert-ar")
tokenizer = AutoTokenizer.from_pretrained("tau/tavbert-ar")

def mask_sentence(sent, span_len=5):
    start_pos = np.random.randint(0, len(sent) - span_len)
    masked_sent = sent[:start_pos] + '[MASK]' * span_len + sent[start_pos + span_len:]
    print("Masked sentence:", masked_sent)
    output = model(**tokenizer.encode_plus(masked_sent, 
                                           return_tensors='pt'))['logits'][0][1:-1]
    preds = [int(x) for x in torch.argmax(torch.softmax(output, axis=1), axis=1)[start_pos:start_pos + span_len]]
    pred_sent = sent[:start_pos] + ''.join(tokenizer.convert_ids_to_tokens(preds)) + sent[start_pos + span_len:]
    print("Model's prediction:", pred_sent)

Training data

OSCAR (Ortiz, 2019) Arabic section (32 GB text, 67 million sentences).

Downloads last month
25
Inference Examples
Inference API (serverless) is not available, repository is disabled.

Dataset used to train tau/tavbert-ar