DCU-NLP commited on
Commit
4e635b9
1 Parent(s): 600973b
Files changed (3) hide show
  1. README.md +30 -50
  2. config.json +5 -0
  3. tf_model.h5 +3 -0
README.md CHANGED
@@ -1,67 +1,47 @@
1
  ---
2
- language:
3
- - ga
4
- license: apache-2.0
5
  tags:
6
- - irish
7
- - bert
8
- widget:
9
- - text: "Ceoltóir [MASK] ab ea Johnny Cash."
10
  ---
11
 
12
- # gaBERT
 
13
 
14
- [gaBERT](https://arxiv.org/abs/2107.12930) is a BERT-base model trained on 7.9M Irish sentences. For more details, including the hyperparameters and pretraining corpora used please refer to our paper.
15
 
16
- ### How to use gaBERT with HuggingFace
 
17
 
18
- ```
19
- from transformers import AutoModelWithLMHead, AutoTokenizer
20
- import torch
21
 
22
- tokenizer = AutoTokenizer.from_pretrained("DCU-NLP/bert-base-irish-cased-v1")
23
- model = AutoModelWithLMHead.from_pretrained("DCU-NLP/bert-base-irish-cased-v1")
24
 
25
- sequence = f"Ceoltóir {tokenizer.mask_token} ab ea Johnny Cash."
26
 
27
- input = tokenizer.encode(sequence, return_tensors="pt")
28
- mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
29
 
30
- token_logits = model(input)[0]
31
- mask_token_logits = token_logits[0, mask_token_index, :]
32
 
33
- top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
34
 
35
- for token in top_5_tokens:
36
- print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))
37
- ```
38
 
39
- ### Limitations and bias
40
- Some data used to pretrain gaBERT was scraped from the web which potentially contains ethically problematic text (bias, hate, adult content, etc.). Consequently, downstream tasks/applications using gaBERT should be thoroughly tested with respect to ethical considerations.
41
 
42
- ### BibTeX entry and citation info
43
 
44
- If you use this model in your research, please consider citing our paper:
 
 
45
 
46
- ```
47
- @article{DBLP:journals/corr/abs-2107-12930,
48
- author = {James Barry and
49
- Joachim Wagner and
50
- Lauren Cassidy and
51
- Alan Cowap and
52
- Teresa Lynn and
53
- Abigail Walsh and
54
- M{\'{\i}}che{\'{a}}l J. {\'{O}} Meachair and
55
- Jennifer Foster},
56
- title = {gaBERT - an Irish Language Model},
57
- journal = {CoRR},
58
- volume = {abs/2107.12930},
59
- year = {2021},
60
- url = {https://arxiv.org/abs/2107.12930},
61
- archivePrefix = {arXiv},
62
- eprint = {2107.12930},
63
- timestamp = {Fri, 30 Jul 2021 13:03:06 +0200},
64
- biburl = {https://dblp.org/rec/journals/corr/abs-2107-12930.bib},
65
- bibsource = {dblp computer science bibliography, https://dblp.org}
66
- }
67
- ```
 
1
  ---
 
 
 
2
  tags:
3
+ - generated_from_keras_callback
4
+ model-index:
5
+ - name: bert-base-irish-cased-v1
6
+ results: []
7
  ---
8
 
9
+ <!-- This model card has been generated automatically according to the information Keras had access to. You should
10
+ probably proofread and complete it, then remove this comment. -->
11
 
12
+ # bert-base-irish-cased-v1
13
 
14
+ This model was trained from scratch on an unknown dataset.
15
+ It achieves the following results on the evaluation set:
16
 
 
 
 
17
 
18
+ ## Model description
 
19
 
20
+ More information needed
21
 
22
+ ## Intended uses & limitations
 
23
 
24
+ More information needed
 
25
 
26
+ ## Training and evaluation data
27
 
28
+ More information needed
 
 
29
 
30
+ ## Training procedure
 
31
 
32
+ ### Training hyperparameters
33
 
34
+ The following hyperparameters were used during training:
35
+ - optimizer: None
36
+ - training_precision: float32
37
 
38
+ ### Training results
39
+
40
+
41
+
42
+ ### Framework versions
43
+
44
+ - Transformers 4.20.1
45
+ - TensorFlow 2.9.1
46
+ - Datasets 2.3.2
47
+ - Tokenizers 0.12.1
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,8 +1,10 @@
1
  {
 
2
  "architectures": [
3
  "BertForMaskedLM"
4
  ],
5
  "attention_probs_dropout_prob": 0.1,
 
6
  "hidden_act": "gelu",
7
  "hidden_dropout_prob": 0.1,
8
  "hidden_size": 768,
@@ -14,6 +16,9 @@
14
  "num_attention_heads": 12,
15
  "num_hidden_layers": 12,
16
  "pad_token_id": 0,
 
 
17
  "type_vocab_size": 2,
 
18
  "vocab_size": 30101
19
  }
 
1
  {
2
+ "_name_or_path": "DCU-NLP/bert-base-irish-cased-v1",
3
  "architectures": [
4
  "BertForMaskedLM"
5
  ],
6
  "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
  "hidden_act": "gelu",
9
  "hidden_dropout_prob": 0.1,
10
  "hidden_size": 768,
 
16
  "num_attention_heads": 12,
17
  "num_hidden_layers": 12,
18
  "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "transformers_version": "4.20.1",
21
  "type_vocab_size": 2,
22
+ "use_cache": true,
23
  "vocab_size": 30101
24
  }
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a5147ad2ba231a4788e05f18e23c3c01f0b8f1d01d600a962ae690afd44d4de8
3
+ size 531099308