mserras commited on
Commit
9c442e6
1 Parent(s): 2a2462b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -20
README.md CHANGED
@@ -1,18 +1,33 @@
1
  ---
2
- license: apache-2.0
3
  tags:
4
  - setfit
5
  - sentence-transformers
6
  - text-classification
7
  pipeline_tag: text-classification
 
 
 
 
 
8
  ---
9
 
10
  # mserras/setfit-alpaca-es-unprocessable-sample-detection
11
 
12
- This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for text classification. The model has been trained using an efficient few-shot learning technique that involves:
13
 
14
- 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
15
- 2. Training a classification head with features from the fine-tuned Sentence Transformer.
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## Usage
18
 
@@ -26,24 +41,32 @@ You can then run inference as follows:
26
 
27
  ```python
28
  from setfit import SetFitModel
 
 
29
 
30
  # Download from Hub and run inference
31
  model = SetFitModel.from_pretrained("mserras/setfit-alpaca-es-unprocessable-sample-detection")
32
- # Run inference
33
- preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])
34
- ```
35
 
36
- ## BibTeX entry and citation info
37
-
38
- ```bibtex
39
- @article{https://doi.org/10.48550/arxiv.2209.11055,
40
- doi = {10.48550/ARXIV.2209.11055},
41
- url = {https://arxiv.org/abs/2209.11055},
42
- author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
43
- keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
44
- title = {Efficient Few-Shot Learning Without Prompts},
45
- publisher = {arXiv},
46
- year = {2022},
47
- copyright = {Creative Commons Attribution 4.0 International}
48
- }
49
  ```
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  tags:
3
  - setfit
4
  - sentence-transformers
5
  - text-classification
6
  pipeline_tag: text-classification
7
+ datasets:
8
+ - mserras/alpaca-es-hackaton
9
+ - somosnlp/somos-clean-alpaca-es
10
+ language:
11
+ - es
12
  ---
13
 
14
  # mserras/setfit-alpaca-es-unprocessable-sample-detection
15
 
16
+ This is a [SetFit model](https://github.com/huggingface/setfit) that can be used for filtering the Alpaca ES instruction dataset.
17
 
18
+ The base model is [Paraphrase mpnet base v2](sentence-transformers/paraphrase-mpnet-base-v2) from Sentence Transformers
19
+
20
+ This model has been developed during the 2023 Hackaton organized by [SomosNLP](https://somosnlp.org/)/[HF Card](https://huggingface.co/somosnlp) and with the GPUs provided by [Q Blocks](https://www.qblocks.cloud)
21
+
22
+ This model has been trained over "unprocessable" samples of the translated [Clean Alpaca Es](https://huggingface.co/datasets/somosnlp/somos-clean-alpaca-es) dataset from
23
+ the HF [Argilla](https://argilla.io) space https://huggingface.co/spaces/mserras/somos-alpaca-es.
24
+
25
+ To this end, a custom tag is proposed: "unprocessable" which corresponds to instruction/input/output triplets that require processing image, fetching information from the
26
+ open web and similar tasks where the LLM has no capability action, thus, ending in hallucinations or strange outcomes.
27
+
28
+ As this model was trained over samples of Alpaca, which were generated using ChatGPT3.5 this model **cannot be used for commercial purposes or to compete against OpenAI**
29
+
30
+ The scores are dumped in the dataset in the metadata field "sf-unprocessable-score"
31
 
32
  ## Usage
33
 
 
41
 
42
  ```python
43
  from setfit import SetFitModel
44
+ import argilla as rg
45
+
46
 
47
  # Download from Hub and run inference
48
  model = SetFitModel.from_pretrained("mserras/setfit-alpaca-es-unprocessable-sample-detection")
 
 
 
49
 
50
+ def instruct_fields_to_text(field_instruction: str, field_input: str, field_output: str):
51
+ """Given the instruction, input and output fields, return a text to be used by setfit"""
52
+ return f"INSTRUCTION:\n{field_instruction}\nINPUT:\n{field_input}\nOUTPUT:\n{field_output}\n"
53
+
54
+ def sample_to_text(sample: rg.TextClassificationRecord) -> str:
55
+ """Converts and Argilla TextClassificationRecord to a text to be used by setfit"""
56
+ return instruct_fields_to_text(sample.inputs["1-instruction"], sample.inputs["2-input"], sample.inputs["3-output"])
57
+
58
+ # For a given Argilla record:
59
+
60
+ unprocessable_score = model.predict_proba([sample_to_text(argilla_record)])[0].tolist()[1]
61
+
 
62
  ```
63
+
64
+ ## Evaluation
65
+
66
+ *Disclaimer*: There was no formal evaluation done, just a bunch of guys looking at the data & the outcomes.
67
+
68
+ ## Changelog
69
+
70
+ - [09/04/2023] SQL code generation, date conversion, percentual discounts and renewable energies no longer detected as unprocessable.
71
+ - [06/04/2023] It no longer detects password generation as unprocessable.
72
+