File size: 4,793 Bytes
c91e259
 
 
 
 
b0bdf5c
 
 
 
 
 
 
 
 
 
 
 
 
 
d5966b8
b0bdf5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a36897
b0bdf5c
9a36897
 
b0bdf5c
 
 
 
 
 
 
 
 
 
 
 
d1cf4fa
b0bdf5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d379cab
 
c91e259
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: mit
pipeline_tag: text-classification
---
This is a fine-tuned version of [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base).

I created a dataset of 89k items, I used a LLM to extract and label relevant information from job adverts description and used the unlabeled data as the negatives.
I used the same type of method to extract labels from the LLM by scraping different company websites so I can get labels for e.g product, service and upcoming offerings.


I fine-tuned the reranker by using the labels from the LLM and then inserting "Example of" in the start of the query to provide the model more context and intent of the query.

Fine-tuned querys:
```
Example of education
Example of certification
Example of qualifications
Example of work experience
Example of hard skills
Example of soft skills 
Example of benefits
Example of breadtext
Example of company culture
Example of product
Example of service
Example of upcoming offerings
Example of job duties
```

It works pretty well, not 100% since that might require more data but I could only train it for max 12hr due to Kaggle's sessions restrictions so I couldn't trainer it on a larger dataset.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/657e233acec775bfe0d5cbc6/rQTkVzunm1CyOzgR6oV5z.png)
 

You can load the model as you usually load the FlagReranker model:

```
from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

score = reranker.compute_score(["Example of education", "Requires bachelor degree"])
print(score) 
```

Or using tranformer:
```
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
model.eval()

pairs = [["Example of education", "Requires bachelor degree"]]
with torch.no_grad():
    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
    print(scores)

```

Along with the model files I also uploaded a ONNX file which you load in .NET (C#) using the below. 
For tokenizing the input i used [BlingFire](https://github.com/microsoft/BlingFire) since they have the sentence piece model for xlm_roberta_base, the only special thing you'll need is to add the special tokens that indicate the start and EOF.
ONNX model: https://huggingface.co/MarcusLoren/Reranker-job-description/blob/main/Reranker.onnx

 ```
    public class RankerInput
    {  
        public long[] input_ids { get; set; } 
        public long[] attention_mask { get; set; }
    }
        public class RankedOutput
    {
        public float[] logits { get; set; }
    }
  _mlContext = new MLContext();
  
  var onnxModelPath = "Reranker.onnx";
  var dataView = _mlContext.Data.LoadFromEnumerable(new List<RankerInput>());
  var pipeline = _mlContext.Transforms.ApplyOnnxModel(
      modelFile: onnxModelPath,
      gpuDeviceId: 0,
      outputColumnNames: new[] { nameof(RankedOutput.logits) },
      inputColumnNames: new[] { nameof(RankerInput.input_ids), nameof(RankerInput.attention_mask) });
  rankerModel = pipeline.Fit(dataView);
  var predictionEngine = _mlContext.Model.CreatePredictionEngine<RankerInput, RankedOutput>(rankerModel);

  var tokened_input = Tokenize(["Example of education", "Requires bachelor degree"])
  
  var pred = predictionEngine.Predict(tokened_input)
  var score = pred.logits[0];  // e.g 0.99
   
   private RankerInput Tokenize(string[] pair)
   {   
         List<long> input_ids =
         [
                 0,
                 .. TokenizeText(pair[0]),
                 2,
                 2,
                 .. TokenizeText(pair[1]),
                 2,
             ];

         var attention_mask = Enumerable.Repeat((long)1, input_ids.Count).ToArray();
         return new RankerInput() { input_ids = input_ids.ToArray(), attention_mask = attention_mask }; 
 }
  
  var TokenizerModel = BlingFireUtils.LoadModel(@".\xlm_roberta_base.bin");
  public int[] TokenizeText(string text)
  {
      List<int> tokenized = new();
      foreach (var chunk in text.Split(' ').Chunk(80)) {
  
          int[] labelIds = new int[128]; 
          byte[] inBytes = Encoding.UTF8.GetBytes(string.Join(" ", chunk));
          var outputCount = BlingFireUtils2.TextToIds(TokenizerModel, inBytes, inBytes.Length, labelIds, labelIds.Length, 0);
          Array.Resize(ref labelIds, outputCount);
          tokenized.AddRange(labelIds);
      }
      return tokenized.ToArray();
  }
```


  
---
license: mit
---