--- license: mit pipeline_tag: text-classification --- This is a fine-tuned version of [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base). I created a dataset of 89k items, I used a LLM to extract and label relevant information from job adverts description and used the unlabeled data as the negatives. I used the same type of method to extract labels from the LLM by scraping different company websites so I can get labels for e.g product, service and upcoming offerings. I fine-tuned the reranker by using the labels from the LLM and then inserting "Example of" in the start of the query to provide the model more context and intent of the query. Fine-tuned querys: ``` Example of education Example of certification Example of qualifications Example of work experience Example of hard skills Example of soft skills Example of benefits Example of breadtext Example of company culture Example of product Example of service Example of upcoming offerings Example of job duties ``` It works pretty well, not 100% since that might require more data but I could only train it for max 12hr due to Kaggle's sessions restrictions so I couldn't trainer it on a larger dataset. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/657e233acec775bfe0d5cbc6/rQTkVzunm1CyOzgR6oV5z.png) You can load the model as you usually load the FlagReranker model: ``` from FlagEmbedding import FlagReranker reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation score = reranker.compute_score(["Example of education", "Requires bachelor degree"]) print(score) ``` Or using tranformer: ``` import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large') model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large') model.eval() pairs = [["Example of education", "Requires bachelor degree"]] with torch.no_grad(): inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) scores = model(**inputs, return_dict=True).logits.view(-1, ).float() print(scores) ``` Along with the model files I also uploaded a ONNX file which you load in .NET (C#) using the below. For tokenizing the input i used [BlingFire](https://github.com/microsoft/BlingFire) since they have the sentence piece model for xlm_roberta_base, the only special thing you'll need is to add the special tokens that indicate the start and EOF. ONNX model: https://huggingface.co/MarcusLoren/Reranker-job-description/blob/main/Reranker.onnx ``` public class RankerInput { public long[] input_ids { get; set; } public long[] attention_mask { get; set; } } public class RankedOutput { public float[] logits { get; set; } } _mlContext = new MLContext(); var onnxModelPath = "Reranker.onnx"; var dataView = _mlContext.Data.LoadFromEnumerable(new List()); var pipeline = _mlContext.Transforms.ApplyOnnxModel( modelFile: onnxModelPath, gpuDeviceId: 0, outputColumnNames: new[] { nameof(RankedOutput.logits) }, inputColumnNames: new[] { nameof(RankerInput.input_ids), nameof(RankerInput.attention_mask) }); rankerModel = pipeline.Fit(dataView); var predictionEngine = _mlContext.Model.CreatePredictionEngine(rankerModel); var tokened_input = Tokenize(["Example of education", "Requires bachelor degree"]) var pred = predictionEngine.Predict(tokened_input) var score = pred.logits[0]; // e.g 0.99 private RankerInput Tokenize(string[] pair) { List input_ids = [ 0, .. TokenizeText(pair[0]), 2, 2, .. TokenizeText(pair[1]), 2, ]; var attention_mask = Enumerable.Repeat((long)1, input_ids.Count).ToArray(); return new RankerInput() { input_ids = input_ids.ToArray(), attention_mask = attention_mask }; } var TokenizerModel = BlingFireUtils.LoadModel(@".\xlm_roberta_base.bin"); public int[] TokenizeText(string text) { List tokenized = new(); foreach (var chunk in text.Split(' ').Chunk(80)) { int[] labelIds = new int[128]; byte[] inBytes = Encoding.UTF8.GetBytes(string.Join(" ", chunk)); var outputCount = BlingFireUtils2.TextToIds(TokenizerModel, inBytes, inBytes.Length, labelIds, labelIds.Length, 0); Array.Resize(ref labelIds, outputCount); tokenized.AddRange(labelIds); } return tokenized.ToArray(); } ``` --- license: mit ---