Edit model card

Sports Text Classifier

Overview

This Sports Text Classifier is a crucial component of the OnlySports Dataset creation pipeline. It's designed to accurately identify and extract sports-related documents from a large corpus of web content.

Model Architecture

  • Base model: Snowflake-arctic-embed-xs
  • Additional layer: Binary classification layer
  • Training: 10 epochs with a learning rate of 3e-4

Performance

The classifier achieves exceptional accuracy in distinguishing between sports and non-sports documents:

image/png

Training Data

The classifier was trained on a balanced dataset of sports and non-sports content:

  • 64k samples from seven prestigious sports websites
  • 36k non-sports text documents classified using GPT-3.5

Usage

This classifier is primarily used in the creation of the OnlySports Dataset, presented in this paper. It can be applied to filter large text corpora for sports-related content with high accuracy.

Integration

The classifier is integrated into a MapReduce architecture for efficient processing of large-scale datasets. It's used in conjunction with URL keyword filtering to create a comprehensive sports text dataset.

Related Projects

This classifier is part of the larger OnlySports collection, which includes:

For more information, check our paper.

Downloads last month
43
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) is not available, repository is disabled.

Model tree for Chrisneverdie/OnlySports_Classifier

Finetuned
this model

Dataset used to train Chrisneverdie/OnlySports_Classifier

Collection including Chrisneverdie/OnlySports_Classifier