Edit model card

h2o-danube3-500m-chat-GGUF

Description

This repo contains GGUF format model files for h2o-danube3-500m-chat quantized using llama.cpp framework.

Table below summarizes different quantized versions of h2o-danube3-500m-chat. It shows the trade-off between size, speed and quality of the models.

Name Quant method Model size MT-Bench AVG Perplexity Tokens per second
h2o-danube3-500m-chat-F16.gguf F16 1.03 GB 3.34 9.46 1870
h2o-danube3-500m-chat-Q8_0.gguf Q8_0 0.55 GB 3.76 9.46 2144
h2o-danube3-500m-chat-Q6_K.gguf Q6_K 0.42 GB 3.77 9.46 2418
h2o-danube3-500m-chat-Q5_K_M.gguf Q5_K_M 0.37 GB 3.20 9.55 2430
h2o-danube3-500m-chat-Q4_K_M.gguf Q4_K_M 0.32 GB 3.16 9.96 2427

Columns in the table are:

  • Name -- model name and link
  • Quant method -- quantization method
  • Model size -- size of the model in gigabytes
  • MT-Bench AVG -- MT-Bench benchmark score. The score is from 1 to 10, the higher, the better
  • Perplexity -- perplexity metric on WikiText-2 dataset. It's reported in a perplexity test from llama.cpp. The lower, the better
  • Tokens per second -- generation speed in tokens per second, as reported in a perplexity test from llama.cpp. The higher, the better. Speed tests are done on a single H100 GPU

Prompt template

<|prompt|>Why is drinking water so healthy?</s><|answer|>
Downloads last month
679
GGUF
Model size
514M params
Architecture
llama

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Examples
Inference API (serverless) is not available, repository is disabled.

Collection including h2oai/h2o-danube3-500m-chat-GGUF