Quantize Llama 2 models using GGUF and llama.cpp

🗣️ Large Language Model Course

Usage

MODEL_ID: The ID of the model to quantize (e.g., mlabonne/EvolCodeLlama-7b).
QUANTIZATION_METHOD: The quantization method to use.

Quantization methods

The names of the quantization methods follow the naming convention: "q" + the number of bits + the variant used (detailed below). Here is a list of all the possible quant methods and their corresponding use cases, based on model cards made by TheBloke:

q2_k: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
q3_k_l: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_m: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_s: Uses Q3_K for all tensors
q4_0: Original quant method, 4-bit.
q4_1: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
q4_k_s: Uses Q4_K for all tensors
q5_0: Higher accuracy, higher resource usage and slower inference.
q5_1: Even higher accuracy, resource usage and slower inference.
q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
q5_k_s: Uses Q5_K for all tensors
q6_k: Uses Q8_K for all tensors
q8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

As a rule of thumb, I recommend using Q5_K_M as it preserves most of the model's performance. Alternatively, you can use Q4_K_M if you want to save some memory. In general, K_M versions are better than K_S versions. I cannot recommend Q2_K or Q3_* versions, as they drastically decrease model performance.

Source Code

https://colab.research.google.com/drive/1HZwFloVY4-1Ca3dARX5OuWTj-u5wtH7C?usp=sharing