No GPU support?

#3
by seedmanc - opened

It's missing the raw Q4 versions without letters. Those usually work with GPU while others don't, noticed it with other models too.

Hi there @seedmanc , thanks for reaching out! What kind of GPU do you use?

My understanding is that K-quants are universally compatible with any GPU accelerator supported by Llama.cpp (Metal, AMD, NVIDIA, etc), and preferable to the original Q4_0, Q4_1, Q5_0, and Q5_1 quants.

For example, they're described as "legacy" in the source code:

Old quant types (some base model types require these):
- Q4_0: small, very high quality loss - legacy, prefer using Q3_K_M
- Q4_1: small, substantial quality loss - legacy, prefer using Q3_K_L
- Q5_0: medium, balanced quality - legacy, prefer using Q4_K_M
- Q5_1: medium, low quality loss - legacy, prefer using Q5_K_M

Here's a nice detailed breakdown on the tradeoffs between file size and accuracy of the various quant types (K, I, & legacy).

And in terms of performance, K-quants are faster because they have a simpler dequantization process — I-quants require a lookup table in-order to manage the process so there's a bit of overhead there, which is the cost of the accuracy improvement from this approach (more here via ikawrakow, the author of these formats!).

My takeaways:

  • if you value model size on disk and accuracy most, you should use I-quants (e.g. IQ4_NL);
  • if you value performance first & foremost, and don't mind a bit less accuracy & a bit larger model files, you should use K-quants (e.g. Q4_K_M).
  • there isn't a scenario where the 4_* or 5_* quants are preferred, which explains a bit more clearly why they're considered legacy at this point.

For these reasons, I haven't been converting into those legacy quants so far, mainly to avoid any more confusion than already exists when choosing a quant size — but if you have some compatibility constraint I'm not aware of, I'd be more than happy to adjust my view here & begin doing so!

Thanks,
Britt

RTX 2070S 8Gb, using GPT4All for inference
only zero quants work on GPU, where there is 0 instead of the letters
tried 5, 6 and 8 bits with 0, only 4 bits work with GPU
it's frustrating to see too many bits to choose from and even among the same bit depth several versions
you have to DL everything and test and test again (not just here, generally with models)
why can't there be a disclaimer what people seeking accelerated inference should choose right away

@seedmanc I can definitely understand your frustration! it’s a new space that is evolving rapidly, so it can be challenging to find accurate and up to date documentation, much less working tools, for very long.

in this case, it would seem the llama.cpp interface is a custom fork which is now 2 months out of date. That might sound relatively recent, but with how rapidly the technology is changing, it’s quite outdated — so that might explain some of the trouble you’ve had with finding compatible quants. I would suggesting exploring alternatives, as there are many which are more up to date and with more flexible compatibility!

@seedmanc in the meantime:

Q4_0

Q4_1

Q5_0

Q5_1

Cool, thanks. Both Q4 versions work with GPU. Does it mean that GPT4ALL just doesn't support new (lettered) quants, only the old (numbered)?

Sign up or log in to comment