The "silly" experiment.

#4
by ZeroWw - opened

ZeroWw 'SILLY' version. The original model has been quantized (fq8 version) and a percentage of it's tensors have been modified adding some noise.

Full colab: https://colab.research.google.com/drive/1a7seagBzu5l3k3FL4SFk0YJocl7nsDJw?usp=sharing

Fast colab: https://colab.research.google.com/drive/1SDD7ox21di_82Y9v68AUoy0PhkxwBVvN?usp=sharing

Original reddit post: https://www.reddit.com/r/LocalLLaMA/comments/1ec0s8p/i_made_a_silly_test/

I created a program to randomize the weights of a model. The program has 2 parameters: the percentage of weights to modify and the percentage of the original value to randmly apply to each weight.

At the end I check the resulting GGUF file for binary differences. In this example I set to modify 100% of the weights of Mistral 7b Instruct v0.3 by a maximum of 15% deviation.

Since the deviation is calculated on the F32 weights, when quantized to Q8_0 this changes. So, in the end I got a file that compared to the original has:

Bytes Difference percentage: 73.04%

Average value divergence: 2.98%

The cool thing is that chatting with the model I see no apparent difference and the model still works nicely as the original.

Since I am running everything on CPU, I could not run perplexity scores or anything computing intensive.

As a small test, I asked the model a few questions (like the history of the roman empire) and then fact check its answer using a big model. No errors were detected.

Update: all procedure tested and created on COLAB.

Example: https://huggingface.co/ZeroWw/Lumimaid-v0.2-12B-SILLY

Since I am running everything on CPU, I could not run perplexity scores or anything computing intensive.

If you can upload a silly version of a smaller model like Phi-3-Mini in FQ8 & Silly, I can test perplexities :3 (depending on timing i may also be able to run Ollama-MMLU-Pro)

Edit - I forgot -ngl existed, running mistral silly perplexity :3

Bytes Difference percentage: 73.04%

Average value divergence: 2.98%

The cool thing is that chatting with the model I see no apparent difference and the model still works nicely as the original.

I'm not really sure what you are trying to accomplish by this but you might find this interesting:

https://en.wikipedia.org/wiki/Rate%E2%80%93distortion_theory

Quantization using "squared-error distortion" is essentially adding "random" noise like this.

If you use some of the simpler Q4_0 or Q8_0 quants (which IIRC, use "squared-error distortion" criteria alone) and add Gaussian random noise; then it is in fact exactly the same...

I don't understand anything of that page. Anyway, the patch to add noise to the model is at the top of the colab notebook.
Before making this tool I searched and I found no tool able to do that.
The results are very interesting. Especially with 100% and 20% respectively. Also 30% is interesting. 60% is very degraded and over that percentage it becomes more and more random. But at 15%-30% there is a sweet spot where the model is "different" but not degraded.

You are doing this similar to https://github.com/EGjoni/DRUGS ?

You are doing this similar to https://github.com/EGjoni/DRUGS ?

DRµGS just inverts this scheme. Instead of using noise to sample from the model's predictions, 
DRµGS injects noise directly into the transformer layers at inference time, thereby varying what the model predicts.

I don't know about that. Read the first post in this thread. I explained what I did: I add a divergence to all weights.
In the colab notebook you can vary this divergence or the percentage of weights affected.
And it's not done at inference time, but it generates a modified model that you can later use as you please.

I don't understand anything of that page. Anyway, the patch to add noise to the model is at the top of the colab notebook.
Before making this tool I searched and I found no tool able to do that.
The results are very interesting. Especially with 100% and 20% respectively. Also 30% is interesting. 60% is very degraded and over that percentage it becomes more and more random. But at 15%-30% there is a sweet spot where the model is "different" but not degraded.

What I'm saying is we already have a way to add noise to models: quantization...

What I'm saying is we already have a way to add noise to models: quantization...

Quantization degrades a model (to different extents). With a low percentage of noise instead the model does not really degrade. That's the interesting part imho.

Sounds like a non-true scottsman fallacy, to be honest :)

Sounds like a non-true scottsman fallacy, to be honest :)

Nobody asked you nor forced you to use it.

Just pointing out the obvious logical problems.

The "silly" experiment is called as it's called just because of that. Wasn't that obvious, mr scottsman? :D
The results anyway are interesting especially with a divergence <=20%

The no true scottsman fallacy refers to the "not really degraded". Saying it is degraded, but not really degraded is not a reasonable thing to say because nobody knows what that can possibly mean. It's pretty much meaningless. I have no idea how you could somehow apply it to the title. See https://en.wikipedia.org/wiki/No_true_Scotsman for further explanations and examples.

Again, I am just the messenger. You don't have to act so defensively and try to stifle discussion by implying only invited opinions are allowed here. Nobody asked you either, but you still post here. Why do you assume to have special privileges?

The sjngle purpose i see for this is to prepare a model before finetuning to better accept a finetune because it became unstable.

Another example that reminds me of this is adding toxic finetunes to a merge to increase creativity and instruction following, by degrading rules and norms. This is done very carfully though.

Besides that, i dont see your theory of adding forced randomness (instead of higher temps at inference) add up.

@Nelathan mine is "a silly test".
Generating many models from one and then evaluating them might bring something to the table. Or maybe not.
Like an evolutionary algorithm for example.
But, again, it's just a test.
Some liked it, some didn't.
Personally I like the models more with a 10% divergence. They are more creative and not "crazy".
Other projects before mine used a similar technique: one was called DRuGS.
The difference in mine is just that I create a new model file and use llama.cpp while DRuGS used python and different algos.

Disclaimer: I'm not familiar with signal processing theory.

Could it be that the concept we're talking about here is dithering? It seems to help mitigating quantization degradations and is widely adopted for images and audios. Quoted from wiki: "Dither is an intentionally applied form of noise used to randomize quantization error, preventing large-scale patterns such as color banding in images."

Also found two relevant discussions from the llama.cpp repository: #4976 #4933

Hi @kalomaze , I noticed you raised similar points before in these pull requests and did some preliminary research there. Would you mind sharing your thoughts on this topic? I think your insights would be valuable to this conversation.

Dithering is around edges, but this silly modification adds noise all over.

Btw, In stable diffusion thats used to add more details when upscaling, but in latent space not model wheights.

Yep. The models I posted have 20% noise... but with less noise like 10%-15% they might be interesting. I posted the notebook if people want to do it themselves.

Sign up or log in to comment