@visheratin on Hugging Face: "VLMs have a resolution problem, which prevents them from finding small details…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

visheratin

posted an update Feb 24

Post

VLMs have a resolution problem, which prevents them from finding small details in large images. In my community blog post, I discuss the ways to solve it and describe the details of MC-LLaVA architecture - https://huggingface.co/blog/visheratin/vlm-resolution-curse

Check it out, and let me know what you think!

Ayushnangia

Feb 25

Seems Fun to explore can you link some reference papers too if possible?

visheratin

Feb 27

There are links to existing papers in the blog post if you want to dive into the field.

tyao

Feb 26

so good!

radames

Feb 26

hi @visheratin , do you have any guides on how to train similar model? Phi-2 + SigLIP vision encoder?

visheratin

Feb 27

I used mainly the LLaVA training codebase with some changes to support multi-crop. I'll be working on the next post about fine-tuning MC-LLaVA on a task-specific dataset and will open all the code.

oldsea

Mar 6

•

edited Mar 6

I found your blog post really interesting.
I have a question regarding training models: in your method, you mentioned that images are divided into max_crop patches and then fed into an image encoder. Does this mean that, compared to the original LLaVA, the forward pass of the model requires max_crop times more time or memory consumption?
Or is there a more efficient way to implement this?

visheratin

Mar 6

You are right. The method requires multiple passes for the vision encoder, which increases memory usage. This is not such a big problem during inference, but it makes training harder because of the gradients stored. At the moment, I don't have a solution to make it more efficient. But this is an ongoing project, so maybe I will find one =)

In this post