Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
visheratin 
posted an update Feb 24

Seems Fun to explore can you link some reference papers too if possible?

·

There are links to existing papers in the blog post if you want to dive into the field.

so good!

hi @visheratin , do you have any guides on how to train similar model? Phi-2 + SigLIP vision encoder?

·

I used mainly the LLaVA training codebase with some changes to support multi-crop. I'll be working on the next post about fine-tuning MC-LLaVA on a task-specific dataset and will open all the code.

I found your blog post really interesting.
I have a question regarding training models: in your method, you mentioned that images are divided into max_crop patches and then fed into an image encoder. Does this mean that, compared to the original LLaVA, the forward pass of the model requires max_crop times more time or memory consumption?
Or is there a more efficient way to implement this?

·

You are right. The method requires multiple passes for the vision encoder, which increases memory usage. This is not such a big problem during inference, but it makes training harder because of the gradients stored. At the moment, I don't have a solution to make it more efficient. But this is an ongoing project, so maybe I will find one =)