Large language models have shown notable achievements in executing instructions, multi-turn conversations, and image-based question-answering tasks. These models include Flamingo, GPT-4V, and Gemini. The fast development of open-source Large Language Models, such as LLaMA and Vicuna, has greatly accelerated the evolution of open-source vision language models. These advancements mainly center on improving visual understanding by…
