Guiding Instruction-based Image Editing via Multimodal Large Language Models

February 22, 2024

A new paper has been published, this time by Apple, discussing the use of a Multimodal Large Language Model (MLLM) to enhance instruction-based image editing.

If I am reading this correctly, instruction-based image editing can sometimes struggle when given ambiguous or brief instructions by humans, this approach involves using an MLLM to “translate” or enhance the given instructions into instructions that will achieve the desired result from the instruction-based editing models.

This is an exciting development for two reasons, first, it gives us some insight into what Apple is working on in generative AI. Second, this development gives us some insight into what can be accomplished as we start to layer LLMs with other technologies.

References:

[2309.17102] Guiding Instruction-based Image Editing via Multimodal Large Language Models (arxiv.org)