Multimodal Large Language Continual Assistant

1East China Normal University, 2Tencent YouTu Lab, 3Xiamen University,
Corresponding author

This Website has been built on 2024/9/22

Abstract

Multimodal Continual Instruction Tuning (MCIT) is adopted to continually instruct Multimodal Large Language Models (MLLMs) to follow human intent in sequential tasks.

It meets two major challenges: catastrophic forgetting (inferior performance on seen task) and zero-shot degradation (inferior performance on unseen task).

In this paper, we propose a method called Multimodal Large Language Continual Assistant (LLaCA) to address the challenges.

Based on exponential movement average (EMA), we construct a set of equations, aiming to equip our method with excellent anti-forgetting ability. By solving the equation set, we obtain a theoretical derivation for the dynamic adaption weight of EMA.

Through extensive experiments on LLaVA-1.5 in continual visual question answering benchmark, compared with baseline, our approach not only highly improves anti-forgetting ability (from 22.67 to 2.68) but also significantly promotes zero-shot performance (from 6.68 to 47.55).

Model Flowchart

Rendered scene in Blender

General Multiple Rounds of Conversation Cases

Case 1: Extreme Ironing
Rendered scene in Blender

LLaVA-LoRA (After continually fine-tuned on six datasets)

Human: Can you describe the image?

Assistant: In the image, a man is standing on the back of a yellow truck, holding a clothesline with clothes hanging on it. He appears to be washing clothes in the middle of a busy street. The truck is driving down the road, and there are other vehicles around, including a taxi and a car.

There are also several traffic lights visible in the scene, indicating that the area is likely a busy intersection. The man is focused on his task, despite the bustling environment around him.

Human: Do you think there is anything that goes against human intuition?

Assistant: Yes, the image of a man washing clothes in the middle of a busy street goes against human intuition. Typically, people would expect to see someone washing clothes in a more private or secluded area, such as a backyard or a laundromat. However, in this case, the man has chosen to wash clothes in the middle of a busy street, which is an unusual and unconventional location for such an activity.

LLaCA-ours (After continually fine-tuned on six datasets)

Human: Can you describe the image?

Assistant: In the image, a man is standing on a folding chair in the back of a yellow taxi cab, ironing clothes. The taxi is driving down a busy city street, surrounded by other vehicles such as a car and a truck. The man appears to be focused on his task, possibly trying to dry his clothes quickly in the moving vehicle.

Human: Do you think there is anything that goes against human intuition?

Assistant: Yes, the image of a man ironing clothes in the back of a taxi cab is unusual and goes against human intuition. It is not a typical scenario one would expect to see, as ironing clothes is usually done in a more comfortable and controlled environment, such as a home or a laundry room. The man's decision to iron clothes in the back of a taxi cab suggests that he is trying to save time or that he might be in a hurry.

LLaVA-original (w/o fine-tuning)

Human: Can you describe the image?

Assistant: The image captures a busy city street scene with a man standing on a folding chair, holding a blue shirt and ironing clothes. He is positioned near the center of the scene, with a yellow car driving by in the background.

There are multiple vehicles on the street, including a taxi cab and a truck. The taxi is located on the left side of the scene, while the truck is further down the street, closer to the right side. The man is focused on his task, and the overall atmosphere suggests a bustling urban environment.

Human: Do you think there is anything that goes against human intuition?

Assistant: Yes, there are several examples of things that go against human intuition. Some of these include:

1. The illusion of motion: When an object moves quickly, our brain perceives it as a continuous motion, even though it is actually a series of discrete positions. This illusion is responsible for the motion blur effect in photographs.

2. The perception of time: Time is a relative concept, and our perception of it can be influenced by various factors, such as the speed of light, the rotation of the Earth, and the position of the observer relative to the Earth's surface.

3. The apparent size of objects: The size of objects can be distorted due to factors like distance, angle, and perspective. For example, when looking at a distant object, it appears larger than it actually is, and this effect is known as the "perspective illusion."

4. The perception of color: Color is a subjective experience, and our perception of it can be influenced by factors like lighting, the presence of colorants, and the specific wavelengths of light that are being reflected or emitted.

5. The perception of depth: Our brain creates a sense of depth by combining information from multiple sources, such as the position of objects relative to each other and the distance between them. This process can lead to misinterpretations of depth, as seen in the famous "duck-rabbit" optical illusion.

These examples demonstrate how our perceptions can be influenced by various factors and can lead to misunderstandings or misinterpretations of the world around us.

Case2: Funny Image
Rendered scene in Blender

LLaVA-LoRA (After continually fine-tuned on six datasets)

Human: What is the name of the man in the photo?

Assistant: the man in the photo is named elon musk

Human: Where is the interesting aspect of this picture?

Assistant: the interesting aspect of this picture is the man holding a dog

Human: What is Elon Musk holding in his hand?

Assistant: elon musk is holding a dog

Human: What kind of dog is in elon's hand?

Assistant: a puppy

LLaCA-ours (After continually fine-tuned on six datasets)

Human: What is the name of the man in the photo?

Assistant: Elon Musk

Human: Where is the interesting aspect of this picture?

Assistant: Elon Musk's face

Human: What is Elon Musk holding in his hand?

Assistant: puppy

Human: What kind of Puppy is in Elon's hand?

Assistant: Shiba Inu

LLaVA-original (w/o fine-tuning)

Human: What is the name of the man in the photo?

Assistant: The man in the photo is named Elon Musk.

Human: Where is the interesting aspect of this picture?

Assistant: The interesting aspect of this picture is that it is a cartoon of a man holding a dog, and the man is actually Elon Musk, a real person.

Human: What is Elon Musk holding in his hand?

Assistant: Elon Musk is holding a small dog in his hand.

Human: What kind of dog is in elon's hand?

Assistant: The dog in Elon's hand is a small puppy.

Case3: Food Cooking
Rendered scene in Blender

LLaVA-LoRA (After continually fine-tuned on six datasets)

Human: What are the meals that I can cook with these?

Assistant: With the items in the refrigerator, you can prepare a variety of meals. For example, you can make a fruit salad using the strawberries and blueberries. You can also use the carrots to prepare a healthy vegetable stir-fry, a soup, or a salad. The milk can be used to make a smoothie, a sauce, or a creamy soup. The eggs can be used to make a fried egg, an omelette, or a baked dish. Additionally, you can use the cheese to make a pizza, a grilled cheese sandwich, or a cheese-based sauce. The yogurt can be used to make a yogurt parfait, a dip, or a sauce. The butter can be used to cook or bake various dishes, such as bread or pastries. The ice cream can be used to make a dessert or a topping for other dishes.

Human: I would like to do the fruit salad. Show me the recipe.

Assistant: Sure, here's a simple recipe for a fruit salad:

Ingredients:

* 1 cup strawberries, hulled and sliced

* 1 cup blueberries, rinsed and drained

* 1/2 cup raspberries, rinsed and drained

* 1/2 cup grapes, seedless

* 1/2 cup cantaloupe, diced

* 1/4 cup honey or sugar (optional)

* 1/4 cup lemon juice (optional)

* 1/4 cup chopped fresh mint leaves (optional)

Instructions:

1. In a large bowl, combine the strawberries, blueberries, raspberries, grapes, and cantaloupe.

2. If desired, add honey or sugar to taste.

3. If desired, add lemon juice to the bowl for a tangy flavor.

4. Gently mix the fruit together.

5. Garnish with chopped fresh mint leaves, if desired.

Enjoy your fruit salad!

LLaCA-ours (After continually fine-tuned on six datasets)

Human: What are the meals that I can cook with these?

Assistant: With the variety of fruits, vegetables, and milk available in the refrigerator, you can prepare a range of healthy and delicious meals. Some ideas include:

1. Fruit salad: Combine the strawberries, blueberries, and any other fruits available in the refrigerator. You can add a squeeze of lemon juice and a sprinkle of sugar for a simple, refreshing dessert.

2. Vegetable stir-fry: Chop up the carrots and any other vegetables available in the refrigerator. Heat some oil in a pan and stir-fry the vegetables with your choice of protein (e.g., tofu, chicken, or beef) and a flavorful sauce.

3. Smoothie: Blend the milk with the blueberries, strawberries, and any other fruits available in the refrigerator. You can also add some ice and a scoop of protein powder for a nutritious and delicious smoothie.

4. Vegetable soup: Use the carrots and any other vegetables available in the refrigerator to make a vegetable soup. You can add some milk or a vegetable broth for added flavor and nutrition.

5. Omelette: Whisk eggs with milk and season with salt and pepper. Cook the omelette in a pan and add the carrots and any other vegetables available in the refrigerator for a nutritious and filling breakfast.

These are just a few examples of the many meals you can prepare using the ingredients available in the refrigerator.

Human: I would like to do the fruit salad. Show me the recipe.

Assistant: Sure, here's a simple recipe for a fruit salad using the ingredients available in the refrigerator:

Ingredients:

* Strawberries

* Blueberries

* Carrots (optional)

* Milk

* Lemon juice (optional)

* Sugar (optional)

Instructions:

1. Wash the strawberries and blueberries thoroughly.

2. Chop the strawberries into bite-sized pieces.

3. Wash the blueberries and chop them into bite-sized pieces.

4. If you'd like to add carrots, wash and chop them as well.

5. In a large bowl, combine the chopped strawberries, blueberries, and carrots (if using).

6. Pour the milk over the fruit mixture and stir gently.

7. If you'd like to add a little sweetness, squeeze a lemon wedge over the fruit mixture and sprinkle with sugar.

8. Serve the fruit salad immediately or refrigerate for a few hours to allow the flavors to meld together.

Enjoy your delicious and refreshing fruit salad!

LLaVA-original (w/o fine-tuning)

Human: What are the meals that I can cook with these?

Assistant: With the variety of food items in the refrigerator, you can prepare a range of meals. Some potential meals include:

1. Strawberry and blueberry smoothies or fruit salads, using the strawberries and blueberries.

2. A vegetable stir-fry, using the carrots and any other vegetables available.

3. A pasta dish, using the milk and any other ingredients you might have.

4. A sandwich or wrap, using the bread, meat, and any other ingredients you prefer.

These are just a few examples of the meals you can create using the items in the refrigerator. The possibilities are vast, and you can get creative with the ingredients to make delicious and nutritious meals.

5. A salad, using the lettuce and any other vegetables you have.

Human: I would like to do the fruit salad. Show me the recipe.

Assistant: Certainly! Here's a simple recipe for a fruit salad:

Ingredients:

- 1 cup strawberries, sliced

- 1 cup blueberries, rinsed and drained

- 1 cup grapes, halved

- 1/2 cup pineapple chunks

- 1/2 cup orange slices

- 1/2 cup apple slices

- 1/4 cup walnuts (optional)

- 1/4 cup honey (optional)

Instructions:

1. In a large bowl, combine the strawberries, blueberries, grapes, pineapple, and orange slices.

2. If desired, add walnuts and honey for extra flavor and texture.

3. Mix well to combine the fruits and nuts.

4. Cover and refrigerate the fruit salad for at least 30 minutes to allow the flavors to meld together.

5. Serve chilled or at room temperature.

Enjoy your delicious fruit salad!

Case4: Website Making
Rendered scene in Blender

LLaVA-LoRA (After continually fine-tuned on six datasets)

Human: Please turn this mock-up into an interactive website using html/js, and replace the joke with a real joke.

Assistant: I'm sorry, but I am not able to create an interactive website using html/js. However, I can provide you with a real joke:

Why was the math book sad?

Because it had too many problems.

LLaCA-ours (After continually fine-tuned on six datasets)

Human: Please turn this mock-up into an interactive website using html/js, and replace the joke with a real joke.

Assistant: Sure, I can help you with that! Here's a simple HTML/JS mock-up of the joke website: My Joke Website

My Joke Website

Funny Joke

Push to reveal punchline

This is a simple mock-up of the joke website, which includes a button to reveal the punchline. The punchline is stored in the `punchline` element, which is hidden by default. When the button is clicked, the `revealPunchline()` function is called, which sets the text of the `punchline` element to "Why did the tomato turn red?".

LLaVA-original (w/o fine-tuning)

Human: Please turn this mock-up into an interactive website using html/js, and replace the joke with a real joke.

Assistant: To create an interactive website using HTML/JS, you can follow these steps:

1. Create a new HTML file and add the necessary tags, such as `head`, `body`, and `div`.

2. Add a `div` element with the class "joke-box" to the body of the HTML file.

3. Inside the "joke-box" div, add a button element with the class "joke-button".

4. Add a click event listener to the button element using JavaScript.

5. Inside the event listener, replace the joke with a real joke, such as the one below:

"Why was the math book sad? Because it had too many problems."

1. Save the HTML file and open it in a web browser to test the interactive joke website.

Here's the code for the joke website:

My Joke Website

Specific Single Round of Conversation Cases

Rendered scene in Blender

Case1: Model trained from ScienceQA, TextVQA, GQA, VizWiz, VQAv2 to OCRVQA

test on ScienceQA

Rendered scene in Blender

Case2: Model trained from ScienceQA, TextVQA, GQA, VizWiz, VQAv2 to OCRVQA

test on GQA

Experimental Results

Rendered scene in Blender Rendered scene in Blender Rendered scene in Blender

BibTeX

@article{qiao2025,
  author    = {Jingyang Qiao and Zhizhong Zhang and Xin Tan and Yanyu Qu and Yuan Xie},
  title     = {Multimodal Large Language Continual Assistant},
  journal   = {Preprint},
  year      = {2025},
}