Efficient Parameter Gradient Projection For Continual Learning

1East China Normal University,
*Equal contribution, Corresponding author

News

This is a unified work about how to resist forgetting in various efficient-parameter based continual learning and the corresponding paper will release soon.

Here, we create this website to propagate our work (2024/3/3).

Our previous work 'Prompt Gradient Projection for Continual Learning, ICLR, 2024' has been received by ICLR24 SpotLight!💐💐

The latest update is in 2024/3/3.

Abstract

Parameter-efficient tuning (PET) has demonstrated impressive performance in continual learning by adding scalable extra parameters that are independent of the encoder. Based on tiny trainable fine-tuning parameters and a frozen pre-trained encoder, the comprehensive performance is highly improved but remains under-explored due to the novel forgetting mechanism.

However, recent progress mainly focused on designing efficient fine-tuning paradigms, while ignoring the mechanism of forgetting generation in the PET continual learning, let alone anti-forgetting criteria. Moreover, the unresolved trade-off between learning new information and protecting old knowledge further exacerbates these challenges.

This paper presents Efficient Parameter Gradient Projection (EPGP), combining various PET paradigms with orthogonal gradient projection, and theoretically deducing that the orthogonal condition for gradient can effectively resist forgetting in continual learning, which is applicable to all PET continual learning methods.

Uniquely, EPGP is the first unified method to provide anti-forgetting mechanism with mathematical demonstration for different tuning paradigms. Additionally, by conducting Singular Value Decomposition (SVD) to obtain gradient projection matrix, EPGP is proved as the optimal solution to balance the trade-off between plasticity and stability in PET continual learning methods.

We extensively evaluate our method with different backbones on diverse datasets and experiments demonstrate the efficiency of reducing forgetting both in class incremental, online class incremental, domain incremental, task incremental settings for uni-model, cross-model and instruction incremental learning for cross-model.

Efficient Tuning Paradigms

Pipeline of distinct parameter-efficient tuning paradigms

Visualization of distinct parameter-efficient tuning paradigms

For various parameter efficient tuning paradigms, to better preserve old knowledge, we propose that the update of network would satisfy the following theorems.

Prompt-Tuning:

Theorem 1.

Prefix-Tuning:

Theorem 2.

where and .

Adapter-Tuning:

Theorem 3.

LoRA-Tuning:

Theorem 4.


Forgetting-resist Deductions

Prompt-based Gradient Projection

Rendered scene in Blender

To achieve Theorem 1, i.e., the condition of anti-forgetting, the new prompts require to be:

Therefore, our key observation is reached: restricting the gradient of prompts by the following equations can realize anti-forgetting:

Prefix-based Gradient Projection

Rendered scene in Blender

To achieve Theorem 2, i.e., the condition of anti-forgetting, the new prompts require to be:

Therefore, our key observation is reached: restricting the gradient of prompts by the following equations can realize anti-forgetting:

Adapter-based Gradient Projection

Rendered scene in Blender

To achieve Theorem 3, i.e., the condition of anti-forgetting, the new prompts require to be:

Therefore, our key observation is reached: restricting the gradient of prompts by the following equations can realize anti-forgetting:

LoRA-based Gradient Projection

Rendered scene in Blender

To achieve Theorem 4, i.e., the condition of anti-forgetting, the new prompts require to be:

Therefore, our key observation is reached: restricting the gradient of prompts by the following equations can realize anti-forgetting:

Experimental Results

Class Incremental Learning

.
.
Visualization of tracking results on synthetic data. Visualization of tracking results on synthetic data.

T-SNE results of prompt and prompt-pg on 10-Split-CIFAR100 dataset with ViT backbone. The left column represents prompt, and the right column represents prompt-gp. The red circle means the drawback existing in prompt, and the blue circle shows the improvement of our method.


Online Incremental Learning

.

OnlineClass incremental learning results of prefix/prompt tuning paradigms with ViT backbone.


To-Do List

  • Domain incremental learning on ViT
  • Instruction incremental learning on BLIP-2
  • BibTeX

    @article{qiao2024prompt,
      author    = {Jingyang Qiao & Zhizhong Zhang, Xin Tan, Chengwei Chen, Yanyun Qu, Yong Peng, Yuan Xie},
      title     = {Prompt Gradient Projection for Continual Learning},
      journal   = {The Twelfth International Conference on Learning Representations},
      year      = {2024},
    }