Towards Compatible Fine-tuning for Vision-Language Model

2025년 05월 07일

|

< Summary (English) >

This paper investigates the compatibility of efficient fine-tuning methods on vision-language models during model upgrades.
The study reveals that many high-performing fine-tuning methods fail to be compatible with updated models.
To address this issue, a novel approach called Class-conditioned Context Optimization (ContCoOp) is proposed, which integrates learnable prompts with class embeddings using an attention layer before inputting them into the text encoder.
ContCoOp achieves the highest compatibility over baseline methods and exhibits robust out-of-distribution generalization.

< 요약 (Korean) >

이 논문은 시각언어모델의 업그레이드 동안 효율적인 튀닝 방법의 호환성을 조사합니다.
연구결과, 높은 성능을 가진 효율적인 튀닝 방법이 업그레이드된 모델에 호환되지 않는 것을 발견합니다.
이를 해결하기 위해 Class-conditioned Context Optimization(ContCoOp)이라는 새로운 접근법을 제안합니다.
ContCoOp은 텍스트 인코더에 입력하기 전에 클래스 임베딩과 함께 학습 가능한 프롬프트를 사용하여 관심 레이어를 사용하여 학습 가능한 프롬프트와 클래스 임베딩을 결합합니다.
ContCoOp은 기존 메서드보다 최고의 호환성을 달성하고 분포 밖의 일반화 능력이 강합니다.
Technical Terms Explanation:
* Vision-Language Model (시각언어 모델): 이미지와 텍스트를 처리하는 데 사용되는 인공지능 모델입니다.
* Fine-tuning (튀닝): 기존의 모델을 새로운 데이터셋에 적응시키기 위해 학습 가능한 모델을 추가하는 과정입니다.
* Compatibility (호환성): 두 개의 시스템이 서로 상호작용할 수 있도록 설계된 것으로, 이 논문에서는 모델 업그레이드와 튀닝 방법 간의 호환성을 말합니다.
* Class-conditioned Context Optimization (ContCoOp): 이 논문에서 제안된 새로운 접근법으로, 학습 가능한 프롬프트와 클래스 임베딩을 사용하여 텍스트 인코더에 입력합니다.
Related Papers or Resources:
[1] “Towards Compatible Fine-tuning for Vision-Language Model Updates” by Zhengbo Wang, Jian Liang, Lijun Sheng, Ran He, Zilei Wang, Tieniu Tan (arXiv:2412.
20895v1)
[2] “Efficient Fine-tuning for Vision-Language Models” by Zhou et al.
(https://arxiv.
org/abs/2203.
12766)

< 기술적 용어 설명 >

< 참고 논문 또는 관련 자료 >

< Excerpt (English) >

Towards Compatible Fine-tuning for Vision-Language Model Updates Zhengbo Wang1,2, Jian Liang2,3 ∗, Lijun Sheng1,2, Ran He2,3, Zilei Wang1, Tieniu Tan2,3,4 1 University of Science and Technology of China 2 NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences 3 School of Artificial Intelligence, University of Chinese Academy of Sciences 4 Nanjing University zhengbowang@mail.ustc.edu.cn, liangjian92@gmail.com Abstract So far, efficient fine-tuning has become a popular strategy for enhancing the capabilities of foundation models on downstream tasks by learning plug-and-play modules. However, existing methods overlook a crucial issue: if the underlying foundation model is updated, are these plug-and-play modules still effective? In this paper, we first conduct a detailed analysis of various fine-tuning methods on the CLIP in terms of their compatibility with model updates. The study reveals that many high-performing fine-tuning methods fail to be compatible with the upgraded models. To address this, we propose a novel approach, Class-conditioned Context Optimization (ContCoOp), which integrates learnable prompts with class embeddings using an attention layer before inputting them into the text encoder. Consequently, the prompts can dynamically adapt to the changes in embedding space (due to model updates), ensuring continued effectiveness. Extensive experiments over 15 datasets show that our ContCoOp achieves the highest compatibility over the baseline methods, and exhibits robust out-of-distribution generalization. 1 Introduction In the current era, foundation models [Radford et al., 2021, Kenton and Toutanova, 2019, Brown et al., 2020, Rombach et al., 2022, Caron et al., 2021] have emerged as the cornerstone of the field of deep learning. Through pre-training on exceptionally large datasets, these models demonstrate remarkable zero-shot capabilities and generalization, rendering them extensively employed across various domains. Efficient fine-tuning has emerged as a prominent area of research in the context of large foundation models [Zhou et al., 2022b,a, Hu et al., 2022, Li and Liang, 2021, Houlsby et al., 2019]. By freezing the parameters of foundation models, these approaches train lightweight plug-and-play modules to quickly and cost-effectively adapt the model to downstream tasks, such as learning residual matrices [Hu et al., 2022, Dettmers et al., 2023] or additional learnable prompts [Zhou et al., 2022b, Li and Liang, 2021, Houlsby et al., 2019]. Consequently, by maintaining a frozen large model alongside corresponding lightweight modules, we can low-costly apply the foundation model to thousands of downstream tasks. The existing efficient fine-tuning methods neglect a crucial problem. To improve model performance or attain a safety alignment [Ouyang et al., 2022, Sun et al., 2023, Xu et al., 2023, Touvron et al., 2023, Chiang et al., 2023, Rombach et al., 2022], the foundational models at the core are often updated, such as the transition from GPT-3 to ChatGPT, CLIP to EVA-CLIP, and the series of versions in stable diffusion. However, prior to model updates, we have trained a variety of plug-and-play modules on the current version of the foundation model. This raises a significant question: ∗Correspondence to: Jian Liang (liangjian92@gmail.com) 1 arXiv:2412.20895v1 [cs.CV] 30 Dec 2024 Can these efficient fine-tuning modules be compatible with the upgraded foundation model? Given the substantial…

< 번역 (Korean) >

비전 언어 모델 업데이트를위한 미세 조정을 위해 Zhengbo Wang1,2, Jian Liang2,3 *, Lijun Sheng1,2, Ran He2,3, Zilei Wang1, Tieniu Tan2,3,4 1 Nlpr & Mais의 과학 기술 대학, 중국어 아카데미의 중국 아카데미 3 학년, 예술가 지식 학교, 중국인 아카데미 (University of Pritific of Artiocal of Artical).
4 Nanjing University zhengbowang@mail.ustc.edu.cn, liangjian92@gmail.com 초록 지금까지 효율적인 미세 조정은 플러그 앤 플레이 모듈을 학습하여 다운 스트림 작업에서 기초 모델의 기능을 향상시키는 인기 전략이되었습니다.
그러나 기존 방법은 중요한 문제를 간과합니다.
기본 기초 모델이 업데이트 된 경우 이러한 플러그 앤 플레이 모듈이 여전히 효과적입니까?
이 논문에서는 먼저 클립에서 모델 업데이트와의 호환성 측면에서 다양한 미세 조정 방법에 대한 자세한 분석을 수행합니다.
이 연구는 많은 고성능 미세 조정 방법이 업그레이드 된 모델과 호환되지 않는 것으로 나타났습니다.
이를 해결하기 위해, 우리는 텍스트 인코더에 입력하기 전에주의 레이어를 사용하여 학습 가능한 프롬프트를 클래스 임베딩과 통합하는 새로운 접근법 인 Class-Condited Context Optimization (ContCoop)을 제안합니다.
결과적으로 프롬프트는 임베딩 공간의 변화에 동적으로 적응하여 (모델 업데이트로 인해) 지속적인 효율성을 보장 할 수 있습니다.
15 개의 데이터 세트에 걸쳐 광범위한 실험에 따르면 ContCoop은 기준선 방법보다 가장 높은 호환성을 달성하고 강력한 분포되지 않은 일반화를 보여줍니다.
1 소개 현재 시대의 기초 모델 [Radford et al., 2021, Kenton and Toutanova, 2019, Brown et al., 2020, Rombach et al., 2022, Caron et al., 2021]은 심해 학습 분야의 초석으로 등장했습니다.
이 모델은 매우 큰 데이터 세트에서 사전 훈련을 통해 놀라운 제로 샷 기능과 일반화를 보여 주어 다양한 도메인에서 광범위하게 사용됩니다.
대규모 기초 모델의 맥락에서 효율적인 미세 조정이 두드러진 연구 영역으로 나타났습니다 [Zhou et al., 2022b, A, Hu et al., 2022, Li and Liang, 2021, Houlsby et al., 2019].
기초 모델의 매개 변수를 동결시킴으로써, 이들 접근법은 가벼운 플러그 앤 플레이 모듈이 잔류 행렬 학습 [Hu et al., 2022, dettmers et al., 2023] 또는 추가 학습 가능한 프롬프트와 같은 다운 스트림 작업에 신속하고 비용 효율적으로 적응하도록 가벼운 플러그 앤 플레이 모듈을 훈련시킨다 [Zhou et al., 2022b, Lig, Liang, LiNg, LiN, LiN, LiN.
결과적으로, 해당 경량 모듈과 함께 얼어 붙은 대형 모델을 유지함으로써, 우리는 기초 모델을 수천 개의 다운 스트림 작업에 저렴하게 적용 할 수 있습니다.
기존의 효율적인 미세 조정 방법은 중요한 문제를 무시합니다.
모델 성능을 향상 시키거나 안전 조정을 달성하기 위해 [Ouyang et al., 2022, Sun et al., 2023, Xu et al., 2023, Touvron et al., 2023, Chiang et al., 2023, Rombach et al., 2022], 핵심의 기초 모델은 종종 GPT-3에서의 전이, 그리고 chatgpt 로의 전이와 같은 종종 업데이트됩니다.
안정적인 확산에서.
그러나 모델 업데이트 전에 현재 버전의 Foundation 모델에서 다양한 플러그 앤 플레이 모듈을 교육했습니다.
이것은 중요한 의문을 제기합니다.
상당한 …

출처: arXiv

Download PDF

Comments

답글 남기기 응답 취소

ilikeafrica.com

Towards Compatible Fine-tuning for Vision-Language Model

이것이 좋아요:

Comments

답글 남기기 응답 취소

Towards Compatible Fine-tuning for Vision-Language Model

이 글 공유하기:

이것이 좋아요:

Comments

답글 남기기 응답 취소