为了使得视觉与学习领域相关从业者快速及时地了解领域的最新发展动态和前沿技术进展,VALSE最新推出了《论文速览》栏目,将在每周发布一至两篇顶会顶刊论文的录制视频,对单个前沿工作进行细致讲解。本期VALSE论文速览选取了来自浙江大学计算机学院的零样本描述(Zero-Shot Captioning)方面的工作。该工作由朱霖潮老师和杨易老师指导,论文一作李威同学录制。 论文题目:通过纯文本训练解码CLIP隐空间的零样本描述方法 作者列表: Wei Li (Zhejiang University), Linchao Zhu (Zhejiang University), Longyin Wen (ByteDance Inc), Yi Yang (Zhejiang University) B站观看网址: 论文摘要: Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zeroshot transfer capability in many discriminative tasks, e.g., image classification. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest. Prior arts approach to zero-shot captioning by either utilizing the existing large language models (e.g., GPT-2) or pre-training the encoder-decoder network in an end-to-end manner. However, the large language models may not generate sensible descriptions due to the task discrepancy between captioning and language modeling, while the end-to-end pre-training requires paired data and extensive computational resources. In this work, we propose a simple framework, named DeCap, for zero-shot captioning. We introduce a lightweight visual-aware language decoder. This decoder is both data-efficient and computation-efficient: 1) it only requires the text data for training, easing the burden on the collection of paired data. 2) it does not require end-to-end training. When trained with text-only data, the decoder takes the text embedding extracted from the off-the-shelf CLIP encoder as a prefix embedding. The challenge is that the decoder is trained on the text corpus but at the inference stage, it needs to generate captions based on visual inputs. Though the CLIP text embedding and the visual embedding are correlated, the modality gap issue is widely observed in multi-modal contrastive models that prevents us from directly taking the visual embedding as the prefix embedding. We propose a training-free mechanism to reduce the modality gap. We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input. Taking the projected embedding as the prefix embedding, the decoder generates high-quality descriptions that match the visual input. The experiments show that DeCap outperforms other zero-shot captioning methods and unpaired captioning methods by a large margin on the typical image captioning benchmarks, i.e., MSCOCO and NoCaps. We apply DeCap to video captioning and achieve stateof-the-art zero-shot performance on MSR-VTT and ActivityNet-Captions. 论文信息: [1] Wei Li, Linchao Zhu, Longyin wen and Yi Yang. DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training. ICLR 2023 论文链接: [https://arxiv.org/pdf/2303.03032.pdf] 代码链接: [https://github.com/dhg-wei/DeCap] 视频讲者简介: 李威,浙江大学计算机学院博士生,主要研究方向是视觉语言多模态。 特别鸣谢本次论文速览主要组织者: 月度轮值AC:杨旭 (西安电子科技大学) 季度轮值AC:叶茫 (武汉大学) 活动参与方式 1、VALSE每周举行的Webinar活动依托B站直播平台进行,欢迎在B站搜索VALSE_Webinar关注我们! 直播地址: https://live.bilibili.com/22300737; 历史视频观看地址: https://space.bilibili.com/562085182/ 2、VALSE Webinar活动通常每周三晚上20:00进行,但偶尔会因为讲者时区问题略有调整,为方便您参加活动,请关注VALSE微信公众号:valse_wechat 或加入VALSE QQ S群,群号:317920537); *注:申请加入VALSE QQ群时需验证姓名、单位和身份,缺一不可。入群后,请实名,姓名身份单位。身份:学校及科研单位人员T;企业研发I;博士D;硕士M。 3、VALSE微信公众号一般会在每周四发布下一周Webinar报告的通知。 4、您也可以通过访问VALSE主页:http://valser.org/ 直接查看Webinar活动信息。Webinar报告的PPT(经讲者允许后),会在VALSE官网每期报告通知的最下方更新。 |
小黑屋|手机版|Archiver|Vision And Learning SEminar
GMT+8, 2024-10-31 14:19 , Processed in 0.020836 second(s), 14 queries .
Powered by Discuz! X3.4
Copyright © 2001-2020, Tencent Cloud.