VALSE 论文速览第207期：为图像问答任务构建好的上下文序列

2025-2-18 19:31| 发布者: 程一-计算所| 查看: 563| 评论: 0

摘要: 为了使得视觉与学习领域相关从业者快速及时地了解领域的最新发展动态和前沿技术进展，VALSE最新推出了《论文速览》栏目，将在每周发布一至两篇顶会顶刊论文的录制视频，对单个前沿工作进行细致讲解。本期VALSE论文速 ...

为了使得视觉与学习领域相关从业者快速及时地了解领域的最新发展动态和前沿技术进展，VALSE最新推出了《论文速览》栏目，将在每周发布一至两篇顶会顶刊论文的录制视频，对单个前沿工作进行细致讲解。本期VALSE论文速览选取了来自东南大学的上下文学习领域 (In-context Learning)的工作。该工作由杨旭副教授指导，论文一作李立同学录制。

论文题目：

How to Configure Good In-Context Sequence for Visual Question Answering

作者列表：

李立 (东南大学)、彭嘉炜 (东南大学)、陈慧仪 (东南大学)、高重阳 (西北大学)、杨旭 (东南大学，通讯)

B站观看网址：

https://www.bilibili.com/video/BV1JsK5eyEnr/

复制链接到浏览器打开或点击阅读原文即可跳转至观看页面。

论文摘要：

Inspired by the success of Large Language Models in dealing with new tasks via In-Context Learning (ICL) in NLP, researchers have also developed Large Vision-Language Models (LVLMs) with ICL capabilities. However, when implementing ICL using these LVLMs, researchers usually resort to the simplest way like random sampling to configure the in-context sequence, thus leading to sub-optimal results. To enhance the ICL performance, in this study, we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations to find the powerful ones. Additionally, through observing the changes of the LVLM outputs by altering the in-context sequence, we gain insights into the inner properties of LVLMs, improving our understanding of them. Specifically, to explore in-context configurations, we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations. Through exhaustive experiments on three VQA datasets: VQAv2, VizWiz, and OK-VQA, we uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance.

参考文献：

[1] Li Li, Jiawei Peng, Huiyi Chen, Chongyang Gao, Xu Yang, “How to Configure Good In-Context Sequence for Visual Question Answering,” in Proceeding of IEEE Computer Vision and Pattern Recognition (CVPR 2024).

论文链接：

[https://arxiv.org/abs/2312.01571]

代码链接：

[https://github.com/GaryJiajia/OFv2_ICL_VQA]

视频讲者简介：

李立，东南大学软件学院硕士研究生，师从杨旭副教授，主要研究方向是多模态学习，上下文学习和大模型生成。

特别鸣谢本次论文速览主要组织者：

月度轮值AC：高广谓 (南京邮电大学)

收藏邀请

上一篇：VALSE Webinar 25-03期总第374期多模态学习是否实现了1+1>2？下一篇：VALSE 论文速览第208期：IRAD: 基于隐式表征驱动的图像重采样对抗防御 ...

下级分类

小黑屋|手机版|Archiver|Vision And Learning SEminar

GMT+8, 2026-7-23 07:09 , Processed in 0.014674 second(s), 14 queries .

返回顶部

VALSE 论文速览 第207期：为图像问答任务构建好的上下文序列

相关分类

下级分类

VALSE 论文速览第207期：为图像问答任务构建好的上下文序列