VALSE 论文速览第90期：面向动态视音场景的问答学习机制

2022-7-29 15:14| 发布者: 程一-计算所| 查看: 1790| 评论: 0

摘要: 为了使得视觉与学习领域相关从业者快速及时地了解领域的最新发展动态和前沿技术进展，VALSE最新推出了《论文速览》栏目，将在每周发布一至两篇顶会顶刊论文的录制视频，对单个前沿工作进行细致讲解。本期VALSE论文速 ...

为了使得视觉与学习领域相关从业者快速及时地了解领域的最新发展动态和前沿技术进展，VALSE最新推出了《论文速览》栏目，将在每周发布一至两篇顶会顶刊论文的录制视频，对单个前沿工作进行细致讲解。本期VALSE论文速览选取了来自中国人民大学高瓴人工智能学院的视音场景理解方面的工作。该工作由论文第一作者李光耀博士生录制。

论文题目：Learning to Answer Questions in Dynamic Audio-Visual Scenarios

作者列表：李光耀 (中国人民大学)，卫雅珂 (中国人民大学)，田亚鹏 (University of Rochester)，徐辰良 (University of Rochester)，文继荣 (中国人民大学)，胡迪 (中国人民大学)

B站观看网址：

https://www.bilibili.com/video/BV1SB4y1k7xZ/

论文摘要：

In this paper, we focus on the Audio-Visual Question Answering (AVQA)task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale MUSIC-AVQA dataset, which contains more than 45K question-answer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting progress in audio-visual scene understanding and spatio-temporal reasoning.

论文信息：

[1] Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu, Learning to Answer Questions in Dynamic Audio-Visual Scenarios, CVPR 2022 (Oral presentation)

论文链接：

[https://openaccess.thecvf.com/content/CVPR2022/papers/Li_Learning_To_Answer_Questions_in_Dynamic_Audio-Visual_Scenarios_CVPR_2022_paper.pdf]

代码链接：

[https://github.com/GeWu-Lab/MUSIC-AVQA]

项目主页链接：

[https://gewu-lab.github.io/MUSIC-AVQA/]

知乎讲解：

[https://zhuanlan.zhihu.com/p/498943693]

视频讲者简介：

李光耀，中国人民大学高瓴人工智能学院博士研究生，主要研究方向是多模态视频理解。

特别鸣谢本次论文速览主要组织者：

月度轮值AC：王智慧 (大连理工大学)、杨旭 (西安电子科技大学)

季度责任AC：魏秀参 (南京理工大学)

活动参与方式

1、VALSE每周举行的Webinar活动依托B站直播平台进行，欢迎在B站搜索VALSE_Webinar关注我们！

直播地址：

https://live.bilibili.com/22300737；

历史视频观看地址：

https://space.bilibili.com/562085182/

2、VALSE Webinar活动通常每周三晚上20:00进行，但偶尔会因为讲者时区问题略有调整，为方便您参加活动，请关注VALSE微信公众号：valse_wechat 或加入VALSE QQ R群，群号：137634472）；

*注：申请加入VALSE QQ群时需验证姓名、单位和身份，缺一不可。入群后，请实名，姓名身份单位。身份：学校及科研单位人员T；企业研发I；博士D；硕士M。

3、VALSE微信公众号一般会在每周四发布下一周Webinar报告的通知。

4、您也可以通过访问VALSE主页：http://valser.org/ 直接查看Webinar活动信息。Webinar报告的PPT（经讲者允许后），会在VALSE官网每期报告通知的最下方更新。

收藏邀请

上一篇：VALSE 论文速览第89期：β-DARTS: 用于可微分架构搜索的Beta-Decay正则化 ...下一篇：VALSE Webinar 20220810-20期总第287期图神经网络及其在结构建模中的应用 ... ...

下级分类

小黑屋|手机版|Archiver|Vision And Learning SEminar

GMT+8, 2026-7-19 21:29 , Processed in 0.016833 second(s), 14 queries .

返回顶部

VALSE 论文速览 第90期：面向动态视音场景的问答学习机制

相关分类

下级分类

VALSE 论文速览第90期：面向动态视音场景的问答学习机制