VALSE 论文速览第189期：声音提示下的可泛化视听分割

2024-7-26 10:50| 发布者: 程一-计算所| 查看: 504| 评论: 0

摘要: 论文题目：Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer作者列表：王耀霆 (中国人民大学)、刘卫松 (西北工业大学)、李光耀 (中国人民大学)、丁健 (武汉大学)、胡迪 (中国人民 ...

论文题目：

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer

作者列表：

王耀霆 (中国人民大学)、刘卫松 (西北工业大学)、李光耀 (中国人民大学)、丁健 (武汉大学)、胡迪 (中国人民大学)、李玺 (浙江大学)

B站观看网址：

https://www.bilibili.com/video/BV1dT421r7tm/

论文摘要：

Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the demanding zero-shot and few-shot scenarios. To achieve this goal, different from existing approaches that mostly employ the encoder-fusion-decoder paradigm to decode localization information from the fused audio-visual feature, we introduce the encoder-prompt-decoder paradigm, aiming to better fit the data scarcity and varying data distribution dilemmas with the help of abundant knowledge from pre-trained models. Specifically, we first propose to construct a Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects, meanwhile, the semantic gap between the visual and audio modalities is also encouraged to shrink. Then, we develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model. By equipping with these means, extensive experiments demonstrate that this new paradigm outperforms other fusion-based methods in both the unseen class and cross-dataset settings. We hope that our work can further promote the generalization study of Audio-Visual Localization and Segmentation in practical application scenarios.

参考文献：

[1] Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, Xi Li, “Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer,” in Proceeding of AAAI Conference on Artificial Intelligence (AAAI 2024), Vancouver, Canada, February 2024.

论文链接：

[https://arxiv.org/abs/2309.07929]

代码链接：

[https://github.com/GeWu-Lab/Generalizable-Audio-Visual-Segmentation]

视频讲者简介：

Yaoting Wang obtained his master's degree from the University of Edinburgh and is currently working as a research intern at GeWu Lab, Renmin University of China, under the guidance of Prof. Di Hu. He will be participating in the Visiting Student Research Program of King Abdullah University of Science and Technology starting in March 2024.

个人主页：

https://github.com/yaotingwangofficial

特别鸣谢本次论文速览主要组织者：

月度轮值AC：于茜 (北京航空航天大学)

活动参与方式

1、VALSE每周举行的Webinar活动依托B站直播平台进行，欢迎在B站搜索VALSE_Webinar关注我们！

直播地址：

https://live.bilibili.com/22300737；

历史视频观看地址：

https://space.bilibili.com/562085182/

2、VALSE Webinar活动通常每周三晚上20:00进行，但偶尔会因为讲者时区问题略有调整，为方便您参加活动，请关注VALSE微信公众号：valse_wechat 或加入VALSE QQ T群，群号：863867505）；

*注：申请加入VALSE QQ群时需验证姓名、单位和身份，缺一不可。入群后，请实名，姓名身份单位。身份：学校及科研单位人员T；企业研发I；博士D；硕士M。

3、VALSE微信公众号一般会在每周四发布下一周Webinar报告的通知。

4、您也可以通过访问VALSE主页：http://valser.org/ 直接查看Webinar活动信息。Webinar报告的PPT（经讲者允许后），会在VALSE官网每期报告通知的最下方更新。

收藏邀请

上一篇：VALSE 论文速览第188期：具有泛化原型的域偏移联邦图学习下一篇：VALSE Webinar IJCAI期间暂停通知

下级分类

小黑屋|手机版|Archiver|Vision And Learning SEminar

GMT+8, 2025-10-17 11:10 , Processed in 0.014700 second(s), 14 queries .

返回顶部

VALSE 论文速览 第189期：声音提示下的可泛化视听分割

相关分类

下级分类

VALSE 论文速览第189期：声音提示下的可泛化视听分割