VALSE Webinar 22-21期总第288期面向交互行为的视觉场景理解

2022-8-11 18:14| 发布者: 程一-计算所| 查看: 2930| 评论: 0

摘要: 报告时间2022年08月17日 (星期三)晚上20:00 (北京时间)主题面向交互行为的视觉场景理解主持人丁长兴 (华南理工大学)直播地址https://live.bilibili.com/22300737报告嘉宾：王瑞平 (中国科学院计算技术研究所)报告题 ...

报告时间	2022年08月17日 (星期三) 晚上20:00 (北京时间)
主题	面向交互行为的视觉场景理解
主持人	丁长兴 (华南理工大学)
直播地址	https://live.bilibili.com/22300737

报告嘉宾：王瑞平 (中国科学院计算技术研究所)

报告题目：视觉场景图—表示、生成与应用

报告嘉宾：李永露 (香港科技大学)

报告题目：Three-Stages in Human-Object Interaction Detection

报告嘉宾：卢策吾 (上海交通大学)

报告题目：Bridging Isolated Islands in Human Activity Understanding

Panel嘉宾：

王瑞平 (中国科学院计算技术研究所)、李永露 (香港科技大学)、卢策吾 (上海交通大学)、郑伟诗 (中山大学)、王利民 (南京大学)

Panel议题：

1. 动作识别、场景图生成 (SGG)、HOI Detection在概念上的区别和联系。动作识别领域已有的大量成果是否对后两个新兴任务的研究有所启发？

2. 作为组合学习任务，HOI Detection和SGG面临许多独特的挑战。例如，relationship的表征建模复杂、数据标注的成本很高、动作和物体类别的组合使得HOI类别存在严重的长尾问题、测试阶段容易出现新的动作和物体组合。在解决以上挑战方面，有没有值得关注的思路？

3. DETR模型在HOI Detection和SGG任务上得到了广泛应用。相比于物体检测领域，DETR模型在这两个任务上的应用有哪些独特的挑战？

4. 目前，HOI Detection和SGG这两个任务是分别进行研究的。是否可能有一个综合性的benchmark来统一这两个任务的研究？这样的benchmark应该具备什么样的要素？

5. Embodied AI关注机器人如何在与物体交互过程中进行学习。那么对于交互行为的理解，是否有助于Embodied AI方面的研究？两个领域的研究是否可以起到相互促进的作用？

*欢迎大家在下方留言提出主题相关问题，主持人和panel嘉宾会从中选择若干热度高的问题加入panel议题！

报告嘉宾：王瑞平 (中国科学院计算技术研究所)

报告时间：2022年08月17日 (星期三)晚上20:00 (北京时间)

报告题目：视觉场景图—表示、生成与应用

报告人简介：

王瑞平，中国科学院计算技术研究所研究员、博导，研究领域为计算机视觉与模式识别，重点关注真实开放环境下的视觉场景理解问题。发表国际期刊和会议论文90余篇，Google Scholar引用6000余次，获授权国家发明专利9项。带领研究生6次获得本领域主流国际学术竞赛冠亚军，获得CVPR 2021 CLVISION Workshop “Best Paper Award”奖励。先后在CVPR 2015、ECCV 2016、ICCV 2019等国际会议合作组织并主讲Tutorial。担任Pattern Recognition、Neurocomputing等国际期刊编委，十余次担任IEEE CVPR (2021/ 2022)、ICCV (2021)、ECCV (2022)、WACV (2018-2020/ 2022/ 2023)、ACCV (2022)等国际会议领域主席。先后获得ACCV 2012/ IEEE FG 2019 “Best Reviewer Award”、IEEE CVPR 2019/ ICCV 2019/ ECCV 2020/ NeurIPS 2020/ ICML 2022 “Outstanding Reviewers”等荣誉。研究成果获得2015年度国家自然科学奖二等奖 (第4完成人)、2019年度国家自然科学基金委优青项目资助。

个人主页：

http://vipl.ict.ac.cn/people/rpwang/

报告摘要：

近年来面向视觉场景理解的认知任务探索成为热点，研究视角从关注局部的视觉实体 (object-centric)转向关注全局的实体间关系 (relationship-centric)，如何建立纷繁视觉信息与其本质语义内涵之间的跨模态通路成为关键挑战。结构化的视觉场景图 (scene graph)为连接底层的物体识别检测等感知任务与高层的语言描述问答等认知任务提供了桥梁。本课题组近几年围绕场景图的表示、生成与应用开展了系列研究，致力于建立 “物体-->场景-->语言-->知识”的递进式场景理解统一框架。报告将介绍取得的一些具体进展，包括：结构化图推理驱动的物体检测、场景关系图的自动生成、复杂场景跨模态图文检索、图像描述生成的自动化评测等工作。

参考文献：

[1] Wenbin Wang, Ruiping Wang, Xilin Chen, “Topic Scene Graph Generation by Attention Distillation from Caption,”18th IEEE International Conference on Computer Vision (ICCV 2021), pp. 15900-15910, Montreal, Canada, Oct. 11-17, 2021.

[2] Jiwei Xiao, Ruiping Wang, Xilin Chen, “Holistic Pose Graph: Modeling Geometric Structure among Objects in a Scene using Graph Inference for 3D Object Prediction,”18th IEEE International Conference on Computer Vision (ICCV 2021), pp. 12717–12726, Montreal, Canada, Oct. 11-17, 2021.

[3] Sijin Wang, Ziwei Yao, Ruiping Wang, Zhongqin Wu, Xilin Chen, “FAIEr: Fidelity and Adequacy Ensured Image Caption Evaluation,”IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2021), pp. 14050–14059, June 19-25, 2021.

[4] Wenbin Wang, Ruiping Wang, Shiguang Shan, Xilin Chen, “Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation,”16th European Conference on Computer Vision (ECCV 2020), LNCS 12358, pp. 222–239, Aug. 23-28, 2020.

[5] Yong Liu, Ruiping Wang, Shiguang Shan, Xilin Chen, “Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships,”IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), pp. 6985-6994, Salt Lake City, UT, June 18-22, 2018.

报告嘉宾：李永露 (香港科技大学)

报告时间：2022年08月17日 (星期三)晚上20:30 (北京时间)

报告题目：Three-Stages in Human-Object Interaction Detection

报告人简介：

Yong-Lu Li is a postdoc fellow of Hong Kong University of Science and Technology, working closely with IEEE fellow Prof. Chi Keung Tang and Prof. Yu-Wing Tai. He received a Ph.D. degree in Computer Science from the Shanghai Jiao Tong University, under the supervision of Prof. Cewu Lu. His primary research interests are vision-based learning and reasoning, human activity understanding, and intelligent robotics. He focuses on developing a knowledge-driven vision system (HAKE, http://hake-mvig.cn)that learns to effectively perceive human activities, reason human behavior logic, and interact with objects and environments. He has published 20 papers in top-tier CV/ ML/ AI conferences and journals, e.g., TPAMI, NeurIPS, CVPR, ICCV, ECCV, AAAI. He has won Baidu Scholarship, WAIC YunFan Reward (Rising-star), China National Scholarship, Outstanding Reviewer Award of NeurIPS, Shanghai Outstanding Doctoral Graduate, Chinese AI New Star Top-100, Ph.D. Fellowship of Yang Yuanqing Education Fund.

个人主页：

https://dirtyharrylyl.github.io/

报告摘要：

Human-Object Interaction (HOI)is a key direction for human activity understanding. It aims at detecting the interactive humans and objects meanwhile classifying the existing interactions. Generally speaking, this set detection task contains three metaphysical stages: instance detection, interactiveness detection, and interaction classification. In this talk, I will discuss the challenges in three key stages and introduce our recent works. Then, I will look forward to possible directions for HOI detection and general human activity understanding in the lens of multi-modal representation, open-vocabulary learning, knowledge and reasoning, benchmarks, and applications.

参考文献：

[1] Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Hao-Shu Fang, Yan-Feng Wang, Cewu Lu, “Transferable Interactiveness Knowledge for Human-Object Interaction Detection”, CVPR 2019.

[2] Xinpeng Liu*, Yong-Lu Li* (*=equal contribution), Xiaoqian Wu, Yu-Wing Tai, Cewu Lu, Chi Keung Tang, “Interactiveness Field of Human-Object Interactions”, CVPR 2022.

[3] Xiaoqian Wu*, Yong-Lu Li* (*=equal contribution), Xinpeng Liu, Junyi Zhang, Yuzhe Wu, Cewu Lu, “Mining Cross-Person Cues for Body-Part Interactiveness Learning in HOI Detection”, ECCV 2022.

报告嘉宾：卢策吾 (上海交通大学)

报告时间：2022年08月17日 (星期三)晚上21:00 (北京时间)

报告题目：Bridging Isolated Islands in Human Activity Understanding

报告人简介：

卢策吾现为上海交通大学计算机系教授、清源研究院院长助理。主要从事计算机视觉、行为理解和智能机器人的研究。以第一或通讯作者在《Nature》、《Nature Machine Intelligence》、TPAMI、CVPR等高水平期刊和会议发表论文100多篇,开源了一系列如AlphaPose (人体姿态估计系统，GitHub Star 5000+)，HAKE (人体行为引擎)，GraspNet (高性能机器人抓取系统)等多项拥有国际先进水平的开源人工智能框架和数据集。曾获求是杰出青年学者奖、上海市科技进步特等奖、世界人工智能大会最高级SAIL奖，被爱思唯尔 (Elsevier)评为2021年度中国高被引学者。

个人主页：

https://www.mvig.org/

报告摘要：

As a vital step toward the intelligent agent, action understanding attracts a lot of attention and achieves success recently. This task can be formed as the mapping from the human action physical space to semantic space. However, due to the complexity of action patterns and semantic ambiguity, great challenges remain. In terms of action physical space, multi-modal methods have been proposed to extract representative features to facilitate recognition. But few efforts have been made in the design of semantic space. Usually, researchers built datasets according to idiosyncratic choices to define action classes and then develop methods to push the envelope of these datasets respectively. As a result, these datasets are incompatible with each other due to the semantic gap and different action class granularities e.g., “do housework”in dataset A and “wash plate”in dataset B. Here, we call it the "isolated islands"problem which brings a great challenge to general action understanding as these “isolated islands”with semantic gaps cannot afford unified training. We argue that a more principled and complete semantic space is an urgent need to concentrate the efforts of the community and enable us to use all the existing multi-modal datasets to pursue general and effective action understanding. To this end, we propose a novel path to reshape the action understanding paradigm. In detail, we redesign a structured semantic space given verb taxonomy hierarchy and cover massive verbs. By aligning the classes of existing datasets to our structured space, we can put all image/ video/ skeleton/ MoCap action datasets into the largest database by far with a unified semantic label system. Accordingly, we propose a bidirectional mapping framework, Sandwich, to use multi-modal data with unified labels to bridge the action physical-semantic space. In extensive experiments, our framework shows great potential for future action study and significant superiority upon the canonical paradigm, especially on few/zero-shot action learning tasks with semantic analogy thanks to the verb structure knowledge and our data coverage.

Panel嘉宾：郑伟诗 (中山大学)

嘉宾简介：

郑伟诗博士，中山大学计算机学院教授、博导，现任中山大学计算机学院副院长、机器智能与先进计算教育部重点实验室副主任、大数据分析与应用技术国家工程实验室副主任。他致力研究 (跨场景)追踪与行为感知处理，并结合多种模态信息，实现高层语义理解。他发表在CCF-A/ 中科院1区/ Nature子刊 150多篇，其中在IEEE T-PAMI/ IJCV/ 自然通讯发表20余篇。担任Pattern Recognition、《自动化学报》等期刊的编委，担任国际顶级学术会议ICCV、CVPR、IJCAI 等领域主席。作为负责人，主持承担国家自然科学基金委联合基金重点项目、国家重点研发课题、国家自然科学基金委联合重大项目课题、国家科技部重大攻关课题、国防科技173计划基金、广东省重点基金、及其他5个国家级项目。获全国2020水下目标检测算法赛光学图像赛项一等奖、获CVPR等国际顶级学术会议竞赛第一名3次；获中国图象图形学学会自然科学奖一等奖、广东省自然科学奖一等奖、广东省自然科学奖二等奖等；他是2020/ 2021年中国高被引学者 (爱思唯尔)、入选2020/ 2021全球前2%顶尖科学家榜单 (斯坦福大学)。获国家优秀青年科学基金、英国皇家学会牛顿高级学者基金和广东省创新领军人才项目支持。

个人主页：

https://www.isee-ai.cn/~zhwshi/

Panel嘉宾：王利民 (南京大学)

嘉宾简介：

王利民，南京大学教授，博士生导师，2011年在南京大学获得学士学位，2015年香港中文获得博士学位，2015年至2018在苏黎世联邦理工学院 (ETH Zurich)从事博士后研究工作。主要研究领域为计算机视觉和深度学习，专注视频理解和动作识别，在IJCV、T-PAMI、CVPR、ICCV等重要学术期刊和会议发表论文50余篇。根据Google Scholar统计，论文被引用 14000余次，两篇一作论文取得了单篇引用超过3000的学术影响力。提出的TSN网络获得首届ActivityNet比赛冠军，已经成为动作识别领域基准方法。2018年入选国家高层次青年人才计划，曾获得广东省技术发明一等奖，世界人工智能大会青年优秀论文奖。入选AI 2000人工智能全球最具影响力学者榜单 (计算机视觉方向)，2022年度全球华人AI青年学者榜单，2021爱思唯尔中国高被引学者榜单。

个人主页：

http://wanglimin.github.io/

主持人：丁长兴 (华南理工大学)

嘉宾简介：

丁长兴，华南理工大学研究员，博士生导师。2016年获得悉尼科技大学计算机科学博士学位。主要研究面向交互行为的视觉场景理解任务，如HOI Detection、Scene Graph Generation等课题。近年来在IEEE TPAMI、TIP、IJCV、CVPR、ECCV、AAAI 等重要学术期刊和会议发表论文近50篇，5篇入选ESI高被引论文。获得本领域主流国际学术竞赛冠军4次，含CVPR EPIC Kitchens动作识别/ 预期挑战赛冠军各一次。担任国际期刊IET Computer Vision编委、Pattern Recognition Letters责任客座编辑、VALSE执行领域主席。研究工作获得国家自然科学基金、广东省引进创新创业团队项目、广东省青年拔尖人才计划、华南理工大学杰出青年基金、CCF-百度松果基金等项目的资助。

个人主页：

https://www.researchgate.net/profile/Changxing-Ding

特别鸣谢本次Webinar主要组织者：

主办AC：丁长兴 (华南理工大学)

协办AC：王利民 (南京大学)

活动参与方式

1、VALSE每周举行的Webinar活动依托B站直播平台进行，欢迎在B站搜索VALSE_Webinar关注我们！

直播地址：

https://live.bilibili.com/22300737；

历史视频观看地址：

https://space.bilibili.com/562085182/

2、VALSE Webinar活动通常每周三晚上20:00进行，但偶尔会因为讲者时区问题略有调整，为方便您参加活动，请关注VALSE微信公众号：valse_wechat 或加入VALSE QQ R群，群号：137634472）；

*注：申请加入VALSE QQ群时需验证姓名、单位和身份，缺一不可。入群后，请实名，姓名身份单位。身份：学校及科研单位人员T；企业研发I；博士D；硕士M。

3、VALSE微信公众号一般会在每周四发布下一周Webinar报告的通知。

4、您也可以通过访问VALSE主页：http://valser.org/ 直接查看Webinar活动信息。Webinar报告的PPT（经讲者允许后），会在VALSE官网每期报告通知的最下方更新。

收藏邀请

上一篇：VALSE Webinar 20220810-20期总第287期图神经网络及其在结构建模中的应用 ... ...下一篇：VALSE 论文速览第91期：DF-GAN：一个简单有效的文本到图像生成对抗网络 ...

下级分类

小黑屋|手机版|Archiver|Vision And Learning SEminar

GMT+8, 2026-7-17 04:39 , Processed in 0.016912 second(s), 14 queries .

返回顶部

VALSE Webinar 22-21期 总第288期 面向交互行为的视觉场景理解

相关分类

下级分类

VALSE Webinar 22-21期总第288期面向交互行为的视觉场景理解