为了使得视觉与学习领域相关从业者快速及时地了解领域的最新发展动态和前沿技术进展,VALSE最新推出了《论文速览》栏目,将在每周发布一至两篇顶会顶刊论文的录制视频,对单个前沿工作进行细致讲解。本期VALSE论文速览选取了来自山东大学的视频理解的工作。该工作由孟雷教授指导,论文一作王雨情同学录制。 论文题目: Modeling Event-level Causal Representation for Video Classification 作者列表: 王雨情 (山东大学)、孟雷 (山东大学)、马浩凯 (山东大学)、黄海北 (浪潮)、孟祥旭 (山东大学) B站观看网址: 论文摘要: Classifying videos differs from that of images in the need to capture the information on what has happened, instead of what is in the frames. Conventional methods typically follow the data-driven approach, which uses transformer-based attention models to extract and aggregate the features of video frames as the representation of the entire video. However, this approach tends to extract the object information of frames and may face difficulties in classifying the classes talking about events, such as "fixing bicycle". To address this issue, This paper presents an Event-level Causal Representation Learning (ECRL) model for the spatio-temporal modeling of both the in-frame object interactions and their cross-frame temporal correlations. Specifically, ECRL first employs a Frame-to-Video Causal Modeling (F2VCM) module, which simultaneously builds the in-frame causal graph with the background and foreground information and models their cross-frame correlations to construct a video-level causal graph. Subsequently, a Causality-aware Event-level Representation Inference (CERI) module is introduced to eliminate the spurious correlations in contexts and objects via the back- and front-door interventions, respectively. The former involves visual context de-biasing to filter out background confounders, while the latter employs global-local causal attention to capture event-level visual information. Experimental results on two benchmarking datasets verified that ECRL may better capture the cross-frame correlations to describe videos in event-level features. 参考文献: [1] Ehsan Abbasnejad, Damien Teney, Amin Parvaneh, et al. 2020. Counterfactual vision and language learning. In CVPR. 10044--10054. [2] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, et al. 2021. Vivit: A video vision transformer. In CVPR. 6836--6846. [3] Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable k-means. arXiv preprint arXiv:1203.6402 (2012). [4] Elias Bareinboim and Judea Pearl. 2012. Controlling selection bias in causal inference. In Artificial Intelligence and Statistics. 100--108. [5] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4. [6] Michel Besserve, Arash Mehrjou, Rémy Sun, and Bernhard Schölkopf. 2018. Counterfactuals uncover the modular structure of deep generative models. arXiv preprint arXiv:1812.03253 (2018). 论文链接: https://dl.acm.org/doi/abs/10.1145/3664647.3681547 代码链接: https://github.com/wyqcrystal/ECRL. 视频讲者简介: Wang Yuqing is currently a master's student at the School of Software, Shandong University. During his graduate studies, he took charge of large-scale multimedia data processing and algorithm research related to social governance events and facilitated the practical implementation of a digital twin platform for digitalized social governance. He has published two conference papers as the first author, classified as CCF-A and CCF-C, respectively, and has filed a patent under the Tencent Rhino-Bird Innovation Fund. During his master's program, he was awarded the Third Prize for Freshmen Scholarship and the First-Class Academic Scholarship. In terms of competitions, he demonstrated outstanding performance, winning the First Prize in the International Mathematical Contest in Modeling (MCM), the Second Prize in the CCF Outstanding Undergraduate Academic Showcase, and the National Third Prize in the 14th China College Students' Innovation and Entrepreneurship Outsourcing Competition.
个人主页: https://scholar.google.com.hk/citations?view_op=list_works&hl=zh-CN&user=2MA5TZcAAAAJ 特别鸣谢本次论文速览主要组织者: 月度轮值AC:王一帆 (大连理工大学) |
小黑屋|手机版|Archiver|Vision And Learning SEminar
GMT+8, 2025-10-14 01:34 , Processed in 0.013046 second(s), 14 queries .
Powered by Discuz! X3.4
Copyright © 2001-2020, Tencent Cloud.