What Matters in Detecting AI-Generated Videos like Sora?

1The University of Hong Kong, 2The Chinese University of Hong Kong

Abstract

Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models.


Empirical Studies and Analysis

Comprehensive Video Representation

To allow the decomposition of a video into individual components, we design a comprehensive video representation (CVR) for video analysis and fake video detection. CVR is composed of three components:
  • Appearance representation: Instead of simply using the original RGB information, we additionally extract visual feature information from each frame with DINOv2, a vision foundation model, to extract rich high-level features as the appearance representation.
  • Motion representation: We leverage optical flow obtained from RAFT to study the motion patterns between synthetic and real videos due to its ability to capture subtle variations in pixel movement, enabling precise analysis of dynamics within the video frames.
  • Geometry representation: Depth conveys many 2.5D geometric cues, such as occlusion, spatial relationships, scale, and so on. To investigate the geometric properties of generated videos, we leverage both relative depth from Marigold and metric depth from UniDepth. Compared to relative depth, metric depth has a uniform scale and provides better consistency across videos, which helps perceive changes in the geometric structure of the video.
  • Results

    We adopt 3D ConvNets to predict whether the video is real or fake with only one of the components of our CVR as input, and the results are listed as follows.


    Analysis on Appearance

    To further understand the classifier's decision criteria, we used Grad-CAM to aid in obtaining a more in-depth analysis. Below are some of the Grad-CAM results from our appearance classifier on different video generation models. For simplicity, we have shown only excerpts of the generated videos for some examples.

    Generated videos suffer from color inconsistency and texture distortion.

    Analysis on Motion

    Below are some of the Grad-CAM results from our motion classifier on different video generation models. For simplicity, we have shown only excerpts of the generated videos for some examples.

    Video generation models cannot fully reproduce real-world motion patterns and may create unrealistic motion patterns.

    Analysis on Geometry

    Below are some of the Grad-CAM results from our geometry classifier on different video generation models. For simplicity, we have shown only excerpts of the generated videos for some examples.

    Generated videos still cannot fully follow real-world geometry rules with unreal occlusion patterns and inconsistent object scales.

    Comparison

    We compare the ensembled CVR classifier with existing works and the results are listed in the following table. As shown in the table, existing works struggle to detect AI-generated videos in cross-domain settings. In contrast, our approach, leveraging appearance, motion, and geometry classifiers along with the Ensembled-Experts strategy, consistently outperforms existing works.

    BibTeX

    @article{chang2024mattersdetectingaigeneratedvideos,
      author    = {Chirui Chang and Zhengzhe Liu and Xiaoyang Lyu and Xiaojuan Qi},
      title     = {What Matters in Detecting AI-Generated Videos like Sora?},
      journal   = {arXiv preprint arXiv:2406.19568},
      year      = {2024},
    }