How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation (L3DE)

ICCV 2025 Video Generation Video Evaluation 3D Coherence
Chirui Chang, Jiahui Liu, Zhengzhe Liu, Xiaoyang Lyu, Yi-Hua Huang, Xin Tao, Pengfei Wan, Di Zhang, Xiaojuan Qi
The University of Hong Kong · Kling Team, Kuaishou Technology · Lingnan University
L3DE teaser

Abstract

L3DE is a practical, interpretable metric for judging how well AI-generated videos simulate the 3D visual world.

  • Uses three monocular cues—motion, geometry, appearance—instead of brittle full 3D reconstruction.
  • A lightweight 3D ConvNet separates real vs. synthetic and outputs a calibrated L3DE score.
  • Attribution maps localize implausible regions (occlusion violations, texture inconsistencies, inconsistent dynamics).
  • Correlates with 3D reconstruction quality and human judgments; surfaces subtle inconsistencies in SOTA models.
  • Label-free, robust to diverse web videos, and easy to integrate.
  • Versatile: Supports deepfake detection, refinement-in-the-loop, generator benchmarking, and dataset filtering.
Label-free 3D Coherence Attribution Maps Web-scale Practical
Why it matters

L3DE quantifies the “simulation gap” in 3D coherence and pinpoints artifacts for targeted fixes, enabling more reliable evaluation and practical debugging beyond conventional benchmarks.

See the paper and supplementary for full details.

Experimental Results

Dataset Overview

Our dataset comprises three parts: Paired Real/Synthetic Video Set, 3D Reconstruction Verification Set, and 3D Visual Simulation Benchmark.

Source Synth/Real #Videos Clip Resolution FPS Prompt
Paired Real/Synthetic Video Set
PexelsReal80,0004sVariableVariable
Stable Video DiffusionSynth80,0004s1024×5767I2V
3D Reconstruction Verification Set
Kling 1.5Synth3,0005sVariable30I2V & T2V
3D Visual Simulation Benchmark
PexelsReal14,0004sVariableVariable
Runway-Gen3Synth5395s1280×76824I2V & T2V
MiniMaxSynth5395s1280×72025I2V & T2V
ViduSynth5393sVariable24I2V & T2V
Luma Dream Machine 1.6Synth539VariableVariable24I2V & T2V
Kling 1.5Synth5395sVariable30I2V & T2V
CogVideoX-5BSynth5396s720×4808I2V & T2V
SoraSynth5395sVariable30I2V & T2V
Kling 2.1Synth5395sVariable30I2V & T2V

Applications

1) Benchmarking Video Generation Models

L3DE ranks video generation models by overall 3D visual coherence (Fusion) and reports per-aspect scores.

GeneratorsFusionAppearanceMotionGeometry
Runway-Gen30.71620.69460.57680.6739
MiniMax0.79320.77140.60980.7251
Vidu0.70520.64060.62280.7615
Luma 1.60.50620.49500.58530.6800
Kling 1.50.75180.72470.59260.6927
CogVideoX-5B0.61040.58930.62030.7539
Sora0.88950.83940.64670.7458
Kling 2.10.89040.81290.67350.7623
Real Videos0.99990.99500.83210.8435
Fusion is the primary ranking signal; real videos provide an empirical upper bound.

2) Fake Video Detection

Classify videos as real or synthetic by thresholding L3DE scores; compare with image-based detectors.

MethodInputMiniMaxKling 1.5Runway-Gen3LumaCogVideoXViduSoraAverage
CNNDetectionImage49.9250.0250.0050.4550.0750.0049.9150.05
DIREImage50.0050.0050.0050.0050.0050.0050.0050.00
NPRImage60.1967.9164.9954.0635.7936.0460.8254.25
L3DEVideo66.5182.5272.1983.3876.7370.0156.3173.14
Accuracy (%) on our fake video detection benchmark.

3) Video Refinement

L3DE Grad‑CAM localizes artifact regions. We propagate masks (e.g., SAM‑2) across frames and apply 3D‑consistent inpainting (e.g., LaMa in a 3D‑GS loop) to remove artifacts while preserving content.

Refinement guided by L3DE
Example of refinement guided by L3DE activations.

4) More Visualization with Grad‑CAM

L3DE identifies inconsistencies in generative videos, such as texture artifacts, implausible dynamics, and occlusion errors, across multiple aspects.