Abstract
L3DE is a practical, interpretable metric for judging how well AI-generated videos simulate the 3D visual world.
- Uses three monocular cues—motion, geometry, appearance—instead of brittle full 3D reconstruction.
- A lightweight 3D ConvNet separates real vs. synthetic and outputs a calibrated L3DE score.
- Attribution maps localize implausible regions (occlusion violations, texture inconsistencies, inconsistent dynamics).
- Correlates with 3D reconstruction quality and human judgments; surfaces subtle inconsistencies in SOTA models.
- Label-free, robust to diverse web videos, and easy to integrate.
- Versatile: Supports deepfake detection, refinement-in-the-loop, generator benchmarking, and dataset filtering.
Why it matters
L3DE quantifies the “simulation gap” in 3D coherence and pinpoints artifacts for targeted fixes, enabling more reliable evaluation and practical debugging beyond conventional benchmarks.
See the paper and supplementary for full details.
Experimental Results


Dataset Overview
Our dataset comprises three parts: Paired Real/Synthetic Video Set, 3D Reconstruction Verification Set, and 3D Visual Simulation Benchmark.
Source | Synth/Real | #Videos | Clip | Resolution | FPS | Prompt |
---|---|---|---|---|---|---|
Paired Real/Synthetic Video Set | ||||||
Pexels | Real | 80,000 | 4s | Variable | Variable | — |
Stable Video Diffusion | Synth | 80,000 | 4s | 1024×576 | 7 | I2V |
3D Reconstruction Verification Set | ||||||
Kling 1.5 | Synth | 3,000 | 5s | Variable | 30 | I2V & T2V |
3D Visual Simulation Benchmark | ||||||
Pexels | Real | 14,000 | 4s | Variable | Variable | — |
Runway-Gen3 | Synth | 539 | 5s | 1280×768 | 24 | I2V & T2V |
MiniMax | Synth | 539 | 5s | 1280×720 | 25 | I2V & T2V |
Vidu | Synth | 539 | 3s | Variable | 24 | I2V & T2V |
Luma Dream Machine 1.6 | Synth | 539 | Variable | Variable | 24 | I2V & T2V |
Kling 1.5 | Synth | 539 | 5s | Variable | 30 | I2V & T2V |
CogVideoX-5B | Synth | 539 | 6s | 720×480 | 8 | I2V & T2V |
Sora | Synth | 539 | 5s | Variable | 30 | I2V & T2V |
Kling 2.1 | Synth | 539 | 5s | Variable | 30 | I2V & T2V |
Applications
1) Benchmarking Video Generation Models
L3DE ranks video generation models by overall 3D visual coherence (Fusion) and reports per-aspect scores.
Generators | Fusion | Appearance | Motion | Geometry |
---|---|---|---|---|
Runway-Gen3 | 0.7162 | 0.6946 | 0.5768 | 0.6739 |
MiniMax | 0.7932 | 0.7714 | 0.6098 | 0.7251 |
Vidu | 0.7052 | 0.6406 | 0.6228 | 0.7615 |
Luma 1.6 | 0.5062 | 0.4950 | 0.5853 | 0.6800 |
Kling 1.5 | 0.7518 | 0.7247 | 0.5926 | 0.6927 |
CogVideoX-5B | 0.6104 | 0.5893 | 0.6203 | 0.7539 |
Sora | 0.8895 | 0.8394 | 0.6467 | 0.7458 |
Kling 2.1 | 0.8904 | 0.8129 | 0.6735 | 0.7623 |
Real Videos | 0.9999 | 0.9950 | 0.8321 | 0.8435 |
2) Fake Video Detection
Classify videos as real or synthetic by thresholding L3DE scores; compare with image-based detectors.
Method | Input | MiniMax | Kling 1.5 | Runway-Gen3 | Luma | CogVideoX | Vidu | Sora | Average |
---|---|---|---|---|---|---|---|---|---|
CNNDetection | Image | 49.92 | 50.02 | 50.00 | 50.45 | 50.07 | 50.00 | 49.91 | 50.05 |
DIRE | Image | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 |
NPR | Image | 60.19 | 67.91 | 64.99 | 54.06 | 35.79 | 36.04 | 60.82 | 54.25 |
L3DE | Video | 66.51 | 82.52 | 72.19 | 83.38 | 76.73 | 70.01 | 56.31 | 73.14 |
3) Video Refinement
L3DE Grad‑CAM localizes artifact regions. We propagate masks (e.g., SAM‑2) across frames and apply 3D‑consistent inpainting (e.g., LaMa in a 3D‑GS loop) to remove artifacts while preserving content.

4) More Visualization with Grad‑CAM
L3DE identifies inconsistencies in generative videos, such as texture artifacts, implausible dynamics, and occlusion errors, across multiple aspects.


