L3DE Project Page

Abstract

L3DE is a practical, interpretable metric for judging how well AI-generated videos simulate the 3D visual world.

Uses three monocular cues—motion, geometry, appearance—instead of brittle full 3D reconstruction.
A lightweight 3D ConvNet separates real vs. synthetic and outputs a calibrated L3DE score.
Attribution maps localize implausible regions (occlusion violations, texture inconsistencies, inconsistent dynamics).
Correlates with 3D reconstruction quality and human judgments; surfaces subtle inconsistencies in SOTA models.
Label-free, robust to diverse web videos, and easy to integrate.
Versatile: Supports deepfake detection, refinement-in-the-loop, generator benchmarking, and dataset filtering.

Why it matters

L3DE quantifies the “simulation gap” in 3D coherence and pinpoints artifacts for targeted fixes, enabling more reliable evaluation and practical debugging beyond conventional benchmarks.

See the paper and supplementary for full details.

Experimental Results

Fig. 4 – Localization & Activation Alignment

Localizing 3D inconsistencies with L3DE. For each case: (a) AI-generated, (b) 3D rendered, (c) pixel difference, (d) Grad‑CAM, (e) activation alignment. High‑activation regions from L3DE overlap with large pixel errors, revealing inconsistencies in 3D reconstruction. Results demonstrates strong alignment between L3DE- detected and rendering-inconsistent regions.

Fig. 5 – Correlation with Human Judgments

Correlation with human judgments. Spearman correlation on 300 videos: Fusion ρ≈0.65, Appearance ρ≈0.60, Motion ρ≈0.48, Geometry ρ≈0.54. Fusion performs best by integrating all cues.

Dataset Overview

Our dataset comprises three parts: Paired Real/Synthetic Video Set, 3D Reconstruction Verification Set, and 3D Visual Simulation Benchmark.

Source	Synth/Real	#Videos	Clip	Resolution	FPS	Prompt
Paired Real/Synthetic Video Set
Pexels	Real	80,000	4s	Variable	Variable	—
Stable Video Diffusion	Synth	80,000	4s	1024×576	7	I2V
3D Reconstruction Verification Set
Kling 1.5	Synth	3,000	5s	Variable	30	I2V & T2V
3D Visual Simulation Benchmark
Pexels	Real	14,000	4s	Variable	Variable	—
Runway-Gen3	Synth	539	5s	1280×768	24	I2V & T2V
MiniMax	Synth	539	5s	1280×720	25	I2V & T2V
Vidu	Synth	539	3s	Variable	24	I2V & T2V
Luma Dream Machine 1.6	Synth	539	Variable	Variable	24	I2V & T2V
Kling 1.5	Synth	539	5s	Variable	30	I2V & T2V
CogVideoX-5B	Synth	539	6s	720×480	8	I2V & T2V
Sora	Synth	539	5s	Variable	30	I2V & T2V
Kling 2.1	Synth	539	5s	Variable	30	I2V & T2V

Applications

1) Benchmarking Video Generation Models

L3DE ranks video generation models by overall 3D visual coherence (Fusion) and reports per-aspect scores.

Generators	Fusion	Appearance	Motion	Geometry
Runway-Gen3	0.7162	0.6946	0.5768	0.6739
MiniMax	0.7932	0.7714	0.6098	0.7251
Vidu	0.7052	0.6406	0.6228	0.7615
Luma 1.6	0.5062	0.4950	0.5853	0.6800
Kling 1.5	0.7518	0.7247	0.5926	0.6927
CogVideoX-5B	0.6104	0.5893	0.6203	0.7539
Sora	0.8895	0.8394	0.6467	0.7458
Kling 2.1	0.8904	0.8129	0.6735	0.7623
Real Videos	0.9999	0.9950	0.8321	0.8435

Fusion is the primary ranking signal; real videos provide an empirical upper bound.

2) Fake Video Detection

Classify videos as real or synthetic by thresholding L3DE scores; compare with image-based detectors.

Method	Input	MiniMax	Kling 1.5	Runway-Gen3	Luma	CogVideoX	Vidu	Sora	Average
CNNDetection	Image	49.92	50.02	50.00	50.45	50.07	50.00	49.91	50.05
DIRE	Image	50.00	50.00	50.00	50.00	50.00	50.00	50.00	50.00
NPR	Image	60.19	67.91	64.99	54.06	35.79	36.04	60.82	54.25
L3DE	Video	66.51	82.52	72.19	83.38	76.73	70.01	56.31	73.14

Accuracy (%) on our fake video detection benchmark.

3) Video Refinement

L3DE Grad‑CAM localizes artifact regions. We propagate masks (e.g., SAM‑2) across frames and apply 3D‑consistent inpainting (e.g., LaMa in a 3D‑GS loop) to remove artifacts while preserving content.

4) More Visualization with Grad‑CAM

L3DE identifies inconsistencies in generative videos, such as texture artifacts, implausible dynamics, and occlusion errors, across multiple aspects.