🗺️ LingBot-Map

Geometric Context Transformer for Streaming 3D Reconstruction

20 FPS

Real-time inference

10K+

Frames per sequence

98.98

ETH3D F1 Score

61.64

Oxford AUC@15

🔮 Run Inference Remotely

Upload a short video and this tab calls the LingBot-Map model remotely via the dennny123/lingbot-3d ZeroGPU Space API.

⚠️ Note: First run may take 2-5 minutes (cold start: GPU allocation + 4.3GB model download). Only 1 request at a time.

Input video

Sampling FPS

1 12

Max frames

2 24

Scale frames

1 8

Keyframe interval

1 8

Confidence percentile

0 90

3D Result

Frame preview

🗺️ LingBot-Map: Geometric Context Transformer for Streaming 3D Reconstruction

LingBot-Map is a feed-forward 3D foundation model that reconstructs 3D scenes from video streams in real-time at ~20 FPS.

Given a continuous video stream, it recovers:

📷 Camera poses for each frame
🌊 Depth maps per frame
☁️ 3D point clouds of the scene

Unlike traditional SLAM systems that rely on iterative optimization, LingBot-Map does this in a single forward pass through a transformer.

Paper

"Geometric Context Transformer for Streaming 3D Reconstruction"
Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, Yinghao Xu

📄 arXiv:2604.14141 | 🌐 Project Page | 💻 GitHub

🧠 Architecture: Geometric Context Transformer (GCT)

The core innovation is Geometric Context Attention (GCA) — a carefully designed attention mechanism with three complementary context types:

1️⃣ Anchor Context — Coordinate Grounding

Provides a stable spatial reference frame. Without it, the model's coordinate system can drift or become ambiguous over long sequences. The anchor acts as a "north star" that grounds all predictions.

2️⃣ Pose-Reference Window — Dense Geometric Cues

A sliding window of recent frames that supplies rich local geometric information. This is where the model gets its precise relative pose estimates and dense depth predictions — analogous to how classical SLAM uses co-visibility frames for local bundle adjustment.

3️⃣ Trajectory Memory — Long-Range Drift Correction

A compact, long-range memory that prevents accumulated drift over thousands of frames. Unlike the dense pose-reference window, trajectory memory stores sparse keyframes that enable the model to "remember" places it saw long ago — similar to loop closure in traditional SLAM, but learned.

Key Design: Paged KV Cache

Like autoregressive LLMs, LingBot-Map caches KV states of processed frames. It uses a paged KV cache with sliding window + keyframe selection to keep memory bounded while retaining long-range context. This enables stable inference on sequences exceeding 10,000 frames.

Model Specs

Component	Detail
Backbone	DINOv2 ViT-L/14 with register tokens
Embed dim	1024
Patch size	14×14
Input resolution	518×378
Dense heads	Camera head + 2× DPT heads (depth + point map)
Positional encoding	3D RoPE (spatial + temporal)
Parameters	~4.3 GB checkpoint

📊 Benchmark Results

🏛️ Oxford Spires — Large-Scale Trajectory Estimation (Sparse: 320 frames)

Method	Type	AUC@15 ↑	AUC@30 ↑	ATE ↓	FPS
VGGT	Offline	23.84	35.09	24.78	—
DA3	Offline	49.84	56.68	12.87	—
Pi3	Offline	38.64	48.65	14.03	—
VIPE	Optim	45.35	51.88	10.52	—
DroidSLAM	Optim	8.58	21.41	21.84	—
CUT3R	Online	5.98	14.95	18.16	29.2
TTT3R	Online	13.92	25.90	19.35	29.0
Wint3R	Online	11.61	23.42	21.10	3.9
LingBot-Map	Online	61.64	75.16	6.42	20.3

🏆 LingBot-Map beats ALL offline, optimization, and online methods despite being streaming!

Dense Setting (3,840 frames) — Drift Resistance

Method	ATE sparse ↓	ATE dense ↓	ΔATE
CUT3R	18.16	32.47	+14.31
Wint3R	21.10	32.90	+11.80
LingBot-Map	6.42	7.11	+0.69

LingBot-Map's ATE barely increases over 12× longer sequences!

3D Reconstruction (F1 Score)

Method	ETH3D	7-Scenes	NRGBD
Wint3R	77.28	78.81	56.96
TTT3R	68.48	77.25	53.55
Stream3R	72.87	78.79	54.07
LingBot-Map	98.98	80.39	64.26

ETH3D F1 of 98.98 — nearly perfect reconstruction!

🚀 Quick Start

Installation

conda create -n lingbot-map python=3.10 -y
conda activate lingbot-map
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -e .
pip install flashinfer-python -i https://flashinfer.ai/whl/cu128/torch2.9/

Inference from Images

python demo.py --model_path lingbot-map.pt \
    --image_folder /path/to/images/

Inference from Video

python demo.py --model_path lingbot-map.pt \
    --video_path video.mp4 --fps 10

Long Sequences (10,000+ frames)

python demo.py --model_path lingbot-map.pt \
    --image_folder /path/to/images/ \
    --keyframe_interval 6

Windowed Mode (>3000 frames)

python demo.py --model_path lingbot-map.pt \
    --video_path video.mp4 --fps 10 \
    --mode windowed --window_size 64

Without FlashInfer (CPU fallback)

python demo.py --model_path lingbot-map.pt \
    --image_folder /path/to/images/ --use_sdpa

Model Checkpoints

Name	Size	Description
`lingbot-map.pt`	4.63 GB	Base model
`lingbot-map-long.pt`	4.63 GB	Long-sequence variant
`lingbot-map-stage1.pt`	4.76 GB	Stage 1 training checkpoint

All available at robbyant/lingbot-map

Built with Gradio logo