๐บ๏ธ LingBot-Map
Geometric Context Transformer for Streaming 3D Reconstruction
๐ฎ Run Inference Remotely
Upload a short video and this tab calls the LingBot-Map model remotely via the dennny123/lingbot-3d ZeroGPU Space API.
โ ๏ธ Note: First run may take 2-5 minutes (cold start: GPU allocation + 4.3GB model download). Only 1 request at a time.
๐บ๏ธ LingBot-Map: Geometric Context Transformer for Streaming 3D Reconstruction
LingBot-Map is a feed-forward 3D foundation model that reconstructs 3D scenes from video streams in real-time at ~20 FPS.
Given a continuous video stream, it recovers:
- ๐ท Camera poses for each frame
- ๐ Depth maps per frame
- โ๏ธ 3D point clouds of the scene
Unlike traditional SLAM systems that rely on iterative optimization, LingBot-Map does this in a single forward pass through a transformer.
Paper
"Geometric Context Transformer for Streaming 3D Reconstruction"
Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, Yinghao Xu
๐ arXiv:2604.14141 | ๐ Project Page | ๐ป GitHub
๐ง Architecture: Geometric Context Transformer (GCT)
The core innovation is Geometric Context Attention (GCA) โ a carefully designed attention mechanism with three complementary context types:
1๏ธโฃ Anchor Context โ Coordinate Grounding
Provides a stable spatial reference frame. Without it, the model's coordinate system can drift or become ambiguous over long sequences. The anchor acts as a "north star" that grounds all predictions.
2๏ธโฃ Pose-Reference Window โ Dense Geometric Cues
A sliding window of recent frames that supplies rich local geometric information. This is where the model gets its precise relative pose estimates and dense depth predictions โ analogous to how classical SLAM uses co-visibility frames for local bundle adjustment.
3๏ธโฃ Trajectory Memory โ Long-Range Drift Correction
A compact, long-range memory that prevents accumulated drift over thousands of frames. Unlike the dense pose-reference window, trajectory memory stores sparse keyframes that enable the model to "remember" places it saw long ago โ similar to loop closure in traditional SLAM, but learned.
Key Design: Paged KV Cache
Like autoregressive LLMs, LingBot-Map caches KV states of processed frames. It uses a paged KV cache with sliding window + keyframe selection to keep memory bounded while retaining long-range context. This enables stable inference on sequences exceeding 10,000 frames.
Model Specs
| Component | Detail |
|---|---|
| Backbone | DINOv2 ViT-L/14 with register tokens |
| Embed dim | 1024 |
| Patch size | 14ร14 |
| Input resolution | 518ร378 |
| Dense heads | Camera head + 2ร DPT heads (depth + point map) |
| Positional encoding | 3D RoPE (spatial + temporal) |
| Parameters | ~4.3 GB checkpoint |
๐ Benchmark Results
๐๏ธ Oxford Spires โ Large-Scale Trajectory Estimation (Sparse: 320 frames)
| Method | Type | AUC@15 โ | AUC@30 โ | ATE โ | FPS |
|---|---|---|---|---|---|
| VGGT | Offline | 23.84 | 35.09 | 24.78 | โ |
| DA3 | Offline | 49.84 | 56.68 | 12.87 | โ |
| Pi3 | Offline | 38.64 | 48.65 | 14.03 | โ |
| VIPE | Optim | 45.35 | 51.88 | 10.52 | โ |
| DroidSLAM | Optim | 8.58 | 21.41 | 21.84 | โ |
| CUT3R | Online | 5.98 | 14.95 | 18.16 | 29.2 |
| TTT3R | Online | 13.92 | 25.90 | 19.35 | 29.0 |
| Wint3R | Online | 11.61 | 23.42 | 21.10 | 3.9 |
| LingBot-Map | Online | 61.64 | 75.16 | 6.42 | 20.3 |
๐ LingBot-Map beats ALL offline, optimization, and online methods despite being streaming!
Dense Setting (3,840 frames) โ Drift Resistance
| Method | ATE sparse โ | ATE dense โ | ฮATE |
|---|---|---|---|
| CUT3R | 18.16 | 32.47 | +14.31 |
| Wint3R | 21.10 | 32.90 | +11.80 |
| LingBot-Map | 6.42 | 7.11 | +0.69 |
LingBot-Map's ATE barely increases over 12ร longer sequences!
3D Reconstruction (F1 Score)
| Method | ETH3D | 7-Scenes | NRGBD |
|---|---|---|---|
| Wint3R | 77.28 | 78.81 | 56.96 |
| TTT3R | 68.48 | 77.25 | 53.55 |
| Stream3R | 72.87 | 78.79 | 54.07 |
| LingBot-Map | 98.98 | 80.39 | 64.26 |
ETH3D F1 of 98.98 โ nearly perfect reconstruction!
๐ Quick Start
Installation
conda create -n lingbot-map python=3.10 -y
conda activate lingbot-map
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -e .
pip install flashinfer-python -i https://flashinfer.ai/whl/cu128/torch2.9/
Inference from Images
python demo.py --model_path lingbot-map.pt \
--image_folder /path/to/images/
Inference from Video
python demo.py --model_path lingbot-map.pt \
--video_path video.mp4 --fps 10
Long Sequences (10,000+ frames)
python demo.py --model_path lingbot-map.pt \
--image_folder /path/to/images/ \
--keyframe_interval 6
Windowed Mode (>3000 frames)
python demo.py --model_path lingbot-map.pt \
--video_path video.mp4 --fps 10 \
--mode windowed --window_size 64
Without FlashInfer (CPU fallback)
python demo.py --model_path lingbot-map.pt \
--image_folder /path/to/images/ --use_sdpa
Model Checkpoints
| Name | Size | Description |
|---|---|---|
lingbot-map.pt |
4.63 GB | Base model |
lingbot-map-long.pt |
4.63 GB | Long-sequence variant |
lingbot-map-stage1.pt |
4.76 GB | Stage 1 training checkpoint |
All available at robbyant/lingbot-map