📟 PaGeR — Unified Panoramic Geometry Estimation Model Card

PaGeR is the unified geometry-estimation checkpoint released with our paper:

Paper: Unified Panoramic Geometry Estimation via Multi-View Foundation Models — arXiv:2605.26368

From a single equirectangular (ERP) panorama, one forward pass returns:

Scale-invariant (SI) depth at full panoramic resolution, predicted by the dense depth head.
Metric depth in metres, obtained by multiplying the SI depth with a single global log-scale predicted by a parallel coarse metric-scale head. Two such scale heads are trained — one for indoor, one for outdoor scenes — and exactly one runs per panorama (see routing below).
Surface normals as unit vectors in the panorama's world frame,
Sky segmentation for masking unbounded depth regions.

So the unified PaGeR checkpoint emits both the SI depth map and the metric depth map in one shot: the dense head fixes geometry, the scale head fixes absolute scale. If you only need metric depth and don't want to manage the indoor/outdoor scale routing, the depth-only prs-eth/PaGeR-metric-depth checkpoint predicts metric depth directly in a single head.

Indoor and outdoor scenes are served by twin scale heads, so a single checkpoint covers both regimes. The active head can be selected manually or routed automatically — see Model Details below for how that routing is done at inference time.

You can also browse the rest of our PaGeR HF collection or try the interactive demo.

Model Details

Developed by: Vukasin Bozic, Isidora Slavkovic, Dominik Narnhofer, Nando Metzger, Denis Rozumny, Konrad Schindler, Nikolai Kalischek.
Model type: Feed-forward, multi-view foundation-model adaptation for single-image panoramic geometry estimation (depth + normals + sky + metric scale).
Backbone: Depth Anything 3 (da3-giant, ViT-Giant), repurposed for cubemap-based multi-view processing of the panorama.
Inputs: A single ERP panorama, internally projected onto a 6-face cubemap at 504 px per face.
Outputs (in one forward pass):
- Scale-invariant (SI) depth map at panoramic resolution, from the dense depth head.
- Metric depth (metres), computed as SI_depth * exp(log_scale) where log_scale is the single global log-scale predicted by the active (indoor or outdoor) coarse metric-scale head. Both the SI and metric maps share the same dense geometry; the scale head only injects absolute scale.
- Surface normals as unit vectors in the panorama's world frame.
- Sky mask for filling/masking unbounded regions in the depth and normal outputs.
Indoor / outdoor routing (inference-time add-on, not part of the paper): The paper's per-domain numbers were produced with the indoor or outdoor scale head selected manually per dataset. For convenience in the released demo and CLI, a small zero-shot CLIP ViT-B/32 classifier (loaded via open_clip) can auto-pick between the twin scale heads at inference time, by scoring the 4 equatorial cubemap faces against two text-prompt centroids ("indoor scene" vs. "outdoor scene"). The router lives outside the checkpoint and can be overridden by the user (--scene_mode {auto,indoor,outdoor}).
Resolution: Designed for high-resolution ERP inputs, up to 3K.
License: CC BY-NC 4.0 — academic / non-commercial use only. The released weights are derivative works of the Depth Anything 3 da3-giant backbone, released by ByteDance under CC BY-NC 4.0, and inherit that restriction. The CLIP ViT-B/32 weights used for the indoor/outdoor router are loaded from open_clip at runtime; they are MIT-licensed and travel separately, so they do not propagate any additional restriction. Commercial use is not permitted.
Resources for more information: Project Website, Paper, Code.

Other released checkpoints

Checkpoint	Hugging Face id	Depth	Normals	Sky
PaGeR (this card, recommended)	`prs-eth/PaGeR`	✅	✅	✅
PaGeR-Metric-Depth	`prs-eth/PaGeR-metric-depth`	✅ (metric)
PaGeR-Normals	`prs-eth/PaGeR-normals`		✅

Usage

A minimal Python snippet that runs the unified model on a single panorama and produces metric depth, surface normals, and a sky mask in one forward pass. The snippet assumes you have cloned the repository and pip install -e . ed it, so that src.pager is importable; checkpoint weights and config are streamed from the Hub on first use.

import matplotlib.pyplot as plt
import numpy as np
import torch
from huggingface_hub import hf_hub_download
from omegaconf import OmegaConf
from PIL import Image

from src.pager import Pager
from src.utils.geometry_utils import erp_to_cubemap
from src.utils.utils import prepare_depth_for_logging, prepare_normals_for_logging

checkpoint = "prs-eth/PaGeR"          # or a local directory
device = torch.device("cuda")

# 1. Load the model config from the Hub and instantiate Pager.
config_path = hf_hub_download(repo_id=checkpoint, filename="config.yaml")
cfg = OmegaConf.load(config_path)

pager = Pager(checkpoint, cfg=cfg, device=device)
pager.get_intrinsics_extrinsics(image_size=cfg.face_size, fov=getattr(cfg, "cube_fov", 90.0))
pager.model.to(device).eval()

# 2. Load a panorama and project it to the 6-face cubemap PaGeR consumes.
panorama = np.array(Image.open("assets/examples/apartment_synth.jpg").convert("RGB")) / 255.0
panorama = torch.from_numpy(panorama).permute(2, 0, 1).float() * 2 - 1
rgb_cubemap = erp_to_cubemap(panorama, face_w=cfg.face_size,
                             fov=getattr(cfg, "cube_fov", 90.0)).unsqueeze(0).to(device)

# 3. Run one forward pass. The unified checkpoint carries both indoor and
#    outdoor scale heads; pass ``skip_heads`` to keep exactly one of them
#    (here: force the outdoor head by skipping ``scale_indoor``). The full
#    CLI in ``inference.py`` instead routes each panorama automatically via
#    a small CLIP ViT-B/32 classifier on the cubemap faces.
with torch.inference_mode():
    pred = pager(rgb_cubemap, dtype=torch.float16, skip_heads={"scale_indoor"})

# 4. Convert raw head outputs into ERP-resolution arrays:
#    - depth: SI depth × exp(log_scale) → metric depth (metres), with the
#      predicted sky region filled to ``MAX_DEPTH`` via a soft alpha blend.
#    - normals: unit vectors in the panorama's world frame, sky-filled.
cmap = plt.get_cmap("Spectral")
H, W = panorama.shape[-2:]
depth_metric, depth_viz = prepare_depth_for_logging(
    pager, pred["depth"][0], pred["sky"][0], (H, W), cmap,
    log_scale=pred["scale"],
)
normals, normals_viz = prepare_normals_for_logging(
    pager, pred["normals"][0], pred["sky"][0], (H, W),
)

depth_metric is a (1, H, W) float32 array of metric depth (metres); normals is a (3, H, W) unit-normal field. Both already have the predicted sky region filled in. The *_viz companions are uint8 RGB previews (Spectral-coloured for depth, per-sample rescaled for normals). See the GitHub repository for the full CLI (inference.py), evaluation scripts, the Gradio demo (app.py), and the point-cloud exporter.

Citation

If you use this checkpoint in your work, please cite:

@article{bozic2026pager,
  title   = {Unified Panoramic Geometry Estimation via Multi-View Foundation Models},
  author  = {Bozic, Vukasin and Slavkovic, Isidora and Narnhofer, Dominik and
             Metzger, Nando and Rozumny, Denis and Schindler, Konrad and
             Kalischek, Nikolai},
  journal = {arXiv preprint arXiv:2605.26368},
  year    = {2026}
}