VLM
3D Scene Understanding
Data prepration
Download json files from here. Move the json files to ./dataset/eval/3dscene/
Download metadata from EmbodiedScan. You need to fill out the official form to get the access to the dataset. Move the embodiedscan_infos_*.pkl to ./dataset/eval/embodiedscan.
Download images from Video3DLLM. Then,
cd Video-3D-LLM_data
# unzip posed images
cat posed_images_part* > posed_images.tar.gz
tar -xzf posed_images.tar.gz
# unzip mask
unzip mask.zip
# unzip pcd
tar -xzf pcd_with_object_aabbs.tar.gz
mkdir scannet
mv posed_images/ scannet/
mv mask/ scannet/
mv data/scannet/pcd_with_object_aabbs/ scannet/
Move the scannet to ./dataset/eval/.
The whole file structure under ./dataset/eval/ will be as follows.
./dataset/eval/
βββ 3dscene
β βββ scannet_det_val_4frames.json
β βββ scannet_select_frames.json
β βββ scanqa_val_llava_style.json
β βββ scanrefer_val_32frames.json
β βββ sqa3d_test_llava_style.json
βββ embodiedscan
β βββ embodiedscan_infos_test.pkl
β βββ embodiedscan_infos_train.pkl
β βββ embodiedscan_infos_val.pkl
βββ scannet
βββ mask
βββ pcd_with_object_aabbs
βββ posed_images
6 directories, 9 files
Inference
torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_3dvqa --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset sqa3d
torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_3dvqa --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset scanqa
torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_3dvqa --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset scanrefer
torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_3dvqa --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset 3ddet
The results (*.json files) will be saved in ./results/.
Evaluation
SQA3D
python eval/vlm/eval/vqa/3dvqa_eval.py --dataset sqa3d --input-file ./results/sqa3d.json
# output
EM-all: 59.16453537936914
EM-what: 51.787271142109844
EM-which: 50.427350427350426
EM-can: 68.04733727810651
EM-is: 73.15950920245399
EM-how: 60.86021505376345
EM-others: 56.71378091872792
EM-R-all: 62.432509235578294
EM-R-what: 58.326068003487364
EM-R-which: 51.566951566951566
EM-R-can: 68.04733727810651
EM-R-is: 75.61349693251533
EM-R-how: 61.29032258064516
EM-R-others: 59.8939929328622
ScanQA
python eval/vlm/eval/vqa/3dvqa_eval.py --dataset scanqa --input-file ./results/scanqa.json
# output
CIDER: 103.01606173581641
BLEU: 48.355630543562654, 32.50084859567164, 23.329984202585262, 16.194281050888
METEOR: 20.101447665118403
Rouge: 49.04130603336627
EM: 29.497326203208555
ScanRefer
python eval/vlm/eval/vqa/3dgrounding_eval.py --input-file ./results/scanrefer.json
# output
all [email protected]: 50.82036180058898
all [email protected]: 45.0462768195204
multiple [email protected]: 44.877985123319846
multiple [email protected]: 39.34490408456218
unique [email protected]: 75.50135501355012
unique [email protected]: 68.72628726287263
3D Detection
python eval/vlm/eval/vqa/3ddet_eval.py --input-file ./results/3ddet.json
# output
+Metrics Per Category------+--------+----------+
| Category | Precision | Recall | F1 Score |
+--------------+-----------+--------+----------+
| chair | 0.4807 | 0.5074 | 0.4937 |
| pillow | 0.1683 | 0.1881 | 0.1777 |
| cabinet | 0.1401 | 0.1608 | 0.1497 |
| table | 0.3744 | 0.3528 | 0.3633 |
| lamp | 0.1302 | 0.0986 | 0.1122 |
| couch | 0.4795 | 0.4862 | 0.4828 |
| desk | 0.3567 | 0.3896 | 0.3724 |
| stand | 0.4167 | 0.3509 | 0.3810 |
| bed | 0.7468 | 0.6797 | 0.7117 |
| backpack | 0.3130 | 0.3628 | 0.3361 |
| bathtub | 0.4343 | 0.4000 | 0.4164 |
| ottoman | 0.1250 | 0.0714 | 0.0909 |
| dresser | 0.4828 | 0.3825 | 0.4268 |
| bin | 0.3727 | 0.3162 | 0.3421 |
| toilet | 0.7720 | 0.7395 | 0.7554 |
| refrigerator | 0.3486 | 0.4176 | 0.3800 |
| stove | 0.7826 | 0.7347 | 0.7579 |
| microwave | 0.2453 | 0.1884 | 0.2131 |
| monitor | 0.2422 | 0.2770 | 0.2585 |
| computer | 0.1546 | 0.0968 | 0.1190 |
| window | 0.1297 | 0.0997 | 0.1127 |
| shelf | 0.1939 | 0.2184 | 0.2054 |
| curtain | 0.1260 | 0.1291 | 0.1275 |
| plant | 0.1538 | 0.0855 | 0.1099 |
| stairs | 0.3243 | 0.4000 | 0.3582 |
| picture | 0.0212 | 0.0212 | 0.0212 |
| book | 0.0348 | 0.0629 | 0.0448 |
| bottle | 0.0247 | 0.0284 | 0.0264 |
| lamp | 0.1302 | 0.0986 | 0.1122 |
| towl | 0.0000 | 0.0000 | 0.0000 |
| sink | 0.4752 | 0.4467 | 0.4605 |
+--------------+-----------+--------+----------+
+--------+---------------+------------+--------+
| Split | Avg Precision | Avg Recall | Avg F1 |
+--------+---------------+------------+--------+
| cate8 | 0.4751 | 0.4553 | 0.4644 |
| cate20 | 0.3783 | 0.3601 | 0.3670 |
| cate31 | 0.2961 | 0.2836 | 0.2877 |
+--------+---------------+------------+--------+
During evaluation, the error
ERROR | __main__:threedod_process_results:357 - Error parsing prediction bboxmay appear. This error does not affect the evaluation.
VSI-Bench
Data prepration
Download VSI-Bench, put it in dataset/eval.
Inference
torchrun --nproc_per_node=4 --master_port=12345 -m eval.vlm.eval.vqa.evaluate_vsibench --model-path ./pretrained_model/BAGEL-7B-MoT/ --safetensor-path model.safetensors --dataset vsibench
The results (*.json files) will be saved in ./results/.
Evaluation
python eval/vlm/eval/vqa/3dvqa_eval.py --dataset vsibench --input-file ./results/vsibench.json
# output
obj_appearance_order_accuracy: 49.029126213592235
object_abs_distance_MRA:.5:.95:.05: 46.402877697841724
object_counting_MRA:.5:.95:.05: 70.31858407079646
object_rel_distance_accuracy: 65.91549295774648
object_size_estimation_MRA:.5:.95:.05: 68.59391395592864
room_size_estimation_MRA:.5:.95:.05: 54.722222222222214
route_planning_accuracy: 33.50515463917525
object_rel_direction_accuracy: 54.404572390100945
overall: 55.3614930184255
Novel View Synthesis
Data prepration
Download the RealEstate10K dataset from this link, which is provided by pixelSplat, and unzip the zip file and put the data in YOUR_RAW_DATAPATH.
Run the following command to preprocess the data into our format.
git clone https://github.com/zalkklop/LVSM.git
cd LVSM
python process_data.py --base_path YOUR_RAW_DATAPATH --output_dir ./dataset/eval/re10k/ --mode ['train' or 'test']
The whole file structure under ./dataset/eval/re10k/test/ will be as follows.
./dataset/eval/re10k/test/
βββ full_list.txt
βββ images
β βββ 000c3ab189999a83
β βββ ...
βββ metadata
β βββ 000c3ab189999a83.json
β βββ ...
Evaluation
We provide a script to evaluate Omni-View on RE10k.
python inference.py --scene-id 000c3ab189999a83
| Argument | Default | Description |
|---|---|---|
scene-id |
None | The scene id in RE10k. |
pose-id |
None | The id of camera trajectory in RE10k. Default: pose_id = scene_id |
image-path |
None | The reference image path. |
If scene-id != pose-id, we will use the first image of scene-id as the reference image and generate novel views using the camera trajectory of pose-id.
If (scene-id is None) and (image-path is not None), we will use the image in image-path as the reference image and generate novel views using the camera trajectory of pose-id.