File size: 7,373 Bytes
fe6c2e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
# **ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives (ICCV 2025 Highlight)**

> #### Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, Luc Van Goo
>

[Arxiv Paper](https://arxiv.org/abs/2411.19083)

### Features

* **Ego-Exo Object Correspondence Task:** We conduct an early exploration of this challenging task, analyzing its unique difficulties, constructing several baselines, and proposing a new method.

* **ObjectRelator Framework:** We introduce ObjectRelator, a cross-view object segmentation method combining MCFuse and XObjAlign. MCFuse first introduces the text modality into this task and improves localization using multimodal cues for the same object(s), while XObjAlign boosts performance under appearance variations with an object-level consistency constraint.

* **New Testbed** & **SOTA Results:** Alongside Ego-Exo4D, we present HANDAL-X as an additional benchmark. Our proposed ObjectRelator achieves state-of-the-art (SOTA) results on both datasets.

  ![](assets/teaser.png)

## Updates

- [x] Release evaluation code
- [x] Release training code
- [x] Release data
- [x] Release model

## Installation

Follow [PSLAM installation instructions.](https://github.com/zamling/PSALM/blob/main/docs/INSTALL.md)

## Getting Started

- ### **Prepare Datasets**

  #### Ego-Exo4D

  ------

  Follow [SegSwap](https://github.com/EGO4D/ego-exo4d-relation/tree/main/correspondence/SegSwap) to download Ego-Exo4D videos and pre-process the data into images. After processing, you will obtain image folders structured as follows:

  ```
  data_root
  └── take_id_01/
      β”œβ”€β”€ ego_cam/
      		β”œβ”€β”€ 0.jpg
      		β”œβ”€β”€ ...
      		└── n.jpg
      β”œβ”€β”€ exo_cam/
          β”œβ”€β”€ 0.jpg
          β”œβ”€β”€ ...
          └── n.jpg
      └── annotation.json
  β”œβ”€β”€ ...
  β”œβ”€β”€ take_id_n
  └── split.json
  ```

  Next, we use the images and annotations to generate a JSON file for training and evaluating ObjectRelator (w/o text prompt):

  ```
  python datasets/build_egoexo.py --root_path /path/to/ego-exo4d/data_root --save_path /path/to/save/ego2exo_train_visual.json --split_path /path/to/ego-exo4d/data_root/split.json --split train --task ego2exo
  ```

  This gives us a JSON file without text prompts. We then use LLaVA to generate textual descriptions for the objects in the images:

  ```
  cd LLaVA
  conda activate llava
  python gen_text.py --image_path /path/to/ego-exo4d/data_root --json_path /path/to/save/ego2exo_train.json --save_path /path/to/save/ego2exo_train_visual_text_tmp.json
  ```

  In the final step, we process the LLaVA-generated text to extract object names and convert them into tokenized form, producing a complete JSON file that includes both visual and textual prompts.

  ```
  python datasets/build_text.py --text_path /path/to/save/ego2exo_train_visual_text_tmp.json --save_path /path/to/save/ego2exo_train_visual_text.json
  ```

  

  #### HANDAL

  ------

  Download all ZIP files in [HANDAL](https://drive.google.com/drive/folders/10mDNZnYrg55ZiP9GV4upKWnxlxay1OwM). You can use `gdown` in the command line as follows:

  ```
  gdown "https://drive.google.com/file/d/1bYP3qevtmjiG3clRiP93mwVBTxyiDQFq/view?usp=share_link" --fuzzy
  ```

  Once unzipped, the dataset will be organized into image folders as shown below:

  ```
  data_root
  └── handal_dataset_{obj_name}/
      β”œβ”€β”€ dynamic/
      β”œβ”€β”€ models/
      β”œβ”€β”€ models_parts/
      β”œβ”€β”€ test/
      └── train/
  β”œβ”€β”€ ...
  └── handal_dataset_{obj_name}
  ```

  Next, we use the images and masks to generate a JSON file for training and evaluating ObjectRelator (w/o text prompt):

  ```
  python datasets/build_handal.py --root_path /path/to/handal/data_root --save_path /path/to/save/handal_train_visual.json --split train
  ```

  The following text prompt generation steps are the same as those for Ego-Exo4D. Refer to the instructions above.

- ### **Pre-trained Checkpoint**

  #### **PSALM componets:**

  ------

  ​	Download Siwn-B Mask2former from [here](https://dl.fbaipublicfiles.com/maskformer/mask2former/coco/panoptic/maskformer2_swin_base_IN21k_384_bs16_50ep/model_final_54b88a.pkl).

  ​	Download Phi-1.5 based on huggingface from [here](https://huggingface.co/susnato/phi-1_5_dev).

  ​	LLava pretrained projector can be downloaded [here](https://huggingface.co/EnmingZhang/PSALM_stage1).

  

  #### **Pre-trained PSALM:**

  ------

  ​	Download PSALM [here](https://huggingface.co/EnmingZhang/PSALM).

  

- ### **Train**

  #### Training on Ego-Exo4D

  ```
  '''
  1. change model paths and dataset paths to the exact ego-exo related paths in train_ObjectRelator.sh
  2. You can adjust the training behavior by modifying the configuration parameters in objectrelator/mask_config/data_args.py
  3. data_args.condition controls the number of prompt modalities used
  4. training_args.joint_training determines whether joint training is enabled
  5. training_args.first_stage determines whether to use the first stage of training
  '''
  
  # stage-1 training: set training_args.first_stage=True
  bash scripts/train_ObjectRelator.sh 
  
  # stage-2 training: set training_args.first_stage=False, training_args.pretrained_model_path=/path/to/stage-1
  bash scripts/train_ObjectRelator.sh 
  ```

  #### Training on HANDAL

  ```
  # change model paths and dataset paths to the exact handal related paths in train_ObjectRelator.sh
  # set training_args.is_handal=True in data_args.py
  # The remaining training procedure is identical to that of Ego-Exo.
  
  bash scripts/train_ObjectRelator.sh 
  ```

- ### **Evaluation**

  #### Eval on Ego-Exo4D

  ```
  # set data_args.condition in objectrelator/mask_config/data_args.py to control the number of prompt modalities used
  
  python objectrelator/eval/eval_egoexo.py --image_folder /path/to/ego-exo4d/data_root --model_path /path/to/pretrained_model --json_path /path/to/save/ego2exo_val_visual_text.json --split_path /path/to/ego-exo4d/data_root/split.json --split val
  ```

  #### Eval on HANDAL

  ```
  # set data_args.condition in objectrelator/mask_config/data_args.py to control the number of prompt modalities used
  
  python objectrelator/eval/eval_handal.py --image_folder /path/to/handal/data_root --model_path /path/to/pretrained_model --json_path /path/to/save/handal_val_visual_text.json
  ```


## Model Zoo

- Download Objectrelator here.
- Download prepared json file here.

## Citation

If you think this work is useful for your research, please use the following BibTeX entry.

```
@misc{fu2024objectrelatorenablingcrossviewobject,
      title={ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos}, 
      author={Yuqian Fu and Runze Wang and Yanwei Fu and Danda Pani Paudel and Xuanjing Huang and Luc Van Gool},
      year={2024},
      eprint={2411.19083},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.19083}, 
}
```

## Acknowledgement

Thanks for awesome works: [PSALM](https://github.com/zamling/PSALM/blob/main/) , [LLaVA](https://github.com/haotian-liu/LLaVA) and [Ego-Exo4D](https://ego-exo4d-data.org). Code is based on these works.