Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training
Paper • 2510.12586 • Published • 115
Table 1: Pre-trained model.
| Model | Dataset | Configuraiton |
|---|---|---|
| RCM-B | IN-256 | 12-layer 768-dim 12-heads 16 patch size |
| RCM-B | IN-512 | 12-layer 768-dim 12-heads 32 patch size |
| RCM-L | IN-256 | 16-layer 1024-dim 16-heads 16 patch size |
Table 2: Fine-tuned model configurations. Encoder and decoder settings are separated by comma.
| Name | Blocks | Dim | Heads | Params |
|---|---|---|---|---|
| EPG-L | 16, 16 | 1024, 1024 | 16, 16 | 540M |
| EPG-XL | 12, 12 | 768, 1584 | 12, 22 | 583M |
| EPG-XXL | 12, 12 | 768, 1920 | 12, 16 | 789M |
| EPG-G | 12, 12 | 768, 2688 | 12, 21 | 1391M |
Table 3: Fine-tuned model performance in downstream tasks.
| Model | Task | FID |
|---|---|---|
| EPG-XL/16 | DM on IN-256 | 2.04 |
| EPG-XXL/16 | DM on IN-256 | 1.87 |
| EPG-G/16 | DM on IN-256 | 1.58 |
| EPG-L/32 | DM on IN-512 | 2.35 |
| EPG-L/16 | CM on IN-256 | 8.82 |