There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-Training

arXiv Project page Code


Model Configuration

Table 1: Pre-trained model.

Model Dataset Configuraiton
RCM-B IN-256 12-layer 768-dim 12-heads 16 patch size
RCM-B IN-512 12-layer 768-dim 12-heads 32 patch size
RCM-L IN-256 16-layer 1024-dim 16-heads 16 patch size

Table 2: Fine-tuned model configurations. Encoder and decoder settings are separated by comma.

Name Blocks Dim Heads Params
EPG-L 16, 16 1024, 1024 16, 16 540M
EPG-XL 12, 12 768, 1584 12, 22 583M
EPG-XXL 12, 12 768, 1920 12, 16 789M
EPG-G 12, 12 768, 2688 12, 21 1391M

Table 3: Fine-tuned model performance in downstream tasks.

Model Task FID
EPG-XL/16 DM on IN-256 2.04
EPG-XXL/16 DM on IN-256 1.87
EPG-G/16 DM on IN-256 1.58
EPG-L/32 DM on IN-512 2.35
EPG-L/16 CM on IN-256 8.82
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for jiachenlei/EPG