Papers
arxiv:2603.06507

Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis

Published on Mar 6
Authors:
,
,
,
,
,
,
,

Abstract

Self-Flow introduces a self-supervised flow matching paradigm with dual-timestep scheduling to enhance semantic representations and generative capabilities across modalities.

AI-generated summary

Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.

Community

Here are the main results from Self-Flow, explained with reference to the key figures and tables in the paper:

1. Core Innovation: Dual-Timestep Scheduling

Figure 3 illustrates the key mechanism. Instead of applying uniform noise to all tokens (standard flow matching), Self-Flow samples two distinct timesteps ($t$ and $s$) and applies them heterogeneously across tokens using a random mask $M$:

  • Masked tokens see noise level $s$ (usually cleaner)
  • Unmasked tokens see noise level $t$ (usually noisier)

This creates information asymmetry where the model must infer heavily corrupted tokens from cleaner context tokens, forcing it to learn strong semantic representations. An EMA teacher network sees the "cleaner" version (noised with $\tau_{\min} = \min{t,s}$), and the student learns to predict the teacher's internal representations (Figure 3, right).

2. Image Generation: Surpassing External Alignment

Figure 1(a) and Table 1 show that on ImageNet $256\times256$, Self-Flow (FID 5.70) outperforms REPA (FID 5.89)—the leading external alignment method—without using any external encoder. Notably:

  • REPA uses DINOv2, which was heavily trained on ImageNet, giving it an unfair advantage
  • Self-Flow converges $\sim2.8\times$ faster than REPA and continues improving while REPA plateaus
  • When combined with Representation Autoencoders (RAE), Self-Flow improves FID from 3.24 to 2.95 (Table 1, bottom)

Figure 4(b) validates that Self-Flow learns stronger representations via linear probing: early and mid-layer features significantly outperform standard flow matching.

3. The Scaling Paradox of External Encoders

Figure 2(a) reveals a critical flaw in external alignment: when replacing DINOv2-B with stronger variants (DINOv2-L → DINOv3-H+), generation quality degrades (FID worsens from 8.3 to >9). This suggests external alignment creates a bottleneck where the generative model becomes dependent on fixed representations that don't scale with the model.

Figure 6 demonstrates Self-Flow's superior scaling behavior:

  • As parameters increase (290M → 420M → 625M → 1B), the performance gap between Self-Flow and REPA widens consistently
  • Self-Flow with 625M parameters outperforms REPA with 1B parameters (Figure 6a)
  • Self-Flow follows expected scaling laws with increased compute, while REPA shows diminishing returns (Figure 6b)

4. Cross-Modal Superiority

Figure 5 and Tables 2-4 show quantitative results across modalities. Crucially, external alignment methods that work for images often fail for other modalities:

Text-to-Image (Table 2):
Self-Flow achieves the best FID (3.61) vs. REPA (3.92) and SigLIP 2 (3.97). Even when evaluated with DINOv2 features (FD-DINO), Self-Flow (167.98) beats REPA (173.35)—remarkable because REPA explicitly aligns with DINOv2.

Text-to-Video (Table 3):
Self-Flow achieves FVD 47.81 (next best is 49.59). Notably, aligning with video-specific encoders (V-JEPA2, Depth Anything 3) harms performance relative to vanilla flow matching, while Self-Flow provides consistent gains.

Text-to-Audio (Table 4):
Self-Flow achieves the best FAD scores across all CLAP variants, while external alignment with MERT provides no benefit.

5. Multi-Modal Training and Robotics

Figure 8(a) shows joint training on Image+Video+Audio with different loss weightings. Self-Flow provides consistent improvements (shaded area) across all modalities simultaneously, even under extreme weightings that favor one modality over others.

Figure 8(b) and Figure 7 demonstrate transfer to embodied AI (SIMPLER simulator):

  • Self-Flow learns more efficiently from limited robotics data (RT-1 dataset)
  • On complex multi-object tasks ("Move Near", "Open and Place"), Self-Flow maintains significant advantages over vanilla flow matching even at 100k steps
  • Early in training (30k steps), Self-Flow succeeds on all task categories while vanilla flow matching fails entirely on "Open and Place" tasks

6. Ablations: Why Each Component Matters

Figure 11(a) shows ablations on ImageNet:

  • Removing the self-supervised loss is most detrimental (+4.3 FID)
  • Removing Dual-Timestep Scheduling while keeping the representation loss degrades performance (+1.1 FID)
  • Restricting the second timestep to be only slightly cleaner than the first (instead of sampling from full distribution) is nearly as bad as removing masking entirely

Figure 11(b) shows that while better noise scheduling helps both methods, Self-Flow benefits significantly more, indicating it optimally leverages the timestep selection.

Summary

The key takeaway is that self-supervised representation learning integrated directly into flow matching outperforms external alignment across all modalities, follows healthy scaling laws, and enables seamless multi-modal training without domain-specific encoder selection. Self-Flow eliminates the "train-inference gap" that plagues masking-based approaches (Figure 2b) while avoiding the scaling bottlenecks of external encoders.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.06507
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.06507 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.06507 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.06507 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.