Update model card: Add library_name and update with latest information

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +174 -28
README.md CHANGED
@@ -1,19 +1,20 @@
1
  ---
 
 
2
  license: other
3
  license_name: stabilityai-ai-community
4
  license_link: LICENSE.md
5
- base_model:
6
- - stabilityai/stable-diffusion-3-medium
7
  pipeline_tag: text-to-video
 
8
  tags:
9
  - image-to-video
10
  ---
11
 
12
  # ⚡️Pyramid Flow⚡️
13
 
14
- [[Paper]](https://arxiv.org/abs/2410.05954) [[Project Page ✨]](https://pyramid-flow.github.io) [[Code 🚀]](https://github.com/jy0205/Pyramid-Flow)
15
 
16
- This is the official repository for Pyramid Flow, a training-efficient **Autoregressive Video Generation** method based on **Flow Matching**. By training only on open-source datasets, it generates high-quality 10-second videos at 768p resolution and 24 FPS, and naturally supports image-to-video generation.
17
 
18
  <table class="center" border="0" style="width: 100%; text-align: left;">
19
  <tr>
@@ -22,29 +23,110 @@ This is the official repository for Pyramid Flow, a training-efficient **Autoreg
22
  <th>Image-to-video</th>
23
  </tr>
24
  <tr>
25
- <td><video src="https://pyramid-flow.github.io/static/videos/t2v_10s/fireworks.mp4" autoplay muted loop playsinline></video></td>
26
- <td><video src="https://pyramid-flow.github.io/static/videos/t2v/trailer.mp4" autoplay muted loop playsinline></video></td>
27
- <td><video src="https://pyramid-flow.github.io/static/videos/i2v/sunday.mp4" autoplay muted loop playsinline></video></td>
28
  </tr>
29
  </table>
30
 
31
  ## News
 
 
 
 
 
 
 
 
 
 
32
 
33
- * `COMING SOON` ⚡️⚡️⚡️ Training code and new model checkpoints trained from scratch.
34
  * `2024.10.10` 🚀🚀🚀 We release the [technical report](https://arxiv.org/abs/2410.05954), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
35
 
36
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- You can directly download the model from [Huggingface](https://huggingface.co/rain1011/pyramid-flow-sd3). We provide both model checkpoints for 768p and 384p video generation. The 384p checkpoint supports 5-second video generation at 24FPS, while the 768p checkpoint supports up to 10-second video generation at 24FPS.
 
 
 
 
 
 
 
 
 
 
39
 
40
  ```python
41
  from huggingface_hub import snapshot_download
42
 
43
  model_path = 'PATH' # The local directory to save downloaded checkpoint
44
- snapshot_download("rain1011/pyramid-flow-sd3", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  ```
46
 
47
- To use our model, please follow the inference code in `video_generation_demo.ipynb` at [this link](https://github.com/jy0205/Pyramid-Flow/blob/main/video_generation_demo.ipynb). We further simplify it into the following two-step procedure. First, load the downloaded model:
 
 
48
 
49
  ```python
50
  import torch
@@ -53,36 +135,49 @@ from pyramid_dit import PyramidDiTForVideoGeneration
53
  from diffusers.utils import load_image, export_to_video
54
 
55
  torch.cuda.set_device(0)
56
- model_dtype, torch_dtype = 'bf16', torch.bfloat16 # Use bf16, fp16 or fp32
57
 
58
  model = PyramidDiTForVideoGeneration(
59
  'PATH', # The downloaded checkpoint dir
60
- model_dtype,
61
- model_variant='diffusion_transformer_768p', # 'diffusion_transformer_384p'
 
62
  )
63
 
64
- model.vae.to("cuda")
65
- model.dit.to("cuda")
66
- model.text_encoder.to("cuda")
67
  model.vae.enable_tiling()
 
 
 
 
 
 
68
  ```
69
 
70
- Then, you can try text-to-video generation on your own prompts:
71
 
72
  ```python
73
  prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"
74
 
 
 
 
 
 
 
 
 
75
  with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
76
  frames = model.generate(
77
  prompt=prompt,
78
  num_inference_steps=[20, 20, 20],
79
  video_num_inference_steps=[10, 10, 10],
80
- height=768,
81
- width=1280,
82
  temp=16, # temp=16: 5s, temp=31: 10s
83
- guidance_scale=9.0, # The guidance for the first frame
84
  video_guidance_scale=5.0, # The guidance for the other video latent
85
  output_type="pil",
 
86
  )
87
 
88
  export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
@@ -91,7 +186,15 @@ export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
91
  As an autoregressive model, our model also supports (text conditioned) image-to-video generation:
92
 
93
  ```python
94
- image = Image.open('assets/the_great_wall.jpg').convert("RGB").resize((1280, 768))
 
 
 
 
 
 
 
 
95
  prompt = "FPV flying over the Great Wall"
96
 
97
  with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
@@ -102,32 +205,75 @@ with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
102
  temp=16,
103
  video_guidance_scale=4.0,
104
  output_type="pil",
 
105
  )
106
 
107
  export_to_video(frames, "./image_to_video_sample.mp4", fps=24)
108
  ```
109
 
110
- Usage tips:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
 
112
  * The `guidance_scale` parameter controls the visual quality. We suggest using a guidance within [7, 9] for the 768p checkpoint during text-to-video generation, and 7 for the 384p checkpoint.
113
  * The `video_guidance_scale` parameter controls the motion. A larger value increases the dynamic degree and mitigates the autoregressive generation degradation, while a smaller value stabilizes the video.
114
  * For 10-second video generation, we recommend using a guidance scale of 7 and a video guidance scale of 5.
115
 
 
 
 
 
 
 
 
 
 
 
116
  ## Gallery
117
 
118
  The following video examples are generated at 5s, 768p, 24fps. For more results, please visit our [project page](https://pyramid-flow.github.io).
119
 
120
  <table class="center" border="0" style="width: 100%; text-align: left;">
121
  <tr>
122
- <td><video src="https://pyramid-flow.github.io/static/videos/t2v/tokyo.mp4" autoplay muted loop playsinline></video></td>
123
- <td><video src="https://pyramid-flow.github.io/static/videos/t2v/eiffel.mp4" autoplay muted loop playsinline></video></td>
124
  </tr>
125
  <tr>
126
- <td><video src="https://pyramid-flow.github.io/static/videos/t2v/waves.mp4" autoplay muted loop playsinline></video></td>
127
- <td><video src="https://pyramid-flow.github.io/static/videos/t2v/rail.mp4" autoplay muted loop playsinline></video></td>
128
  </tr>
129
  </table>
130
 
 
 
 
 
 
 
 
 
 
 
131
  ## Acknowledgement
132
 
133
  We are grateful for the following awesome projects when implementing Pyramid Flow:
 
1
  ---
2
+ base_model:
3
+ - stabilityai/stable-diffusion-3-medium
4
  license: other
5
  license_name: stabilityai-ai-community
6
  license_link: LICENSE.md
 
 
7
  pipeline_tag: text-to-video
8
+ library_name: diffusers
9
  tags:
10
  - image-to-video
11
  ---
12
 
13
  # ⚡️Pyramid Flow⚡️
14
 
15
+ [[Paper]](https://arxiv.org/abs/2410.05954) [[Project Page ✨]](https://pyramid-flow.github.io) [[miniFLUX Model 🚀]](https://huggingface.co/rain1011/pyramid-flow-miniflux) [[SD3 Model ⚡️]](https://huggingface.co/rain1011/pyramid-flow-sd3) [[demo 🤗](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow)]
16
 
17
+ This is the official repository for Pyramid Flow, a training-efficient **Autoregressive Video Generation** method based on **Flow Matching**. By training only on **open-source datasets**, it can generate high-quality 10-second videos at 768p resolution and 24 FPS, and naturally supports image-to-video generation.
18
 
19
  <table class="center" border="0" style="width: 100%; text-align: left;">
20
  <tr>
 
23
  <th>Image-to-video</th>
24
  </tr>
25
  <tr>
26
+ <td><video src="https://github.com/user-attachments/assets/9935da83-ae56-4672-8747-0f46e90f7b2b" autoplay muted loop playsinline></video></td>
27
+ <td><video src="https://github.com/user-attachments/assets/3412848b-64db-4d9e-8dbf-11403f6d02c5" autoplay muted loop playsinline></video></td>
28
+ <td><video src="https://github.com/user-attachments/assets/3bd7251f-7b2c-4bee-951d-656fdb45f427" autoplay muted loop playsinline></video></td>
29
  </tr>
30
  </table>
31
 
32
  ## News
33
+ * `2024.11.13` 🚀🚀🚀 We release the [768p miniFLUX checkpoint](https://huggingface.co/rain1011/pyramid-flow-miniflux) (up to 10s).
34
+
35
+ > We have switched the model structure from SD3 to a mini FLUX to fix human structure issues, please try our 1024p image checkpoint, 384p video checkpoint (up to 5s) and 768p video checkpoint (up to 10s). The new miniflux model shows great improvement on human structure and motion stability
36
+
37
+ * `2024.10.29` ⚡️⚡️⚡️ We release [training code for VAE](#1-training-vae), [finetuning code for DiT](#2-finetuning-dit) and [new model checkpoints](https://huggingface.co/rain1011/pyramid-flow-miniflux) with FLUX structure trained from scratch.
38
+
39
+
40
+ * `2024.10.13` ✨✨✨ [Multi-GPU inference](#3-multi-gpu-inference) and [CPU offloading](#cpu-offloading) are supported. Use it with **less than 8GB** of GPU memory, with great speedup on multiple GPUs.
41
+
42
+ * `2024.10.11` 🤗🤗🤗 [Hugging Face demo](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow) is available. Thanks [@multimodalart](https://huggingface.co/multimodalart) for the commit!
43
 
 
44
  * `2024.10.10` 🚀🚀🚀 We release the [technical report](https://arxiv.org/abs/2410.05954), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
45
 
46
+ ## Table of Contents
47
+
48
+ * [Introduction](#introduction)
49
+ * [Installation](#installation)
50
+ * [Inference](#inference)
51
+ 1. [Quick Start with Gradio](#1-quick-start-with-gradio)
52
+ 2. [Inference Code](#2-inference-code)
53
+ 3. [Multi-GPU Inference](#3-multi-gpu-inference)
54
+ 4. [Usage Tips](#4-usage-tips)
55
+ * [Training](#Training)
56
+ 1. [Training VAE](#training-vae)
57
+ 2. [Finetuning DiT](#finetuning-dit)
58
+ * [Gallery](#gallery)
59
+ * [Comparison](#comparison)
60
+ * [Acknowledgement](#acknowledgement)
61
+ * [Citation](#citation)
62
+
63
+ ## Introduction
64
+
65
+ ![motivation](assets/motivation.jpg)
66
+
67
+ Existing video diffusion models operate at full resolution, spending a lot of computation on very noisy latents. By contrast, our method harnesses the flexibility of flow matching ([Lipman et al., 2023](https://openreview.net/forum?id=PqvMRDCJT9t); [Liu et al., 2023](https://openreview.net/forum?id=XVjTT1nw5z); [Albergo & Vanden-Eijnden, 2023](https://openreview.net/forum?id=li7qeBbCR1t)) to interpolate between latents of different resolutions and noise levels, allowing for simultaneous generation and decompression of visual content with better computational efficiency. The entire framework is end-to-end optimized with a single DiT ([Peebles & Xie, 2023](http://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html)), generating high-quality 10-second videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours.
68
+
69
+ ## Installation
70
+
71
+ We recommend setting up the environment with conda. The codebase currently uses Python 3.8.10 and PyTorch 2.1.2 ([guide](https://pytorch.org/get-started/previous-versions/#v212)), and we are actively working to support a wider range of versions.
72
 
73
+ ```bash
74
+ git clone https://github.com/jy0205/Pyramid-Flow
75
+ cd Pyramid-Flow
76
+
77
+ # create env using conda
78
+ conda create -n pyramid python==3.8.10
79
+ conda activate pyramid
80
+ pip install -r requirements.txt
81
+ ```
82
+
83
+ Then, download the model from [Huggingface](https://huggingface.co/rain1011) (there are two variants: [miniFLUX](https://huggingface.co/rain1011/pyramid-flow-miniflux) or [SD3](https://huggingface.co/rain1011/pyramid-flow-sd3)). The miniFLUX models support 1024p image, 384p and 768p video generation, and the SD3-based models support 768p and 384p video generation. The 384p checkpoint generates 5-second video at 24FPS, while the 768p checkpoint generates up to 10-second video at 24FPS.
84
 
85
  ```python
86
  from huggingface_hub import snapshot_download
87
 
88
  model_path = 'PATH' # The local directory to save downloaded checkpoint
89
+ snapshot_download("rain1011/pyramid-flow-miniflux", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')
90
+ ```
91
+
92
+ ## Inference
93
+
94
+ ### 1. Quick start with Gradio
95
+
96
+ To get started, first install [Gradio](https://www.gradio.app/guides/quickstart), set your model path at [#L36](https://github.com/jy0205/Pyramid-Flow/blob/3777f8b84bddfa2aa2b497ca919b3f40567712e6/app.py#L36), and then run on your local machine:
97
+
98
+ ```bash
99
+ python app.py
100
+ ```
101
+
102
+ The Gradio demo will be opened in a browser. Thanks to [@tpc2233](https://github.com/tpc2233) the commit, see [#48](https://github.com/jy0205/Pyramid-Flow/pull/48) for details.
103
+
104
+ Or, try it out effortlessly on [Hugging Face Space 🤗](https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow) created by [@multimodalart](https://huggingface.co/multimodalart). Due to GPU limits, this online demo can only generate 25 frames (export at 8FPS or 24FPS). Duplicate the space to generate longer videos.
105
+
106
+ #### Quick Start on Google Colab
107
+
108
+ To quickly try out Pyramid Flow on Google Colab, run the code below:
109
+
110
+ ```
111
+ # Setup
112
+ !git clone https://github.com/jy0205/Pyramid-Flow
113
+ %cd Pyramid-Flow
114
+ !pip install -r requirements.txt
115
+ !pip install gradio
116
+
117
+ # This code downloads miniFLUX
118
+ from huggingface_hub import snapshot_download
119
+
120
+ model_path = '/content/Pyramid-Flow'
121
+ snapshot_download("rain1011/pyramid-flow-miniflux", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')
122
+
123
+ # Start
124
+ !python app.py
125
  ```
126
 
127
+ ### 2. Inference Code
128
+
129
+ To use our model, please follow the inference code in `video_generation_demo.ipynb` at [this link](https://github.com/jy0205/Pyramid-Flow/blob/main/video_generation_demo.ipynb). We strongly recommend you to try the latest published pyramid-miniflux, which shows great improvement on human structure and motion stability. Set the param `model_name` to `pyramid_flux` to use. We further simplify it into the following two-step procedure. First, load the downloaded model:
130
 
131
  ```python
132
  import torch
 
135
  from diffusers.utils import load_image, export_to_video
136
 
137
  torch.cuda.set_device(0)
138
+ model_dtype, torch_dtype = 'bf16', torch.bfloat16 # Use bf16 (not support fp16 yet)
139
 
140
  model = PyramidDiTForVideoGeneration(
141
  'PATH', # The downloaded checkpoint dir
142
+ model_name="pyramid_flux",
143
+ model_dtype=model_dtype,
144
+ model_variant='diffusion_transformer_768p',
145
  )
146
 
 
 
 
147
  model.vae.enable_tiling()
148
+ # model.vae.to("cuda")
149
+ # model.dit.to("cuda")
150
+ # model.text_encoder.to("cuda")
151
+
152
+ # if you're not using sequential offloading bellow uncomment the lines above ^
153
+ model.enable_sequential_cpu_offload()
154
  ```
155
 
156
+ Then, you can try text-to-video generation on your own prompts. Noting that the 384p version only support 5s now (set temp up to 16)!
157
 
158
  ```python
159
  prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"
160
 
161
+ # used for 384p model variant
162
+ # width = 640
163
+ # height = 384
164
+
165
+ # used for 768p model variant
166
+ width = 1280
167
+ height = 768
168
+
169
  with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
170
  frames = model.generate(
171
  prompt=prompt,
172
  num_inference_steps=[20, 20, 20],
173
  video_num_inference_steps=[10, 10, 10],
174
+ height=height,
175
+ width=width,
176
  temp=16, # temp=16: 5s, temp=31: 10s
177
+ guidance_scale=7.0, # The guidance for the first frame, set it to 7 for 384p variant
178
  video_guidance_scale=5.0, # The guidance for the other video latent
179
  output_type="pil",
180
+ save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
181
  )
182
 
183
  export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
 
186
  As an autoregressive model, our model also supports (text conditioned) image-to-video generation:
187
 
188
  ```python
189
+ # used for 384p model variant
190
+ # width = 640
191
+ # height = 384
192
+
193
+ # used for 768p model variant
194
+ width = 1280
195
+ height = 768
196
+
197
+ image = Image.open('assets/the_great_wall.jpg').convert("RGB").resize((width, height))
198
  prompt = "FPV flying over the Great Wall"
199
 
200
  with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
 
205
  temp=16,
206
  video_guidance_scale=4.0,
207
  output_type="pil",
208
+ save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
209
  )
210
 
211
  export_to_video(frames, "./image_to_video_sample.mp4", fps=24)
212
  ```
213
 
214
+ #### CPU offloading
215
+
216
+ We also support two types of CPU offloading to reduce GPU memory requirements. Note that they may sacrifice efficiency.
217
+ * Adding a `cpu_offloading=True` parameter to the generate function allows inference with **less than 12GB** of GPU memory. This feature was contributed by [@Ednaordinary](https://github.com/Ednaordinary), see [#23](https://github.com/jy0205/Pyramid-Flow/pull/23) for details.
218
+ * Calling `model.enable_sequential_cpu_offload()` before the above procedure allows inference with **less than 8GB** of GPU memory. This feature was contributed by [@rodjjo](https://github.com/rodjjo/Pyramid-Flow/pull/75) for details.
219
+
220
+ #### MPS backend
221
+
222
+ Thanks to [@niw](https://github.com/niw), Apple Silicon users (e.g. MacBook Pro with M2 24GB) can also try our model using the MPS backend! Please see [#113](https://github.com/jy0205/Pyramid-Flow/pull/113) for the details.
223
+
224
+ ### 3. Multi-GPU Inference
225
+
226
+ For users with multiple GPUs, we provide an [inference script](https://github.com/jy0205/Pyramid-Flow/blob/main/scripts/inference_multigpu.sh) that uses sequence parallelism to save memory on each GPU. This also brings a big speedup, taking only 2.5 minutes to generate a 5s, 768p, 24fps video on 4 A100 GPUs (vs. 5.5 minutes on a single A100 GPU). Run it on 2 GPUs with the following command:
227
+
228
+ ```bash
229
+ CUDA_VISIBLE_DEVICES=0,1 sh scripts/inference_multigpu.sh
230
+ ```
231
+
232
+ It currently supports 2 or 4 GPUs (For SD3 Version), with more configurations available in the original script. You can also launch a [multi-GPU Gradio demo](https://github.com/jy0205/Pyramid-Flow/blob/main/scripts/app_multigpu_engine.sh) created by [@tpc2233](https://github.com/tpc2233), see [#59](https://github.com/jy0205/Pyramid-Flow/pull/59) for details.
233
+
234
+ > Spoiler: We didn't even use sequence parallelism in training, thanks to our efficient pyramid flow designs.
235
+
236
+ ### 4. Usage tips
237
 
238
  * The `guidance_scale` parameter controls the visual quality. We suggest using a guidance within [7, 9] for the 768p checkpoint during text-to-video generation, and 7 for the 384p checkpoint.
239
  * The `video_guidance_scale` parameter controls the motion. A larger value increases the dynamic degree and mitigates the autoregressive generation degradation, while a smaller value stabilizes the video.
240
  * For 10-second video generation, we recommend using a guidance scale of 7 and a video guidance scale of 5.
241
 
242
+ ## Training
243
+
244
+ ### 1. Training VAE
245
+
246
+ The hardware requirements for training VAE are at least 8 A100 GPUs. Please refer to [this document](https://github.com/jy0205/Pyramid-Flow/blob/main/docs/VAE.md). This is a [MAGVIT-v2](https://arxiv.org/abs/2310.05737) like continuous 3D VAE, which should be quite flexible. Feel free to build your own video generative model on this part of VAE training code.
247
+
248
+ ### 2. Finetuning DiT
249
+
250
+ The hardware requirements for finetuning DiT are at least 8 A100 GPUs. Please refer to [this document](https://github.com/jy0205/Pyramid-Flow/blob/main/docs/DiT.md). We provide instructions for both autoregressive and non-autoregressive versions of Pyramid Flow. The former is more research oriented and the latter is more stable (but less efficient without temporal pyramid).
251
+
252
  ## Gallery
253
 
254
  The following video examples are generated at 5s, 768p, 24fps. For more results, please visit our [project page](https://pyramid-flow.github.io).
255
 
256
  <table class="center" border="0" style="width: 100%; text-align: left;">
257
  <tr>
258
+ <td><video src="https://github.com/user-attachments/assets/5b44a57e-fa08-4554-84a2-2c7a99f2b343" autoplay muted loop playsinline></video></td>
259
+ <td><video src="https://github.com/user-attachments/assets/5afd5970-de72-40e2-900d-a20d18308e8e" autoplay muted loop playsinline></video></td>
260
  </tr>
261
  <tr>
262
+ <td><video src="https://github.com/user-attachments/assets/1d44daf8-017f-40e9-bf18-1e19c0a8983b" autoplay muted loop playsinline></video></td>
263
+ <td><video src="https://github.com/user-attachments/assets/7f5dd901-b7d7-48cc-b67a-3c5f9e1546d2" autoplay muted loop playsinline></video></td>
264
  </tr>
265
  </table>
266
 
267
+ ## Comparison
268
+
269
+ On VBench ([Huang et al., 2024](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard)), our method surpasses all the compared open-source baselines. Even with only public video data, it achieves comparable performance to commercial models like Kling ([Kuaishou, 2024](https://kling.kuaishou.com/en)) and Gen-3 Alpha ([Runway, 2024](https://runwayml.com/research/introducing-gen-3-alpha)), especially in the quality score (84.74 vs. 84.11 of Gen-3) and motion smoothness.
270
+
271
+ ![vbench](assets/vbench.jpg)
272
+
273
+ We conduct an additional user study with 20+ participants. As can be seen, our method is preferred over open-source models such as [Open-Sora](https://github.com/hpcaitech/Open-Sora) and [CogVideoX-2B](https://github.com/THUDM/CogVideo) especially in terms of motion smoothness.
274
+
275
+ ![user_study](assets/user_study.jpg)
276
+
277
  ## Acknowledgement
278
 
279
  We are grateful for the following awesome projects when implementing Pyramid Flow: