VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Paper • 2303.16727 • Published
This model is a fine-tuned version of VideoMAE-v2 vit_base_patch16_224 for video classification on the FineBio.
Custom video dataset for laboratory action recognition
from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
import torch
import numpy as np
# Load model and processor
processor = VideoMAEImageProcessor.from_pretrained("AnnaelleMyriam/videomaev2-finetuned-finebio")
model = VideoMAEForVideoClassification.from_pretrained("AnnaelleMyriam/videomaev2-finetuned-finebio")
# Prepare video frames (list of PIL images or numpy arrays)
# Shape: (num_frames, height, width, channels)
video = np.random.randn(16, 224, 224, 3)
# Process inputs
inputs = processor(list(video), return_tensors="pt")
# Forward pass
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# Get predictions
predicted_class_idx = logits.argmax(-1).item()
print(f"Predicted class: {predicted_class_idx}")
If you use this model, please cite:
@article{wang2023videomaev2,
title={VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking},
author={Wang, Limin and Huang, Bingkun and Zhao, Zhiyu and Tong, Zhan and He, Yinan and Wang, Yi and Wang, Yali and Qiao, Yu},
journal={arXiv preprint arXiv:2303.16727},
year={2023}
}
For questions or issues, please open an issue in the model repository.