--- license: mit datasets: - Hemg/deepfake-and-real-images language: - en tags: - deepfake-detection - computer-vision - ensemble-learning - pytorch - vision-transformer - cnn - image-classification - swint-transformer - EffiSwinT metrics: - accuracy - precision - recall - f1 model-index: - name: EffiSwinT-Deepfake-Detector results: - task: type: image-classification name: Deepfake Detection dataset: type: Hemg/deepfake-and-real-images name: Deepfake and Real Images Dataset metrics: - type: accuracy value: 98.9 name: Test Accuracy - type: f1 value: 0.99 name: F1 Score - type: precision value: 0.99 name: Precision - type: recall value: 0.99 name: Recall pipeline_tag: image-classification library_name: pytorch --- # EffiSwinT: Efficient Deep Fake Detection using EfficientNet-Swin Transformer Hybrid Architecture ## Abstract This repository presents EffiSwinT, a novel hybrid architecture combining EfficientNet-B3 and Swin Transformer for robust deepfake detection. The model leverages the complementary strengths of both architectures: EfficientNet's efficient feature extraction and Swin Transformer's hierarchical representation learning capabilities. ## Architecture General DeepFake Architecture ![General DeepFake Architecture](./assets/image.png) Detailed Architecture to Detect Deepfake Images ![Detailed Architecture to Detect Deepfake Images](./assets/image-1.png) SWIN transformer architecture d) SWIN Transformer: Layer Normalization helps in estimating the normalization statistics without introducing any more dependencies between the training set shifted window multi-head self-attention-It takes the O/P of W-MSA shift all ![ SWIN Transformer: Layer Normalization helps in estimating the normalization statistics without introducing any more dependencies between the training set shifted window multi-head self-attention-It takes the O/P of W-MSA shift all](./assets/image-2.png) Region Merging for boosting feature dimension e) Region Merging: The input patches are divided into equal 4 parts combined by this layer. This boosts the feature dimension by 4 times, a linear layer later reduces the feature dimensions back to the original 2. This entire procedure is carried out three times paired with SWIN transformer blocks. SWIN transformer selectively merges adjacent patches to capture the global information properly By merging 4 patches, we keep on increasing the resolution. Fig.5 shows the region merging for boosting feature dimension. 2) DECODER: ![Region Merging for boosting feature dimension e) Region Merging: The input patches are divided into equal 4 parts combined by this layer. This boosts the feature dimension by 4 times, a linear layer later reduces the feature dimensions back to the original 2. This entire procedure is carried out three times paired with SWIN transformer blocks. SWIN transformer selectively merges adjacent patches to capture the global information properly By merging 4 patches, we keep on increasing the resolution. Fig.5 shows the region merging for boosting feature dimension. 2) DECODER: ](./assets/image-3.png) Complete Block Diagram ![Complete Block Diagram](./assets/image-4.png) The EffiSwinT architecture consists of three main components: 1. **EfficientNet-B3 Branch**: Extracts local features efficiently 2. **Swin Transformer Branch**: Captures global dependencies and hierarchical features 3. **Fusion Module**: Combines features from both branches through concatenation and MLP layers ### Technical Details - Input Image Size: 224x224 - Backbone Models: - EfficientNet-B3 (pretrained) - Swin-Base-Patch4-Window7 (pretrained) - Feature Fusion: Concatenation followed by MLP (512 units) - Training Augmentations: - CutMix with α=1.0 - Random Horizontal Flip - Normalization ## Results ![Confusion Matrix](./assets/confusionmatrix.jpg) The model achieves competitive results on the Hemg/deepfake-and-real-images dataset: - Training Accuracy: 91.7% - Validation Accuracy: 98.9% ### Accuracy Plot ![Accuracy Plot](./assets/accuracyvsepoch.jpg) ### Loss Plot ![Loss Plot](./assets/lossovertime.jpg) ### Classification Report ![plot](./assets/report.jpg) ### Train & Validation Loss ![plot](./assets/train_val_loss.jpg) ## Dataset The Hemg/deepfake-and-real-images dataset is used for training and validation. It contains a balanced distribution of real and deepfake images. ![alt text](./assets/data.png) ## Training Details - Training Epochs: 5 - Batch Size: 32 - Optimizer: AdamW - Learning Rate: 1e-4 - Scheduler: Cosine Annealing - Augmentations: CutMix, Random Horizontal Flip, Normalization This Model is Trained on GPU-p100 and it takes around 10 Hours to train. ## Implementation Details ```python # Example usage from PIL import Image model = DeepfakeDetector() model.load_state_dict(torch.load("effiswint_model.pt")) result, confidence = predict_image("path/to/image.jpg") ``` ## Future Improvements 1. **Data Diversity** - Incorporate multiple deepfake datasets - Add more diverse real images - Include different types of manipulations 2. **Hyperparameter Optimization** - Learning rate scheduling strategies - Batch size optimization - CutMix probability tuning - Architecture-specific parameters 3. **Training Enhancements** - Increase training epochs (current: 5) - Implement gradient accumulation - Experiment with different optimizers - Add more augmentation techniques 4. **Model Robustness** - Test on cross-dataset scenarios - Add adversarial training - Implement ensemble methods ## Dependencies - PyTorch - timm - pytorch-lightning - transformers - datasets - scikit-learn - seaborn ## Citation ```bibtex @unknown{unknown, author = {Mishra, Soumya and Mohapatra, Hitesh and Gourisaria, Mahendra}, year = {2024}, month = {07}, pages = {}, title = {A Robust Approach for Deepfake Detection Using SWIN Transformer}, doi = {10.21203/rs.3.rs-4672886/v1} } @article{coccomini2021combining, title={Combining EfficientNet and Vision Transformers for Video Deepfake Detection}, author={Coccomini, Davide and Bechini, Alessio and Bertini, Marco}, journal={arXiv preprint arXiv:2107.02612}, year={2021} } @mastersthesis{saha2024deepfake, title = {Leveraging Ensemble Models for Enhanced Deepfake Detection}, author = {Saha, Shawna}, school = {University at Buffalo, The State University of New York}, year = {2024}, type = {Master's thesis}, url = {https://cse.buffalo.edu/tech-reports/2024-06.pdf} } ``` ## License MIT ## Contact Contact on saqibiqbal27772@gmail.com