---

license: mit
datasets:
- Hemg/deepfake-and-real-images
language:
- en
tags:
- deepfake-detection
- computer-vision
- ensemble-learning
- pytorch
- vision-transformer
- cnn
- image-classification
- swint-transformer
- EffiSwinT
metrics:
- accuracy
- precision
- recall
- f1

model-index:
- name: EffiSwinT-Deepfake-Detector
  results:
  - task: 
      type: image-classification
      name: Deepfake Detection
    dataset:
      type: Hemg/deepfake-and-real-images
      name: Deepfake and Real Images Dataset
    metrics:
      - type: accuracy
        value: 98.9
        name: Test Accuracy
      - type: f1
        value: 0.99
        name: F1 Score
      - type: precision
        value: 0.99
        name: Precision
      - type: recall
        value: 0.99
        name: Recall

pipeline_tag: image-classification
library_name: pytorch

---
# EffiSwinT: Efficient Deep Fake Detection using EfficientNet-Swin Transformer Hybrid Architecture

## Abstract
This repository presents EffiSwinT, a novel hybrid architecture combining EfficientNet-B3 and Swin Transformer for robust deepfake detection. The model leverages the complementary strengths of both architectures: EfficientNet's efficient feature extraction and Swin Transformer's hierarchical representation learning capabilities.

## Architecture
General DeepFake Architecture

![General DeepFake Architecture](./assets/image.png)
Detailed Architecture to Detect Deepfake Images

![Detailed Architecture to Detect Deepfake Images](./assets/image-1.png)
SWIN transformer architecture d) SWIN Transformer: Layer Normalization helps in estimating the normalization statistics without introducing any more dependencies between the training set shifted window multi-head self-attention-It takes the O/P of W-MSA shift all

![ SWIN Transformer: Layer Normalization helps in estimating the normalization statistics without introducing any more dependencies between the training set shifted window multi-head self-attention-It takes the O/P of W-MSA shift all](./assets/image-2.png)

Region Merging for boosting feature dimension e) Region Merging: The input patches are divided into equal 4 parts combined by this layer. This boosts the feature dimension by 4 times, a linear layer later reduces the feature dimensions back to the original 2. This entire procedure is carried out three times paired with SWIN transformer blocks. SWIN transformer selectively merges adjacent patches to capture the global information properly By merging 4 patches, we keep on increasing the resolution. Fig.5 shows the region merging for boosting feature dimension. 2) DECODER:


![Region Merging for boosting feature dimension e) Region Merging: The input patches are divided into equal 4 parts combined by this layer. This boosts the feature dimension by 4 times, a linear layer later reduces the feature dimensions back to the original 2. This entire procedure is carried out three times paired with SWIN transformer blocks. SWIN transformer selectively merges adjacent patches to capture the global information properly By merging 4 patches, we keep on increasing the resolution. Fig.5 shows the region merging for boosting feature dimension. 2) DECODER:
](./assets/image-3.png)

Complete Block Diagram

![Complete Block Diagram](./assets/image-4.png)


The EffiSwinT architecture consists of three main components:
1. **EfficientNet-B3 Branch**: Extracts local features efficiently
2. **Swin Transformer Branch**: Captures global dependencies and hierarchical features
3. **Fusion Module**: Combines features from both branches through concatenation and MLP layers

### Technical Details
- Input Image Size: 224x224
- Backbone Models:
  - EfficientNet-B3 (pretrained)
  - Swin-Base-Patch4-Window7 (pretrained)
- Feature Fusion: Concatenation followed by MLP (512 units)
- Training Augmentations: 
  - CutMix with α=1.0
  - Random Horizontal Flip
  - Normalization

## Results
![Confusion Matrix](./assets/confusionmatrix.jpg)

The model achieves competitive results on the Hemg/deepfake-and-real-images dataset:
- Training Accuracy: 91.7%
- Validation Accuracy: 98.9%

### Accuracy Plot

![Accuracy Plot](./assets/accuracyvsepoch.jpg)


### Loss Plot
![Loss Plot](./assets/lossovertime.jpg)

### Classification Report
![plot](./assets/report.jpg)

### Train & Validation Loss
![plot](./assets/train_val_loss.jpg)

## Dataset
The Hemg/deepfake-and-real-images dataset is used for training and validation. It contains a balanced distribution of real and deepfake images.

![alt text](./assets/data.png)

## Training Details
- Training Epochs: 5
- Batch Size: 32
- Optimizer: AdamW
- Learning Rate: 1e-4
- Scheduler: Cosine Annealing
- Augmentations: CutMix, Random Horizontal Flip, Normalization

This Model is Trained on GPU-p100 and it takes around 10 Hours to train.

## Implementation Details
```python
# Example usage
from PIL import Image
model = DeepfakeDetector()
model.load_state_dict(torch.load("effiswint_model.pt"))
result, confidence = predict_image("path/to/image.jpg")
```

## Future Improvements
1. **Data Diversity**
   - Incorporate multiple deepfake datasets
   - Add more diverse real images
   - Include different types of manipulations

2. **Hyperparameter Optimization**
   - Learning rate scheduling strategies
   - Batch size optimization
   - CutMix probability tuning
   - Architecture-specific parameters

3. **Training Enhancements**
   - Increase training epochs (current: 5)
   - Implement gradient accumulation
   - Experiment with different optimizers
   - Add more augmentation techniques

4. **Model Robustness**
   - Test on cross-dataset scenarios
   - Add adversarial training
   - Implement ensemble methods

## Dependencies
- PyTorch
- timm
- pytorch-lightning
- transformers
- datasets
- scikit-learn
- seaborn

## Citation
```bibtex
@unknown{unknown,
author = {Mishra, Soumya and Mohapatra, Hitesh and Gourisaria, Mahendra},
year = {2024},
month = {07},
pages = {},
title = {A Robust Approach for Deepfake Detection Using SWIN Transformer},
doi = {10.21203/rs.3.rs-4672886/v1}
}

@article{coccomini2021combining,
  title={Combining EfficientNet and Vision Transformers for Video Deepfake Detection},
  author={Coccomini, Davide and Bechini, Alessio and Bertini, Marco},
  journal={arXiv preprint arXiv:2107.02612},
  year={2021}
}

@mastersthesis{saha2024deepfake,
  title     = {Leveraging Ensemble Models for Enhanced Deepfake Detection},
  author    = {Saha, Shawna},
  school    = {University at Buffalo, The State University of New York},
  year      = {2024},
  type      = {Master's thesis},
  url       = {https://cse.buffalo.edu/tech-reports/2024-06.pdf}
}


```

## License
MIT

## Contact
Contact on saqibiqbal27772@gmail.com