DeepSeek-R1-Distill-Qwen-1.5B Fine-tuned on MedMCQA
This model is a fine-tuned version of deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B on the MedMCQA dataset using QLoRA (Quantized Low-Rank Adaptation).
Model Details
- Base Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
- Dataset: MedMCQA (Medical Multiple Choice Question Answering)
- Fine-tuning Method: QLoRA (4-bit quantization + LoRA)
- Training Framework: Transformers + PEFT + TRL
Performance
Test Set Results
- Accuracy: 0.0980
- Macro F1-Score: 0.0446
- Weighted F1-Score: 0.1785
Validation Set Results
- Accuracy: 0.3400
- Macro F1-Score: 0.2290
- Weighted F1-Score: 0.2481
Per-Class Performance (Test Set)
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| A | 0.0000 | 0.0000 | 0.0000 | 0 |
| B | 0.0000 | 0.0000 | 0.0000 | 0 |
| C | 1.0000 | 0.0980 | 0.1785 | 500 |
| D | 0.0000 | 0.0000 | 0.0000 | 0 |
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Load base model and tokenizer
base_model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
# Load fine-tuned model
model = PeftModel.from_pretrained(base_model, "arumpuri/deepseek-r1-medmcqa-qlora")
# Example usage
prompt = '''Question: What is the most common cause of acute pancreatitis?
Options:
A. Alcohol abuse
B. Gallstones
C. Hypertriglyceridemia
D. Medications
Answer:'''
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Training Details
- LoRA Rank: 16
- LoRA Alpha: 32
- Learning Rate: 2e-4
- Batch Size: 1 (with gradient accumulation)
- Epochs: 3
- Optimizer: Paged AdamW 8-bit
- Training Samples: 5000
- Validation Samples: 500
- Test Samples: 1000
Evaluation Methodology
The model was evaluated using comprehensive metrics including:
- Accuracy: Overall correctness across all questions
- Precision/Recall/F1: Per-class and averaged metrics
- Confusion Matrix: Detailed error analysis
The evaluation was performed on both validation and test sets to ensure robust performance assessment.
Intended Use
This model is designed for medical question answering tasks, particularly multiple-choice questions. It should be used as a research tool and not for actual medical diagnosis or treatment recommendations.
Limitations
- The model may generate incorrect medical information
- It should not be used for clinical decision-making
- Performance may vary on questions outside the training distribution
- Evaluation was performed on a subset of the full dataset due to computational constraints
- Downloads last month
- 1
Model tree for arumpuri/deepseek-r1-medmcqa-qlora
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B