README.md · lokeshch19/ModernPubMedBERT at main

ModernPubMedBERT / README.md

lokeshch19

Update README.md

810bc50 verified 7 months ago

preview code

raw

history blame contribute delete

3.56 kB

	---
	license: mit
	base_model:
	- thomas-sounack/BioClinical-ModernBERT-base
	tags:
	- sentence-transformers
	- sentence-similarity
	- medical
	- clinical
	- biomedical
	- pubmed
	- healthcare
	- medical-ai
	- clinical-nlp
	- bioinformatics
	- medical-literature
	- clinical-text
	---
	# Clinical ModernBERT Embedding Model

	A specialized medical embedding model fine-tuned from Clinical ModernBERT using InfoNCE contrastive learning on PubMed title-abstract pairs.

	## Model Details

	- Base Model: thomas-sounack/BioClinical-ModernBERT-base
	- Training Method: InfoNCE contrastive learning
	- Training Data: PubMed title-abstract pairs
	- Max Sequence Length: 2048 tokens

	## Usage

	```python
	from sentence_transformers import SentenceTransformer

	# Load the model
	model = SentenceTransformer("lokeshch19/ModernPubMedBERT")

	# Encode medical texts
	texts = [
	"Rheumatoid arthritis is an autoimmune disorder attacking joint linings.",
	"Inflammatory cytokines in RA lead to progressive cartilage and bone destruction."
	]
	embeddings = model.encode(texts)
	```

	## Applications

	- Medical document similarity analysis
	- Clinical text retrieval systems
	- Biomedical literature search
	- Medical concept matching and classification

	## Model Comparison

	Compared to `NeuML/bioclinical-modernbert-base-embeddings`, our model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content.

	### Comprehensive Evaluation Results

	\| Metric \| Our Model \| NeuML Model \| Improvement \|
	\|--------\|-----------\|-------------\|-------------\|
	\| Accuracy@1 \| 91.28% \| 85.86% \| +6.3% \|
	\| Accuracy@3 \| 98.46% \| 95.66% \| +2.9% \|
	\| Accuracy@5 \| 99.24% \| 97.14% \| +2.2% \|
	\| Accuracy@10 \| 99.64% \| 98.29% \| +1.4% \|
	\| NDCG@5 \| 95.96% \| 92.37% \| +3.9% \|
	\| NDCG@10 \| 96.10% \| 92.75% \| +3.6% \|
	\| MRR@10 \| 94.89% \| 90.90% \| +4.4% \|
	\| MAP@100 \| 94.91% \| 90.96% \| +4.3% \|

	Evaluation performed using `InformationRetrievalEvaluator` from sentence-transformers on the `gamino/wiki_medical_terms` dataset.

	## Model Comparison

	Compared to `NeuML/bioclinical-modernbert-base-embeddings`, this model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content.

	### Medical Text Similarity

	Example 1: Related Medical Concepts
	```python
	text1 = "Hypertension increases the risk of stroke and heart attack."
	text2 = "High blood pressure damages arterial walls over time, leading to cardiovascular events."

	# Cosine Similarity Results:
	# Our Model: 0.5941 (59.4%)
	# NeuML Model: 0.5267 (52.7%)
	# Improvement: +12.7%
	```

	### Non-Medical Text Discrimination

	Example 2: Medical vs. Programming Terms
	```python
	texts = ["diabetes type 2", "asyncio.run()"]

	# Cosine Similarity Results:
	# Our Model: 0.0804 (8.0%) - Correctly identifies low similarity
	# NeuML Model: 0.1926 (19.3%) - Higher false similarity
	# Better Discrimination: 58% lower false positive rate
	```

	### Key Advantages

	- Enhanced Medical Understanding: 12.7% better similarity detection for related medical concepts
	- Improved Discrimination: 58% reduction in false similarities between medical and non-medical terms
	- Domain Specialization: Fine-tuned specifically on PubMed literature for optimal medical text processing

	## Training Details

	- Optimizer: AdamW (learning rate: 3e-4, weight decay: 0.1)
	- Batch Size: 72
	- Training Steps: 7,000
	- Warmup Steps: 700

	## Citation

	If you use this model, please cite the base model paper and acknowledge this fine-tuning work.