Configuration Parsing Warning: Invalid JSON for config file config.json

Configuration Parsing Warning: Invalid JSON for config file tokenizer_config.json

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

vqa-multimodal-medical-report

Model Overview

The vqa-multimodal-medical-report model is a Vision-Question Answering (VQA) system built on the ViLT (Vision-and-Language Transformer) architecture. It is specialized in the medical domain, specifically designed to answer complex clinical questions based on a provided medical image (e.g., X-ray) and a text question. It performs joint reasoning over visual and textual inputs.

Model Architecture

Base Architecture: ViLT (ViltForQuestionAnswering).
Multimodal Fusion: Unlike traditional VQA models, ViLT uses a single Transformer layer to jointly process both image patches and text tokens (questions) after projecting them into a unified embedding space. This allows for deep, fine-grained cross-modal reasoning.
Inputs: A medical image (e.g., a chest X-ray) and a question (e.g., "Is there evidence of cardiomegaly?").
Output: A predicted answer token from a pre-defined vocabulary of common medical terms and simple answers (e.g., 'yes', 'no', 'pneumonia', 'right upper lobe').

Intended Use

Clinical Decision Support: Assisting clinicians by providing quick, evidence-based answers extracted directly from medical imagery and associated clinical questions.
Medical Training: Serving as an interactive learning tool for radiology and pathology residents to test their image interpretation skills.
Radiology Report Generation: Providing structured data (answers to specific questions) that can be inserted into a final diagnostic report.

Limitations and Ethical Considerations

Safety Criticality: This model is NOT a substitute for a human radiologist or physician. It is an automated tool and should only be used in a supporting, non-diagnostic capacity.
Complex Reasoning: It may struggle with highly abstract, inferential, or differential diagnosis questions that require complex prior medical knowledge.
Adversarial Attacks: Simple modifications to the image or question can sometimes lead to nonsensical or dangerous predictions.
Domain Specificity: Performance is high on the training domain (mostly chest X-rays from MIMIC-CXR), but poor on domains outside of its training distribution (e.g., dermatological images).

Example Code

To get an answer based on an X-ray and a question:

from transformers import ViltProcessor, ViltForQuestionAnswering
from PIL import Image
import torch

# Load model and processor
model_name = "YourOrg/vqa-multimodal-medical-report"
processor = ViltProcessor.from_pretrained(model_name)
model = ViltForQuestionAnswering.from_pretrained(model_name)

# Define inputs (Conceptual image loading)
# image = Image.open("path/to/xray_scan.jpg").convert("RGB")
# Dummy image for demonstration
dummy_image = Image.new('RGB', (384, 384), color = 'white') 
question = "What is the location of the acute finding?"

# Process the inputs
encoding = processor(dummy_image, question, return_tensors="pt")

# Inference
with torch.no_grad():
    outputs = model(**encoding)
    logits = outputs.logits

# Find the predicted answer
idx = logits.argmax(-1).item()
predicted_answer = model.config.id2label[idx]

print(f"Question: {question}")
print(f"Predicted Answer: **{predicted_answer}**")

Downloads last month: 14

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support