--- language: en license: mit tags: - clip - multimodal - contrastive-learning - cultural-heritage - reevaluate - information-retrieval datasets: - xuemduan/reevaluate-image-text-pairs model-index: - name: REEVALUATE CLIP Fine-tuned Models results: - task: type: image-text-retrieval name: Image-Text Retrieval dataset: name: Cultural Heritage Hybrid Dataset type: xuemduan/reevaluate-image-text-pairs metrics: - name: I2T R@1 type: recall@1 value: - name: I2T R@5 type: recall@5 value: - name: T2I R@1 type: recall@1 value: --- # Domain-Adaptive CLIP for Multimodal Retrieval The fine-tuned CLIP (Vit-L/14) used in **Knowledge-Enhanced Multimodal Retrieval** --- ## 📦 Available Models | Model | Description | Data Type | |--------|--------------|-----------| | `reevaluate-clip` | Fine-tuned on images, query texts, and description texts | Image+Text | --- ## 🧾 Dataset The models were trained and evaluated on the **REEVLAUATE Image-Text Pair Dataset**, which contains **43,500 image–text pairs** derived from Wikidata and Pilot Museums. Each artefact is described by: - `Image`: artefact image - `Description text`: BLIP-generated natural language portion + meatadata portion - `Query text`: User query-like text Dataset: [xuemduan/reevaluate-image-text-pairs](https://huggingface.co/datasets/xuemduan/reevaluate-image-text-pairs) --- ## 🚀 Usage ```python from transformers import CLIPProcessor, CLIPModel from PIL import Image model = CLIPModel.from_pretrained("xuemduan/reevaluate-clip") processor = CLIPProcessor.from_pretrained("xuemduan/reevaluate-clip") image = Image.open("artefact.jpg") text = "yellow flower paintings" image_embeds = model.get_image_features(**processor(images=image, return_tensors="pt")) text_embeds = model.get_text_features(**processor(text=[text], return_tensors="pt")) # normalize image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True) text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True) similarity = (image_embeds @ text_embeds.T) print(similarity)