File size: 2,174 Bytes
52f5ae5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d134079
 
52f5ae5
 
661cbef
52f5ae5
661cbef
 
52f5ae5
661cbef
 
 
 
 
52f5ae5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
language: en
license: mit
tags:
- clip
- multimodal
- contrastive-learning
- cultural-heritage
- reevaluate
- information-retrieval
datasets:
- xuemduan/reevaluate-image-text-pairs
model-index:
- name: REEVALUATE CLIP Fine-tuned Models
  results:
  - task:
      type: image-text-retrieval
      name: Image-Text Retrieval
    dataset:
      name: Cultural Heritage Hybrid Dataset
      type: xuemduan/reevaluate-image-text-pairs
    metrics:
    - name: I2T R@1
      type: recall@1
      value: <TOBE_FILL_IN>
    - name: I2T R@5
      type: recall@5
      value: <TOBE_FILL_IN>
    - name: T2I R@1
      type: recall@1
      value: <TOBE_FILL_IN>
---


# Domain-Adaptive CLIP for Multimodal Retrieval

The fine-tuned CLIP (Vit-L/14) used in **Knowledge-Enhanced Multimodal Retrieval**


---

## 📦 Available Models

| Model | Description | Data Type |
|--------|--------------|-----------|
| `reevaluate-clip` | Fine-tuned on images, query texts, and description texts | Image+Text |
---

## 🧾 Dataset

The models were trained and evaluated on the **REEVLAUATE Image-Text Pair Dataset**, which contains **43,500 image–text pairs** derived from Wikidata and Pilot Museums.

Each artefact is described by:
- `Image`: artefact image
- `Description text`: BLIP-generated natural language portion + meatadata portion
- `Query text`: User query-like text

Dataset: [xuemduan/reevaluate-image-text-pairs](https://huggingface.co/datasets/xuemduan/reevaluate-image-text-pairs)

---

## 🚀 Usage

```python
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("xuemduan/reevaluate-clip")
processor = CLIPProcessor.from_pretrained("xuemduan/reevaluate-clip")

image = Image.open("artefact.jpg")
text = "yellow flower paintings"

image_embeds = model.get_image_features(**processor(images=image, return_tensors="pt"))
text_embeds = model.get_text_features(**processor(text=[text], return_tensors="pt"))

# normalize
image_embeds = image_embeds / image_embeds.norm(dim=-1, keepdim=True)
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)

similarity = (image_embeds @ text_embeds.T)
print(similarity)