SentenceTransformer based on Alibaba-NLP/gte-multilingual-base

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-multilingual-base on the vi_health_qa dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: Alibaba-NLP/gte-multilingual-base
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity
Training Dataset:
- vi_health_qa
Language: vi

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("multilingual-e5-base-finetuned")
# Run inference
sentences = [
    'Tăng bạch cầu có ảnh hưởng gì tới viêm lợi không?',
    'Bé nhà bạn ngoài hay bị nhiệt, viêm lợi ra thì còn có vấn đề sức khỏe nào khác không?Cháu đã uống thuốc gì để điều trị nhiệt, viêm lợi ?Bạch cầu của cháu tăng có thể do nhiều nguyên nhân, thường gặp là do viêm nhiễm hoặc do dùng một số loại thuốc. Bạn nên đưa bé đến khám để được chẩn đoán tốt hơn.',
    'Điều trị phẫu thuật và làm xét nghiệm mô bệnh học là phương pháp điều trị các khối u tuyến nước bọt tốt nhất. Phẫu thuật khối u tuyến nước bọt mang tai phụ thuộc vào loại, kích thước, tính chất khối u. Theo đó, phẫu thuật có thể tiến hành loại bỏ một phần hoặc loại bỏ toàn bộ tuyến nước bọt và có hoặc không kèm theo loại bỏ các hạch bạch huyết, dây thần kinh có liên quan.Bên cạnh đó còn có phương pháp phẫu thuật tái tạo, nghĩa là sau khi phẫu thuật để loại bỏ khối u, bác sĩ có thể đề nghị phẫu thuật tái tạo để sửa chữa khu vực. Trong quá trình phẫu thuật tái tạo, bác sĩ phẫu thuật làm việc để sửa chữa cải thiện khả năng nhai, nuốt, nói hoặc thở, có thể cần ghép da, mô hoặc dây thần kinh từ các bộ phận khác của cơ thể để xây dựng lại các khu vực trong miệng, cổ họng hoặc hàm của bệnh nhân.Vì không có thông tin về chẩn đoán của em trước mổ, tính chất (mềm hay cứng, đau hay không đau, có xâm lấn hay không...) và kích thước khối u, phương pháp phẫu thuật, thời gian phẫu thuật bao lâu rồi, em đã thăm khám lại với bác sĩ mổ cho mình hay chưa. Nên chưa thể kết luận liệt dây thần kinh 9,10 sau mổ u tuyến nước bọt có khả năng phục hồi khôngTuy nhiên, bác sĩ có một vài lời khuyên dành cho em:Tái khám đúng lịch để theo dõi tiến triển của khối u cũng như khả năng tái phát của khối u sau điều trị. Đồng thời bác sĩ sẽ đánh giá mức độ tổn thương thần kinh, kết hợp khám chuyên khoa Vật lý trị liệu- Phục hồi chức năng để tập luyện chức năng nói, nuốt.Xét nghiệm thường xuyên theo chỉ định của bác sĩBổ sung dinh dưỡng và uống nhiều nước.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Dataset: healthcare-dev
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.6284
cosine_accuracy@3	0.7825
cosine_accuracy@5	0.8318
cosine_accuracy@10	0.8882
cosine_precision@1	0.6284
cosine_precision@3	0.2608
cosine_precision@5	0.1664
cosine_precision@10	0.0888
cosine_recall@1	0.6284
cosine_recall@3	0.7825
cosine_recall@5	0.8318
cosine_recall@10	0.8882
cosine_ndcg@10	0.7576
cosine_mrr@10	0.7157
cosine_map@100	0.7208
dot_accuracy@1	0.6284
dot_accuracy@3	0.7825
dot_accuracy@5	0.8318
dot_accuracy@10	0.8882
dot_precision@1	0.6284
dot_precision@3	0.2608
dot_precision@5	0.1664
dot_precision@10	0.0888
dot_recall@1	0.6284
dot_recall@3	0.7825
dot_recall@5	0.8318
dot_recall@10	0.8882
dot_ndcg@10	0.7576
dot_mrr@10	0.7157
dot_map@100	0.7208

Information Retrieval

Dataset: healthcare-dev
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.5946
cosine_accuracy@3	0.7486
cosine_accuracy@5	0.8003
cosine_accuracy@10	0.8584
cosine_precision@1	0.5946
cosine_precision@3	0.2495
cosine_precision@5	0.1601
cosine_precision@10	0.0858
cosine_recall@1	0.5946
cosine_recall@3	0.7486
cosine_recall@5	0.8003
cosine_recall@10	0.8584
cosine_ndcg@10	0.726
cosine_mrr@10	0.6836
cosine_map@100	0.6887
dot_accuracy@1	0.5946
dot_accuracy@3	0.7486
dot_accuracy@5	0.8003
dot_accuracy@10	0.8584
dot_precision@1	0.5946
dot_precision@3	0.2495
dot_precision@5	0.1601
dot_precision@10	0.0858
dot_recall@1	0.5946
dot_recall@3	0.7486
dot_recall@5	0.8003
dot_recall@10	0.8584
dot_ndcg@10	0.726
dot_mrr@10	0.6836
dot_map@100	0.6887

Training Details

Training Dataset

vi_health_qa

Dataset: vi_health_qa at d90a62d
Size: 7,009 training samples
Columns: question and answer
Approximate statistics based on the first 1000 samples:
question answer
type string string
details
min: 8 tokens
mean: 30.36 tokens
max: 325 tokens

min: 12 tokens
mean: 131.93 tokens
max: 1249 tokens

	question	answer
type	string	string
details	min: 8 tokens mean: 30.36 tokens max: 325 tokens	min: 12 tokens mean: 131.93 tokens max: 1249 tokens

Samples:

question	answer
`Đang chích ngừa viêm gan B có chích ngừa Covid-19 được không?`	`Nếu anh/chị đang tiêm ngừa vaccine phòng bệnh viêm gan B, anh/chị vẫn có thể tiêm phòng vaccine phòng Covid-19, tuy nhiên vaccine Covid-19 phải được tiêm cách trước và sau mũi vaccine viêm gan B tối thiểu là 14 ngày.`
`Đau đầu, căng thẳng do công việc, suy giảm trí nhớ khoảng gần một năm phải làm sao?`	`Tình trạng đau đầu theo bạn mô tả thì chưa rõ. Vì thế, bác sĩ khuyến khích bạn đến cơ sở y tế hoặc bệnh viện thuộc Hệ thống Y tế Vinmec để khám chuyên khoa Thần kinh. Nếu đau đầu thông thường thì cần nghỉ ngơi thư giãn sẽ đỡ, còn nếu có những yếu tố khác thì cần phải khám kỹ, xét nghiệm cận lâm sàng để chẩn đoán chính xác hơn và có hướng điều trị phù hợp.`
`Đặt lưu lượng khí hệ thống Jackson-Rees thấp hơn quy định khi sử dụng gây mê cho trẻ em sẽ gây hậu quả gì?`	Hệ thống Jackson – Rees dùng khi gây mê để tránh hít lại khí thở ra cần đặt lưu lượng khí mới gấp 2 – 2,5 lần thông khí phút của bệnh nhân. Nếu cài đặt thấp hơn mức này sẽ gây ra hiện tượng ưu thán hay còn gọi là thừa khí CO2 biểu hiện kích thích vã mồ hôi, tăng huyết áp, nguy hiểm hơn là bệnh nhân tím tái, trụy tim mạch, thậm chí là tử vong.Nếu còn thắc mắc, bạn có thể liên hệ hoặc đến trực tiếp một trong các bệnh viện thuộc Hệ thống Y tế Vinmec trên toàn quốc để được bác sĩ chuyên môn tư vấn cụ thể hơn.Cảm ơn bạn đã tin tưởng và đặt câu hỏi tới Vinmec. Trân trọng!

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Dataset

vi_health_qa

Dataset: vi_health_qa at d90a62d
Size: 993 evaluation samples
Columns: question and answer
Approximate statistics based on the first 993 samples:
question answer
type string string
details
min: 9 tokens
mean: 30.03 tokens
max: 267 tokens

min: 13 tokens
mean: 133.86 tokens
max: 1103 tokens

	question	answer
type	string	string
details	min: 9 tokens mean: 30.03 tokens max: 267 tokens	min: 13 tokens mean: 133.86 tokens max: 1103 tokens

Samples:

question	answer
`Em nghe nói kích trứng nhiều lần sẽ làm rối loạn nội tiết và tăng khả năng ung thư buồng trứng có phải không? Vì em có dự trữ buồng trứng rất thấp, được chỉ định gom trứng nên nghe thông tin trên em rất lo lắng.`	`Theo thông tin chị cung cấp thì chưa đủ dữ liệu để kết luận là kích trứng gây rối loạn nội tiết hay ung thư, tuy nhiên những điều này vẫn có thể ảnh hưởng về sau.`
`Tại sao tỷ lệ dịch chuyển tinh trùng thấp?`	Nguyên nhân dẫn đến tỷ lệ dịch chuyển tinh trùng thấp là do tinh trùng tổn thương đuôi, tinh trùng không hoạt động, tinh trùng chết. Bạn nên thăm khám bác sĩ chuyên khoa và có phương pháp điều trị thích hợp, tránh ảnh hưởng đến khả năng sinh sản. Nếu được bạn nên giảm thức đêm. Thức đêm cũng là nguyên nhân ảnh hưởng đến chất lượng và số lượng tinh trùng. Ngoài ra, bạn cần ăn uống các loại thực phẩm tươi sạch bổ dưỡng, cần bỏ rượu, thuốc lá và các chất kích thích khác như rượu, cần sa, amphetamin... nếu hai vợ chồng đang cố gắng thụ thai. Ngoài ra, cần tập luyện thể dục đều đặn để nâng cao thể lực, duy trì cân nặng ở mức phù hợp, giảm cân nếu đang thừa cân và hạn chế tiếp xúc với điện thoại di động.
`Ngồi dậy hay nằm xuống đều bị chóng mặt có phải bị tổn thương dây thần kinh do phẫu thuật không?`	`Xương đòn khi mổ rất lâu liền xương, 3 tuần chưa thể có cal xương dù là cal non, Để đánh giá có tổn thương thần kinh hay không cần khám về lâm sàng, bạn có thể đến khám tại bệnh viện thuộc Hệ thống Y tế Vinmec trên toàn quốc để bác sĩ tư vấn rõ hơn và đưa ra hướng điều trị phù hợp nhất.`

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 6
per_device_eval_batch_size: 6
learning_rate: 3.0692519709098972e-06
num_train_epochs: 4
warmup_ratio: 0.04970511867965379
fp16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 6
per_device_eval_batch_size: 6
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 3.0692519709098972e-06
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 4
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.04970511867965379
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
eval_use_gather_object: False
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	loss	healthcare-dev_cosine_map@100
0	0	-	-	0.6555
0.0855	100	0.1744	0.1599	0.6672
0.1711	200	0.1618	0.1178	0.6927
0.2566	300	0.1219	0.0920	0.7032
0.3422	400	0.0778	0.0807	0.7083
0.4277	500	0.0993	0.0739	0.7106
0.5133	600	0.0821	0.0695	0.7149
0.5988	700	0.0632	0.0685	0.7125
0.6843	800	0.0653	0.0669	0.7129
0.7699	900	0.0962	0.0655	0.7185
0.8554	1000	0.0395	0.0648	0.7170
0.9410	1100	0.0784	0.0628	0.7154
1.0265	1200	0.0791	0.0627	0.7180
1.1121	1300	0.063	0.0618	0.7179
1.1976	1400	0.0811	0.0606	0.7163
1.2831	1500	0.0425	0.0610	0.7179
1.3687	1600	0.028	0.0603	0.7205
1.4542	1700	0.0761	0.0596	0.7202
1.5398	1800	0.0419	0.0591	0.7190
1.6253	1900	0.0394	0.0589	0.7214
1.7109	2000	0.0623	0.0593	0.7235
1.7964	2100	0.0683	0.0594	0.7214
1.8820	2200	0.0316	0.0590	0.7212
1.9675	2300	0.0681	0.0579	0.7246
2.0530	2400	0.0366	0.0579	0.7243
2.1386	2500	0.0315	0.0579	0.7247
2.2241	2600	0.0633	0.0578	0.7247
2.3097	2700	0.0278	0.0580	0.7247
2.3952	2800	0.029	0.0582	0.7236
2.4808	2900	0.0472	0.0577	0.7206
2.5663	3000	0.0307	0.0575	0.7208
2.6518	3100	0.0248	0.0574	0.7198
2.7374	3200	0.0504	0.0575	0.7195
2.8229	3300	0.0259	0.0574	0.7208
2.9085	3400	0.0288	0.0570	0.7214
2.9940	3500	0.0595	0.0566	0.7233
3.0796	3600	0.0372	0.0562	0.7212
3.1651	3700	0.0334	0.0563	0.7218
3.2506	3800	0.0384	0.0563	0.7210
3.3362	3900	0.0178	0.0564	0.7200
3.4217	4000	0.0313	0.0564	0.7201
3.5073	4100	0.0447	0.0562	0.7197
3.5928	4200	0.0281	0.0562	0.7199
3.6784	4300	0.02	0.0563	0.7199
3.7639	4400	0.0535	0.0562	0.7212
3.8494	4500	0.017	0.0562	0.7207
3.9350	4600	0.0353	0.0562	0.7208
4.0	4676	-	-	0.6887

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.1.0
Transformers: 4.44.2
PyTorch: 2.4.0+cu121
Accelerate: 0.34.2
Datasets: 3.0.0
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: 3

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for nampham1106/gte-multilingual-base-finetuned

Base model

Alibaba-NLP/gte-multilingual-base

Finetuned

(90)

this model

Dataset used to train nampham1106/gte-multilingual-base-finetuned

Papers for nampham1106/gte-multilingual-base-finetuned

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 9

Efficient Natural Language Response Suggestion for Smart Reply

Paper • 1705.00652 • Published May 1, 2017

Evaluation results

Cosine Accuracy@1 on healthcare dev
self-reported

0.628
Cosine Accuracy@3 on healthcare dev
self-reported

0.782
Cosine Accuracy@5 on healthcare dev
self-reported

0.832
Cosine Accuracy@10 on healthcare dev
self-reported

0.888
Cosine Precision@1 on healthcare dev
self-reported

0.628
Cosine Precision@3 on healthcare dev
self-reported

0.261
Cosine Precision@5 on healthcare dev
self-reported

0.166
Cosine Precision@10 on healthcare dev
self-reported

0.089
Cosine Recall@1 on healthcare dev
self-reported

0.628
Cosine Recall@3 on healthcare dev
self-reported

0.782