Matryoshka Representation Learning
Paper
•
2205.13147
•
Published
•
25
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("acpotts/finetuned_arctic")
# Run inference
sentences = [
"What percentage of racy results did Google cut for searches like 'Latina teenager' in March 2022?",
"2022. https://www.reuters.com/technology/google-cuts-racy-results-by-30-searches-like-latina\xad\nteenager-2022-03-30/\n40. Safiya Umoja Noble. Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press.\nFeb. 2018. https://nyupress.org/9781479837243/algorithms-of-oppression/\n41. Paresh Dave. Google cuts racy results by 30% for searches like 'Latina teenager'. Reuters. Mar. 30,\n2022. https://www.reuters.com/technology/google-cuts-racy-results-by-30-searches-like-latina\xad\nteenager-2022-03-30/\n42. Miranda Bogen. All the Ways Hiring Algorithms Can Introduce Bias. Harvard Business Review. May\n6, 2019. https://hbr.org/2019/05/all-the-ways-hiring-algorithms-can-introduce-bias",
"they've used drugs, or whether they've expressed interest in LGBTQI+ groups, and then use that data to \nforecast student success.76 Parents and education experts have expressed concern about collection of such\nsensitive data without express parental consent, the lack of transparency in how such data is being used, and\nthe potential for resulting discriminatory impacts.\n• Many employers transfer employee data to third party job verification services. This information is then used\nby potential future employers, banks, or landlords. In one case, a former employee alleged that a\ncompany supplied false data about her job title which resulted in a job offer being revoked.77\n37",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
InformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.815 |
| cosine_accuracy@3 | 0.935 |
| cosine_accuracy@5 | 0.95 |
| cosine_accuracy@10 | 0.965 |
| cosine_precision@1 | 0.815 |
| cosine_precision@3 | 0.3117 |
| cosine_precision@5 | 0.19 |
| cosine_precision@10 | 0.0965 |
| cosine_recall@1 | 0.815 |
| cosine_recall@3 | 0.935 |
| cosine_recall@5 | 0.95 |
| cosine_recall@10 | 0.965 |
| cosine_ndcg@10 | 0.8954 |
| cosine_mrr@10 | 0.8723 |
| cosine_map@100 | 0.8742 |
| dot_accuracy@1 | 0.815 |
| dot_accuracy@3 | 0.935 |
| dot_accuracy@5 | 0.95 |
| dot_accuracy@10 | 0.965 |
| dot_precision@1 | 0.815 |
| dot_precision@3 | 0.3117 |
| dot_precision@5 | 0.19 |
| dot_precision@10 | 0.0965 |
| dot_recall@1 | 0.815 |
| dot_recall@3 | 0.935 |
| dot_recall@5 | 0.95 |
| dot_recall@10 | 0.965 |
| dot_ndcg@10 | 0.8954 |
| dot_mrr@10 | 0.8723 |
| dot_map@100 | 0.8742 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
What are some of the principles proposed for the ethical use of AI and automated systems? |
lems with legislation, and some courts extending longstanding statutory protections to new and emerging tech |
How are companies and researchers addressing the challenges posed by new and emerging technologies in relation to legislation? |
lems with legislation, and some courts extending longstanding statutory protections to new and emerging tech |
What is the purpose of reporting summary information about automated systems in plain language? |
any operators or others who need to understand the system, and calibrated to the level of risk based on the |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
eval_strategy: stepsper_device_train_batch_size: 20per_device_eval_batch_size: 20num_train_epochs: 5multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 20per_device_eval_batch_size: 20per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 5max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseeval_use_gather_object: Falsebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | cosine_map@100 |
|---|---|---|
| 1.0 | 40 | 0.8676 |
| 1.25 | 50 | 0.8670 |
| 2.0 | 80 | 0.8731 |
| 2.5 | 100 | 0.8722 |
| 1.0 | 40 | 0.8641 |
| 1.25 | 50 | 0.8654 |
| 2.0 | 80 | 0.8674 |
| 2.5 | 100 | 0.8706 |
| 3.0 | 120 | 0.8659 |
| 3.75 | 150 | 0.8697 |
| 4.0 | 160 | 0.8706 |
| 5.0 | 200 | 0.8742 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
Snowflake/snowflake-arctic-embed-m