BGE base Financial Matryoshka

This is a sentence-transformers model finetuned from NovaSearch/stella_en_400M_v5 on the json dataset. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: NovaSearch/stella_en_400M_v5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cristiano-sartori/stella_finetuned")
# Run inference
sentences = [
    'Suppose we use the Simplex method to solve the following linear program: \\begin{align*} \\textbf{maximize} \\hspace{0.8cm} & 2x_1 - x_2 \\\\ \\textbf{subject to}\\hspace{0.8cm} & x_1 - x_2  + s_1  = 1 \\\\ \\hspace{0.8cm} & \\hspace{0.85cm}x_1  + s_2  = 4 \\\\ \\hspace{0.8cm} & \\hspace{0.85cm} x_2  + s_3 = 2 \\\\ \\hspace{0.8cm} &\\hspace{-0.8cm}  x_1,\\: x_2, \\:s_1, \\:s_2, \\:s_3 \\geq 0 \\end{align*} At the current step, we have the following Simplex tableau: \\begin{align*} \\hspace{1cm} x_1 &= 1 + x_2 - s_1 \\\\ s_2 &= 3 -x_2  + s_1  \\\\ s_3 &= 2 -x_2  \\\\ \\cline{1-2} z &= 2 + x_2 - 2s_1 \\end{align*} Write the tableau obtained by executing one iteration (pivot) of the Simplex method starting from the above tableau.',
    "To execute one iteration (pivot) of the Simplex method from the given tableau, we need to follow these key steps:\n\n1. **Identify the entering variable:** This is the variable that will increase to improve the objective function. In our case, we look at the coefficients of the objective function row (z-row). We want to maximize z, so we will select the variable with the highest positive coefficient. In this tableau, the coefficients for x1x_1, x2x_2, and the slack variables s1s_1, s2s_2, and s3s_3 in the objective function are 22, 11, 2-2, 00, and 00, respectively. The variable x1x_1 has the highest positive coefficient of 22, so it will be our entering variable.\n\n2. **Identify the leaving variable:** Next, we need to determine which variable will leave the basis. This is done using the minimum ratio test. We look at the constraints and find the ratios of the current solution values to the coefficients of the entering variable in each constraint that has a positive coefficient for that variable.\n\n   From the current tableau:\n   - For the first equation, x1=1+x2s1 x_1 = 1 + x_2 - s_1 : The coefficient of x1x_1 is 11. If we set x2=0x_2 = 0 and s1=0s_1 = 0, we can solve for the right-hand side (RHS), which is 11. The ratio is frac11=1 \\frac{1}{1} = 1.\n   - For the second equation, s2=3x2+s1s_2 = 3 - x_2 + s_1: The coefficient of x1x_1 is 11. The RHS is 33, so the ratio is again frac31=3 \\frac{3}{1} = 3.\n   - For the third equation, s3=2x2s_3 = 2 - x_2: The coefficient of x1x_1 is 00, so we can't use this equation.\n\n   The minimum ratio is 11 from the first equation, so s1s_1 will leave the basis.\n\n3. **Perform the pivot operation:** We pivot on the intersection of the entering variable x1x_1 and the leaving variable s1s_1. We want to express all variables in terms of the new basic variable x1x_1.\n\n   The tableau before pivoting is:\n\n   \\[\n   \\begin{array}{c|c|c|c|c|c}\n   & x_1 & x_2 & s_1 & s_2 & s_3 \\\\\n   \\hline\n   x_1 & 1 & 1 & -1 & 0 & 0 \\\\\n   s_2 & 0 & 1 & 1 & 1 & 0 \\\\\n   s_3 & 0 & 0 & 0 & 0 & 2 \\\\\n   \\hline\n   z & 0 & 1 & 2 & 0 & 0 \\\\\n   \\end{array}\n   \\]\n\n   Now we perform row operations to update the tableau:\n\n   - **Row for x1x_1**: This will remain unchanged as it becomes our new basic variable.\n   - **Row for s2s_2**: We will subtract the 00 multiplied by row x1x_1.\n   - **Row for s3s_3**: This row will also not change as x1x_1 does not appear.\n   - **Row for zz**: We need to replace the zz row to reflect the new basis. The new coefficient of zz will be 22 because we have removed s1s_1 from the basis.\n\nAfter performing these operations, we can represent the new tableau:\n\n\\[\n\\begin{array}{c|c|c|c|c|c}\n   & x_1 & x_2 & s_2 & s_3 \\\\\n   \\hline\n   x_1 & 1 & 1 & -1 & 0 \\\\\n   s_2 & 0 & 1 & 1 & 1 \\\\\n   s_3 & 0 & 0 & 0 & 2 \\\\\n   \\hline\n   z & 0 & 1 & 2 & 0 \\\\\n\\end{array}\n\\]\n\nHere, we have successfully updated the tableau after one pivot step of the Simplex method. The new solution reflects a state where x1x_1 is now a basic variable. We will repeat this process until we find the optimal solution.",
    "Certainly! Let's reevaluate the proof that  w(S_\\ell) = \\max_{T \\in \\mathcal{I}: |T| = \\ell} w(T)  for all ell=1,ldots,k \\ell = 1, \\ldots, k  in a more rigorous manner.\n\n### Overview of the Proof Structure\n\nWe need to show that for each ell \\ell , the greedy algorithm produces a set  S_\\ell  of elements with the maximum weight among all independent sets of size ell \\ell  in a matroid mathcalM=(E,mathcalI) \\mathcal{M} = (E, \\mathcal{I})  with a weight function w:ErightarrowmathbbR w: E \\rightarrow \\mathbb{R} .\n\n### Step 1: Base Case\n\nFor ell=1 \\ell = 1 :\n- S1=s1 S_1 = \\{s_1\\}  is the element with the highest weight.\n- The maximum weight of any independent set of size 1 is indeed maxeinEw(e) \\max_{e \\in E} w(e) , and since s1 s_1  is chosen to be the element with the maximum weight, we have:\n\n\\[\nw(S_1) = w(s_1) = \\max_{T \\in \\mathcal{I}: |T| = 1} w(T)\n\\]\n\nThis base case holds true.\n\n### Step 2: Induction Hypothesis\n\nNow we assume that for some ell \\ell  where 1leqell<k 1 \\leq \\ell < k :\n\n\\[\nw(S_\\ell) = \\max_{T \\in \\mathcal{I}: |T| = \\ell} w(T)\n\\]\n\nThis means that the greedy choice up to ell \\ell  produces a maximum weight independent set of size ell \\ell .\n\n### Step 3: Induction Step\n\nWe need to prove that:\n\n\\[\nw(S_{\\ell + 1}) = \\max_{T \\in \\mathcal{I}: |T| = \\ell + 1} w(T)\n\\]\n\nLet Sell+1=s1,s2,ldots,sell+1 S_{\\ell + 1} = \\{s_1, s_2, \\ldots, s_{\\ell + 1}\\} .\n\n#### Constructing the Set Sell+1 S_{\\ell + 1} \n\nTo show this equality, we will analyze any independent set T T  of size ell+1 \\ell + 1 .\n\n1. **Case 1: sell+1notinT s_{\\ell + 1} \\notin T **\n\n   If sell+1 s_{\\ell + 1}  is not in T T , then T T  must consist of elements among  \\{s_1, s_2, \\ldots, s_\\ell\\}  and possibly other elements from E E . Since  S_\\ell  consists of the elements with the highest weights, we can state:\n\n   \\[\n   w(T) \\leq w(S_\\ell)\n   \\]\n\n   By the induction hypothesis, this means:\n\n   \\[\n   w(T) \\leq \\max_{T' \\in \\mathcal{I}: |T'| = \\ell} w(T') = w(S_\\ell)\n   \\]\n\n2. **Case 2: sell+1inT s_{\\ell + 1} \\in T **\n\n   In this case, we can remove sell+1 s_{\\ell + 1}  from T T  to obtain an independent set T T'  of size ell \\ell . Thus, we have:\n\n   \\[\n   w(T) = w(T') + w(s_{\\ell + 1})\n   \\]\n\n   Since T T'  is an independent set, we can apply the induction hypothesis, which gives us:\n\n   \\[\n   w(T') \\leq w(S_\\ell)\n   \\]\n\n   Therefore:\n\n   \\[\n   w(T) = w(T') + w(s_{\\ell + 1}) \\leq w(S_\\ell) + w(s_{\\ell + 1})\n   \\]\n\n### Conclusion from the Induction Step\n\nTo compare w(Sell+1) w(S_{\\ell + 1})  with w(T) w(T) , we note:\n\n- Sell+1 S_{\\ell + 1}  contains the highest weight elements, and thus:\n\n\\[\nw(S_{\\ell + 1}) = w(S_\\ell) + w(s_{\\ell + 1})\n\\]\n\nSince sell+1 s_{\\ell + 1}  is the highest weight among the remaining elements, we establish the following:\n\n- If sell+1 s_{\\ell + 1}  is included in T T , we have w(T)leqw(Sell+1) w(T) \\leq w(S_{\\ell + 1}) .\n- If sell+1 s_{\\ell + 1}  is not included,  w(T) \\leq w(S_\\ell)  which is already less than or equal to w(Sell+1) w(S_{\\ell + 1}) .\n\nThus, we conclude that:\n\n\\[\nw(S_{\\ell + 1}) = \\max_{T \\in \\mathcal{I}: |T| = \\ell + 1} w(T)\n\\]\n\n### Final Remarks\n\nThe proof holds without any assumptions being violated. Each step has been justified based on the properties of the matroid and the greedy choice made at each step, ensuring that the maximum weight independent set is correctly identified for all sizes from 1 to k k .\n\nThis completes the proof with careful consideration of all cases and steps involved. Thank you for your patience!",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.2772
cosine_accuracy@3 0.8807
cosine_accuracy@5 0.9298
cosine_accuracy@10 0.9825
cosine_precision@1 0.2772
cosine_precision@3 0.2936
cosine_precision@5 0.186
cosine_precision@10 0.0982
cosine_recall@1 0.2772
cosine_recall@3 0.8807
cosine_recall@5 0.9298
cosine_recall@10 0.9825
cosine_ndcg@10 0.6591
cosine_mrr@10 0.5512
cosine_map@100 0.5524

Information Retrieval

Metric Value
cosine_accuracy@1 0.2807
cosine_accuracy@3 0.8772
cosine_accuracy@5 0.9298
cosine_accuracy@10 0.9825
cosine_precision@1 0.2807
cosine_precision@3 0.2924
cosine_precision@5 0.186
cosine_precision@10 0.0982
cosine_recall@1 0.2807
cosine_recall@3 0.8772
cosine_recall@5 0.9298
cosine_recall@10 0.9825
cosine_ndcg@10 0.659
cosine_mrr@10 0.5511
cosine_map@100 0.5523

Information Retrieval

Metric Value
cosine_accuracy@1 0.2772
cosine_accuracy@3 0.8737
cosine_accuracy@5 0.9368
cosine_accuracy@10 0.9895
cosine_precision@1 0.2772
cosine_precision@3 0.2912
cosine_precision@5 0.1874
cosine_precision@10 0.0989
cosine_recall@1 0.2772
cosine_recall@3 0.8737
cosine_recall@5 0.9368
cosine_recall@10 0.9895
cosine_ndcg@10 0.6595
cosine_mrr@10 0.5497
cosine_map@100 0.5504

Information Retrieval

Metric Value
cosine_accuracy@1 0.2947
cosine_accuracy@3 0.8702
cosine_accuracy@5 0.9333
cosine_accuracy@10 0.986
cosine_precision@1 0.2947
cosine_precision@3 0.2901
cosine_precision@5 0.1867
cosine_precision@10 0.0986
cosine_recall@1 0.2947
cosine_recall@3 0.8702
cosine_recall@5 0.9333
cosine_recall@10 0.986
cosine_ndcg@10 0.6667
cosine_mrr@10 0.5604
cosine_map@100 0.5613

Information Retrieval

Metric Value
cosine_accuracy@1 0.2702
cosine_accuracy@3 0.8456
cosine_accuracy@5 0.9333
cosine_accuracy@10 0.9789
cosine_precision@1 0.2702
cosine_precision@3 0.2819
cosine_precision@5 0.1867
cosine_precision@10 0.0979
cosine_recall@1 0.2702
cosine_recall@3 0.8456
cosine_recall@5 0.9333
cosine_recall@10 0.9789
cosine_ndcg@10 0.6493
cosine_mrr@10 0.5396
cosine_map@100 0.5411

Training Details

Training Dataset

json

  • Dataset: json
  • Size: 1,140 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 5 tokens
    • mean: 169.58 tokens
    • max: 512 tokens
    • min: 3 tokens
    • mean: 375.1 tokens
    • max: 512 tokens
  • Samples:
    anchor positive
    In the following let $\kappa_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)$ and $\kappa_{2}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)$ be two valid kernels. Show that the following is also valid kernel: $\kappa\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\kappa_{1}\left(\mathbf{x}, \mathbf{x}^{\prime}\right) \kappa_{2}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)$. To show that the product of two kernels, (\kappa(\mathbf{x}, \mathbf{x}^{\prime}) = \kappa_1(\mathbf{x}, \mathbf{x}^{\prime}) \kappa_2(\mathbf{x}, \mathbf{x}^{\prime})), is also a valid kernel, we can use the property of positive semi-definiteness.

    ### Step 1: Understanding Kernels
    A function (\kappa(\mathbf{x}, \mathbf{x}^{\prime})) is a kernel if it is symmetric (i.e., (\kappa(\mathbf{x}, \mathbf{x}^{\prime}) = \kappa(\mathbf{x}^{\prime}, \mathbf{x}))) and positive semi-definite (PSD). A function is PSD if for any finite set of points ({\mathbf{x}_1, \mathbf{x}2, \ldots, \mathbf{x}n}) and any real coefficients (c_1, c_2, \ldots, c_n), the following holds:
    [
    \sum
    {i=1}^n \sum
    {j=1}^n c_i c_j \kappa(\mathbf{x}_i, \mathbf{x}_j} \geq 0.
    ]

    ### Step 2: Properties of (\kappa_1) and (\kappa_2)
    Given that (\kappa_1) and (\kappa_2) are kernels, we know:
    1. They are symmetric: (\kappa_1(\mathbf{x}, \mathbf{x}^{\prime}) = \kappa_1(\mathbf{x}^{\prime}, \mathbf{x})) and...
    The first annotator rated {a} reviews as positive and the rest as negative. The second annotator rated {b} reviews as positive and the rest as negative. 80 reviews were rated as positive by both annotators. Compute the quality of the above reference using Cohen's Kappa.Give your answer as a numerical value to three decimal places. To compute Cohen's Kappa, we need to understand the components required for the calculation. Cohen's Kappa (( \kappa )) is a statistic that measures inter-rater agreement for categorical items. The formula for Cohen's Kappa is:

    [
    \kappa = \frac{P_o - P_e}{1 - P_e}
    ]

    Where:
    - ( P_o ) is the observed agreement (the proportion of times both annotators agreed).
    - ( P_e ) is the expected agreement by chance.

    Let's break down the problem step by step:

    1. Understanding the Data:
    - Let ( N ) be the total number of reviews.
    - The first annotator rated ( a ) reviews as positive and ( N - a ) as negative.
    - The second annotator rated ( b ) reviews as positive and ( N - b ) as negative.
    - We know that both annotators agreed on 80 reviews being positive.

    2. Observed Agreement ( P_o ):
    - The total number of reviews is ( N ).
    - The observed agreement ( P_o ) is the proportion of reviews where both annotators agreed on the rating.
    - Therefore...
    Consider the (toy) grammar $G$ consisting of the following rules:
    R1: S --> NP VP
    R2: NP --> NN
    R3: NP --> Det NN
    R4: NN --> N
    R5: NN --> NN NN
    R6: NN --> NN PNP
    R7: PNP --> Prep NP
    R8: VP --> V
    R9: VP --> Adv V
    Precisely define the type of grammar G is corresponding to (for that, consider at least the following aspects: dependency-based vs. constituency-based, position in the Chomsky hierarchy, and CNF).
    Justify your answer for each of the aspects you will be mentioning.
    To analyze the grammar ( G ) you've provided, we need to consider several aspects: whether it is dependency-based or constituency-based, its position in the Chomsky hierarchy, and whether it is in Chomsky Normal Form (CNF).

    ### 1. Dependency-based vs. Constituency-based

    Definition:
    - Dependency-based grammar focuses on the relationships between words in a sentence, where a word is dependent on another word (its head).
    - Constituency-based grammar organizes sentences into nested structures (constituents) that can be replaced with single units.

    Analysis:
    The grammar ( G ) is constituency-based because it uses hierarchical structures (Non-terminal symbols like ( S, NP, VP, NN )) to define the grammatical relationships. The rules show that phrases (like Noun Phrases and Verb Phrases) are made up of smaller constituents. For instance, ( S ) is defined as consisting of a noun phrase ( NP ) and a verb phrase ( VP ). This clearly indicates a constituency str...
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • num_train_epochs: 4
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: False
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 16
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: False
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_768_cosine_ndcg@10 dim_512_cosine_ndcg@10 dim_256_cosine_ndcg@10 dim_128_cosine_ndcg@10 dim_64_cosine_ndcg@10
0.2807 10 0.1249 - - - - -
0.5614 20 0.8091 - - - - -
0.8421 30 0.0235 - - - - -
1.0 36 - 0.6463 0.6516 0.6468 0.6425 0.6363
1.1123 40 0.0293 - - - - -
1.3930 50 0.0474 - - - - -
1.6737 60 0.0062 - - - - -
1.9544 70 0.0022 - - - - -
2.0 72 - 0.6535 0.6501 0.6488 0.6488 0.6454
2.2246 80 0.9144 - - - - -
2.5053 90 0.0139 - - - - -
2.7860 100 0.0019 - - - - -
3.0 108 - 0.6533 0.6582 0.6523 0.6651 0.6499
3.0561 110 0.3805 - - - - -
3.3368 120 0.0075 - - - - -
3.6175 130 0.0035 - - - - -
3.8982 140 0.0012 - - - - -
4.0 144 - 0.6591 0.659 0.6595 0.6667 0.6493
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.8
  • Sentence Transformers: 4.1.0
  • Transformers: 4.52.4
  • PyTorch: 2.7.0+cu126
  • Accelerate: 1.3.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
-
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cristiano-sartori/stella_finetuned

Finetuned
(18)
this model

Papers for cristiano-sartori/stella_finetuned

Evaluation results