SentenceTransformer based on sentence-transformers/all-distilroberta-v1

This is a sentence-transformers model finetuned from sentence-transformers/all-distilroberta-v1 on the ai-job-embedding-finetuning dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Resulting model from example code for fine-tuning an embedding model for AI job search.

Links

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/all-distilroberta-v1
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- ai-job-embedding-finetuning

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("shawhin/distilroberta-ai-job-embeddings")
# Run inference
sentences = [
    'data integrity governance PowerBI development Juno Beach',
    'skills: 2-5 y of exp with data analysis/ data integrity/ data governance; PowerBI development; Python; SQL, SOQL\n\nLocation: Juno Beach, FL\nPLEASE SEND LOCAL CANDIDATES ONLY\n\nSeniority on the skill/s required on this requirement: Mid.\n\nEarliest Start Date: ASAP\n\nType: Temporary Project\n\nEstimated Duration: 12 months with possible extension(s)\n\nAdditional information: The candidate should be able to provide an ID if the interview is requested. The candidate interviewing must be the same individual who will be assigned to work with our client. \nRequirements:• Availability to work 100% at the Client’s site in Juno Beach, FL (required);• Experience in data analysis/ data integrity/ data governance;• Experience in analytical tools including PowerBI development, Python, coding, Excel, SQL, SOQL, Jira, and others.\n\nResponsibilities include but are not limited to the following:• Analyze data quickly using multiple tools and strategies including creating advanced algorithms;• Serve as a critical member of data integrity team within digital solutions group and supplies detailed analysis on key data elements that flow between systems to help design governance and master data management strategies and ensure data cleanliness.',
    "QualificationsAdvanced degree (MS with 5+ years of industry experience, or Ph.D.) in Computer Science, Data Science, Statistics, or a related field, with an emphasis on AI and machine learning.Proficiency in Python and deep learning libraries, notably PyTorch and Hugging Face, Lightning AI, evidenced by a history of deploying AI models.In-depth knowledge of the latest trends and techniques in AI, particularly in multivariate time-series prediction for financial applications.Exceptional communication skills, capable of effectively conveying complex technical ideas to diverse audiences.Self-motivated, with a collaborative and solution-oriented approach to problem-solving, comfortable working both independently and as part of a collaborative team.\n\nCompensationThis role is compensated with equity until the product expansion and securing of Series A investment. Cash-based compensation will be determined after the revenue generation has been started. As we grow, we'll introduce additional benefits, including performance bonuses, comprehensive health insurance, and professional development opportunities. \nWhy Join BoldPine?\nInfluence the direction of financial market forecasting, contributing to groundbreaking predictive models.Thrive in an innovative culture that values continuous improvement and professional growth, keeping you at the cutting edge of technology.Collaborate with a dedicated team, including another technical expert, setting new benchmarks in AI-driven financial forecasting in a diverse and inclusive environment.\nHow to Apply\nTo join a team that's redefining financial forecasting, submit your application, including a resume and a cover letter. At BoldPine, we're committed to creating a diverse and inclusive work environment and encouraging applications from all backgrounds. Join us, and play a part in our mission to transform financial predictions.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Datasets: ai-job-validation and ai-job-test
Evaluated with TripletEvaluator

Metric	ai-job-validation	ai-job-test
cosine_accuracy	0.9901	1.0

Training Details

Training Dataset

ai-job-embedding-finetuning

Dataset: ai-job-embedding-finetuning at c86ac36
Size: 809 training samples
Columns: query, job_description_pos, and job_description_neg

Approximate statistics based on the first 809 samples:

	query	job_description_pos	job_description_neg
type	string	string	string
details	min: 8 tokens mean: 15.02 tokens max: 40 tokens	min: 7 tokens mean: 348.14 tokens max: 512 tokens	min: 7 tokens mean: 351.24 tokens max: 512 tokens

Samples:

query	job_description_pos	job_description_neg
`Data engineering Azure cloud Apache Spark Kafka`	Skills:Proven experience in data engineering and workflow development.Strong knowledge of Azure cloud services.Proficiency in Apache Spark and Apache Kafka.Excellent programming skills in Python/Java.Hands-on experience with Azure Synapse, DataBricks, and Azure Data Factory. Nice To Have Skills:Experience with BI Tools such as Tableau or Power BI.Familiarity with Terraform for infrastructure as code.Knowledge of Git Actions for CI/CD pipelines.Understanding of database design and architecting principles.Strong communication skills and ability to manage technical projects effectively.	requirements, and assist in data structure implementation planning for innovative data visualization, predictive modeling, and advanced analytics solutions.* Unfortunately, we cannot accommodate Visa Sponsorship for this role at this time. ESSENTIAL JOB FUNCTIONS Mine data covering a wide range of information from customer profile to transaction details to solve risk problems that involve classification, clustering, pattern analysis, sampling and simulations.Apply strong data science expertise and systems analysis methodology to help guide solution analysis, working closely with both business and technical teams, with consideration of both technical and non-technical implications and trade-offs.Carry out independent research and innovation in new content, ML, and technological domains. Trouble shooting any data, system and flow challenges while maintaining clearly defined strategy execution.Extract data from various data sources; perform exploratory data analysis, cleanse, transform, a...
`Databricks, Medallion architecture, ETL processes`	experience with Databricks, PySpark, SQL, Spark clusters, and Jupyter Notebooks.- Expertise in building data lakes using the Medallion architecture and working with delta tables in the delta file format.- Familiarity with CI/CD pipelines and Agile methodologies, ensuring efficient and collaborative development practices.- Strong understanding of ETL processes, data modeling, and data warehousing principles.- Experience with data visualization tools like Power BI is a plus.- Knowledge of cybersecurity data, particularly vulnerability scan data, is preferred.- Bachelor's or Master's degree in Computer Science, Information Systems, or a related field. requirements and deliver effective solutions aligned with Medallion architecture principles.- Ensure data quality and implement robust data governance standards, leveraging the scalability and efficiency offered by the Medallion architecture.- Design and implement ETL processes, including data cleansing, transformation, and integration, opti...	experience with a minimum of 0+ years of experience in a Computer Science or Data Management related fieldTrack record of implementing software engineering best practices for multiple use cases.Experience of automation of the entire machine learning model lifecycle.Experience with optimization of distributed training of machine learning models.Use of Kubernetes and implementation of machine learning tools in that context.Experience partnering and/or collaborating with teams that have different competences.The role holder will possess a blend of design skills needed for Agile data development projects.Proficiency or passion for learning, in data engineer techniques and testing methodologies and Postgraduate degree in data related field of study will also help. Desirable for the role Experience with DevOps or DataOps concepts, preferably hands-on experience implementing continuous integration or highly automated end-to-end environments.Interest in machine learning will also be advan...
`Gas Processing, AI Strategy Development, Plant Optimization`	experience in AI applications for the Hydrocarbon Processing & Control Industry, specifically, in the Gas Processing and Liquefaction business. Key ResponsibilitiesYou will be required to perform the following:- Lead the development and implementation of AI strategies & roadmaps for optimizing gas operations and business functions- Collaborate with cross-functional teams to identify AI use cases to transform gas operations and business functions (AI Mapping)- Design, develop, and implement AI models and algorithms that solve complex problems- Implement Gen AI use cases to enhance natural gas operations and optimize the Gas business functions- Design and implement AI-enabled plant optimizers for efficiency and reliability- Integrate AI models into existing systems and applications- Troubleshoot and resolve technical issues related to AI models and deployments- Ensure compliance with data privacy and security regulations- Stay up-to-date with the latest advancements in AI and machine lea...	QualificationsAbility to gather business requirements and translate them into technical solutionsProven experience in developing interactive dashboards and reports using Power BI (3 years minimum)Strong proficiency in SQL and PythonStrong knowledge of DAX (Data Analysis Expressions)Experience working with APIs inside of Power BIExperience with data modeling and data visualization best practicesKnowledge of data warehousing concepts and methodologiesExperience in data analysis and problem-solvingExcellent communication and collaboration skillsBachelor's degree in Computer Science, Information Systems, or a related fieldExperience with cloud platforms such as Azure or AWS is a plus HoursApproximately 15 - 20 hours per week for 3 months with the opportunity to extend the contract further

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Dataset

ai-job-embedding-finetuning

Dataset: ai-job-embedding-finetuning at c86ac36
Size: 101 evaluation samples
Columns: query, job_description_pos, and job_description_neg

Approximate statistics based on the first 101 samples:

	query	job_description_pos	job_description_neg
type	string	string	string
details	min: 8 tokens mean: 15.39 tokens max: 29 tokens	min: 27 tokens mean: 381.37 tokens max: 512 tokens	min: 14 tokens mean: 320.86 tokens max: 512 tokens

Samples:

query	job_description_pos	job_description_neg
`Data Engineer Snowflake ETL big data processing`	experience. But most of all, we’re backed by a culture of respect. We embrace authenticity and inspire people to thrive. The CDE Data Engineer will join the Content Delivery Engineering team, within the Global Video Engineering organization at NBCUniversal. The CDE Data Engineer will be responsible for implementing and maintaining systems that ingest, process, and store vast amounts of data from internal systems and external partner systems. These data systems must be scalable, robust, and within budget. In this role, the CDE Data Engineer will work with a variety of technologies that support the building of meaningful models, alerts, reports, and visualizations from vast quantities of data. Responsibilities Include, But Are Not Limited To Development of data systems and pipelinesAssist in cleansing, discretization, imputation, selection, generalization etc. to create high quality features for the modeling processWork with business stakeholders to define business requirements includ...	Skills Looking For:- The project involves creating a unified data structure for Power BI reporting.- Candidate would work on data architecture and unifying data from various sources.- Data engineering expertise, including data modeling and possibly data architecture.- Proficiency in Python, SQL, and DAX.- Work with AWS data, and data storage.- Experience with cloud platforms like AWS is preferred.- Familiarity with Microsoft Power Automate and Microsoft Fabric is a plus.- Collaborating with users to understand reporting requirements for Power BI. Must be good at using Power BI tools (creating dashboards); excellent Excel skills.- Supply chain background preferred. Education and Level of Experience:- Bachelor's degree (quantitative learnings preferred- data analytics, statistics, computer science, math) with 3 to 5 years of experience.- Must have recent and relevant experience. Top 3 Skills:- Data engineering, including data modeling and data architecture.- Proficiency in Python, SQL, a...
`GenAI applications, NLP model development, MLOps pipelines`	experience building enterprise level GenAI applications, designed and developed MLOps pipelines . The ideal candidate should have deep understanding of the NLP field, hands on experience in design and development of NLP models and experience in building LLM-based applications. Excellent written and verbal communication skills with the ability to collaborate effectively with domain experts and IT leadership team is key to be successful in this role. We are looking for candidates with expertise in Python, Pyspark, Pytorch, Langchain, GCP, Web development, Docker, Kubeflow etc. Key requirements and transition plan for the next generation of AI/ML enablement technology, tools, and processes to enable Walmart to efficiently improve performance with scale. Tools/Skills (hands-on experience is must):• Ability to transform designs ground up and lead innovation in system design• Deep understanding of GenAI applications and NLP field• Hands on experience in the design and development of NLP mode...	skills, education, experience, and other qualifications. Featured Benefits: Medical Insurance in compliance with the ACA401(k)Sick leave in compliance with applicable state, federal, and local laws Description: Responsible for performing routine and ad-hoc analysis to identify actionable business insights, performance gaps and perform root cause analysis. The Data Analyst will perform in-depth research across a variety of data sources to determine current performance and identify trends and improvement opportunities. Collaborate with leadership and functional business owners as well as other personnel to understand friction points in data that cause unnecessary effort, and recommend gap closure initiatives to policy, process, and system. Qualification: Minimum of three (3) years of experience in data analytics, or working in a data analyst environment.Bachelor’s degree in Data Science, Statistics, Applied Math, Computer Science, Business, or related field of study from an accredited ...
`data engineering ETL cloud platforms data security`	experience. While operating within the Banks risk appetite, achieves results by consistently identifying, assessing, managing, monitoring, and reporting risks of all types. ESSENTIAL DUTIES AND SKILLS, AND ABILITIES REQUIRED: Bachelors degree in Computer Science/Information Systems or equivalent combination of education and experience. Must be able to communicate ideas both verbally and in writing to management, business and IT sponsors, and technical resources in language that is appropriate for each group. Fundamental understanding of distributed computing principles Knowledge of application and data security concepts, best practices, and common vulnerabilities. Conceptual understanding of one or more of the following disciplines preferred big data technologies and distributions, metadata management products, commercial ETL tools, Bi and reporting tools, messaging systems, data warehousing, Java (language and run time environment), major version control systems, continuous integra...	experience. We are looking for a highly energetic and collaborative Senior Data Engineer with experience leading enterprise data projects around Business and IT operations. The ideal candidate should be an expert in leading projects in developing and testing data pipelines, data analytics efforts, proactive issue identification and resolution and alerting mechanism using traditional, new and emerging technologies. Excellent written and verbal communication skills and ability to liaise with technologists to executives is key to be successful in this role. • Assembling large to complex sets of data that meet non-functional and functional business requirements• Identifying, designing and implementing internal process improvements including re-designing infrastructure for greater scalability, optimizing data delivery, and automating manual processes• Building required infrastructure for optimal extraction, transformation and loading of data from various data sources using GCP/Azure and S...

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
learning_rate: 2e-05
num_train_epochs: 1
warmup_ratio: 0.1
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	ai-job-validation_cosine_accuracy	ai-job-test_cosine_accuracy
0	0	0.8812	-
1.0	51	0.9901	1.0

Framework Versions

Python: 3.12.8
Sentence Transformers: 3.3.1
Transformers: 4.48.0
PyTorch: 2.5.1
Accelerate: 1.2.1
Datasets: 3.2.0
Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: 31

Safetensors

Model size

82.1M params

Tensor type

F32

Model tree for shawhin/distilroberta-ai-job-embeddings

Base model

sentence-transformers/all-distilroberta-v1

Finetuned

(43)

this model

Dataset used to train shawhin/distilroberta-ai-job-embeddings

Papers for shawhin/distilroberta-ai-job-embeddings

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 10

Efficient Natural Language Response Suggestion for Smart Reply

Paper • 1705.00652 • Published May 1, 2017

Evaluation results

Cosine Accuracy on ai job validation
self-reported

0.990
Cosine Accuracy on ai job test
self-reported

1.000