--- license: apache-2.0 language: - en base_model: - cisco-ai/SecureBERT2.0-base pipeline_tag: sentence-similarity library_name: sentence-transformers tags: - sentence-transformers - sentence-similarity - feature-extraction - dense - securebert - IR - docembedding - generated_from_trainer - dataset_size:35705 - loss:MultipleNegativesRankingLoss widget: - source_sentence: >- What is the primary responsibility of the Information Security Oversight Committee in an organization? sentences: - Least privilege - By searching for repeating ciphertext sequences at fixed displacements. - >- Ensuring and supporting information protection awareness and training programs --- # Model Card for cisco-ai/SecureBERT2.0-biencoder The **SecureBERT 2.0 Bi-Encoder** is a cybersecurity-domain sentence-similarity and document-embedding model fine-tuned from [SecureBERT 2.0](https://huggingface.co/cisco-ai/SecureBERT2.0-base). It independently encodes queries and documents into a shared vector space for **semantic search**, **information retrieval**, and **cybersecurity knowledge retrieval**. --- ## Model Details ### Model Description - **Developed by:** Cisco AI - **Model type:** Bi-Encoder (Sentence Transformer) - **Architecture:** ModernBERT backbone with dual encoders - **Max sequence length:** 1024 tokens - **Output dimension:** 768 - **Language:** English - **License:** Apache-2.0 - **Finetuned from:** [cisco-ai/SecureBERT2.0-base](https://huggingface.co/cisco-ai/SecureBERT2.0-base) --- ## Uses ### Direct Use - **Semantic search** and **document similarity** in cybersecurity corpora - **Information retrieval** and **ranking** for threat intelligence reports, advisories, and vulnerability notes - **Document embedding** for retrieval-augmented generation (RAG) and clustering ### Downstream Use - Threat intelligence knowledge graph construction - Cybersecurity QA and reasoning systems - Security operations center (SOC) data mining ### Out-of-Scope Use - Non-technical or general-domain text similarity - Generative or conversational tasks --- ## Model Architecture The Bi-Encoder encodes queries and documents **independently** into a joint vector space. This architecture enables scalable **approximate nearest-neighbor search** for candidate retrieval and semantic ranking. --- ## Datasets ### Fine-Tuning Datasets | Dataset Category | Number of Records | |:-----------------|:-----------------:| | Cybersecurity QA corpus | 43 000 | | Security governance QA corpus | 60 000 | | Cybersecurity instruction–response corpus | 25 000 | | Cybersecurity rules corpus (evaluation) | 5 000 | #### Dataset Descriptions - **Cybersecurity QA corpus:** 43 k question–answer pairs, reports, and technical documents covering network security, malware analysis, cryptography, and cloud security. - **Security governance QA corpus:** 60 k expert-curated governance and compliance QA pairs emphasizing clear, validated responses. - **Cybersecurity instruction–response corpus:** 25 k instructional pairs enabling reasoning and instruction-following. - **Cybersecurity rules corpus:** 5 k structured policy and guideline records used for evaluation. --- ## How to Get Started with the Model ### Using Sentence Transformers ```bash pip install -U sentence-transformers ``` ### Run Model to Encode ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("cisco-ai/SecureBERT2.0-biencoder") sentences = [ "How would you use Amcache analysis to detect fileless malware?", "Amcache analysis provides forensic artifacts for detecting fileless malware ...", "To capture and display network traffic" ] embeddings = model.encode(sentences) print(embeddings.shape) ``` ### Compute Similarity ```python from sentence_transformers import util similarity = util.cos_sim(embeddings, embeddings) print(similarity) ``` --- ## Framework Versions * python: 3.10.10 * sentence_transformers: 5.0.0 * transformers: 4.52.4 * PyTorch: 2.7.0+cu128 * accelerate: 1.9.0 * datasets: 3.6.0 --- ## Training Details ### Training Dataset The model was fine-tuned on cybersecurity-specific paired-sentence data for document embedding and similarity learning. - **Dataset Size:** 35,705 samples - **Columns:** `sentence_0`, `sentence_1`, `label` #### Example Schema | Field | Type | Description | |:------|:------|:------------| | sentence_0 | string | Query or short text input | | sentence_1 | string | Candidate or document text | | label | float | Similarity score (1.0 = relevant) | #### Example Samples | sentence_0 | sentence_1 | label | |:------------|:-----------|:------:| | *Under what circumstances does attribution bias distort intrusion linking?* | *Attribution bias in intrusion linking occurs when analysts allow preconceived notions, organizational pressures, or cognitive shortcuts to influence their assessment of attack origins and relationships between incidents...* | 1.0 | | *How can you identify store buffer bypass speculation artifacts?* | *Store buffer bypass speculation artifacts represent side-channel vulnerabilities that exploit speculative execution to leak sensitive information...* | 1.0 | --- ### Training Objective and Loss The model was optimized to maximize semantic similarity between relevant cybersecurity text pairs using contrastive learning. - **Loss Function:** [MultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) #### Loss Parameters ```json { "scale": 20.0, "similarity_fct": "cos_sim" } ``` ## Reference ``` @article{aghaei2025securebert, title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence}, author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun}, journal={arXiv preprint arXiv:2510.00240}, year={2025} } ``` --- ## Model Card Authors Cisco AI ## Model Card Contact For inquiries, please contact [ai-threat-intel@cisco.com](mailto:ai-threat-intel@cisco.com)