| --- |
| license: apache-2.0 |
| language: |
| - zh |
| - en |
| base_model: |
| - Qwen/Qwen2.5-Coder-7B-Instruct |
| pipeline_tag: feature-extraction |
| library_name: transformers |
| tags: |
| - code |
| - sentence-transformers |
| --- |
| <div align="center" style="display: flex; justify-content: center; align-items: center; gap: 20px;"> |
| <a href="https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main/" style="display: flex; align-items: center; text-decoration: none; color: inherit;"> |
| <img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width="30" height="30" style="vertical-align: middle; margin-right: 8px;"> |
| <span style="font-size: 1.5em; font-weight: bold;">CodeFuse-Embeddings</span> |
| </a> |
| </div> |
| |
|
|
|
|
| # A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling |
|
|
| [Paper](https://huggingface.co/papers/2512.21332) | [Code](https://github.com/codefuse-ai/CodeFuse-Embeddings) |
|
|
| **C2LLMs (Code Contrastive Large Language Models)** are powerful new models for generating code embeddings, designed to capture the deep semantics of source code. |
|
|
| #### Key Features |
|
|
| - **Powerful Base Model**: Built upon the state-of-the-art `Qwen2.5-Coder`, inheriting its exceptional code comprehension capabilities. |
| - **Intelligent Pooling with PMA**: Instead of traditional `mean pooling` or `last token pooling`, C2LLM uses **PMA (Pooling by Multi-head Attention)**. This allows the model to dynamically focus on the most critical parts of the code, creating a more informative and robust embedding. |
| - **Trained for Retrieval**: C2LLM was fine-tuned on a massive dataset of **3 million query-document pairs**, optimizing it for real-world code retrieval and semantic search tasks. Supporting Text2Code/Code2Code/Code2Text tasks. |
|
|
| C2LLM is designed to be a go-to model for tasks like code search and Retrieval-Augmented Generation (RAG). For more details, please see our [GitHub repository](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main). |
|
|
| # Model Details |
|
|
| # How to use |
|
|
| ## Usage (**HuggingFace Transformers**) |
|
|
| ```Python |
| from transformers import AutoModel, AutoTokenizer |
| import torch |
| |
| model_path = "codefuse-ai/C2LLM-7B" |
| |
| # Load the model |
| model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True) |
| |
| # Prepare your custom instruction |
| instruction = "xxxxx" |
| |
| # Prepare the data |
| sentences = ['''int r = (int) params >> 8 & 0xff; |
| int p = (int) params & 0xff; |
| |
| byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32); |
| |
| if (derived0.length != derived1.length) return false; |
| |
| int result = 0; |
| for (int i = 0; i < derived0.length; i++) { |
| result |= derived0[i] ^ derived1[i]; |
| } |
| return result == 0; |
| } catch (UnsupportedEncodingException e) { |
| throw new IllegalStateException("JVM doesn't support UTF-8?"); |
| } catch (GeneralSecurityException e) { |
| throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?"); |
| } |
| }''', |
| ''' |
| } |
| if (tempFrom > tempTo) { |
| return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true); |
| } |
| return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false); |
| }'''] |
| |
| sentences = [instruction+sentence for sentence in sentences] |
| |
| # Get the embeddings |
| embeddings = model.encode(sentences) |
| ``` |
|
|
| ## Usage (**Sentence-Transformers**) |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer |
| |
| # Load the model |
| model = SentenceTransformer("codefuse-ai/C2LLM-7B", trust_remote_code=True, tokenizer_kwargs={"padding_side":"left"}) |
| |
| # Prepare your custom instruction |
| instruction = "xxxxx" |
| |
| # Prepare the data |
| sentences = ['''int r = (int) params >> 8 & 0xff; |
| int p = (int) params & 0xff; |
| |
| byte[] derived1 = SCrypt.scrypt(passwd.getBytes("UTF-8"), salt, N, r, p, 32); |
| |
| if (derived0.length != derived1.length) return false; |
| |
| int result = 0; |
| for (int i = 0; i < derived0.length; i++) { |
| result |= derived0[i] ^ derived1[i]; |
| } |
| return result == 0; |
| } catch (UnsupportedEncodingException e) { |
| throw new IllegalStateException("JVM doesn't support UTF-8?"); |
| } catch (GeneralSecurityException e) { |
| throw new IllegalStateException("JVM doesn't support SHA1PRNG or HMAC_SHA256?"); |
| } |
| }''', |
| ''' |
| } |
| if (tempFrom > tempTo) { |
| return new RangeInfo(inclusive ? tempTo : tempTo + 1, tempFrom + 1, true); |
| } |
| return new RangeInfo(tempFrom, inclusive ? tempTo + 1 : tempTo, false); |
| }'''] |
| |
| sentences = [instruction+sentence for sentence in sentences] |
| |
| # Get the embeddings |
| embeddings = model.encode(sentences) |
| ``` |
|
|
| ## Evaluation (**MTEB**) |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer |
| from mteb.models import ModelMeta |
| from mteb.cache import ResultCache |
| |
| model_name = "codefuse-ai/C2LLM-7B" |
| |
| # Load the model |
| model = mteb.get_model(model_name) # if the model is not implemented in MTEB it will be eq. to SentenceTransformer(model_name) |
| |
| # Select tasks |
| tasks = mteb.get_tasks(tasks=["AppsRetrieval", "CodeSearchNetCCRetrieval", "CodeEditSearchRetrieval","CodeSearchNetRetrieval","CodeFeedbackMT","CodeFeedbackST","CodeTransOceanContest","CodeTransOceanDL","COIRCodeSearchNetRetrieval","CosQA","StackOverflowQA","SyntheticText2SQL"]) |
| |
| # Cache the result |
| cache = ResultCache("./c2llm_results") |
| |
| # Evaluate |
| results = mteb.evaluate(model, tasks=tasks, cache=cache, encode_kwargs={"batch_size": 16}) |
| ``` |
|
|
| ## Support Us |
|
|
| If you find this project helpful, please give it a star. It means a lot to us! |
|
|
| [](https://github.com/codefuse-ai/CodeFuse-Embeddings/tree/main) |
|
|
| ## Citation |
|
|
| @article{2025C2LLM, |
| title={C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling}, |
| author={Jin Qin and Zihan Liao and Ziyin Zhang and Hang Yu and Peng Di and Rui Wang}, |
| journal = {CoRR}, |
| volume = {abs/2512.21332}, |
| year = {2025}, |
| url = {https://doi.org/10.48550/arXiv.2512.21332}, |
| doi = {10.48550/ARXIV.2512.21332}, |
| eprinttype = {arXiv}, |
| eprint = {2512.21332} |
| } |