|
|
--- |
|
|
license: llama3.2 |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
language: |
|
|
- en |
|
|
- ko |
|
|
- fr |
|
|
- zh |
|
|
- es |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
 |
|
|
 |
|
|
 |
|
|
 |
|
|
 |
|
|
 |
|
|
|
|
|
<img src="images/sagea-logo.png" alt="SAGE Logo" width="75%"> |
|
|
|
|
|
# SAGE Reasoning 3B |
|
|
|
|
|
*Advanced Hybrid Reasoning Model with Tool-Calling Capabilities* |
|
|
|
|
|
[](https://huggingface.co/) |
|
|
[](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## Table of Contents |
|
|
|
|
|
- [Overview](#overview) |
|
|
- [Key Features](#key-features) |
|
|
- [Evaluations](#evaluations) |
|
|
- [License](#license) |
|
|
- [Contact](#contact) |
|
|
|
|
|
--- |
|
|
|
|
|
## Overview |
|
|
|
|
|
SAGE Reasoning Family Models are instruction-tuned, text-in/text-out generative systems released under a permissive open license for commercial use. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
### **Hybrid Reasoning Architecture** |
|
|
- **Dual Mode Operation**: Capable of producing fast direct responses in standard LLM mode, or applying self-reflection before answering in reasoning mode |
|
|
- **Advanced Training**: Uses **Iterated Distillation and Amplification (IDA)** - a scalable alignment method based on iterative self-improvement |
|
|
|
|
|
### **Specialized Capabilities** |
|
|
- **Code Generation**: Optimized for programming tasks with strong coding abilities |
|
|
- **STEM Excellence**: Enhanced performance on science, technology, engineering, and mathematics problems |
|
|
- **Instruction Following**: Superior adherence to complex instructions and prompts |
|
|
- **Tool Calling**: Notable strength in tool-calling ability compared to similar-sized models |
|
|
|
|
|
### **Global Reach** |
|
|
- **Multilingual Support**: Over 30 languages supported |
|
|
- **Extended Context**: 128k context window for handling large documents and conversations |
|
|
- **Consistent Performance**: Both standard and reasoning variants consistently outperform other models in the same parameter class on public benchmarks |
|
|
|
|
|
## Evaluations |
|
|
|
|
|
We compare our models against state-of-the-art size-equivalent models in both direct mode and reasoning mode. For direct mode, we compare against Llama/Qwen instruct counterparts. For reasoning, we use Deepseek's R1 distilled counterparts and Qwen's QwQ model. |
|
|
|
|
|
### Overall Performance Benchmarks |
|
|
|
|
|
<div align="center"> |
|
|
<img src="images/3b_benchmarks.png" alt="Overall Performance Benchmarks" width="85%"> |
|
|
<p><em>Comprehensive benchmark results showing SAGE Reasoning 3B performance across multiple evaluation metrics</em></p> |
|
|
</div> |
|
|
|
|
|
### Livebench Global Average |
|
|
|
|
|
<div align="center"> |
|
|
<img src="images/3b_8b_tools.png" alt="Livebench Global Average Performance" width="75%"> |
|
|
<p><em>Livebench global performance comparison demonstrating consistent superiority</em></p> |
|
|
</div> |
|
|
|
|
|
### Tool Calling Performance |
|
|
|
|
|
<div align="center"> |
|
|
<img src="images/3b_8b_tool_calling_benchmarks.png" alt="Tool Calling Benchmarks" width="85%"> |
|
|
<p><em>Tool calling capabilities comparison showing enhanced performance in function calling and tool utilization</em></p> |
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
# Usage |
|
|
Here is a snippet below for usage with Transformers: |
|
|
|
|
|
```python |
|
|
import transformers |
|
|
import torch |
|
|
|
|
|
model_id = "sagea-ai/sage-reasoning-3b" |
|
|
|
|
|
pipeline = transformers.pipeline( |
|
|
"text-generation", |
|
|
model=model_id, |
|
|
model_kwargs={"torch_dtype": torch.bfloat16}, |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
messages = [ |
|
|
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, |
|
|
{"role": "user", "content": "Give me a short introduction to LLMs."}, |
|
|
] |
|
|
|
|
|
outputs = pipeline( |
|
|
messages, |
|
|
max_new_tokens=512, |
|
|
) |
|
|
|
|
|
print(outputs[0]["generated_text"][-1]) |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## Implementing extended thinking |
|
|
- By default, the model will answer in the standard mode. |
|
|
- To enable thinking, you can do any one of the two methods: |
|
|
- Add a specific system prompt, or |
|
|
- Set `enable_thinking=True` while applying the chat template. |
|
|
|
|
|
> **_NOTE:_** For the SAGE reasoning 3b model, we suggest using `repetition_penalty=1.1` while implementing extended thinking. |
|
|
|
|
|
### Method 1 - Add a specific system prompt. |
|
|
To enable thinking, simply use this in the system prompt `system_instruction = 'Enable deep thinking subroutine.'` |
|
|
|
|
|
If you already have a system_instruction, then use `system_instruction = 'Enable deep thinking subroutine.' + '\n\n' + system_instruction`. |
|
|
|
|
|
Here is an example - |
|
|
|
|
|
```python |
|
|
import transformers |
|
|
import torch |
|
|
|
|
|
model_id = "sagea-ai/sage-reasoning-3b" |
|
|
|
|
|
pipeline = transformers.pipeline( |
|
|
"text-generation", |
|
|
model=model_id, |
|
|
model_kwargs={"torch_dtype": torch.bfloat16}, |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine." |
|
|
|
|
|
messages = [ |
|
|
{"role": "system", "content": DEEP_THINKING_INSTRUCTION}, |
|
|
{"role": "user", "content": "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."}, |
|
|
] |
|
|
|
|
|
outputs = pipeline( |
|
|
messages, |
|
|
max_new_tokens=512, |
|
|
) |
|
|
|
|
|
print(outputs[0]["generated_text"][-1]) |
|
|
``` |
|
|
|
|
|
|
|
|
Similarly, if you have a system prompt, you can append the `DEEP_THINKING_INSTRUCTION` to the beginning in this way - |
|
|
|
|
|
```python |
|
|
DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine." |
|
|
|
|
|
system_prompt = "Reply to each prompt with only the actual code - no explanations." |
|
|
prompt = "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format." |
|
|
|
|
|
messages = [ |
|
|
{"role": "system", "content": DEEP_THINKING_INSTRUCTION + '\n\n' + system_prompt}, |
|
|
{"role": "user", "content": prompt} |
|
|
] |
|
|
``` |
|
|
|
|
|
### Method 2 - Set enable_thinking=True in the tokenizer |
|
|
If you are using Huggingface tokenizers, then you can simply use add the argument `enable_thinking=True` to the tokenization (this option is added to the chat template). |
|
|
|
|
|
Here is an example - |
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_name = "sagea-ai/sage-reasoning-3b" |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype="auto", |
|
|
device_map="auto" |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
|
|
prompt = "Give me a short introduction to LLMs." |
|
|
messages = [ |
|
|
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"}, |
|
|
{"role": "user", "content": prompt} |
|
|
] |
|
|
|
|
|
text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True, |
|
|
enable_thinking=True |
|
|
) |
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
|
|
generated_ids = model.generate( |
|
|
**model_inputs, |
|
|
max_new_tokens=512 |
|
|
) |
|
|
generated_ids = [ |
|
|
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
|
|
] |
|
|
|
|
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
# Tool Calling |
|
|
SAGE reasoning 3b models support tool calling (single, parallel, multiple and parallel_multiple) both in standard and extended thinking mode. |
|
|
|
|
|
Here is a snippet - |
|
|
|
|
|
```python |
|
|
# First, define a tool |
|
|
def get_current_temperature(location: str) -> float: |
|
|
""" |
|
|
Get the current temperature at a location. |
|
|
|
|
|
Args: |
|
|
location: The location to get the temperature for, in the format "City, Country" |
|
|
Returns: |
|
|
The current temperature at the specified location in the specified units, as a float. |
|
|
""" |
|
|
return 22. # A real function should probably actually get the temperature! |
|
|
|
|
|
# Next, create a chat and apply the chat template |
|
|
messages = [ |
|
|
{"role": "user", "content": "Hey, what's the temperature in Paris right now?"} |
|
|
] |
|
|
|
|
|
model_inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True) |
|
|
|
|
|
text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False) |
|
|
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device) |
|
|
outputs = model.generate(**inputs, max_new_tokens=512) |
|
|
output_text = tokenizer.batch_decode(outputs)[0][len(text):] |
|
|
print(output_text) |
|
|
``` |
|
|
|
|
|
This will result in the output - |
|
|
``` |
|
|
<tool_call> |
|
|
{"name": "get_current_temperature", "arguments": {"location": "Paris, France"}} |
|
|
</tool_call><|eot_id|> |
|
|
``` |
|
|
|
|
|
You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so: |
|
|
|
|
|
```python |
|
|
tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}} |
|
|
messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]}) |
|
|
``` |
|
|
|
|
|
and then call the tool and append the result, with the `tool` role, like so: |
|
|
|
|
|
```python |
|
|
messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"}) |
|
|
``` |
|
|
|
|
|
After that, you can `generate()` again to let the model use the tool result in the chat: |
|
|
|
|
|
```python |
|
|
text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False) |
|
|
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device) |
|
|
outputs = model.generate(**inputs, max_new_tokens=512) |
|
|
output_text = tokenizer.batch_decode(outputs)[0][len(text):] |
|
|
``` |
|
|
|
|
|
This should result in the string - |
|
|
|
|
|
'The current temperature in Paris is 22.0 degrees.<|eot_id|>' |
|
|
|
|
|
## License |
|
|
|
|
|
This repository and the model weights are licensed under the [**Llama 3.2 Community License Agreement**](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (Llama models' default license agreement). |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) |
|
|
|
|
|
</div> |
|
|
|
|
|
## Contact |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**Get in Touch with Our Team** |
|
|
|
|
|
For inquiries, collaborations, or support, please reach out to us: |
|
|
|
|
|
**Email**: [[email protected]](mailto:[email protected]) |
|
|
|
|
|
--- |
|
|
|
|
|
<p> |
|
|
<strong>SAGE Reasoning 3B</strong><br> |
|
|
<em>Advancing the frontier of hybrid reasoning models</em> |
|
|
</p> |
|
|
|
|
|
 |
|
|
|
|
|
</div> |