Instructions to use nvidia/Llama3-ChatQA-1.5-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/Llama3-ChatQA-1.5-8B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nvidia/Llama3-ChatQA-1.5-8B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nvidia/Llama3-ChatQA-1.5-8B") model = AutoModelForCausalLM.from_pretrained("nvidia/Llama3-ChatQA-1.5-8B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nvidia/Llama3-ChatQA-1.5-8B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/Llama3-ChatQA-1.5-8B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Llama3-ChatQA-1.5-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nvidia/Llama3-ChatQA-1.5-8B
- SGLang
How to use nvidia/Llama3-ChatQA-1.5-8B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/Llama3-ChatQA-1.5-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Llama3-ChatQA-1.5-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/Llama3-ChatQA-1.5-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Llama3-ChatQA-1.5-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nvidia/Llama3-ChatQA-1.5-8B with Docker Model Runner:
docker model run hf.co/nvidia/Llama3-ChatQA-1.5-8B
Chat template
In the model card, you list the chat template as:
System: {System}
{Context}
User: {Question}
Assistant: {Response}
User: {Question}
Assistant:
However in the tokenizer_config.json it's:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
which is correct? I assume the one in the model card..
The sample code from the README file seems to be aligned with the model card.
Sorry about the confusion here. the chat template listed in our model card is correct. We simply use the tokenizer_config.json from Llama3, it comes with their chat template. We have updated this tokenizer_config.json and removed it.
i am using the eos_token ("<|end_of_text|>") as the pad_token.
@zihanliu I fine tune the model, same is you suggest the chat template.
I did the same, to assign the eos token to as pad. Like this
Tokenizer.pad_token=Tokenizer.eos_token.
But the model some time add the eos token but for long answer the model is sucks to add.
Is this bad approach to do pad=EOS?
OR should I need to add the pad token to tokenizer and update the embadding too.
Check this chat template.
"chat_template": "{% for message in messages %}{% if message['role'] == 'system' %}{{ bos_token + 'System: ' + message['content'] }}{% elif message['role'] == 'user' %}{{ '\n\nUser: ' + message['content'] + eos_token }}{% elif message['role'] == 'assistant' %}{{ '\n\nAssistant: ' + message['content'] + eos_token }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '\n\nAssistant: ' }}{% endif %}",
Hi. I checked your chat template. The eos_token shouldn't be added after the User turn or Assistant turn. During fine-tuning, We use eos_token as the pad_token, and it is added at the end of the sequence to make sure each sequence in a batch has same length.
@zihanliu so in your point of view, we don't need to add the eos token and bos token. Will the model pack up Long sequence in the data to be pad ?
Let say, I have data set like this
[ " Ai assistant"]
[" Ai friendly assistant BLA BLA "]
[" ABC"]
Dose the tokenizer will choose Long sequence randomly and the remaining will be pad?
I mean how the tokenizer will decide to choose all batch with same length.
Last question, by doing this, I am afraid the model will stuck/sucks to add the eos token or end the conversation.
Some time the model don't know how to stop the answer.
Thank you.
Hi, Let me try to answer your questions as follows:
- You need to set a maximum sequence length (4k/8k). Then, tokenizer will pad all sequences to the maximum length. If the sample is longer than maximum sequence length, it will be cut.
- Model will be trained to generate the eos_token when the output is finished. However, for the padding tokens, we will set loss_mask as 0 to make sure the padding tokens will not be trained
- We do need bos_token at the beginning of a sequence, which is the same as llama3 models.
Hope these can help :)
when i use the vllm , waring :No chat template provided. Chat API will not work.
and the result blew:
Q: is hi
R : is
hi
<|im_end|>
<|im_start|>user
hi<|im_end|>
<|im_start|>assistant
<|im_end|>
<|im_start|>user
hi<|im_end|>
<|im_start|>assistant
<|im_end|>
<|im_start|>user
hi<|im_end|>
<|im_start|>assistant
<|im_end|>
<|im_start|>user
.......
Why?
@zihanliu Hy, I hope you will good. I train the model. The model is adding the eos token and not stuck in response.
But, the model return very small answer. Just a 30 or 25 token..
However my dataset consist of 250 token to 1000..
I increase the new token length 1000, 2000 in generation config. But the model still produce 30, or 25 token.
Can you explain why this happen?
Hi @Imran1 ,
ChatQA is trained to provide full but concise response to the question. How large is your dataset? If it is small, the model might still follow its original output format. Also, it depends on the question, some questions do not necessary need a very long answer output.