Instructions to use nvidia/Llama3-ChatQA-1.5-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Llama3-ChatQA-1.5-8B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Llama3-ChatQA-1.5-8B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/Llama3-ChatQA-1.5-8B")
model = AutoModelForCausalLM.from_pretrained("nvidia/Llama3-ChatQA-1.5-8B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/Llama3-ChatQA-1.5-8B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Llama3-ChatQA-1.5-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Llama3-ChatQA-1.5-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/Llama3-ChatQA-1.5-8B

SGLang

How to use nvidia/Llama3-ChatQA-1.5-8B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Llama3-ChatQA-1.5-8B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Llama3-ChatQA-1.5-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Llama3-ChatQA-1.5-8B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Llama3-ChatQA-1.5-8B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/Llama3-ChatQA-1.5-8B with Docker Model Runner:
```
docker model run hf.co/nvidia/Llama3-ChatQA-1.5-8B
```

Chat template

by bartowski - opened May 2, 2024

Discussion

bartowski

May 2, 2024

In the model card, you list the chat template as:

System: {System}

{Context}

User: {Question}

Assistant: {Response}

User: {Question}

Assistant:

However in the tokenizer_config.json it's:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

which is correct? I assume the one in the model card..

bennegeek

May 2, 2024

The sample code from the README file seems to be aligned with the model card.

zihanliu

NVIDIA org May 2, 2024

•

edited May 2, 2024

Sorry about the confusion here. the chat template listed in our model card is correct. We simply use the tokenizer_config.json from Llama3, it comes with their chat template. We have updated this tokenizer_config.json and removed it.

FINGU-AI

May 6, 2024

This comment has been hidden

zihanliu

NVIDIA org May 6, 2024

i am using the eos_token ("<|end_of_text|>") as the pad_token.

Imran1

May 6, 2024

@zihanliu I fine tune the model, same is you suggest the chat template.
I did the same, to assign the eos token to as pad. Like this
Tokenizer.pad_token=Tokenizer.eos_token.

But the model some time add the eos token but for long answer the model is sucks to add.

Is this bad approach to do pad=EOS?
OR should I need to add the pad token to tokenizer and update the embadding too.

Check this chat template.

  "chat_template": "{% for message in messages %}{% if message['role'] == 'system' %}{{ bos_token + 'System: ' + message['content'] }}{% elif message['role'] == 'user' %}{{ '\n\nUser: ' + message['content'] +  eos_token  }}{% elif message['role'] == 'assistant' %}{{ '\n\nAssistant: ' + message['content'] +  eos_token  }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '\n\nAssistant: ' }}{% endif %}",

zihanliu

NVIDIA org May 6, 2024

Hi. I checked your chat template. The eos_token shouldn't be added after the User turn or Assistant turn. During fine-tuning, We use eos_token as the pad_token, and it is added at the end of the sequence to make sure each sequence in a batch has same length.

Imran1

May 7, 2024

@zihanliu so in your point of view, we don't need to add the eos token and bos token. Will the model pack up Long sequence in the data to be pad ?
Let say, I have data set like this

[ " Ai assistant"]
[" Ai friendly assistant BLA BLA "]
[" ABC"]

Dose the tokenizer will choose Long sequence randomly and the remaining will be pad?
I mean how the tokenizer will decide to choose all batch with same length.

Last question, by doing this, I am afraid the model will stuck/sucks to add the eos token or end the conversation.
Some time the model don't know how to stop the answer.

Thank you.

zihanliu

NVIDIA org May 7, 2024

•

edited May 7, 2024

Hi, Let me try to answer your questions as follows:

You need to set a maximum sequence length (4k/8k). Then, tokenizer will pad all sequences to the maximum length. If the sample is longer than maximum sequence length, it will be cut.
Model will be trained to generate the eos_token when the output is finished. However, for the padding tokens, we will set loss_mask as 0 to make sure the padding tokens will not be trained
We do need bos_token at the beginning of a sequence, which is the same as llama3 models.

Hope these can help :)

Imran1

May 7, 2024

•

edited May 7, 2024

@zihanliu the point 2 are not clear.
How to set the loss_mask to 0?

Imran1

May 7, 2024

Hahaha, I confused. I understand now.
I think packing true will also handle this issues.
@zihanliu thank you. Have a nice day 😊

donglai1

May 8, 2024

•

edited May 8, 2024

when i use the vllm , waring :No chat template provided. Chat API will not work.
and the result blew:

Q: is hi
R : is
hi
<|im_end|>
<|im_start|>user

.......
Why?

Imran1

May 8, 2024

@zihanliu Hy, I hope you will good. I train the model. The model is adding the eos token and not stuck in response.
But, the model return very small answer. Just a 30 or 25 token..

However my dataset consist of 250 token to 1000..

I increase the new token length 1000, 2000 in generation config. But the model still produce 30, or 25 token.

Can you explain why this happen?

zihanliu

NVIDIA org May 8, 2024

Hi @donglai1 ,
We didn't provide the chat template since usually additional context needs to be added, making it hard to fit into the chat template. You can refer to our sample code to get the prompt template for the model.

zihanliu

NVIDIA org May 8, 2024

•

edited May 8, 2024

Hi @Imran1 ,
ChatQA is trained to provide full but concise response to the question. How large is your dataset? If it is small, the model might still follow its original output format. Also, it depends on the question, some questions do not necessary need a very long answer output.

Imran1

May 9, 2024

@zihanliu the data have 28k sample.
Each answer have 250-1000 token.

When I ask write asimple cv it generate that one, but the domain topic, the answer is to small. Just a few token.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment