T5-Small Vietnamese

A T5-small model adapted for Vietnamese language through continual pretraining with ViT5 tokenizer.

Model Description

This model combines:

Architecture: google-t5/t5-small (~60M parameters)
Tokenizer: VietAI/vit5-base tokenizer (Vietnamese-optimized)
Pretraining: Span corruption denoising objective on Vietnamese text

The model was created by:

Loading T5-small architecture
Replacing tokenizer with ViT5's Vietnamese tokenizer
Resizing embedding layer to match new vocabulary
Pretraining on Vietnamese corpus

Training Details

Training Data

Dataset: VTSNLP/vietnamese_curated_dataset
Samples: 2,500,000 text samples out of 12,169,131 samples from the dataset (7.5 Gb)
Max Length: 4,056 tokens

Pretraining Objective

Method: Span Corruption (T5-style denoising)
Noise Density: 15%
Mean Span Length: 3.0 tokens

Compute Resources

Resource	Details
Hardware	NVIDIA A100 80GB
Platform	Google Colab
Training Time	~100 hours

Usage

Basic Usage (Fill-mask / Denoising)

from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained("nbdaaa/t5-small-vietnamese")
tokenizer = AutoTokenizer.from_pretrained("nbdaaa/t5-small-vietnamese")

text = "Bến Tre là <extra_id_0> của Việt Nam."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Expected output:

<extra_id_0> một trong những tỉnh </s>

Additional Examples

test_cases = [
    "Hà Nội là <extra_id_0> của Việt Nam.",
    "Phở là món <extra_id_0> nổi tiếng của Việt Nam.",
    "Tôi <extra_id_0> học.",
    "Tiếng Việt là ngôn ngữ <extra_id_0> của người Việt.",
    "Con mèo đang <extra_id_0> trên ghế.",
    "Việt Nam là một <extra_id_0> nằm ở <extra_id_1> Á."
]

for text in test_cases:
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_length=50,
        num_beams=4,
        early_stopping=True
    )
    print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Zero-shot Downstream Task Examples

Although the model is not fine-tuned for specific downstream tasks, it can perform several tasks in a zero-shot manner by leveraging T5’s text-to-text formulation.

Zero-shot Named Entity Recognition (NER)

text = (
    "Ông Phạm Nhật Vượng là chủ tịch của tập đoàn Vingroup. "
    "Tên ông là <extra_id_0>, thực thể tổ chức là <extra_id_1>."
)

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Expected output:

<extra_id_0> ông Phạm Nhật Vượng <extra_id_1> Vingroup </s>

Zero-shot Contextual Question Answering (QA)

text = (
    "Bối cảnh: Chiến thắng Điện Biên Phủ năm 1954 là một mốc son chói lọi "
    "trong lịch sử dân tộc Việt Nam. Dưới sự chỉ huy của Đại tướng Võ Nguyên Giáp, "
    "quân và dân ta đã đập tan tập đoàn cứ điểm mạnh nhất Đông Dương của thực dân Pháp "
    "sau 56 ngày đêm chiến đấu gian khổ. "
    "Câu hỏi: Ai là người chỉ huy quân đội Việt Nam trong chiến dịch Điện Biên Phủ? "
    "Trả lời: <extra_id_0>"
)

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Expected output:

<extra_id_0> Đại tướng Võ Nguyên Giáp </s>

Intended Uses

This model can be used as:

Base model for fine-tuning on Vietnamese NLP tasks:
- Text summarization
- Question answering
- Text classification
- Named Entity Recognition
- Machine translation
Fill-in-the-blank style text completion
Vietnamese language understanding tasks

Fine-tuning Example

from transformers import T5ForConditionalGeneration, AutoTokenizer, Trainer, TrainingArguments

model = T5ForConditionalGeneration.from_pretrained("nbdaaa/t5-small-vietnamese")
tokenizer = AutoTokenizer.from_pretrained("nbdaaa/t5-small-vietnamese")

# Fine-tune on your downstream task
training_args = TrainingArguments(
    output_dir="./my-finetuned-model",
    per_device_train_batch_size=8,
    learning_rate=3e-5,
    num_train_epochs=3,
    # ... other arguments
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=your_dataset,
    # ...
)

trainer.train()

Model Architecture

T5ForConditionalGeneration(
  (shared): Embedding(36334, 512)  # Resized for ViT5 tokenizer
  (encoder): T5Stack(
    (embed_tokens): Embedding(36334, 512)
    (block): ModuleList(6 layers)
    (final_layer_norm): T5LayerNorm()
  )
  (decoder): T5Stack(
    (embed_tokens): Embedding(36334, 512)
    (block): ModuleList(6 layers)
    (final_layer_norm): T5LayerNorm()
  )
  (lm_head): Linear(512, 36334)
)

Citation

If you use this model, please cite:

@misc{t5-small-vietnamese,
  author = {nbdaaa},
  title = {T5-Small Vietnamese: A Vietnamese-adapted T5 model},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/nbdaaa/t5-small-vietnamese}
}

Acknowledgments

Google T5 for the original T5 architecture
VietAI for the ViT5 Vietnamese tokenizer
VTSNLP for the Vietnamese dataset

Downloads last month: 80

Safetensors

Model size

62.5M params

Tensor type

F32

nbdaaa
/

t5-small-vietnamese

T5-Small Vietnamese

Model Description

Training Details

Training Data

Pretraining Objective

Compute Resources

Usage

Basic Usage (Fill-mask / Denoising)

Additional Examples

Zero-shot Downstream Task Examples

Zero-shot Named Entity Recognition (NER)

Zero-shot Contextual Question Answering (QA)

Intended Uses

Fine-tuning Example

Model Architecture

Citation

Acknowledgments

Dataset used to train nbdaaa/t5-small-vietnamese