T5-Small Vietnamese

A T5-small model adapted for Vietnamese language through continual pretraining with ViT5 tokenizer.

Model Description

This model combines:

  • Architecture: google-t5/t5-small (~60M parameters)
  • Tokenizer: VietAI/vit5-base tokenizer (Vietnamese-optimized)
  • Pretraining: Span corruption denoising objective on Vietnamese text

The model was created by:

  1. Loading T5-small architecture
  2. Replacing tokenizer with ViT5's Vietnamese tokenizer
  3. Resizing embedding layer to match new vocabulary
  4. Pretraining on Vietnamese corpus

Training Details

Training Data

Pretraining Objective

  • Method: Span Corruption (T5-style denoising)
  • Noise Density: 15%
  • Mean Span Length: 3.0 tokens

Compute Resources

Resource Details
Hardware NVIDIA A100 80GB
Platform Google Colab
Training Time ~100 hours

Usage

Basic Usage (Fill-mask / Denoising)

from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained("nbdaaa/t5-small-vietnamese")
tokenizer = AutoTokenizer.from_pretrained("nbdaaa/t5-small-vietnamese")

text = "Bến Tre là <extra_id_0> của Việt Nam."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Expected output:

<extra_id_0> một trong những tỉnh </s>

Additional Examples

test_cases = [
    "Hà Nội là <extra_id_0> của Việt Nam.",
    "Phở là món <extra_id_0> nổi tiếng của Việt Nam.",
    "Tôi <extra_id_0> học.",
    "Tiếng Việt là ngôn ngữ <extra_id_0> của người Việt.",
    "Con mèo đang <extra_id_0> trên ghế.",
    "Việt Nam là một <extra_id_0> nằm ở <extra_id_1> Á."
]

for text in test_cases:
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_length=50,
        num_beams=4,
        early_stopping=True
    )
    print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Zero-shot Downstream Task Examples

Although the model is not fine-tuned for specific downstream tasks, it can perform several tasks in a zero-shot manner by leveraging T5’s text-to-text formulation.

Zero-shot Named Entity Recognition (NER)

text = (
    "Ông Phạm Nhật Vượng là chủ tịch của tập đoàn Vingroup. "
    "Tên ông là <extra_id_0>, thực thể tổ chức là <extra_id_1>."
)

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Expected output:

<extra_id_0> ông Phạm Nhật Vượng <extra_id_1> Vingroup </s>

Zero-shot Contextual Question Answering (QA)

text = (
    "Bối cảnh: Chiến thắng Điện Biên Phủ năm 1954 là một mốc son chói lọi "
    "trong lịch sử dân tộc Việt Nam. Dưới sự chỉ huy của Đại tướng Võ Nguyên Giáp, "
    "quân và dân ta đã đập tan tập đoàn cứ điểm mạnh nhất Đông Dương của thực dân Pháp "
    "sau 56 ngày đêm chiến đấu gian khổ. "
    "Câu hỏi: Ai là người chỉ huy quân đội Việt Nam trong chiến dịch Điện Biên Phủ? "
    "Trả lời: <extra_id_0>"
)

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Expected output:

<extra_id_0> Đại tướng Võ Nguyên Giáp </s>

Intended Uses

This model can be used as:

  1. Base model for fine-tuning on Vietnamese NLP tasks:

    • Text summarization
    • Question answering
    • Text classification
    • Named Entity Recognition
    • Machine translation
  2. Fill-in-the-blank style text completion

  3. Vietnamese language understanding tasks

Fine-tuning Example

from transformers import T5ForConditionalGeneration, AutoTokenizer, Trainer, TrainingArguments

model = T5ForConditionalGeneration.from_pretrained("nbdaaa/t5-small-vietnamese")
tokenizer = AutoTokenizer.from_pretrained("nbdaaa/t5-small-vietnamese")

# Fine-tune on your downstream task
training_args = TrainingArguments(
    output_dir="./my-finetuned-model",
    per_device_train_batch_size=8,
    learning_rate=3e-5,
    num_train_epochs=3,
    # ... other arguments
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=your_dataset,
    # ...
)

trainer.train()

Model Architecture

T5ForConditionalGeneration(
  (shared): Embedding(36334, 512)  # Resized for ViT5 tokenizer
  (encoder): T5Stack(
    (embed_tokens): Embedding(36334, 512)
    (block): ModuleList(6 layers)
    (final_layer_norm): T5LayerNorm()
  )
  (decoder): T5Stack(
    (embed_tokens): Embedding(36334, 512)
    (block): ModuleList(6 layers)
    (final_layer_norm): T5LayerNorm()
  )
  (lm_head): Linear(512, 36334)
)

Citation

If you use this model, please cite:

@misc{t5-small-vietnamese,
  author = {nbdaaa},
  title = {T5-Small Vietnamese: A Vietnamese-adapted T5 model},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/nbdaaa/t5-small-vietnamese}
}

Acknowledgments

  • Google T5 for the original T5 architecture
  • VietAI for the ViT5 Vietnamese tokenizer
  • VTSNLP for the Vietnamese dataset
Downloads last month
80
Safetensors
Model size
62.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train nbdaaa/t5-small-vietnamese