T5 Changelog Generator Collection

Collection of fine-tuned T5 models (small, base, and large) for generating human-readable changelog entries from code changes and diffs.

Recommended Model

t5-large (770M params, 2.8 GB) - Best quality, recommended for production use
- CTranslate2 int8: ~770 MB (CPU/GPU compatible)

Alternative Model

t5-base (220M params, 850 MB) - Faster, balanced performance
- CTranslate2 int8: ~220 MB (CPU/GPU compatible)

Not Recommended

t5-small (60M params, 231 MB) - Tends to generate hallucinated changelogs

Model Description

These models generate concise, informative changelog entries for the Open Build Serve and where trained with accepted requests to openSUSE:Factory with the diffs of the spec file and every files which looks like a changelog.

The only use case is generating a changelog entry after a package change

Usage

CTranslate2 (Recommended for Production)

CTranslate2 provides 2-4x faster inference with lower memory usage. Ideal for production deployments.

Installation

pip install ctranslate2 transformers

Basic Usage

import ctranslate2
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download

# Choose quantization variant:
# - float32: Full precision, no quality loss
# - float16: 50% smaller, GPU only
# - int8_float16: 75% smaller, GPU only
# - int8: 75% smaller, CPU/GPU compatible (Recommended)

repo_id = "mslacken/t5-finetune-changelog"
model_size = "large"  # Recommended
quantization = "int8"  # Recommended: int8 for CPU/GPU compatibility

# Download CTranslate2 model from HuggingFace
print("Downloading model...")
model_dir = snapshot_download(
    repo_id=repo_id,
    allow_patterns=f"ct2_models/t5-{model_size}-ct2-{quantization}/*"
)
ct2_model_path = f"{model_dir}/ct2_models/t5-{model_size}-ct2-{quantization}"

# Load CTranslate2 model
translator = ctranslate2.Translator(ct2_model_path, device="cpu")  # or "cuda"

# Load tokenizer from the fine-tuned model
subfolder = f"t5-{model_size}"
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)

# Helper function to decode while preserving newlines
def decode_changelog(tokens):
    import re
    decoded = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens), skip_special_tokens=False)
    decoded = decoded.replace(tokenizer.pad_token or "<pad>", "")
    decoded = decoded.replace(tokenizer.eos_token or "</s>", "")
    decoded = re.sub(r"<extra_id_\d+>", "", decoded)
    return decoded.strip()

# Example input
input_text = """create structured changelog for package warewulf4 from 4.2.0 to 4.3.0rc2:
changelog:
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### Changed
- Provision interface is not tied to 'eth0' any more. The provision interface must be named
'default' now. The file 'nodes.yaml' must be changed accordingly.
- Creating of '/etc/exports' can now be disabled, so that wwctl configure -a wont overwrite
a existing '/etc/exports'.
- All configurations files for the host (/etc/exports, /etc/dhcpd.conf, /etc/hosts) are now
populated from the templates."""

# Tokenize input
input_tokens = tokenizer.convert_ids_to_tokens(
    tokenizer.encode(input_text, max_length=1024, truncation=True)
)

# Generate with CTranslate2
results = translator.translate_batch(
    [input_tokens],
    beam_size=6,
    max_decoding_length=512,
    no_repeat_ngram_size=4
)

# Decode output
changelog = decode_changelog(results[0].hypotheses[0])
print(changelog)
# Expected output:
# - update to v4.3.0rc2 with following major changes:
#   * Provision interface is not tied to 'eth0' any more. The provision interface
#     must be named 'default' now. The file `nodes.yaml' must be changed accordingly.
#   * Creating of '/etc/exports' can now be disabled, so that wwctl configure -a
#     wont overwrite a existing '/etc/exports'.
#   * All configurations files for the host (/etc/exports, /etc/dhcpd.conf,
#     /etc/hosts) are now populated from the templates.

CTranslate2 Model Variants

Variant	Size vs Original	Device	Speed
float32	100%	CPU/GPU	2x
float16	50%	GPU only	2-3x
int8_float16	25%	GPU only	2-4x
int8	25%	CPU/GPU	2x

Recommendations:

GPU Production: int8_float16 (best speed/memory balance)
CPU Production: int8 (small size, CPU compatible)
Maximum Quality: float32 or float16

Standard Transformers

Each model variant is in its own subdirectory. Use the subfolder parameter to select the variant:

from transformers import T5ForConditionalGeneration, AutoTokenizer

# Repository ID
repo_id = "mslacken/t5-finetune-changelog"

# Choose one of the three variants:
# - subfolder="t5-small"  (fastest, 231 MB) - NOT recommended
# - subfolder="t5-base"   (balanced, 850 MB)
# - subfolder="t5-large"  (best quality, 2.8 GB) - Recommended
subfolder = "t5-large"

model = T5ForConditionalGeneration.from_pretrained(repo_id, subfolder=subfolder)
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)

# Helper function to decode while preserving newlines
def decode_changelog(output_ids):
    # Decode without removing special tokens first
    decoded = tokenizer.decode(output_ids, skip_special_tokens=False)
    # Remove only standard T5 tokens, keep our custom \n token
    import re
    decoded = decoded.replace(tokenizer.pad_token or "<pad>", "")
    decoded = decoded.replace(tokenizer.eos_token or "</s>", "")
    decoded = re.sub(r"<extra_id_\d+>", "", decoded)  # Remove sentinel tokens
    return decoded.strip()

# Generation parameters (recommended to prevent repetitions):
# - num_beams=4: Use beam search for better quality
# - repetition_penalty=1.2: Penalize repeated tokens
# - no_repeat_ngram_size=3: Prevent 3-word phrase repetition
# - early_stopping=True: Stop when EOS token is generated

# Example input
input_text = """create structured changelog for package warewulf4 from 4.2.0 to 4.3.0rc2:
changelog:
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### Changed
- Provision interface is not tied to 'eth0' any more. The provision interface must be named
'default' now. The file 'nodes.yaml' must be changed accordingly.
- Creating of '/etc/exports' can now be disabled, so that wwctl configure -a wont overwrite
a existing '/etc/exports'.
- All configurations files for the host (/etc/exports, /etc/dhcpd.conf, /etc/hosts) are now
populated from the templates."""

inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)
outputs = model.generate(
    **inputs, 
    max_length=512,
    num_beams=6,
    no_repeat_ngram_size=4,
    early_stopping=True
)
changelog = decode_changelog(outputs[0])
print(changelog)
# Expected output:
# - update to v4.3.0rc2 with following major changes:
#   * Provision interface is not tied to 'eth0' any more. The provision interface
#     must be named 'default' now. The file `nodes.yaml' must be changed accordingly.
#   * Creating of '/etc/exports' can now be disabled, so that wwctl configure -a
#     wont overwrite a existing '/etc/exports'.
#   * All configurations files for the host (/etc/exports, /etc/dhcpd.conf,
#     /etc/hosts) are now populated from the templates.

Quick Inference (Alternative)

from transformers import T5ForConditionalGeneration, AutoTokenizer

repo_id = "mslacken/t5-finetune-changelog"
subfolder = "t5-large"  # Recommended

# Load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained(repo_id, subfolder=subfolder)
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)

# Helper function to decode while preserving newlines
def decode_changelog(output_ids):
    import re
    decoded = tokenizer.decode(output_ids, skip_special_tokens=False)
    decoded = decoded.replace(tokenizer.pad_token or "<pad>", "")
    decoded = decoded.replace(tokenizer.eos_token or "</s>", "")
    decoded = re.sub(r"<extra_id_\d+>", "", decoded)
    return decoded.strip()

# Example input
input_text = """create structured changelog for package warewulf4 from 4.2.0 to 4.3.0rc2:
changelog:
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### Changed
- Provision interface is not tied to 'eth0' any more. The provision interface must be named
'default' now. The file 'nodes.yaml' must be changed accordingly.
- Creating of '/etc/exports' can now be disabled, so that wwctl configure -a wont overwrite
a existing '/etc/exports'.
- All configurations files for the host (/etc/exports, /etc/dhcpd.conf, /etc/hosts) are now
populated from the templates."""

inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)
outputs = model.generate(
    **inputs,
    max_length=512,
    num_beams=6,
    no_repeat_ngram_size=4,
    early_stopping=True
)
changelog = decode_changelog(outputs[0])
print(changelog)
# Expected output:
# - update to v4.3.0rc2 with following major changes:
#   * Provision interface is not tied to 'eth0' any more. The provision interface
#     must be named 'default' now. The file `nodes.yaml' must be changed accordingly.
#   * Creating of '/etc/exports' can now be disabled, so that wwctl configure -a
#     wont overwrite a existing '/etc/exports'.
#   * All configurations files for the host (/etc/exports, /etc/dhcpd.conf,
#     /etc/hosts) are now populated from the templates.

Training Details

Training Data

Dataset size: 30,000 examples
Training split: 27,000 examples (90%)
Validation split: 3,000 examples (10%)
Data source: Custom dataset of code changes and corresponding changelog entries
Filtering: Examples with targets exceeding 512 tokens were filtered to ensure proper EOS token learning

Note: All three model variants were trained on the same dataset with the same hyperparameters, differing only in their base model size.

Training Procedure

Base Models:

t5-small: google-t5/t5-small (60M parameters)
t5-base: google-t5/t5-base (220M parameters)
t5-large: google-t5/t5-large (770M parameters)

Hyperparameters:

Epochs: 10
Batch size: 4 per device
Gradient accumulation: 4 steps
Effective batch size: 16
Learning rate: 1e-4
Optimizer: Adafactor
LR scheduler: Cosine with 10% warmup
Weight decay: 0.01
Max input length: 1024 tokens
Max target length: 512 tokens
Early stopping: 5 epochs patience
Label smoothing: 0.0

Training Infrastructure:

Evaluation strategy: Every 500 steps
Checkpoint saving: Every 500 steps
Best model selection: Based on validation loss
TensorBoard logging: Every 10 steps
Maximum checkpoints kept: 2

Special tokens:

Added \n as additional special token for better diff handling
Tokenizer vocabulary resized accordingly

Preprocessing

Input format:

generate changelog: [code diff or change description]

Target format:

[Human-readable changelog entry]

Tokenization:

Padding: Dynamic (longest in batch)
Truncation: Enabled at max_length
Special tokens: Added automatically
Labels: Pad tokens replaced with -100 for loss computation

Model Performance

The model was trained with early stopping based on validation loss. The best checkpoint was automatically selected and saved as the final model.

Limitations

Maximum input length: 1024 tokens (longer diffs will be truncated)
Maximum output length: 512 tokens
Domain specificity: Trained on specific changelog formats; may not generalize to all documentation styles
Language: Primarily trained on English changelogs
Code languages: Performance may vary across programming languages depending on training data distribution

Model Files and Formats

Repository Structure

t5-finetune-changelog/
├── t5-small/          # 231 MB (NOT recommended - hallucinates)
│   ├── config.json
│   ├── generation_config.json
│   ├── model.safetensors
│   ├── tokenizer.json
│   └── tokenizer_config.json
├── t5-base/           # 850 MB - Balanced
│   └── [same files]
├── t5-large/          # 2.8 GB - Best quality (Recommended)
│   └── [same files]
└── ct2_models/        # CTranslate2 optimized models (2-4x faster inference)
    ├── t5-base-ct2-float32/        # Full precision (~850 MB)
    ├── t5-base-ct2-float16/        # GPU only (~425 MB)
    ├── t5-base-ct2-int8_float16/   # GPU only (~220 MB)
    ├── t5-base-ct2-int8/           # CPU/GPU compatible (~220 MB)
    ├── t5-large-ct2-float32/       # Full precision (~2.8 GB)
    ├── t5-large-ct2-float16/       # GPU only (~1.4 GB)
    ├── t5-large-ct2-int8_float16/  # GPU only (~770 MB)
    └── t5-large-ct2-int8/          # CPU/GPU compatible (~770 MB)

Models are provided in two formats:

SafeTensors (t5-*/): For Hugging Face Transformers
CTranslate2 (ct2_models/): Optimized for 2-4x faster inference with lower memory usage

Citation

If you use this model in your research or projects, please cite:

@misc{t5-changelog-collection-2026,
  author = {Christian Goll},
  title = {T5 Changelog Generator Collection},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/mslacken/t5-finetune-changelog}}
}

License

This model is licensed under Apache 2.0, same as the base T5 model.

Acknowledgments

Base model: google-t5/t5-large
Framework: Hugging Face Transformers
Training: Custom implementation with PyTorch

Additional Information

Contact: Christian Goll [email protected]

Issues: Please report issues or feedback on the model's discussion page.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mslacken/t5-finetune-changelog

Base model

google-t5/t5-base

Finetuned

(732)

this model