T5 Changelog Generator Collection
Collection of fine-tuned T5 models (small, base, and large) for generating human-readable changelog entries from code changes and diffs.
Recommended Model
- t5-large (770M params, 2.8 GB) - Best quality, recommended for production use
- CTranslate2 int8: ~770 MB (CPU/GPU compatible)
Alternative Model
- t5-base (220M params, 850 MB) - Faster, balanced performance
- CTranslate2 int8: ~220 MB (CPU/GPU compatible)
Not Recommended
- t5-small (60M params, 231 MB) - Tends to generate hallucinated changelogs
Model Description
These models generate concise, informative changelog entries for the Open Build Serve and
where trained with accepted requests to openSUSE:Factory with the diffs of the spec file and every files which looks like
a changelog.
The only use case is generating a changelog entry after a package change
Usage
CTranslate2 (Recommended for Production)
CTranslate2 provides 2-4x faster inference with lower memory usage. Ideal for production deployments.
Installation
pip install ctranslate2 transformers
Basic Usage
import ctranslate2
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download
# Choose quantization variant:
# - float32: Full precision, no quality loss
# - float16: 50% smaller, GPU only
# - int8_float16: 75% smaller, GPU only
# - int8: 75% smaller, CPU/GPU compatible (Recommended)
repo_id = "mslacken/t5-finetune-changelog"
model_size = "large" # Recommended
quantization = "int8" # Recommended: int8 for CPU/GPU compatibility
# Download CTranslate2 model from HuggingFace
print("Downloading model...")
model_dir = snapshot_download(
repo_id=repo_id,
allow_patterns=f"ct2_models/t5-{model_size}-ct2-{quantization}/*"
)
ct2_model_path = f"{model_dir}/ct2_models/t5-{model_size}-ct2-{quantization}"
# Load CTranslate2 model
translator = ctranslate2.Translator(ct2_model_path, device="cpu") # or "cuda"
# Load tokenizer from the fine-tuned model
subfolder = f"t5-{model_size}"
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
# Helper function to decode while preserving newlines
def decode_changelog(tokens):
import re
decoded = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens), skip_special_tokens=False)
decoded = decoded.replace(tokenizer.pad_token or "<pad>", "")
decoded = decoded.replace(tokenizer.eos_token or "</s>", "")
decoded = re.sub(r"<extra_id_\d+>", "", decoded)
return decoded.strip()
# Example input
input_text = """create structured changelog for package warewulf4 from 4.2.0 to 4.3.0rc2:
changelog:
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### Changed
- Provision interface is not tied to 'eth0' any more. The provision interface must be named
'default' now. The file 'nodes.yaml' must be changed accordingly.
- Creating of '/etc/exports' can now be disabled, so that wwctl configure -a wont overwrite
a existing '/etc/exports'.
- All configurations files for the host (/etc/exports, /etc/dhcpd.conf, /etc/hosts) are now
populated from the templates."""
# Tokenize input
input_tokens = tokenizer.convert_ids_to_tokens(
tokenizer.encode(input_text, max_length=1024, truncation=True)
)
# Generate with CTranslate2
results = translator.translate_batch(
[input_tokens],
beam_size=6,
max_decoding_length=512,
no_repeat_ngram_size=4
)
# Decode output
changelog = decode_changelog(results[0].hypotheses[0])
print(changelog)
# Expected output:
# - update to v4.3.0rc2 with following major changes:
# * Provision interface is not tied to 'eth0' any more. The provision interface
# must be named 'default' now. The file `nodes.yaml' must be changed accordingly.
# * Creating of '/etc/exports' can now be disabled, so that wwctl configure -a
# wont overwrite a existing '/etc/exports'.
# * All configurations files for the host (/etc/exports, /etc/dhcpd.conf,
# /etc/hosts) are now populated from the templates.
CTranslate2 Model Variants
| Variant | Size vs Original | Device | Speed |
|---|---|---|---|
| float32 | 100% | CPU/GPU | 2x |
| float16 | 50% | GPU only | 2-3x |
| int8_float16 | 25% | GPU only | 2-4x |
| int8 | 25% | CPU/GPU | 2x |
Recommendations:
- GPU Production:
int8_float16(best speed/memory balance) - CPU Production:
int8(small size, CPU compatible) - Maximum Quality:
float32orfloat16
Standard Transformers
Each model variant is in its own subdirectory. Use the subfolder parameter to select the variant:
from transformers import T5ForConditionalGeneration, AutoTokenizer
# Repository ID
repo_id = "mslacken/t5-finetune-changelog"
# Choose one of the three variants:
# - subfolder="t5-small" (fastest, 231 MB) - NOT recommended
# - subfolder="t5-base" (balanced, 850 MB)
# - subfolder="t5-large" (best quality, 2.8 GB) - Recommended
subfolder = "t5-large"
model = T5ForConditionalGeneration.from_pretrained(repo_id, subfolder=subfolder)
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
# Helper function to decode while preserving newlines
def decode_changelog(output_ids):
# Decode without removing special tokens first
decoded = tokenizer.decode(output_ids, skip_special_tokens=False)
# Remove only standard T5 tokens, keep our custom \n token
import re
decoded = decoded.replace(tokenizer.pad_token or "<pad>", "")
decoded = decoded.replace(tokenizer.eos_token or "</s>", "")
decoded = re.sub(r"<extra_id_\d+>", "", decoded) # Remove sentinel tokens
return decoded.strip()
# Generation parameters (recommended to prevent repetitions):
# - num_beams=4: Use beam search for better quality
# - repetition_penalty=1.2: Penalize repeated tokens
# - no_repeat_ngram_size=3: Prevent 3-word phrase repetition
# - early_stopping=True: Stop when EOS token is generated
# Example input
input_text = """create structured changelog for package warewulf4 from 4.2.0 to 4.3.0rc2:
changelog:
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### Changed
- Provision interface is not tied to 'eth0' any more. The provision interface must be named
'default' now. The file 'nodes.yaml' must be changed accordingly.
- Creating of '/etc/exports' can now be disabled, so that wwctl configure -a wont overwrite
a existing '/etc/exports'.
- All configurations files for the host (/etc/exports, /etc/dhcpd.conf, /etc/hosts) are now
populated from the templates."""
inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)
outputs = model.generate(
**inputs,
max_length=512,
num_beams=6,
no_repeat_ngram_size=4,
early_stopping=True
)
changelog = decode_changelog(outputs[0])
print(changelog)
# Expected output:
# - update to v4.3.0rc2 with following major changes:
# * Provision interface is not tied to 'eth0' any more. The provision interface
# must be named 'default' now. The file `nodes.yaml' must be changed accordingly.
# * Creating of '/etc/exports' can now be disabled, so that wwctl configure -a
# wont overwrite a existing '/etc/exports'.
# * All configurations files for the host (/etc/exports, /etc/dhcpd.conf,
# /etc/hosts) are now populated from the templates.
Quick Inference (Alternative)
from transformers import T5ForConditionalGeneration, AutoTokenizer
repo_id = "mslacken/t5-finetune-changelog"
subfolder = "t5-large" # Recommended
# Load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained(repo_id, subfolder=subfolder)
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
# Helper function to decode while preserving newlines
def decode_changelog(output_ids):
import re
decoded = tokenizer.decode(output_ids, skip_special_tokens=False)
decoded = decoded.replace(tokenizer.pad_token or "<pad>", "")
decoded = decoded.replace(tokenizer.eos_token or "</s>", "")
decoded = re.sub(r"<extra_id_\d+>", "", decoded)
return decoded.strip()
# Example input
input_text = """create structured changelog for package warewulf4 from 4.2.0 to 4.3.0rc2:
changelog:
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### Changed
- Provision interface is not tied to 'eth0' any more. The provision interface must be named
'default' now. The file 'nodes.yaml' must be changed accordingly.
- Creating of '/etc/exports' can now be disabled, so that wwctl configure -a wont overwrite
a existing '/etc/exports'.
- All configurations files for the host (/etc/exports, /etc/dhcpd.conf, /etc/hosts) are now
populated from the templates."""
inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)
outputs = model.generate(
**inputs,
max_length=512,
num_beams=6,
no_repeat_ngram_size=4,
early_stopping=True
)
changelog = decode_changelog(outputs[0])
print(changelog)
# Expected output:
# - update to v4.3.0rc2 with following major changes:
# * Provision interface is not tied to 'eth0' any more. The provision interface
# must be named 'default' now. The file `nodes.yaml' must be changed accordingly.
# * Creating of '/etc/exports' can now be disabled, so that wwctl configure -a
# wont overwrite a existing '/etc/exports'.
# * All configurations files for the host (/etc/exports, /etc/dhcpd.conf,
# /etc/hosts) are now populated from the templates.
Training Details
Training Data
- Dataset size: 30,000 examples
- Training split: 27,000 examples (90%)
- Validation split: 3,000 examples (10%)
- Data source: Custom dataset of code changes and corresponding changelog entries
- Filtering: Examples with targets exceeding 512 tokens were filtered to ensure proper EOS token learning
Note: All three model variants were trained on the same dataset with the same hyperparameters, differing only in their base model size.
Training Procedure
Base Models:
- t5-small:
google-t5/t5-small(60M parameters) - t5-base:
google-t5/t5-base(220M parameters) - t5-large:
google-t5/t5-large(770M parameters)
Hyperparameters:
- Epochs: 10
- Batch size: 4 per device
- Gradient accumulation: 4 steps
- Effective batch size: 16
- Learning rate: 1e-4
- Optimizer: Adafactor
- LR scheduler: Cosine with 10% warmup
- Weight decay: 0.01
- Max input length: 1024 tokens
- Max target length: 512 tokens
- Early stopping: 5 epochs patience
- Label smoothing: 0.0
Training Infrastructure:
- Evaluation strategy: Every 500 steps
- Checkpoint saving: Every 500 steps
- Best model selection: Based on validation loss
- TensorBoard logging: Every 10 steps
- Maximum checkpoints kept: 2
Special tokens:
- Added
\nas additional special token for better diff handling - Tokenizer vocabulary resized accordingly
Preprocessing
Input format:
generate changelog: [code diff or change description]
Target format:
[Human-readable changelog entry]
Tokenization:
- Padding: Dynamic (longest in batch)
- Truncation: Enabled at max_length
- Special tokens: Added automatically
- Labels: Pad tokens replaced with -100 for loss computation
Model Performance
The model was trained with early stopping based on validation loss. The best checkpoint was automatically selected and saved as the final model.
Limitations
- Maximum input length: 1024 tokens (longer diffs will be truncated)
- Maximum output length: 512 tokens
- Domain specificity: Trained on specific changelog formats; may not generalize to all documentation styles
- Language: Primarily trained on English changelogs
- Code languages: Performance may vary across programming languages depending on training data distribution
Model Files and Formats
Repository Structure
t5-finetune-changelog/
βββ t5-small/ # 231 MB (NOT recommended - hallucinates)
β βββ config.json
β βββ generation_config.json
β βββ model.safetensors
β βββ tokenizer.json
β βββ tokenizer_config.json
βββ t5-base/ # 850 MB - Balanced
β βββ [same files]
βββ t5-large/ # 2.8 GB - Best quality (Recommended)
β βββ [same files]
βββ ct2_models/ # CTranslate2 optimized models (2-4x faster inference)
βββ t5-base-ct2-float32/ # Full precision (~850 MB)
βββ t5-base-ct2-float16/ # GPU only (~425 MB)
βββ t5-base-ct2-int8_float16/ # GPU only (~220 MB)
βββ t5-base-ct2-int8/ # CPU/GPU compatible (~220 MB)
βββ t5-large-ct2-float32/ # Full precision (~2.8 GB)
βββ t5-large-ct2-float16/ # GPU only (~1.4 GB)
βββ t5-large-ct2-int8_float16/ # GPU only (~770 MB)
βββ t5-large-ct2-int8/ # CPU/GPU compatible (~770 MB)
Models are provided in two formats:
- SafeTensors (t5-*/): For Hugging Face Transformers
- CTranslate2 (ct2_models/): Optimized for 2-4x faster inference with lower memory usage
Citation
If you use this model in your research or projects, please cite:
@misc{t5-changelog-collection-2026,
author = {Christian Goll},
title = {T5 Changelog Generator Collection},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/mslacken/t5-finetune-changelog}}
}
License
This model is licensed under Apache 2.0, same as the base T5 model.
Acknowledgments
- Base model: google-t5/t5-large
- Framework: Hugging Face Transformers
- Training: Custom implementation with PyTorch
Additional Information
Contact: Christian Goll [email protected]
Issues: Please report issues or feedback on the model's discussion page.
Model tree for mslacken/t5-finetune-changelog
Base model
google-t5/t5-base