T5 Changelog Generator Collection

Collection of fine-tuned T5 models (small, base, and large) for generating human-readable changelog entries from code changes and diffs.

Recommended Model

  • t5-large (770M params, 2.8 GB) - Best quality, recommended for production use
    • CTranslate2 int8: ~770 MB (CPU/GPU compatible)

Alternative Model

  • t5-base (220M params, 850 MB) - Faster, balanced performance
    • CTranslate2 int8: ~220 MB (CPU/GPU compatible)

Not Recommended

  • t5-small (60M params, 231 MB) - Tends to generate hallucinated changelogs

Model Description

These models generate concise, informative changelog entries for the Open Build Serve and where trained with accepted requests to openSUSE:Factory with the diffs of the spec file and every files which looks like a changelog.

The only use case is generating a changelog entry after a package change

Usage

CTranslate2 (Recommended for Production)

CTranslate2 provides 2-4x faster inference with lower memory usage. Ideal for production deployments.

Installation

pip install ctranslate2 transformers

Basic Usage

import ctranslate2
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download

# Choose quantization variant:
# - float32: Full precision, no quality loss
# - float16: 50% smaller, GPU only
# - int8_float16: 75% smaller, GPU only
# - int8: 75% smaller, CPU/GPU compatible (Recommended)

repo_id = "mslacken/t5-finetune-changelog"
model_size = "large"  # Recommended
quantization = "int8"  # Recommended: int8 for CPU/GPU compatibility

# Download CTranslate2 model from HuggingFace
print("Downloading model...")
model_dir = snapshot_download(
    repo_id=repo_id,
    allow_patterns=f"ct2_models/t5-{model_size}-ct2-{quantization}/*"
)
ct2_model_path = f"{model_dir}/ct2_models/t5-{model_size}-ct2-{quantization}"

# Load CTranslate2 model
translator = ctranslate2.Translator(ct2_model_path, device="cpu")  # or "cuda"

# Load tokenizer from the fine-tuned model
subfolder = f"t5-{model_size}"
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)

# Helper function to decode while preserving newlines
def decode_changelog(tokens):
    import re
    decoded = tokenizer.decode(tokenizer.convert_tokens_to_ids(tokens), skip_special_tokens=False)
    decoded = decoded.replace(tokenizer.pad_token or "<pad>", "")
    decoded = decoded.replace(tokenizer.eos_token or "</s>", "")
    decoded = re.sub(r"<extra_id_\d+>", "", decoded)
    return decoded.strip()

# Example input
input_text = """create structured changelog for package warewulf4 from 4.2.0 to 4.3.0rc2:
changelog:
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### Changed
- Provision interface is not tied to 'eth0' any more. The provision interface must be named
'default' now. The file 'nodes.yaml' must be changed accordingly.
- Creating of '/etc/exports' can now be disabled, so that wwctl configure -a wont overwrite
a existing '/etc/exports'.
- All configurations files for the host (/etc/exports, /etc/dhcpd.conf, /etc/hosts) are now
populated from the templates."""

# Tokenize input
input_tokens = tokenizer.convert_ids_to_tokens(
    tokenizer.encode(input_text, max_length=1024, truncation=True)
)

# Generate with CTranslate2
results = translator.translate_batch(
    [input_tokens],
    beam_size=6,
    max_decoding_length=512,
    no_repeat_ngram_size=4
)

# Decode output
changelog = decode_changelog(results[0].hypotheses[0])
print(changelog)
# Expected output:
# - update to v4.3.0rc2 with following major changes:
#   * Provision interface is not tied to 'eth0' any more. The provision interface
#     must be named 'default' now. The file `nodes.yaml' must be changed accordingly.
#   * Creating of '/etc/exports' can now be disabled, so that wwctl configure -a
#     wont overwrite a existing '/etc/exports'.
#   * All configurations files for the host (/etc/exports, /etc/dhcpd.conf,
#     /etc/hosts) are now populated from the templates.

CTranslate2 Model Variants

Variant Size vs Original Device Speed
float32 100% CPU/GPU 2x
float16 50% GPU only 2-3x
int8_float16 25% GPU only 2-4x
int8 25% CPU/GPU 2x

Recommendations:

  • GPU Production: int8_float16 (best speed/memory balance)
  • CPU Production: int8 (small size, CPU compatible)
  • Maximum Quality: float32 or float16

Standard Transformers

Each model variant is in its own subdirectory. Use the subfolder parameter to select the variant:

from transformers import T5ForConditionalGeneration, AutoTokenizer

# Repository ID
repo_id = "mslacken/t5-finetune-changelog"

# Choose one of the three variants:
# - subfolder="t5-small"  (fastest, 231 MB) - NOT recommended
# - subfolder="t5-base"   (balanced, 850 MB)
# - subfolder="t5-large"  (best quality, 2.8 GB) - Recommended
subfolder = "t5-large"

model = T5ForConditionalGeneration.from_pretrained(repo_id, subfolder=subfolder)
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)

# Helper function to decode while preserving newlines
def decode_changelog(output_ids):
    # Decode without removing special tokens first
    decoded = tokenizer.decode(output_ids, skip_special_tokens=False)
    # Remove only standard T5 tokens, keep our custom \n token
    import re
    decoded = decoded.replace(tokenizer.pad_token or "<pad>", "")
    decoded = decoded.replace(tokenizer.eos_token or "</s>", "")
    decoded = re.sub(r"<extra_id_\d+>", "", decoded)  # Remove sentinel tokens
    return decoded.strip()

# Generation parameters (recommended to prevent repetitions):
# - num_beams=4: Use beam search for better quality
# - repetition_penalty=1.2: Penalize repeated tokens
# - no_repeat_ngram_size=3: Prevent 3-word phrase repetition
# - early_stopping=True: Stop when EOS token is generated

# Example input
input_text = """create structured changelog for package warewulf4 from 4.2.0 to 4.3.0rc2:
changelog:
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### Changed
- Provision interface is not tied to 'eth0' any more. The provision interface must be named
'default' now. The file 'nodes.yaml' must be changed accordingly.
- Creating of '/etc/exports' can now be disabled, so that wwctl configure -a wont overwrite
a existing '/etc/exports'.
- All configurations files for the host (/etc/exports, /etc/dhcpd.conf, /etc/hosts) are now
populated from the templates."""

inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)
outputs = model.generate(
    **inputs, 
    max_length=512,
    num_beams=6,
    no_repeat_ngram_size=4,
    early_stopping=True
)
changelog = decode_changelog(outputs[0])
print(changelog)
# Expected output:
# - update to v4.3.0rc2 with following major changes:
#   * Provision interface is not tied to 'eth0' any more. The provision interface
#     must be named 'default' now. The file `nodes.yaml' must be changed accordingly.
#   * Creating of '/etc/exports' can now be disabled, so that wwctl configure -a
#     wont overwrite a existing '/etc/exports'.
#   * All configurations files for the host (/etc/exports, /etc/dhcpd.conf,
#     /etc/hosts) are now populated from the templates.

Quick Inference (Alternative)

from transformers import T5ForConditionalGeneration, AutoTokenizer

repo_id = "mslacken/t5-finetune-changelog"
subfolder = "t5-large"  # Recommended

# Load model and tokenizer
model = T5ForConditionalGeneration.from_pretrained(repo_id, subfolder=subfolder)
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)

# Helper function to decode while preserving newlines
def decode_changelog(output_ids):
    import re
    decoded = tokenizer.decode(output_ids, skip_special_tokens=False)
    decoded = decoded.replace(tokenizer.pad_token or "<pad>", "")
    decoded = decoded.replace(tokenizer.eos_token or "</s>", "")
    decoded = re.sub(r"<extra_id_\d+>", "", decoded)
    return decoded.strip()

# Example input
input_text = """create structured changelog for package warewulf4 from 4.2.0 to 4.3.0rc2:
changelog:
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
### Changed
- Provision interface is not tied to 'eth0' any more. The provision interface must be named
'default' now. The file 'nodes.yaml' must be changed accordingly.
- Creating of '/etc/exports' can now be disabled, so that wwctl configure -a wont overwrite
a existing '/etc/exports'.
- All configurations files for the host (/etc/exports, /etc/dhcpd.conf, /etc/hosts) are now
populated from the templates."""

inputs = tokenizer(input_text, return_tensors="pt", max_length=1024, truncation=True)
outputs = model.generate(
    **inputs,
    max_length=512,
    num_beams=6,
    no_repeat_ngram_size=4,
    early_stopping=True
)
changelog = decode_changelog(outputs[0])
print(changelog)
# Expected output:
# - update to v4.3.0rc2 with following major changes:
#   * Provision interface is not tied to 'eth0' any more. The provision interface
#     must be named 'default' now. The file `nodes.yaml' must be changed accordingly.
#   * Creating of '/etc/exports' can now be disabled, so that wwctl configure -a
#     wont overwrite a existing '/etc/exports'.
#   * All configurations files for the host (/etc/exports, /etc/dhcpd.conf,
#     /etc/hosts) are now populated from the templates.

Training Details

Training Data

  • Dataset size: 30,000 examples
  • Training split: 27,000 examples (90%)
  • Validation split: 3,000 examples (10%)
  • Data source: Custom dataset of code changes and corresponding changelog entries
  • Filtering: Examples with targets exceeding 512 tokens were filtered to ensure proper EOS token learning

Note: All three model variants were trained on the same dataset with the same hyperparameters, differing only in their base model size.

Training Procedure

Base Models:

  • t5-small: google-t5/t5-small (60M parameters)
  • t5-base: google-t5/t5-base (220M parameters)
  • t5-large: google-t5/t5-large (770M parameters)

Hyperparameters:

  • Epochs: 10
  • Batch size: 4 per device
  • Gradient accumulation: 4 steps
  • Effective batch size: 16
  • Learning rate: 1e-4
  • Optimizer: Adafactor
  • LR scheduler: Cosine with 10% warmup
  • Weight decay: 0.01
  • Max input length: 1024 tokens
  • Max target length: 512 tokens
  • Early stopping: 5 epochs patience
  • Label smoothing: 0.0

Training Infrastructure:

  • Evaluation strategy: Every 500 steps
  • Checkpoint saving: Every 500 steps
  • Best model selection: Based on validation loss
  • TensorBoard logging: Every 10 steps
  • Maximum checkpoints kept: 2

Special tokens:

  • Added \n as additional special token for better diff handling
  • Tokenizer vocabulary resized accordingly

Preprocessing

Input format:

generate changelog: [code diff or change description]

Target format:

[Human-readable changelog entry]

Tokenization:

  • Padding: Dynamic (longest in batch)
  • Truncation: Enabled at max_length
  • Special tokens: Added automatically
  • Labels: Pad tokens replaced with -100 for loss computation

Model Performance

The model was trained with early stopping based on validation loss. The best checkpoint was automatically selected and saved as the final model.

Limitations

  • Maximum input length: 1024 tokens (longer diffs will be truncated)
  • Maximum output length: 512 tokens
  • Domain specificity: Trained on specific changelog formats; may not generalize to all documentation styles
  • Language: Primarily trained on English changelogs
  • Code languages: Performance may vary across programming languages depending on training data distribution

Model Files and Formats

Repository Structure

t5-finetune-changelog/
β”œβ”€β”€ t5-small/          # 231 MB (NOT recommended - hallucinates)
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ generation_config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   └── tokenizer_config.json
β”œβ”€β”€ t5-base/           # 850 MB - Balanced
β”‚   └── [same files]
β”œβ”€β”€ t5-large/          # 2.8 GB - Best quality (Recommended)
β”‚   └── [same files]
└── ct2_models/        # CTranslate2 optimized models (2-4x faster inference)
    β”œβ”€β”€ t5-base-ct2-float32/        # Full precision (~850 MB)
    β”œβ”€β”€ t5-base-ct2-float16/        # GPU only (~425 MB)
    β”œβ”€β”€ t5-base-ct2-int8_float16/   # GPU only (~220 MB)
    β”œβ”€β”€ t5-base-ct2-int8/           # CPU/GPU compatible (~220 MB)
    β”œβ”€β”€ t5-large-ct2-float32/       # Full precision (~2.8 GB)
    β”œβ”€β”€ t5-large-ct2-float16/       # GPU only (~1.4 GB)
    β”œβ”€β”€ t5-large-ct2-int8_float16/  # GPU only (~770 MB)
    └── t5-large-ct2-int8/          # CPU/GPU compatible (~770 MB)

Models are provided in two formats:

  • SafeTensors (t5-*/): For Hugging Face Transformers
  • CTranslate2 (ct2_models/): Optimized for 2-4x faster inference with lower memory usage

Citation

If you use this model in your research or projects, please cite:

@misc{t5-changelog-collection-2026,
  author = {Christian Goll},
  title = {T5 Changelog Generator Collection},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/mslacken/t5-finetune-changelog}}
}

License

This model is licensed under Apache 2.0, same as the base T5 model.

Acknowledgments

  • Base model: google-t5/t5-large
  • Framework: Hugging Face Transformers
  • Training: Custom implementation with PyTorch

Additional Information

Contact: Christian Goll [email protected]

Issues: Please report issues or feedback on the model's discussion page.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mslacken/t5-finetune-changelog

Finetuned
(732)
this model