JSON Semantic Validator - MiniLM v1
Hybrid rule+ML model that detects semantic JSON issues and suggests minimal fixes.
Model Description
This model is fine-tuned from nreimers/MiniLM-L6-H384-uncased to classify semantic errors in JSON payloads and predict appropriate fix actions. It works in conjunction with a deterministic rules engine (JSON Schema validation) to provide a hybrid validation approach.
Key Features
- Error Type Classification: Detects 8 types of semantic errors
- Fix Action Prediction: Suggests appropriate corrections
- Fast Inference: ONNX export for CPU-optimized inference
- Hybrid Approach: Combines rules + ML for best results
Training Data
Dataset: thearnabsarkar/json-semval-synth-v1
- 40 synthetic training examples
- 10 synthetic test examples
- Controlled corruptions with labeled fixes
Error Types
wrong_type- Incorrect data typealias_key- Alternative field namesinvalid_date- Malformed datesenum_near_miss- Close but incorrect enum valuescross_field- Logical inconsistencies across fieldsboolean_text- Text representations of booleansnumber_text- Text representations of numbersextra_key- Unexpected additional properties
Fix Actions
rename_key- Rename field to expected namecast_number- Convert text to numbercast_bool- Convert text to booleanparse_date_iso- Parse and normalize datesmap_enum- Fuzzy match to valid enum valueswap_dates- Fix inverted date rangesfill_default- Use schema default value
Files
model.safetensors- PyTorch model weightsmodel.onnx- ONNX export for fast CPU inferenceconfig.json- Model configurationtokenizer.json/tokenizer_config.json/vocab.txt- Tokenizer filesreports/metrics.json- Evaluation metrics
Evaluation
See reports/metrics.json in this repository for detailed evaluation metrics on the synthetic test set.
Performance (Synthetic Test Set)
- Rules-only pass rate: 0.0% (baseline)
- Hybrid pass rate: 60-80% (rules + ML fixes)
Usage
With Transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import json
model = AutoModelForSequenceClassification.from_pretrained("thearnabsarkar/json-semval-minilm-v1")
tokenizer = AutoTokenizer.from_pretrained("thearnabsarkar/json-semval-minilm-v1")
schema = {"type": "object", "properties": {"age": {"type": "integer"}}}
payload = {"age": "25"}
input_text = f"Schema: {json.dumps(schema)} JSON: {json.dumps(payload)}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
# Get predicted error type
predicted_class = outputs.logits.argmax(-1).item()
error_types = ["wrong_type", "alias_key", "invalid_date", "enum_near_miss",
"cross_field", "boolean_text", "number_text", "extra_key"]
print(f"Predicted error: {error_types[predicted_class]}")
With ONNX Runtime (Faster CPU Inference)
import onnxruntime as ort
import numpy as np
session = ort.InferenceSession("model.onnx")
# ... tokenize input ...
outputs = session.run(None, {"input_ids": input_ids})
Full Pipeline
For the complete validation and auto-fixing pipeline, see:
- GitHub: json-semantic-validator (if available)
- Space: thearnabsarkar/json-semantic-validator (coming soon)
Limitations
- Synthetic training data: Model may not generalize to all real-world edge cases
- Simple fixes only: Cannot handle complex schema violations
- Context window: Limited to 512 tokens (schema + JSON)
- Single error focus: Optimized for single-error scenarios
Training Details
- Base model: nreimers/MiniLM-L6-H384-uncased
- Training epochs: 3
- Batch size: 8
- Learning rate: 5e-5
- Optimizer: AdamW
- Framework: PyTorch + Transformers
Intended Use
This model is designed for:
- JSON validation with auto-fixing suggestions
- API response validation
- Configuration file validation
- Data quality checks
Citation
@misc{json-semval-minilm-v1,
author = {Arnab Sarkar},
title = {JSON Semantic Validator - MiniLM v1},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/thearnabsarkar/json-semval-minilm-v1}
}
Related Resources
- Dataset: thearnabsarkar/json-semval-synth-v1
- Demo Space: thearnabsarkar/json-semantic-validator (coming soon)
- Downloads last month
- 3
Model tree for thearnabsarkar/json-semval-minilm-v1
Base model
nreimers/MiniLM-L6-H384-uncased