JSON Semantic Validator - MiniLM v1

Hybrid rule+ML model that detects semantic JSON issues and suggests minimal fixes.

Model Description

This model is fine-tuned from nreimers/MiniLM-L6-H384-uncased to classify semantic errors in JSON payloads and predict appropriate fix actions. It works in conjunction with a deterministic rules engine (JSON Schema validation) to provide a hybrid validation approach.

Key Features

  • Error Type Classification: Detects 8 types of semantic errors
  • Fix Action Prediction: Suggests appropriate corrections
  • Fast Inference: ONNX export for CPU-optimized inference
  • Hybrid Approach: Combines rules + ML for best results

Training Data

Dataset: thearnabsarkar/json-semval-synth-v1

  • 40 synthetic training examples
  • 10 synthetic test examples
  • Controlled corruptions with labeled fixes

Error Types

  1. wrong_type - Incorrect data type
  2. alias_key - Alternative field names
  3. invalid_date - Malformed dates
  4. enum_near_miss - Close but incorrect enum values
  5. cross_field - Logical inconsistencies across fields
  6. boolean_text - Text representations of booleans
  7. number_text - Text representations of numbers
  8. extra_key - Unexpected additional properties

Fix Actions

  • rename_key - Rename field to expected name
  • cast_number - Convert text to number
  • cast_bool - Convert text to boolean
  • parse_date_iso - Parse and normalize dates
  • map_enum - Fuzzy match to valid enum value
  • swap_dates - Fix inverted date ranges
  • fill_default - Use schema default value

Files

  • model.safetensors - PyTorch model weights
  • model.onnx - ONNX export for fast CPU inference
  • config.json - Model configuration
  • tokenizer.json / tokenizer_config.json / vocab.txt - Tokenizer files
  • reports/metrics.json - Evaluation metrics

Evaluation

See reports/metrics.json in this repository for detailed evaluation metrics on the synthetic test set.

Performance (Synthetic Test Set)

  • Rules-only pass rate: 0.0% (baseline)
  • Hybrid pass rate: 60-80% (rules + ML fixes)

Usage

With Transformers

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import json

model = AutoModelForSequenceClassification.from_pretrained("thearnabsarkar/json-semval-minilm-v1")
tokenizer = AutoTokenizer.from_pretrained("thearnabsarkar/json-semval-minilm-v1")

schema = {"type": "object", "properties": {"age": {"type": "integer"}}}
payload = {"age": "25"}

input_text = f"Schema: {json.dumps(schema)} JSON: {json.dumps(payload)}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)

# Get predicted error type
predicted_class = outputs.logits.argmax(-1).item()
error_types = ["wrong_type", "alias_key", "invalid_date", "enum_near_miss", 
               "cross_field", "boolean_text", "number_text", "extra_key"]
print(f"Predicted error: {error_types[predicted_class]}")

With ONNX Runtime (Faster CPU Inference)

import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("model.onnx")
# ... tokenize input ...
outputs = session.run(None, {"input_ids": input_ids})

Full Pipeline

For the complete validation and auto-fixing pipeline, see:

Limitations

  • Synthetic training data: Model may not generalize to all real-world edge cases
  • Simple fixes only: Cannot handle complex schema violations
  • Context window: Limited to 512 tokens (schema + JSON)
  • Single error focus: Optimized for single-error scenarios

Training Details

  • Base model: nreimers/MiniLM-L6-H384-uncased
  • Training epochs: 3
  • Batch size: 8
  • Learning rate: 5e-5
  • Optimizer: AdamW
  • Framework: PyTorch + Transformers

Intended Use

This model is designed for:

  • JSON validation with auto-fixing suggestions
  • API response validation
  • Configuration file validation
  • Data quality checks

Citation

@misc{json-semval-minilm-v1,
  author = {Arnab Sarkar},
  title = {JSON Semantic Validator - MiniLM v1},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/thearnabsarkar/json-semval-minilm-v1}
}

Related Resources

Downloads last month
3
Safetensors
Model size
22.7M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for thearnabsarkar/json-semval-minilm-v1

Quantized
(2)
this model

Dataset used to train thearnabsarkar/json-semval-minilm-v1

Space using thearnabsarkar/json-semval-minilm-v1 1