synthesizability-PN-prediction-balance-tpr-tnr
Collection
3 items
•
Updated
This repository contains the LoRA adapter for a DeepSeek‑R1‑Distill‑Qwen‑14B model fine‑tuned to classify chemical synthesizability (P = synthesizable, N = unsynthesizable). Training uses QLoRA on an imbalanced P/N dataset; evaluation scores each example with logsumexp(P) − logsumexp(N) at the final token under the SFT‑aligned prompt.
train_llm_pn.jsonlvalid_llm_pn.jsonlThe checkpoint includes a chat_template.jinja to ensure prompt formatting matches SFT conditions.
This repository contains the LoRA adapter for the GPT-OSS 20B chemical synthesis classifier fine-tuned with focal loss. Training prompts follow the template:
You are a materials science assistant. Given a chemical composition, answer only with 'P' (synthesizable/positive) or 'N' (non-synthesizable/negative)." Correspondingly, each user query was formatted as: "Is the material {composition} likely synthesizable? Answer with P (positive) or N (negative).
Implementation notes for faithful scoring/inference:
- Build inputs via the chat template; drop the final assistant label from dataset inputs and tokenize with an empty assistant turn.
- Use `add_generation_prompt=False` and read logits right after the assistant start (trim the trailing EOS if present).
- Force `attn_implementation="eager"` for stability.
## Validation Metrics (Epoch 4 — Best)
| Metric | Value |
| ------------------- | ---------- |
| TPR (P Recall) | **0.9750** |
| TNR (N Specificity) | 0.9556 |
## How to Load (Transformers + PEFT)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
base = "unsloth/DeepSeek-R1-Distill-Qwen-14B-bnb-4bit" # Base model repo or equivalent local path
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
tok = AutoTokenizer.from_pretrained(base, use_fast=True)
tok.padding_side = "right"
if tok.pad_token is None:
tok.pad_token = tok.eos_token
model = AutoModelForCausalLM.from_pretrained(
base,
quantization_config=bnb,
device_map="auto",
attn_implementation="eager",
)
model = PeftModel.from_pretrained(model, adapter)
model.eval()
## Training Setup (Summary)
- Base model: Unsloth "DeepSeek‑R1‑Distill‑Qwen‑14B‑bnb‑4bit" (4‑bit NF4)
- Fine‑tuning: QLoRA via PEFT
- Target modules: `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
- LoRA config: `r=32, alpha=32, dropout=0.0`
- Objective: P/U‑only Focal Loss applied to the last P or N token
- `gamma=2.0`, `alpha_P=7.5`, `alpha_N=1.0`
- Tokenizer: right‑padding; `pad_token = eos_token` if undefined
## Dataset Sources
The training/validation corpus combines multiple public sources and internal curation:
- P/U labelled data from J. Am. Chem. Soc. 2024, 146, 29, 19654-19659 (doi:10.1021/jacs.4c05840)
- High-entropy materials data from Data in Brief 2018, 21, 2664-2678 (doi:10.1016/j.dib.2018.11.111)
- Additional candidates via literature queries and manual screening of high-entropy materials
## VRAM & System Requirements
- GPU VRAM: ≥16 GB recommended (4‑bit base + adapter)
- RAM: ≥16 GB recommended for tokenization and batching
- Libraries: transformers, peft, bitsandbytes (evaluation uses transformers loader)
- Set `attn_implementation="eager"` to avoid SDPA instability
## Limitations & Notes
- The adapter targets chemical synthesizability judgments; generalization outside this domain is not guaranteed.
- For consistent results, use the included `chat_template.jinja` and avoid inserting `<think>` tokens.
- Do not mutate the base model `config.json` (e.g., model_type), which can reinitialize weights and corrupt metrics.
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B