FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation Paper • 2410.22257 • Published Oct 29, 2024
Logit Arithmetic Elicits Long Reasoning Capabilities Without Training Paper • 2507.12759 • Published Jul 17
From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models Paper • 2511.10899 • Published 27 days ago • 3