WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance Paper • 2511.12997 • Published Nov 17 • 10
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents Paper • 2504.13203 • Published Apr 15 • 35
MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations Paper • 2504.07830 • Published Apr 10 • 18
$\textbf{Only-IF}$:Revealing the Decisive Effect of Instruction Diversity on Generalization Paper • 2410.04717 • Published Oct 7, 2024 • 18
SciCode: A Research Coding Benchmark Curated by Scientists Paper • 2407.13168 • Published Jul 18, 2024 • 16
Instruction Diversity Drives Generalization To Unseen Tasks Paper • 2402.10891 • Published Feb 16, 2024
PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis Paper • 2309.05833 • Published Sep 11, 2023
PLUM: Preference Learning Plus Test Cases Yields Better Code Language Models Paper • 2406.06887 • Published Jun 11, 2024 • 2
Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion Paper • 2401.12947 • Published Jan 23, 2024 • 4