When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Menon Suriyakumar, Marzyeh Ghassemi

2026

URL

Type

Conference paper

Publication

The Fourteenth International Conference on Learning Representations

Yuxin Xiao

Yuxin Xiao is a Ph.D. candidate at MIT IDSS. His research focuses on building safe, robust, and trustworthy LLMs and advancing their reasoning and decision-making capabilities for healthcare and other high-stakes applications. Yuxin obtained his M.S. in Machine Learning at Carnegie Mellon University and his B.S. in Computer Science and B.S. in Statistics and Mathematics at the University of Illinois at Urbana-Champaign.

Sana Tonekaboni

Sana is a postdoctoral fellow at the Broad Institute of MIT and Harvard. Her research focuses on developing methods that integrate multimodal biomedical data to better understand human health. She is also interested in challenges of deploying clinical ML in healthcare environments and finding solutions for effective and safe use of such tools in practice. Sana received her PhD in computer science from the University of Toronto, under supervision of Dr. Anna Goldenberg, where she was an Apple scholar in AI/ML and a CIHR health system impact fellow.

When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment

Yuxin Xiao

Sana Tonekaboni

Related