Estimating worst case frontier risks of open weight LLMs

OpenAI Blog
Aug 5, 2025 00:00
1 views
openaiairesearch

Summary

The paper examines the potential worst-case risks of releasing open-weight large language models (LLMs) like gpt-oss by introducing "malicious fine-tuning" (MFT), a method to maximize model capabilities in sensitive areas such as biology and cybersecurity. The findings highlight the heightened risks associated with open access to powerful LLMs, emphasizing the need for careful consideration of their release and potential misuse in high-stakes domains.

In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity.