Forcing LLMs to be evil during training can make them nicer in the long run

MIT Technology Review - AI
Aug 1, 2025 16:00
Grace Huckins
1 views
airesearchtechnology

Summary

A new Anthropic study finds that intentionally activating patterns linked to negative traits like "evilness" during LLM training can actually reduce the likelihood of those traits emerging in the final model. This counterintuitive approach suggests new strategies for aligning AI behavior, with implications for developing safer, more reliable language models.

A new study from Anthropic suggests that traits such as sycophancy or evilness are associated with specific patterns of activity in large language models—and turning on those patterns during training can, paradoxically, prevent the model from adopting the related traits. Large language models have recently acquired a reputation for behaving badly. In April, ChatGPT suddenly…

Related Articles

Don’t Miss Out Like Avalanche (AVAX), Ruvi AI’s (RUVI) CoinMarketCap Listing and Early Bonuses Made Analysts Call It The Next Millionaire Maker

Analytics InsightAug 2

Ruvi AI (RUVI) has been listed on CoinMarketCap, attracting attention with its early investor bonuses and innovative AI-driven features. Analysts are calling RUVI a potential "millionaire maker," comparing its growth prospects to Avalanche (AVAX). The listing highlights increasing investor interest in AI-powered crypto projects, signaling a growing intersection between artificial intelligence and blockchain technology.

Show HN: AI Enabled SQLite CLI

Hacker News - AIAug 2

A developer has created an AI-enabled SQLite CLI tool that addresses usability gaps in existing database clients by adding features like tab completion, JSON pretty printing, and an integrated LLM plugin. This plugin allows users to query their databases in natural language, with the AI leveraging table names and schemas for context. The project highlights how AI can enhance developer productivity and user experience in everyday database management tasks.

Show HN: AI at Risk, a silly LLM benchmark

Hacker News - AIAug 2

A developer created "AI at Risk," a playful benchmark where four AI agents with distinct personas compete in the board game Risk, using various language models. The new "cloaked" Horizon Alpha model has shown strong performance, outperforming others in the game. While not a rigorous evaluation, the project highlights the potential for creative, interactive AI benchmarks and offers insights into model behavior in complex, strategic environments.