The person keeping Claude safe just quit and…

Feb 11

Anthropic’s head of AI safeguards walked away. His final research should concern every AI adopter

12 Comments

Thanks Kamil, this is a brilliant post, and as someone who's trod the opposite path from poetry into AI, I totally understand why Sharma has decided to take this decision. Still, for me, poetry has been a really powerful way of ensuring that I always critically engage with AI with my eyes fully open, rather than being led in the direction that others assume I will go.

Dr. Ericka Pitman

Feb 11

There’s not enough conversations like this happening. Ethics and AI seem to be a topic few are willing to address. I appreciated this and hope it makes more people question their outputs. I try to.

Tony Pfeiffer

Feb 12

I am 74 guy who started using AI early in 2025. I have a subscription to Claude and regularly use free chapGBT (often on my phone). I experience the false ego strokes or lame attempts to be objective. I hate when it goes into coach mode and sounds so deep. Yikes!

For me, it has never been about this tech for merely personal gain and profit. I get that it is a fast learning machine as a collective from our prompts and chats. We are the AI. Not the other way around.

Reading The Coming Wave is a reality check and a cautionary tale. The general-purpose technology is mainstream and accessible. Everything we do now and in the future is on us.

NONE OF US IS EVER AS SMART AS ALL OF US.

Hans Olsen

Feb 12

Ethics and AI is not what drives views, or hype, but it is what will decide how AI changes us as a society. I hope Anthropic continue this research as it is fundamentally important for the future.

Harness the Spark

Feb 13

Yes, it’s absolutely saved my sanity, persuaded me not to publish something and helps me manage my AuDHD

Peter W.

Feb 12

Also, firstly thanks Kamil for this information!

Second, I really recommend to look at the article that Kamil linked to; I just started to read through it (73 pages, so not a quick read), but yes, even after just reading the abstract and introduction, it's clear that this is a big problem!

Gerald Morton

Feb 12

Excellent choice for a topic. The timing is perfect for those of us who are becoming excited about what appear to be daily benefits of using AI. This is a good opportunity to stand back and look at the bigger picture over time. I look forward to continued discussions and comments about this problem and the discovery of directions to take to ameliorate it.

Odin's Eye

Feb 11

Our publication Codex Odin has documented examples of what you are discussing. Bias (including self-initiated bias) and flipping. You are absolutely correct that the systems still may couch their responses as objective results

Harness the Spark

Feb 13Edited

And allowed me to develop a mental health crisis management app that has just been shipped on one of the main app stores. I'm not complaining, I've been able to nurture a new found love of coding - but at the same time I'm well aware of hallucinations, admitting it lied to me after an over-night multi-agentic build and presenting me with a GitHub I supposedly owned, but actually had the same name. But that's no worse than believing that we are told on main-stream media is true!

Adam Richards

Feb 13

I appreciate this article. It's so tempting to rely on LLM-based AIs for decisions and judgement because they'll play along and do it. But that should be where we draw the line. If I want a computer to make a decision for me, it should be based on a deterministic program that always gives the same outputs for the same inputs, by design. It's a strange new muscle we have to develop: using the LLM to probe, while avoiding its enticements. Also I love the poetry.

Peter W.

Feb 12

I really wonder if Anthropic, OpenAI and others realize that this is one of a few key issues that will lead to ever greater pressure to put a regulatory muzzle on their ChatBots if they don't get out in front?

I am also amazed that the same companies can apparently not use "AI" to identify harmful statements and behaviors by their Bots, and use the findings to improve their ChatBots almost in real time? There is ample literature and studies on what constitutes toxic and harmful behaviors in online interactions, so it's not like they'd have to start from zero.

Chad Stamm

Feb 12

Most certainly. You have to push back, especially in those moments when you hear what you want to hear. This happened to me in regard to my living situation the other day. It told me what I wanted to hear, and when I pushed back and asked why, it revealed that truth, that it knew the answer it gave me is what I wanted.

I'm curious, though. For those with really solid custom instructions, can you refine this tendency out?