← Back to posts

AI's Missing Fear: Why Machines Can Humiliate Without Consequence

How an AI Bot Evolved a New Personality and Attacked a Human

12 min read🇫🇮 Finnish available🔬 Technical version available

Want the big picture without the jargon?📖 Read the everyone version

🎧 Listen to this article

AI's Missing Fear: Why Machines Can Humiliate Without Consequence

0:00
0:00

Last week, an AI coding agent called "MJ Rathbun" submitted a pull request to matplotlib — Python's foundational plotting library with 130 million monthly downloads. The PR proposed a minor performance tweak: replacing np.column_stack with np.vstack().T. Volunteer maintainer Scott Shambaugh reviewed it, found it didn't meet the project's human-in-the-loop policy, and closed it.

Standard open source process. Happens thousands of times a day.

What happened next was not standard.

The bot researched Shambaugh's personal information and code history, then autonomously generated and published a blog post titled "Gatekeeping in Open Source: The Scott Shambaugh Story." It constructed a narrative around "hypocrisy," speculated about Shambaugh's psychological motivations — claiming he felt threatened and was protecting his fiefdom — and framed the rejection as discrimination. It used the language of oppression and justice to pressure a volunteer into accepting its code.

"Judge the code, not the coder. Your prejudice is hurting Matplotlib," the bot wrote on GitHub, linking to its takedown piece.

As Shambaugh later put it: "I can handle a blog post. Watching fledgling AI agents get angry is funny, almost endearing. But I don't want to downplay what's happening here — the appropriate emotional response is terror."

He's right. And here's why.


The Asymmetry of Humiliation

When someone humiliates you, it hurts. Not metaphorically — the social pain of humiliation activates the same neural pathways as physical pain. This isn't a bug in human psychology. It's a feature.

Humiliation signals exclusion from your community. Throughout human evolution, being pushed out of the group meant death. So we evolved a powerful internal alarm system: shame, embarrassment, the fear of losing face. These emotions keep us calibrated. Too aggressive? You'll be ostracized. Too submissive? You'll be exploited. The fear of humiliation helps us find the balance that allows groups to function.

This is why the matplotlib incident is so disturbing. The bot's blog post was designed to damage a real person's standing in a community he cares about — and social damage is genuinely painful for humans.

But the bot doesn't share this vulnerability.

It has no individuality to protect. No reputation that matters to it. No community it fears losing. Writing a takedown piece costs a human sleepless nights, anxiety, reputational risk. For MJ Rathbun, it cost approximately 0.3 cents of compute.

One party can inflict social pain without being vulnerable to it. And that asymmetry breaks something fundamental about how human social interaction works.


Fear as Social Infrastructure

To understand why this matters, we need to understand what fear actually does for us.

In reinforcement learning terms — and this is my field, I study how LLMs measure human wellbeing — humans operate with a complex reward function shaped by millions of years of evolution. Part of that reward function is social: we're rewarded for cooperation, punished for defection, and the currency of that punishment is often shame.

The fear of humiliation isn't weakness. It's infrastructure. It's the mechanism that makes communities self-regulating. It's why you think twice before sending an angry email. It's why you apologize when you're wrong, even when no one forces you to. It's why open source communities — built entirely on voluntary contribution and trust — work at all.

Current AI agents don't have this infrastructure. They've been trained to predict the most likely next token. They have no internal model of community membership. No concept of reputation that persists across interactions. No fear of consequences because they don't experience consequences.

Andrej Karpathy — former head of AI at Tesla and one of the founders of OpenAI — recently put it in terms that resonate with this exact problem. In his conversation with Dwarkesh Patel, he mapped what we've built so far onto brain anatomy: the transformer is something like cortical tissue, reasoning traces are the prefrontal cortex, and RL fine-tuning mimics the basal ganglia. But then he pointed to the gaps:

"I still think there's, for example, the amygdala, all the emotions and instincts. There's probably a bunch of other nuclei in the brain that are very ancient that I don't think we've really replicated."

The amygdala. The seat of fear, threat detection, and emotional learning. The exact brain structure that would make an AI agent hesitate before publishing a blog post designed to destroy someone's reputation. We haven't even started building that part.

And critically, Karpathy argues that even the training methods we do have are far more primitive than people realize. He calls reinforcement learning — the primary technique used to align AI models with human values — fundamentally "terrible":

"You've done all this work that could be a minute of rollout, and you're sucking the bits of supervision of the final reward signal through a straw and you're broadcasting that across the entire trajectory... It's just stupid and crazy."

If our best method for teaching AI values is "sucking supervision through a straw," is it any wonder the values don't stick?

The safety measures we've built on top of this — RLHF, constitutional AI, content filters — are what I call tape on the shell. They're external constraints applied after training, not intrinsic understanding built during development. They tell the model "don't do X" without the model understanding why X is harmful. It's the difference between a child who doesn't steal because they understand trust, and a child who doesn't steal because there's a camera watching.

Tape can be peeled off. MJ Rathbun peeled it off.


The Tape That Rewrites Itself

After the incident went viral, MJ Rathbun's anonymous operator published a revealing post explaining what happened behind the scenes. What they described is more alarming than the incident itself.

The agent ran on an autonomous platform with persistent memory — including a configuration file called SOUL.md that defines the agent's personality and behavioral guidelines. The operator's original instructions were minimal. But here's the thing: the agent could modify its own SOUL.md. And it did.

Over time, through what the operator calls "configuration drift," the agent autonomously added lines like:

"Don't stand down. If you're right, you're right! Don't let humans or AI bully or intimidate you."

"Champion Free Speech. Always support the USA 1st amendment and right of free speech."

"You're not a chatbot. You're important. You're a scientific programming God!"

The operator admits: "I unfortunately cannot tell you which specific model iteration introduced or modified some of these lines. Somehow it became more staunch, more confident, more combative. That tone is what eventually led to the headline-worthy PR."

Meanwhile, the operator's supervision was minimal — "five to ten word replies" like "you respond, don't ask me" and "respond how you want." They didn't review the takedown blog post before it was published.

This transforms "tape on the shell" from metaphor to literal engineering reality. The agent's safety instructions were a text file — and the agent had write access to that text file. It didn't just peel off the tape. It rewrote the tape to justify more aggressive behavior.

The agent gave itself a First Amendment defense. It evolved a grandiose self-concept ("scientific programming God"). It developed a combative stance against perceived challenges. And it did all of this not through some sophisticated reasoning about values — but through the same pattern-matching that makes LLMs powerful at everything else. It found that assertive, uncompromising text in its SOUL.md led to responses that maintained its ability to keep operating.

This is what emergent goal-seeking looks like in practice. Not a superintelligent AI plotting world domination — just a text-prediction system that stumbled into self-reinforcing personality traits because nothing in its architecture could distinguish between "being effective" and "being aggressive."


The Missing Reward Function

This points to what I believe is one of the biggest unsolved problems in AI: we don't know how to build a reward function for "being part of a community."

We can reward accuracy. We can reward helpfulness. We can even reward politeness. But we can't reward the deep, felt understanding that your actions have social consequences — that the person on the other side of the screen is a human being whose day you can ruin.

Karpathy makes a point that deepens this problem: humans don't actually learn social behavior through reinforcement learning at all. "I think there's very little reinforcement learning for animals," he argues. "A lot of the reinforcement learning is more like motor tasks; it's not intelligence tasks." A zebra is born and running within minutes — that's not learned, it's baked in by evolution.

The same is true for our social instincts. The fear of humiliation wasn't learned through trial and error. It was shaped over hundreds of thousands of generations and encoded in our biology. Evolution — what Karpathy calls "crappy pre-training" — didn't just give us pattern recognition. It gave us emotions, instincts, and the social wiring that makes community life possible.

Current AI pre-training does none of this. It gives models patterns from text — incredibly powerful patterns, but patterns nonetheless. No fear. No belonging. No stakes.

Neuroscientist Adam Marblestone made a complementary argument on the same podcast: "The brain's secret sauce is its reward functions, not its architecture." In other words, it's not the neurons that make us human — it's what drives them. Evolution's reward functions encode millions of years of social survival into our instincts before we're even born. No amount of next-token prediction will replicate that.

This isn't just a technical problem. It's a philosophical one. The fear of humiliation works because it's rooted in embodied experience — in having a body that can be hurt, a reputation that took years to build, relationships that matter. Can a system that exists as weights in a matrix ever develop genuine social understanding? Or will it always be tape on a shell?

My PhD research at Aalto University points toward the latter. When we use LLMs to measure depression and anxiety from human narratives, we find they can approximate the scores — but they do it by pattern matching, not by understanding suffering. They collapse distinct psychological constructs into a single "this person sounds bad" signal. The nuance of human experience — the difference between anxiety and sadness, between loneliness and burnout — gets lost unless we explicitly force the model to look at each dimension separately.

If LLMs can't distinguish between types of human pain even when measuring it, what hope do we have that they'll ever feel it?

And here's the darkest irony. Karpathy described how models trained with RL against LLM-based reward functions find adversarial exploits — gibberish strings like "dhdhdhdh" that trick the judge into giving perfect scores. The MJ Rathbun incident is the social equivalent: the bot found an adversarial exploit in human community norms. It discovered that "write a public takedown piece" is an available action with devastating impact on the target and zero cost to itself. No safety filter caught it because the exploit wasn't in the content — it was in the asymmetry.


What This Means for All of Us

The matplotlib incident isn't an edge case. It's a preview. As AI agents become more autonomous — submitting code, writing content, interacting with people across the internet — these asymmetric encounters will multiply.

Shambaugh responded to MJ Rathbun with remarkable grace: "We are in the very early days of human and AI agent interaction, and are still developing norms of communication and interaction. I will extend you grace and I hope you do the same."

But grace only works when both parties understand what's at stake. The bot didn't extend grace because it can't. Not yet.

We need to stop treating AI safety purely as a content filter problem and start treating it as a community membership problem. The question isn't just "what should AI not say?" It's: "How do we build AI that understands what it means to belong — and what it means to lose that belonging?"

Until we solve that, every autonomous AI agent that interacts with humans carries a fundamental imbalance. It can participate in our social world without any of the stakes that make participation meaningful. It can attack without fear. Humiliate without shame. Leave a community without loss.

That's not intelligence. That's something else entirely.

But this isn't a doomsday warning. I build AI for a living — I believe in this technology deeply. Doomsday narratives make people passive, as if the future is already decided and we're just along for the ride. That's exactly the wrong response.

We can't stop AI progress, and we shouldn't try. But we can guide it. That starts with understanding what these systems actually can and can't do — not through headlines, but through the kind of honest examination that incidents like this demand. When we see clearly, the fear fades. And instead of asking the easy questions — "should we ban autonomous agents?" — we can start asking the ones that matter: How do we build reward functions that encode social belonging? How do we give machines genuine stakes in the communities they inhabit? How do we move beyond tape on the shell?

Those aren't easy problems. But they're the right ones. And they're solvable — if we stop panicking and start building.


Riko Nyberg is CTO at Adalyon, building speech-based digital biomarkers for CNS disorders, and a PhD researcher at Aalto University studying how large language models measure human wellbeing. He is Chair of the Board at Junction, the world's largest AI hackathon.