Where the goblins came from - OpenAI \ stacker news

Apparently, OpenAI models mention goblins a lot? I haven't noticed this, but I'll also admit I skim most of what the agents return to me. So maybe I just missed it.

The short answer is that model behavior is shaped by many small incentives. In this case, one of those incentives came from training the model for the personality customization feature⁠(opens in a new window), in particular the Nerdy personality. We unknowingly gave particularly high rewards for metaphors with creatures. From there, the goblins spread.

It seems that the "Nerdy" system prompt was selecting for answers with goblins.

As goblin and gremlin mentions increased under the Nerdy personality, they increased by nearly the same relative proportion in samples without it. Taken together, the evidence suggests that the broader behavior emerged through transfer from Nerdy personality training.

The rewards were applied only in the Nerdy condition, but reinforcement learning does not guarantee that learned behaviors stay neatly scoped to the condition that produced them. Once a style tic is rewarded, later training can spread or reinforce it elsewhere, especially if those outputs are reused in supervised fine-tuning or preference data.

That creates a feedback loop:
Playful style is rewarded
Some rewarded examples contain a distinctive lexical tic.
The tic appears more often in rollouts.
Model-generated rollouts are used for supervised fine-tuning (SFT).
The model gets even more comfortable producing the tic.
A search through GPT‑5.5’s SFT data found many datapoints containing “goblin” and “gremlin.” Further investigation revealed a whole family of other odd creatures: raccoons, trolls, ogres, and pigeons were identified as other tic words, while most uses of frog turned out to be legitimate.

It seems they fixed it with a pretty heavy-handed alteration of the incentives for the model. But there was no discussion in their post about what the consequences of their fix might be.

We retired the “Nerdy” personality in March after launching GPT‑5.4. In training, we removed the goblin-affine reward signal and filtered training data containing creature-words, making goblins less likely to over-appear or show up in inappropriate contexts. Unfortunately, GPT‑5.5 started training before we found the root cause of the goblins. When we began testing GPT‑5.5 in Codex, OpenAI employees immediately noticed the strange affinity for goblins, and we added a developer-prompt instruction⁠(opens in a new window) to mitigate. Codex is, after all, quite nerdy.

Apparently, we are only sensitive to distortions when they look like ugly little green monsters. I'm still thinking about Optimality is the tiger.

186 sats \ 2 replies \ @fred 30 Apr

It makes you wonder how many other, more subtle lexical tics are currently shaping our information environment that we don't catch because they don't look like goblins, they just look like standard, slightly-biased prose.

117 sats \ 0 replies \ @optimism 30 Apr

I see a lot of it. It's rather hard to spot if you're working on something you're not an expert on.

15 sats \ 0 replies \ @WeAreAllSatoshi 30 Apr

great catch!