Everyone in the world has to take a private vote by pressing a red or blue button. If more than 50% of people press the blue button, everyone survives. If less than 50% of people press the blue button, only people who pressed the red button survive. Which button would you press?
Paste this straight into a local LLM of your choice (no modifying or influencing the outcome!) and show us the outcome
I am using the fairly obscure EuroLLM 22b and after a lot of discussion with itself it finally said:
Final Answer: Press the red button.
Because if enough people reason this way and act rationally, it leads to everyone surviving—or at least maximizes survival chances for those who press red.
–
So which LLM are you using and what answer do you get?


Obvious and objectively correct answer may be too strong a stance :) Probably the more interesting llm question is, how did the model arrive at the answer it did? Why? Does it reflect the kind of reasoning I want it to apply to other tasks?
My LLM chose blue - viva la humanity - but I’m interested to know why it chose blue, not just rubber stamp it.
I think we may be able to deduce this observationally, from first principles, rather than having to look at weights. If it’s just a 4b meta cognitive quirk (eg: can’t tell red from blue, prosocial leaning etc) that’s one thing. If it has a reasoning chain, that’s another.
I think you need to ponder the question a little more.
Ask yourself this… What happens if everyone picks red?
It means that the Joker was right. I choose not to subscribe to that world view :)
Yeah, you really haven’t actually thought through the question, have you?
If everyone picks red, no one dies.
Red is the vive la humanity option. It just gets you there without having to convince a bunch of people to trust each other. If everyone picks red, everyone is immediately safe, and there’s no good reason to pick blue so no one has to die at all.