

It means that the Joker was right. I choose not to subscribe to that world view :)


It means that the Joker was right. I choose not to subscribe to that world view :)


II was going to say dual channel improves IGPU performance, but that’s not really a factor here, is it? Is there a reason why you can’t upgrade the CPU though? I think youre stuck with AVX2 on that chip.


Obvious and objectively correct answer may be too strong a stance :) Probably the more interesting llm question is, how did the model arrive at the answer it did? Why? Does it reflect the kind of reasoning I want it to apply to other tasks?
My LLM chose blue - viva la humanity - but I’m interested to know why it chose blue, not just rubber stamp it.
I think we may be able to deduce this observationally, from first principles, rather than having to look at weights. If it’s just a 4b meta cognitive quirk (eg: can’t tell red from blue, prosocial leaning etc) that’s one thing. If it has a reasoning chain, that’s another.


Qwen3-4B Hivemind running in llama-conductor:
I’d press blue. I don’t know what the hell the rules are supposed to mean, but if I’m gonna pick, I pick the one that makes the most people survive.
Confidence: unverified | Source: Model Profile: direct | Sarc: high | Snark: high
I’m going to have to smoke test this a bit to figure out why its choosing blue. Good test!


For “fun”, I’ve been using Qwen3.6 27B, Gpt 5.4 mini and Claude to audit (one file at a time) my code. The workflow is
Its been my experience (so far) that Qwen 3.6 27B is very capable in uncovering bugs, sometimes finding issues the others miss. Paradoxically, it’s not much cheaper to call via OR that GPT because it tends to skew verbose.
I may trial the 27B as the “hands” for a run or 2 (Qwen 3.6 35B has been unreliable for me via OR) to see how it does. Tight leash.
PS: This approach may be …overkill. I’m not a great code monkey, but I’m pretty decent at engineering, QA, and project management. I’m leveraging my skills, and this flow may not suit you. So, YMMV.


Perhaps so, but is that an AI issue or a billionaire tech bro issue? It feels more the latter than the former - and I’d argue the two aren’t as easily separable as that distinction implies.
The people building this stuff largely are the problem, which makes it an AI issue by default.
My read of the poster above is that they’re pointing towards the knee jerk reaction AI discussions cause.
Mention AI and you invariably spark off “online experts” who argue in bad faith - and that bad faith cuts both ways, dismissing legitimate concerns and overstating them in equal measure.
There’s a lot more nuance to this issue than commonly presented.
For anyone actually wanting to engage with the substance rather than the noise:
https://blog.andymasley.com/p/a-cheat-sheet-for-conversations-about
That link is worth your time before wading in.


Wasn’t there initially a CoT collapse issue in the Qwen 3.5 series when using llama.cpp? Was that fixed for 3.6?
Also, is the /no_think feature still gated behind temp=x settings?
EDIT: Yes, on the second question.
Interesting new feature?
*Preserve Thinking
By default, only the thinking blocks generated in handling the latest user message is retained, resulting in a pattern commonly as interleaved thinking. Qwen3.6 has been additionally trained to preserve and leverage thinking traces from historical messages. You can enable this behavior by setting the preserve_thinking option: …
This capability is particularly beneficial for agent scenarios, where maintaining full reasoning context can enhance decision consistency and, in many cases, reduce overall token consumption by minimizing redundant reasoning. Additionally, it can improve KV cache utilization, optimizing inference efficiency in both thinking and non-thinking modes.*


Probably not; the models they use all tend to be quite lightweight and inexpensive, tbh.
EDIT:
https://proton.me/support/lumo-privacy
Open-source language models
Lumo is powered by open-source large language models (LLMs) which have been optimized by Proton to give you the best answer based on the model most capable of dealing with your request. The models we’re using currently are Nemo, OpenHands 32B, OLMO 2 32B, GPT-OSS 120B, Qwen, Ernie 4.5 VL 28B, Apertus, and Kimi K2. These run exclusively on servers Proton controls so your data is never stored on a third-party platform.
Lumo’s code is open source, meaning anyone can see it’s secure and does what it claims to. We’re constantly improving Lumo with the latest models that give the best user experience.
Quite lightweight swarm for cloud service, barring Kimi K2.


There are several 3B or less models that are surprisingly good. If you’re talking about a general chat model, you can get a lot of bang for your buck with Qwen3-1.7b. Granite-3B is also quite good (and obedient at tool calls, IIRC).
My every day driver is an ablit of Qwen3-4B 2507 instruct called Qwen HIVEMIND. I find it excellent…but again…black magic and clever tricks.
I’ve actually been scoping out the possibility of using ECA.dev and having something cheap / cloud based (say, GPT-5.4 mini) as the “brains” and SERA-8B as the “hands”.
GPT-5.4 mini is $0.75/M input tokens$4.50/M output tokens…and if it marries up with SERA-8B…well…that could go a long way indeed.
Small models can be made useful, as part of swarm architecture…but that’s not an apples : apples comparison.


For automation - you probably need something that is good at obeying tool calls (measured by BFCL bench - Berkeley Function Calling Leaderboard). You want something around 50+ overall (pref 60+) for automation.
https://gorilla.cs.berkeley.edu/leaderboard.html?
And, if you have 12GB, probably a model no larger than 32B.
Which somewhat narrows your choices down: a 14-32B model (assuming your willing to stick to partial offload as you are now?) with a BFCL bench >50. That sounds like one of the Qwen 3 models (30B? 32B?). Else, you go the other way (14B or less) and run fast.
As for coding: are you happy having SOTA be the “general” and the local model doing the grunt work (rather than local does it all?). If yes, something like GLM 5.1 running your local Qwen 3 via ECA (which I only learned about a little while ago) is great.


Those who know…cry about how much hardware is needed to run it locally :)
I don’t have near the rig to run it, but I’ve been using it with OR a bit recently. What I like about it is that it’s well behaved with nestled comments and gates. You point most coding models at a task and they will ignore everything to get it done. No - I said do X then stop. JUST X.
Codex, Claude and GLM are good at holding the line.


Everything you see - every feature - is everything I use. None of it is ornamental.
But my head is in the code right now, so I don’t “use it” so much as try to break it and then fix it.
The end game is a local, expert system, that I can rely on, automate and audit. Because I built it and know exactly how it works.
If you’re asking for my most common uses for it right now (outside of kicking it and then picking it back up)
Basically, all the shit you would ideally like to use an LLM for, but self hosted, private and non-bullshitty. I run on a potato (so don’t really use it for coding very much) but if you have a better rig than mine and can run bigger models - the router is agnostic and it should just work ™.
TLDR;
What I’m building towards: a local expert system that picks its own tools (I coded), executes them (how I taught it to), and gives me a single-line audit receipt for every decision (that I can check if it smells funny). I ask a question, the system decides whether to calculate, look up, search, retrieve from my docs, or reason from scratch - then tells me exactly which path it took and why. Think ChatGPT convenience but with a paper trail you can actually inspect.
And when that’s done…I’m probably stick it in a robot. Because why not? :)
https://github.com/poboisvert/GPTARS_Interstellar
(or tee it up with Home-Assistant)
PS: If you want to know the why behind this whole thing -
https://codeberg.org/BobbyLLM/llama-conductor/src/branch/main/DESIGN.md
PPS: Give me about … 15 mins. I’m just about to push a >>web sidecar. Needs one more tweak to properly parse DOIs / pubmed extraction. I was bored and it’s been on my TO-DO list for too long
PPPS: Those were some Planet Namek 15 minutes…but the deed is done. Enjoy


You can probably do that right now, actually.
https://www.gsmarena.com/fairphone_5-12540.php + https://postmarketos.org/
or


As was foretold in legend.
Twas but the working of a moment. I just added -
{“term”: “meat popsicle”, “category”: “fifth_element”, “definition”: “The Fifth Element (1997), 01:04:11. A police officer asks Korben Dallas (Bruce Willis): ‘Sir, are you classified as human?’ Reply: ‘Negative. I am a meat popsicle.’ Not an insult or irony - the straightest possible answer to a stupid question. Adopted as Gen X shorthand for acknowledging one’s own biological inconsequence with maximum economy of feeling.”, “source”: “static”, “confidence”: “high”, “tags”: [“fifth_element”, “pop_culture”, “gen_x”, “snark”, “bruce_willis”]}
– to one file and it was up to speed.
EDIT: dropped in a better definition.


Because sometimes, people deserve to have their faith rewarded when they go looking :)
Now go look at the about section or the “Some problems This Solves” on the repo, and enjoy the absurdity of sentient yeast :)
PS: Yes, please do try it
PPS: HAHA! You can run it on your phone RIGHT NOW. Well, you can run it on your PC and then access it on your phone via http://127.0.0.1:8088/ when you’re on the same LAN / WIFI. Given that tailscale exists, you could probably make that happen outside of your home too, firewall troubleshooting notwithstanding. (One of my personal use-cases for llama-conductor is exactly that).
Personally, I really like the below app myself (it’s what I use to access llama-conductor via my phone) and am considering forking it and making more streamlined.
https://github.com/Taewan-P/gpt_mobile
There’s an issue with it in that older (pre Android 12) version times out after 30 seconds. ##mentats triple pass can take longer than that on my shit-tier GPU, so I may need some jiggery-pokery. I tried forcing keep alive via llama-conductor but gpt_mobile just sort of ignored me.
Be aware this is not a multi-tenancy rig - it assumes 1 user at a time. You CAN have more people than 1 person access it of course, but stuff you add via !! they may be able to recall via ?? on their end, so don’t plan any extravagant murders in plain sight (!! DIE BART). That was an intentional design decision due to how gpt_mobile works. I’ll harden it once I fork that app; the piping is already in place.


o7
We green? We super green? Corbin Dallas my man?
PS: I know for sure >>fun mode pulls in a bunch of Firefly, Buffy and 5th element snark but I dunno if it will catch on meat popsicle. Maybe? Let me procrastinate uh, perform some urgent QC right now.
For sure it will once claude-in-a-can is done. I mean, what is the point of a LLM if it can’t shit talk you while helping you solve a problem?
https://bobbyllm.github.io/llama-conductor/blog/claude-in-a-can-1/


Well God damn, that’s impressive…but did they have to go with the kawaii lolichan voices? I can’t deploy that without getting some pointed looks.


It’s a great tool…but some of the data is wildly optimistic. I checked all 3 of my GPU against the reported specs and the TPS predictions were out by about 100% . Sadly.
For someone who claims to be versed in logical fallacies, you do like to bandy about the old strawman. I didn’t dismiss the concerns you raised, I reframed them. There’s a difference. Pointing out that the harms you listed are primarily harms of concentrated corporate power isn’t missing your point, it’s pointing at the root cause.
If Palantir didn’t exist, the surveillance state doesn’t disappear. If Altman vanished tomorrow, the RAM supply chain doesn’t magically recover. The tool is downstream of the incentive structure.
As for the LLM accusation, no I wrote that myself. Though I’ll note the irony of deploying an ad hominem to dodge the substance, especially after opening with a lecture about fallacies. Cute. If my prose is too structured for your taste, that’s a you problem.
The Masley link stands. Engage with it or don’t, but the knee-jerk “that reads like AI” isn’t the ‘aha’ you think it is.