Are there any open models that can actually compete with proprietary ones like GPT 5.5 Extended Thinking or Claude Opus 4.7? I am getting really good results with those in their chat interfaces for coding tasks. They sometimes spend 30-45 minutes working on my task and have an internal container they are doing tool calls on, like cloning a repository and compiling their code, and can find online documentation. Their answers are very good and usually correct for very complex tasks requiring specific protocols.

So I would like to know how well we can replicate this using open models since I want more control over how it runs, and privacy. Do any of you hook in agentic capabilities into your local models? How do you do it, and which models give you good results?

Pretend I have unlimited resources (local llama.cpp, sufficient fast storage/memory, and unlimited time to wait for a good response).

  • SuspciousCarrot78@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 day ago

    For “fun”, I’ve been using Qwen3.6 27B, Gpt 5.4 mini and Claude to audit (one file at a time) my code. The workflow is

    • I flag issues
    • Claude writes the probe spec
    • Qwen Audits (OR via Roo)
    • GPT Audits (OR via Roo)
    • I review
    • Claude and I consolidate the bug reports
    • We run that past GPT in Codex
    • It tests / replicates it against code base
    • I review the output
    • Claude and I prioritise what needs fixing, what can be deferred and what can be ignored
    • I / we create the ticket with staged gates
    • GPT spins up a sandbox
    • Run gate 1
    • Fix bug 1
    • Smoke test against sandboxed
    • I review and discuss
    • Iterate
    • Again
    • Smoke test passes or we pivot to diff fix.
    • Once happy, back port and snap shot
    • Update ticket index and ticket itself with what we did, what worked, what was out of scope
    • Exfil new code to main, manually test again.
    • if all good, back up, archive (3-2-1).

    Its been my experience (so far) that Qwen 3.6 27B is very capable in uncovering bugs, sometimes finding issues the others miss. Paradoxically, it’s not much cheaper to call via OR that GPT because it tends to skew verbose.

    I may trial the 27B as the “hands” for a run or 2 (Qwen 3.6 35B has been unreliable for me via OR) to see how it does. Tight leash.

    PS: This approach may be …overkill. I’m not a great code monkey, but I’m pretty decent at engineering, QA, and project management. I’m leveraging my skills, and this flow may not suit you. So, YMMV.