~hyperlinks

At this point every week there’s a new “insane benchmark” headline

True. A year ago o3 was the best model on the market. Progress is fast.

Real test is still, can it actually help without hallucinating halfway through the task?

Have you used a SOTA model in opencode yet? Chatbots still do that - the progress of agents is on another level tho.

Google Gemini 3.5 Flash: insane benchmark results

zuspotirko

> At this point every week there’s a new “insane benchmark” headline

True. A year ago o3 was the best model on the market. Progress is fast.

> Real test is still, can it actually help without hallucinating halfway through the task?

Have you used a SOTA model in opencode yet? Chatbots still do that - the progress of agents is on another level tho.