It will mostly be defense-in-depth findings, where for example all paths to exploiting an overflow are in fact covered with clever workarounds (which the bots generally miss), but not 100% solved, and in many cases, not documented. I have the same on my own c++ codebases where I had to covert-patch some findings historically.
This is still actual for me:
Dealing with reports that look plausibly correct but are wrong imposes an asymmetric cost on project maintainers: it’s cheap and easy to prompt an LLM to find a “problem” in code, but slow and expensive to respond to it.
But it is also true for me what they say right below that:
we dramatically improved our techniques for harnessing these models — steering them, scaling them, and stacking them to generate large amounts of signal and filter out the noise.
I have done this partially too, it's not finished - I fear it will never be finished - but it's easier for me now to recognize external findings (which are still for > 95% either false positives, or mislabeled severity) because I have built a repository of every false positive my harness found, and can much more rapidly process reports now. Valid findings do get fixed, though I am rather displeased by what's left of the developer communities at this point: there's zero new inflow of talent if you don't count Claude and GPT as talent, and a lot of people stopped caring, including maintainers. This is something I am observing beyond just my own repos too. Getting to actual well reviewed merges is hard right now.
It will mostly be defense-in-depth findings, where for example all paths to exploiting an overflow are in fact covered with clever workarounds (which the bots generally miss), but not 100% solved, and in many cases, not documented. I have the same on my own c++ codebases where I had to covert-patch some findings historically.
This is still actual for me:
But it is also true for me what they say right below that:
I have done this partially too, it's not finished - I fear it will never be finished - but it's easier for me now to recognize external findings (which are still for > 95% either false positives, or mislabeled severity) because I have built a repository of every false positive my harness found, and can much more rapidly process reports now. Valid findings do get fixed, though I am rather displeased by what's left of the developer communities at this point: there's zero new inflow of talent if you don't count Claude and GPT as talent, and a lot of people stopped caring, including maintainers. This is something I am observing beyond just my own repos too. Getting to actual well reviewed merges is hard right now.