pull down to refresh
You're like a freelancer, helping companies/labs integrate AI in their workflows?
No, I was helping with planning & management; like a hit & run project manager to initialize, scope out and get a process in place to keep stakeholders happy. I hadn't suggested the LLM integration at the time because of the hallucinations - especially a year ago.
In both cases, most of the uncertainty and projected cost came from materials science. In both cases, after I delivered my part, they found ways to speed up that research by tapping into LLMs and improve:
- in one case their options to need less exotic materials, and
- in the other case, to radically reduce the number of experimental iterations needed and reduce projected waste - by nearly 95% percent.
Any advice on how to spot the hallucinations?
It's getting very hard right now to just spot at a glance because it got very good at plausible, coherent text. Everything can be an indicator, for example I frown when there are odd patterns[1].
Don't trust, verify, but do fight fire with fire, otherwise you're on the weak end of asymetry; integrate an LLM of your own. Don't ask that LLM to judge or produce a final output. Instead, ask it to help you with the labor intensive things that you are an expert in. Describe the process that you master and would do manually yourself to the bot and test it on something that you've already done. Compare it with your own output. See where it is better, see where it is worse. If you invest even only 5% of your time into this, you'll get ahead eventually.
For example, I let Claude help keeping track of a risk register on @k00b's massive wallet PR, and there was something odd about the distribution of the analysis outcomes. Could be a coincidence, but I think it unlikely, and this is a well-instructed, top-tier, expensive LLM. Also, this happened much less often with a previous version so I think it's a regression: they're still trying to make a one-size-fits-all LLM, even at Anthropic, so when they tune one thing they destroy another - not unique, the same happened with GPT multiple times. ↩
*burden of checking whether their smart-sounding paragraphs in paper drafts are not utter bullshit falls on me.
You're like a freelancer, helping companies/labs integrate AI in their workflows? Without doxxing yourself, ofc. Can I hire you?~~
Yes, it's really what keeps me from using it at scale. Any advice on how to spot the hallucinations? Iterate it through other models, etc?
I've come to trust junior researchers even less than before, because now the burden of checking whether their smart-sounding paragraphs in paper drafts falls on me.