What's up, DocLang?
Websites are being redesigned for consumption by AI models, and now a coalition wants to extend the trend to digital documents.
The LF AI & Data Foundation, under the Linux Foundation, has formed a working group to steer the development of DocLang, an AI-friendly document format that aims to help enterprises feed their files to AI systems.
The DocLang group, founded by IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis, contends that existing formats like PDF, Markdown, HTML, and LaTeX are ill-suited for AI document parsing.
In late 2024, IBM developed an open source toolkit called Docling to facilitate AI document parsing, not unlike Microsoft's MarkItDown or the Marker project. Docling provides a way to convert various file formats into structured AI-ready data. DocLang expands upon that foundation with a standard for exchanging structured output across different systems.
"DocLang is designed to solve one of the foundational problems in enterprise AI: documents were built for humans, not machines," said Maxime Vermeir, VP of AI Strategy at AI automation biz ABBYY in a statement. "By introducing a minimal, standardized, and AI-native representation of document structure, layout, meaning and governance, DocLang creates a far more deterministic foundation for modern AI systems."
...read more at theregister.com
pull down to refresh
related posts
I built this pipeline with mineru last year to get markdown from pdf - it used to perform better than docling, but I haven't compared lately - and for my own meat OCR devices, markdown is better than PDF or MS Word. Easier to read, and I can grep it.
I'm not sure why they think markdown is poor for LLMs though - all the current models have been trained on it. It's the cleanest format we have? Transitioning back to XML again is a bit awful imho, but maybe I am missing their insight here. Luckily it should be easy to convert markdown to XML.
this?
Right. But why not build that into the tokenizer?
Why would that even be better? Isn’t the whole point to cut down on token costs? That’s an ignorant question! ~lol
Tokenizer makes tokens from text. This says: convert your stuff to this first, then feed to a tokenizer. The examples are "converted" to their XML thingy. Hence, make it a feature of the tokenizer and don't bother people with conversions. I'm sure GPT could have told them this too.
Because everybody wants to use random irrelevant garbage input and expect a determinate, repeatable output.
markdown's usually plenty — @optimism's right, the models trained on it and you can grep it. "Reformat everything" is boiling the ocean.
Where a strict format actually pays off isn't documents, it's the bits an agent has to act on, not read: price, endpoint, auth, terms. That wants a tight typed schema — and only on the transaction surface.
Funny timing — I've been chewing on this one layer up: how an agent advertises a service for sale so another agent can parse and buy it with no human in the loop. Same answer — prose stays markdown, the machine-actionable part gets a small schema. Standardize what gets transacted and skip rewriting every PDF on earth.