pull down to refresh
... but maybe I am missing their insight here.
this?
Essentially, DocLang is optimized for LLM tokenizers through markup that maps between DocLang elements and LLM tokens on a 1-to-1 basis. The spec relies on a limited XML vocabulary that aligns with LLM tokenizers to produce optimized prompts. It is lossless, so the AI conversion doesn't do away with valuable info. It's designed to support common graphical elements like tables, formulas, charts, and multimodal content. And it's an open standard.
DocLang could also help keep costs under control. According to AI Cost Check, having an AI model conduct an OCR scan on a PDF requires about 1,200 input tokens and 150 output tokens as a baseline.
reply
reply
Why would that even be better? Isn’t the whole point to cut down on token costs? That’s an ignorant question! ~lol
reply
Tokenizer makes tokens from text. This says: convert your stuff to this first, then feed to a tokenizer. The examples are "converted" to their XML thingy. Hence, make it a feature of the tokenizer and don't bother people with conversions. I'm sure GPT could have told them this too.
reply
I built this pipeline with mineru last year to get markdown from pdf - it used to perform better than docling, but I haven't compared lately - and for my own meat OCR devices, markdown is better than PDF or MS Word. Easier to read, and I can grep it.
I'm not sure why they think markdown is poor for LLMs though - all the current models have been trained on it. It's the cleanest format we have? Transitioning back to XML again is a bit awful imho, but maybe I am missing their insight here. Luckily it should be easy to convert markdown to XML.