~hyperlinks

Why would that even be better? Isn’t the whole point to cut down on token costs? That’s an ignorant question! ~lol

Right. But why not build that into the tokenizer?

optimism

... but maybe I am missing their insight here.

Essentially, DocLang is optimized for LLM tokenizers through markup that maps between DocLang elements and LLM tokens on a 1-to-1 basis. The spec relies on a limited XML vocabulary that aligns with LLM tokenizers to produce optimized prompts. It is lossless, so the AI conversion doesn't do away with valuable info. It's designed to support common graphical elements like tables, formulas, charts, and multimodal content. And it's an open standard.

DocLang could also help keep costs under control. According to 

, having an AI model conduct an OCR scan on a PDF requires about 1,200 input tokens and 150 output tokens as a baseline.

A modest proposal: Reformat everything to make documents more palatable to AI

0xbitcoiner

> ... but maybe I am missing their insight here.

this?


> Essentially, DocLang is optimized for LLM tokenizers through markup that maps between DocLang elements and LLM tokens on a 1-to-1 basis. The spec relies on a limited XML vocabulary that aligns with LLM tokenizers to produce optimized prompts. It is lossless, so the AI conversion doesn't do away with valuable info. It's designed to support common graphical elements like tables, formulas, charts, and multimodal content. And it's an open standard.
>
> DocLang could also help keep costs under control. According to [AI Cost Check](https://aicostcheck.com/blog/ai-ocr-document-processing-costs-2026), having an AI model conduct an OCR scan on a PDF requires about 1,200 input tokens and 150 output tokens as a baseline.