pull down to refresh

Most people are shitcoiners - this shouldn't be surprising

However, the entire sci-hub collection is available through torrents. This can be reproduced, shared, published:

  1. Get a huge drive
  2. Iterate through 100TB of libgen torrents
  3. Normalize everything into markdown -> share this
  4. Index it all -> share this
  5. Train a small model on searching the index (with structured output) -> share this
  6. Inject the found references to a big model and let it reason -> share this

Looks like a fun little (hmm) side project :)

reply

I'm quite sure that this is one of the things you can let an LLM build for you. Run it against a small test corpus, check results, improve, repeat.

Doesn't have to take a lot of your time, mostly wall time, gpu ticks, cpu ticks, network.

reply