Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
The “It’s not clear who’s doing it” section is what’s baffling to me, like I’m very far from into networking and I’ve definitely done some crude scraping the few times I’ve needed it, but even I am pretty sure I could do better than this. Makes me think someone with more money than sense hooked up an “Agentic” AI and told it to gather data to train the next AI then just let it decide how – except not because I’m pretty sure the AIs would have more sense. It’s really weird.