New arms race flares as AI takes your content and blockers fight back

The internet has always had bots on it. Search engines like Google, for instance, have been using automated "crawlers" for decades to build indexes, so that websites show up when you search for certain terms. But a new generation of crawlers, used to train generative AI, is starting to threaten much of the web's economy. These technologies undermine the business models of huge numbers of websites, and raise concerns for privacy. But it is possible to stop them from absorbing your material. To a point. The race to build better blocking tools One way around the AI conundrum is to poison the data: add something to your content that makes it difficult for AIs to learn. Researchers have developed blocking tools like image filters that add a pile of useless "noise" that confuses AIs, while still looking similar or identical to humans. Salil Kanhere, a computer scientist at the University of New South Wales, warns AI developers are constantly on the hunt for ways around these tools. An Australian team has been hunting for a way to keep these blocking tools working. Their research, which is still in its early stages, claims to be a step towards "provably unlearnable" content. Derek Wang, a collaborator on the project and computer scientist at CSIRO, says most available blocking tools are very specific: they are designed to prevent a certain type of AI (such as a large language model) from training on a certain type of content (such as text). His team has instead taken a step backwards, and made an algorithm that can judge how learnable any type of content is, for any type of AI. "This itself is very significant information that can help defenders to polish and update their defences," Dr Wang says. The algorithm can help to build a much stronger blocking tool, where the noise obfuscates the most learnable parts of the content. The team demonstrated their algorithm by using it to make a noise-generating tool for images, which Dr Wang says can "guarantee" will make the image impenetrable to most AIs. "Basically, our guarantee can rule out about 90 per cent of attacks," Dr Wang says. He says they can also adjust their blocker to let more or less information through, if people wanted an AI to get the vague idea of their images but not the details. The team demonstrated their work at a conference earlier this year, and Dr Wang says various online image creators have gotten in touch to see if they can use the algorithm to protect their work. They're focusing on images at the moment, but Dr Wang says it would be possible to use their base algorithm to develop other types of blockers. Some things may be harder to protect than others — such as text, which relies on a few dozen characters, rather than the huge variation of information in an image. Professor Kanhere, who wasn't involved in the study, calls it "clever" — but warns that AIs can still evolve to overcome these blocks. Dr Wang agrees, but points out his team can also improve their work. With time, he thinks they could refine their theory to prevent 99.9 per cent of AI adversaries. "Our method is a very initial step for advancing the theoretical foundation in this area to achieve really unlearnable stuff," Dr Wang says. It's part of the evolution of crawlers and blockers that Professor Kanhere terms a "cat-and-mouse game". "This is a good technical solution, but I think we need to come up with a different model," he says. Switching off AI access to websites There is a simpler way to stop crawlers from accessing content: ask them not to. Professor Kanhere says websites typically have a robots.txt file embedded in their architecture that instructs crawlers which pages they can access, and when. "The crawlers are supposed to adhere to whatever you put in there. Now, of course, whether they do that or not is up to debate," Professor Kanhere says. He points out developers have been working on new standards for websites and AIs, such as the RSL Standard, which people can use on their own websites. "That will specify what content an AI bot can scrape, what sort of licensing agreement you would like, and monetisation," Professor Kanhere says. Some website providers have started to include AI blockers at scale — such as internet infrastructure company Cloudflare, which provides services for about a fifth of all sites on the web. In July, it announced customers' sites would block AI crawlers by default. Will Allen, vice president of product at Cloudflare, says the move is designed to provide website owners with choice over how their content is used. Site owners can still choose to give their content to AI crawlers for free if they want their material to appear in the models. But news sites, for instance, may instead want to work out a payment model with AI companies, allowing access to their work only if they're compensated for it. "We see a real recognition of differentiated access to real-time information, data and content as being an incredible competitive advantage for people who are building AI experiences," Mr Allen says. Are AI companies respecting these rules? All of this hinges on the assumption that AI crawlers will act in good faith. Historically, these models have not asked for permission to access content, with many of the biggest AI companies in the game using reams of copyrighted works to train their early models. But Mr Allen thinks that at least the larger models are playing fair. "The well-behaving [crawlers], which is a good number of them, self-identify." He says some AI companies have called Cloudflare when they've wanted to access certain pages they're blocked from. "If they were being shady, they wouldn't care. They would just do it." Crawlers don't have to identify as company-run bots — they can pretend to be human users and skip all these restrictions. When that happens, they run the risk of being flagged as a malicious bot and blocked from the site entirely. "Once you're off that happy path of self-identification, you just look like a bot to us," Mr Allen says. Future of open internet 'at risk' AI summaries have prompted declines in page views across the internet, and with paywalls springing up to compensate for lost ad revenue, many fear for the changing shape of the web. Mr Allen wants to see Cloudflare's model nurturing a still-open internet. "The internet's an amazing, amazing invention and one of the most amazing parts of it is the fact that large parts have been open," he says. Professor Kanhere says Cloudflare's model "is a good start", but there are still flaws. Turning bot defences up high can stop some human users from visiting a site. And AI companies that don't want to pay for content may still scrape less accurate versions of it from elsewhere. "I guess the question is, is it worth the effort for these companies then to try to subvert this?" Professor Kanhere says. If AI companies decide the material is not worth paying for, Professor Kanhere predicts we'll run straight back into the cat-and-mouse game of technological blocks. But his gut feeling is that the AI companies will agree on some conditions, similar to OpenAI's licensing deals with news publishers. Either way, as web pages field increasingly high numbers of bot visitors, human views are tracking in the opposite direction. "Maybe [licensing deals] get more widely adopted and then we don't have to rethink the whole internet," Professor Kanhere says. "But certainly those days where people are going to actual websites — unfortunately that seems to have declined quite a lot."

New arms race flares as AI takes your content and blockers fight back

Guess You Like