Insights
Training
On-demand
You are viewing 1 of your 2 free articles
We explore AI scraping and the options for preventing it
There are bots crawling all over the internet. They’ve been doing it for a long time — since the early days of search engines like Google. Bots crawl websites, index the content, and report back to search engines. When someone pops a search term or question into the search engine, the response is informed by what the bots have found.
Web crawlers or “spiders” have been welcomed by web developers in the past because they help drive traffic to clients’ sites. And there are systems in place to guide these bots on which pages they can and can’t index.
AI crawlers work in a slightly different way because they have different goals. When an AI bot crawls a website, it might be looking for one or all of the following:
● Data to train a large language model
● Information for users to complement the content of an AI summary
● Content to index so they can come back to it when needed
When data retrieval from websites is automated and bot-led, this is known as AI scraping.
The ability to retrieve and process large volumes of data can be beneficial to charities in some circumstances — in research and data analysis, for example. But without proper safeguards, data that is private, personal, or the intellectual property of third sector organisations could be scraped and used to train AI.
This is a growing issue and taking a position on how charity content is used is an increasing concern for the sector. Data from Cloudflare shows that from July 2024 to July 2025, raw requests from GPTBot rose 147% and raw requests from Meta-External Agent rose 843%. These bots are scraping data to train ChatGPT and Meta’s AI models respectively.
If an AI model has been trained on a charity’s data, it’s unlikely to repeat that data verbatim in response to a question or search query, but it is possible for AI models to memorise strings of data which can be extracted in cyber attacks. Being able to classify content as public, available for a limited audience, private, or protected could help charities think through where and how to publish that content, and where to limit AI access.
Charities may want to make some content accessible to AI crawler and scraper bots while actively trying to conceal other content.
For example, in a charity’s role as a subject expert and educator, it might want to make factual information about the issue it works on available for scraping to help prevent AI misinformation on the topic.
Charities delivering services directly to communities may use personal stories on their website and in social media content. While they should have sought permission from the person whose story is being told, they may not have had a conversation about the possibility of the story being scraped by AI for indexing and training and therefore, may want to prevent this from happening.
If social sector organisations get to a stage where they have guidelines on which types of content is okay to be scraped and which types should be protected, there are a few ways to try and shield content from AI bots.
A robots.txt file gives bots instructions on how to access your website. Their use is evolving to take AI crawler and scraper bots into account. Charities can now instruct either all bots or specific bots on website pages they are and aren’t allowed to crawl or scrape.
It’s also possible to add instructions to bots on a specific page or object on a page using a http header or a meta tag. Examples include ‘X-Robots-Tag:noindex’ which prevents the content from being indexed.
Charities can make it harder for bots to scrape images from their websites by using cloaking tools that distort images for bots, or using poisoning tools that cause bots to misinterpret what they’re seeing.
There are also options at the content management system or platform level that charities can use to prevent AI scraping content. For example, password protecting web pages, or changing settings on website and social media platforms to opt-out of AI training.
Dealing with AI scraping isn’t a simple case of following best practice. There are ethical questions at stake which may be answered differently by each organisation and for each type of content.
Some scraping prevention methods only require voluntary compliance from AI companies and AI legislation is too early-stage and patchy to fully cover the multi-jurisdictional issues that AI presents. Making decisions about how content should or shouldn’t be used by AI and staying informed about methods to handle crawling are the cornerstones of a proactive approach.
Follow-up questions for CAI
What are AI bots?When might charities want AI to train on their content?Why might charities want to prevent AI from training on their content?What practical steps protect personal stories from AI data scraping?How can charities balance public education and protecting sensitive content?Our courses aim, in just three hours, to enhance soft skills and hard skills, boost your knowledge of finance and artificial intelligence, and supercharge your digital capabilities. Check out some of the incredible options by clicking here.