Exploring the ethics of web scraping

New to Charity Digital?

Register

My Account

My Profile

Login

New to Charity Digital?

INSIGHTS

All Topics

Featured Hub

Register/Login

My Account

New to Charity Digital?

Image of human talking to AI in the middle of a magnifying glass against a light pink background

Web scraping describes a process where bots extract data automatically from websites. It’s used by artificial intelligence (AI) system developers to gather information for training AI tools.

Generative AI chatbots, like ChatGPT, Bard, or The New Bing use Large Language Model (LLM) AI which has to be trained on huge datasets, also called big data. The internet is an ideal source for big data with billions of web pages available for scraping.

Earlier in 2023, OpenAI announced that ChatGPT-4 was able to connect to the internet and access up-to-date information. This likely uses a combination of web crawler bots which are already commonly used by search engines like Google and web scraping.

Is web scraping a bad thing?

Web scraping and crawling aren’t totally new issues for charities. For most non-profits, appearing at the top of Google’s search ranking, as a result of their site being crawled, is actually a goal.

In the future, it’s possible that more and more people will bypass traditional search engines and ask generative AI chatbots for answers instead. David Mitchell, Cryptocurrency and Digital Partnerships Manager at Edinburgh Dog and Cat Home, says, “just like with SEO rankings, charities…will soon be striving to establish strong associations with popular AIs, so they are more likely to be surfaced and cited in AI responses.”

In many cases, raising awareness or providing expert advice is a critical element of a charity’s mission. It’s clear that there are advantages to the approach that Mitchell outlines, with tools like The New Bing which cites all its sources and links back to the websites it has crawled for answers.

But there’s a flip side. Text and visual content that’s been crawled or scraped from charity’s websites could be used by generative AI tools to create something new using the content itself or the style it’s been created in.

In the creative sector this issue is already causing uproar. A group of artists, including concept artist Karla Ortiz, are suing image-led generative AI tools Midjourney, Stability AI, and DeviantArt for using their work as AI training material.

In a more conclusive debate, the Writer’s Guild of America went on strike between May and September 2023 over a number of issues, including the threat of generative AI replacing screenwriters – and they won. Their new union contract with the Alliance of Motion Picture and Television Producers guarantees that AI is not a writer.

The regulation of web scraping

At the Charity Excellence Framework, Founder Ian McLintock is addressing the issue from both sides – as an AI tool developer and as an advisor to the charity sector.

McLintock says, “I web scraped the statutory websites. When people go looking for answers they go looking on the Charity Commission or they go looking on the Fundraising Regulator or they go looking on the Gambling Commission website and there’s stuff on all of them.”

By bringing the information together from these sites into curated knowledge banks, McLintock has been able to train his AI tools to answer questions about governance and regulations from charity professionals.

He’s able to do this freely by using the Open Government Licence for public sector information which allows it to be copied, published, distributed, adapted and used for commercial purposes, providing a link is included.

For commercial or non-profit websites, no such license exists, but web scrapers and crawlers can legally access data that isn’t personal or subject to intellectual property rights.

A similar panic about bots accessing website data emerged around 30 years ago with the rise of search engines. It was dealt with by a community-developed industry standard called the Robots Exclusion Protocol (REP).

When developers are building a website they can add a Robots.txt file to detail which parts of the site automated crawlers can access.

OpenAI has now published Robots.txt standards for their GPTBot (the name of the web crawler bot used by ChatGPT) to help site owners instruct it on where it can and can’t crawl.

At Charity Excellence, McLintock has used meta tags on articles and images such as ‘NoAI’ or ‘NoImageAI’ as well as amending the site’s terms and conditions to include a section on web scraping.

Scraping or crawling for personal data or content considered intellectual property (IP) is covered by existing GDPR or IP laws, but this is a very live topic. With emerging law on AI expected from the EU and the US soon, new precedents are being set all the time.

At the heart of this issue for charities is the question: to what extent does use of our website content by generative AI tools support our charitable aims?

Related Media

Can tech save the world? An introduction

Who we are

Helen Olszowska

Managing Director, Seashell Collective

Helen Olszowska

Managing Director, Seashell Collective

How to use responsible AI for recruitment

21 Feb 2025by Joe Lepper

Why fundraisers should embrace fantasy football

20 Feb 2025by Kellie Smith

How to master the art of editing

20 Feb 2025by Ioan Marc Jones

Top fundraising trends for 2025

Generative AI content: The importance of ethics

Generative AI content: The need for accuracy

Blackbaud’s Raiser’s Edge NXT

Zoom Meetings Pro Plan Bundle 1 Year Subscription Access to Discounted Rates

Salesforce.org Nonprofit Cloud

Blackbaud

Can tech save the world? An introduction

Who we are

How to use responsible AI for recruitment

Hub-Topics

21 Feb 2025by Joe Lepper

Why fundraisers should embrace fantasy football

Fundraising

20 Feb 2025by Kellie Smith

How to master the art of editing

Fundraising Hub-Topics

20 Feb 2025by Ioan Marc Jones

Top fundraising trends for 2025

Hub-Topics

Our Events

07 Mar 2024

On-demand webinar: Securing your charity in the age of AI

29 Feb 2024

On-demand webinar: A guide to digital mapping tools

25 Jan 2024

On-demand webinar: Digital fundraising trends for 2024

07 Dec 2023Online

On-demand webinar: How to generate ethical and high-quality AI content

27 Oct 2023

Technology that won't cost the earth

Charity Digital Academy

Our courses aim, in just three hours, to enhance soft skills and hard skills, boost your knowledge of finance and artificial intelligence, and supercharge your digital capabilities. Check out some of the incredible options by clicking here.

Tell me more