Insights
We learn about the basics of AI web scraping and hear different perspectives on the opportunities and threats it poses for charities.
Web scraping describes a process where bots extract data automatically from websites. It’s used by artificial intelligence (AI) system developers to gather information for training AI tools.
Generative AI chatbots, like ChatGPT, Bard, or The New Bing use Large Language Model (LLM) AI which has to be trained on huge datasets, also called big data. The internet is an ideal source for big data with billions of web pages available for scraping.
Earlier in 2023, OpenAI announced that ChatGPT-4 was able to connect to the internet and access up-to-date information. This likely uses a combination of web crawler bots which are already commonly used by search engines like Google and web scraping.
Web scraping and crawling aren’t totally new issues for charities. For most non-profits, appearing at the top of Google’s search ranking, as a result of their site being crawled, is actually a goal.
In the future, it’s possible that more and more people will bypass traditional search engines and ask generative AI chatbots for answers instead. David Mitchell, Cryptocurrency and Digital Partnerships Manager at Edinburgh Dog and Cat Home, says, “just like with SEO rankings, charities…will soon be striving to establish strong associations with popular AIs, so they are more likely to be surfaced and cited in AI responses.”
In many cases, raising awareness or providing expert advice is a critical element of a charity’s mission. It’s clear that there are advantages to the approach that Mitchell outlines, with tools like The New Bing which cites all its sources and links back to the websites it has crawled for answers.
But there’s a flip side. Text and visual content that’s been crawled or scraped from charity’s websites could be used by generative AI tools to create something new using the content itself or the style it’s been created in.
In the creative sector this issue is already causing uproar. A group of artists, including concept artist Karla Ortiz, are suing image-led generative AI tools Midjourney, Stability AI, and DeviantArt for using their work as AI training material.
In a more conclusive debate, the Writer’s Guild of America went on strike between May and September 2023 over a number of issues, including the threat of generative AI replacing screenwriters – and they won. Their new union contract with the Alliance of Motion Picture and Television Producers guarantees that AI is not a writer.
At the Charity Excellence Framework, Founder Ian McLintock is addressing the issue from both sides – as an AI tool developer and as an advisor to the charity sector.
McLintock says, “I web scraped the statutory websites. When people go looking for answers they go looking on the Charity Commission or they go looking on the Fundraising Regulator or they go looking on the Gambling Commission website and there’s stuff on all of them.”
By bringing the information together from these sites into curated knowledge banks, McLintock has been able to train his AI tools to answer questions about governance and regulations from charity professionals.
He’s able to do this freely by using the Open Government Licence for public sector information which allows it to be copied, published, distributed, adapted and used for commercial purposes, providing a link is included.
For commercial or non-profit websites, no such license exists, but web scrapers and crawlers can legally access data that isn’t personal or subject to intellectual property rights.
A similar panic about bots accessing website data emerged around 30 years ago with the rise of search engines. It was dealt with by a community-developed industry standard called the Robots Exclusion Protocol (REP).
When developers are building a website they can add a Robots.txt file to detail which parts of the site automated crawlers can access.
OpenAI has now published Robots.txt standards for their GPTBot (the name of the web crawler bot used by ChatGPT) to help site owners instruct it on where it can and can’t crawl.
At Charity Excellence, McLintock has used meta tags on articles and images such as ‘NoAI’ or ‘NoImageAI’ as well as amending the site’s terms and conditions to include a section on web scraping.
Scraping or crawling for personal data or content considered intellectual property (IP) is covered by existing GDPR or IP laws, but this is a very live topic. With emerging law on AI expected from the EU and the US soon, new precedents are being set all the time.
At the heart of this issue for charities is the question: to what extent does use of our website content by generative AI tools support our charitable aims?
Our courses aim, in just three hours, to enhance soft skills and hard skills, boost your knowledge of finance and artificial intelligence, and supercharge your digital capabilities. Check out some of the incredible options by clicking here.