Robotic hand typing on laptop keyboard.
By Zack Davies

Artificial Impertinence – Are A.I. bots crawling your site?

October 4th, 2023
4 min read

A.I. is definitely the Hot New Thing, and we’ve only just begun to see what it can do. But, as with any innovation, there are concerns about where the end value is ultimately derived from.

Those in the creative sectors have, understandably, been the loudest dissenting voices, and there are already initiatives such as Glaze to help protect visual artists from having their work appropriated. But what about the rest of us, the common-or-garden website owners and builders?

What is the problem with A.I bots?

Earlier this year, OpenAI, the labs behind wildly popular A.I. applications Dall-E and ChatGPT, released details of its web crawler, GPTBot. OpenAI are, understandably, hungry for data with which to teach their applications, and the most obvious source of data is other people’s websites. Hence their new bot, which will visit sites and gather text, images and video for feeding their machine learning models.

While companies like Google already use bots to visit your website (I’m assuming you have a website – if not, please talk to our sales team) this is more of a quid pro quo arrangement. Google get information about your site, and in return, people can find your site when they search for it (if they can’t, have you considered SEO?). OpenAI, on the other hand, have less to offer as payment in kind.

So, responsibly, OpenAI have been open about their intentions. They’ve released documentation telling us what they plan to do, plus how to opt out. The key is your robots.txt file, which you almost certainly already have in place on your website.

What is a robots.txt file?

Robots.txt is a file that tells automated bots how they should behave if they want to carry on being allowed to crawl your site. It can include things like which pages ought to be indexed or not (by search engines), what pages bots are now allowed to visit, how frequently bots should access your pages, and whether any specific bots are disallowed. It’s this last feature that OpenAI have leveraged in order to give site owners consent over whether their content is gathered.

Should I allow A.I to crawl my website?

There are multiple reasons why you might not want A.I. bots crawling your site. You might be worried about the legality of sharing customer comments, you might want to conserve your site’s resources, or you might just not agree with the principle that your content is training material for someone else’s application. Or maybe you just feel it’s better to be safe than sorry.

For most websites, updating the robots.txt file is easy. It’s literally just a case of adding a few lines to a text file, then making sure that file is uploaded to wherever it needs to be. Though, as with any technical system, there may be cases where someone’s been very clever and made it a more complex process!

So, you’ve added a new exclusion to your robots.txt file. Are you now safe from A.I. crawlers? Well, not quite. Robots.txt files are a set of guidelines, a gentleman’s agreement that if bots abide by your rules, they may continue to visit your site without you taking measures to block them. While OpenAI appear to be acting in good faith by highlighting how they operate, not everyone on the internet is so honourable.

That said, it’s heartening to know that the industry’s biggest players are taking this issue seriously.  The last thing we at Logic+Magic want to do is cause a panic in the hope of making a quick buck. While we encourage our clients to consider the impact of A.I. bots, and, if necessary, to make the changes suggested by OpenAI, this is not something that’s going to break the bank – or, for that matter, break your website.

For advice on how to approach A.I crawling your website, don’t hesitate to get in touch with us.

Zack Davies.
Zack Davies

Developer

Zack has been a Drupal developer since the late 2000s, covering diverse areas of development such as financial reporting, API integration, cloud infrastructure and deployment management.