SEO · July 29, 2026 · 4 min read

AI crawlers explained — how to let the models into your site

Before ChatGPT can cite you, its crawler has to read your page. Here's what GPTBot and ClaudeBot are, how robots.txt controls them, and how to check access.

By Mediseo

Before ChatGPT or Perplexity can cite your business, their crawler has to be allowed to read your page. Plenty of websites lock them out without knowing it — and then neither good content nor neat structure helps.

The short version

AI companies use their own crawlers to read the web, including GPTBot, ClaudeBot, PerplexityBot and Google-Extended.
You control them in robots.txt — and an old "block everything" rule can lock them out without you noticing.
Many AI crawlers run little or no JavaScript, so important text should sit in clean HTML.
Slowness and errors make crawlers give up and return less often.
Access is the first link in the chain — without it, nothing else counts.

What an AI crawler is

A crawler is a small program that visits web pages, reads the content and takes it back. Search engines have used them for years. What's new is that the AI companies now have their own.

When ChatGPT looks something up live to answer you, it's a crawler like this that has fetched the pages first. The most common ones to know today are GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot and Google-Extended. Each reads the web on behalf of its model.

robots.txt — the door they knock on

robots.txt is a small file at the root of your site that tells crawlers what they may visit. This is where access is decided, and where things most often go wrong.

Many websites once inherited a rule that blocks "all crawlers", set up for an entirely different purpose long ago. That rule now locks the AI crawlers out too — quietly, with no error message. The result is that the models never read your pages, and you stay invisible in AI answers no matter how good your content is.

So check two things in robots.txt:

Are you letting the AI crawlers in, or did you inherit a rule that locks them out?
Are there any important pages you genuinely should hold back — and if so, on purpose, not by accident?

Clean HTML beats hidden text

Access alone isn't enough; the crawler also has to be able to read the content. This is where many modern websites stumble. A number of AI crawlers run little or no JavaScript today. If your main content is assembled in the browser after the page loads, a crawler can get in and still find an almost empty page.

The rule of thumb: the important text should sit in the HTML your server delivers, not be built up afterwards. Then the crawler sees the same thing the reader does — and has something to cite.

Speed and stability count

Crawlers have limited patience. If pages load slowly, or they hit error codes often, the crawler gives up more readily and returns less often. That means classic technical hygiene — fast loading, stable pages, correct status codes — is AI work too.

This is exactly why SEO and AI visibility belong together rather than being two separate projects. The same groundwork serves both.

A simple checklist

To make sure the models actually get in, run through this:

Open robots.txt and confirm the AI crawlers aren't blocked by accident.
View a key page with JavaScript off and check the main text still appears.
Measure the load time of your most important pages and clean up the slowest.
Check the server logs for visits from GPTBot, ClaudeBot and PerplexityBot — if you see them, the door is open.

Access is the first link in the chain that ends in being cited. It's also the least glamorous — but there's no point polishing content the models never get to read. This is the foundation of GEO, or AI search optimisation.

If you'd like someone to check whether the AI models actually get into your site today, we're happy to have a short call.

Frequently asked questions

What are GPTBot and ClaudeBot?

They're the crawlers of OpenAI and Anthropic — small programs that read web pages on behalf of ChatGPT and Claude. They fetch content the models can use when answering questions.

How do I block AI crawlers by accident?

Usually through an old rule in robots.txt that blocks "all crawlers". It may have been set up for another purpose, but it now locks the AI crawlers out too, with no error message.

Why does clean HTML matter for AI crawlers?

Many AI crawlers run little or no JavaScript. If the main content is built in the browser after loading, the crawler can find an almost empty page. Text in clean HTML, by contrast, it can see.

Should I let all AI crawlers in?

For most businesses that want visibility in AI search, yes. But some choose to hold certain pages back on purpose. The important thing is that the choice is deliberate, not an accident.