Robots.txt, noindex, and AI crawlers: what each control actually does
A clean guide to crawl blocks, index controls, and crawler allowlists so important SEO pages stay reachable.
Robots rules are easy to over-trust. A robots.txt disallow can stop a compliant crawler from fetching a URL, but it does not reliably remove a URL from search results if other pages link to it. A noindex directive can remove a page from the index, but only if the crawler is allowed to fetch the page and see that directive.
Which control to use
AI crawler policy should be explicit
Search crawlers, preview fetchers, and AI training or answer crawlers may have different user agents and different business value. Decide the policy by page type: public product and editorial pages usually need search access, while paid, private, or generated utility pages often should be limited. Document the intent next to the robots file so future deploys do not undo it accidentally.
Seora compares robots.txt, noindex, canonical tags, sitemap URLs, and actual crawl responses. It highlights contradictions like a noindex page blocked by robots.txt or a sitemap URL that cannot be fetched.
The rule of thumb is simple: block crawling when fetching is wasteful, use noindex when visibility is the problem, and never use either as a substitute for access control.
Sources
Put this into practice
Run a free Seora audit and get the exact fixes for your site — performance, AI readiness, internal links and more.
Keep reading
All articlesInternational SEO: hreflang, canonicals, and translated pages
How to keep multilingual URLs clean, prevent duplicate signals, and send users to the right language version.
Programmatic SEO without scaled-content spam
How to build template-driven pages that deserve to rank: unique data, editorial controls, and quality gates before publishing.