Strategy

Robots.txt, noindex, and AI crawlers: what each control actually does

A clean guide to crawl blocks, index controls, and crawler allowlists so important SEO pages stay reachable.

SeoraUpdated June 26, 20261 min read

Robots rules are easy to over-trust. A robots.txt disallow can stop a compliant crawler from fetching a URL, but it does not reliably remove a URL from search results if other pages link to it. A noindex directive can remove a page from the index, but only if the crawler is allowed to fetch the page and see that directive.

Which control to use

Use robots.txt to reduce crawling of duplicate, faceted, staging, or utility URLs that do not need to be fetched.

Use a noindex meta tag or X-Robots-Tag header when the URL can be crawled but should not appear in search results.

Use authentication for private content. Robots rules are public hints, not security.

Use canonical tags when duplicates should consolidate signals into one preferred URL.

AI crawler policy should be explicit

Search crawlers, preview fetchers, and AI training or answer crawlers may have different user agents and different business value. Decide the policy by page type: public product and editorial pages usually need search access, while paid, private, or generated utility pages often should be limited. Document the intent next to the robots file so future deploys do not undo it accidentally.

Where Seora fits

Seora compares robots.txt, noindex, canonical tags, sitemap URLs, and actual crawl responses. It highlights contradictions like a noindex page blocked by robots.txt or a sitemap URL that cannot be fetched.

The rule of thumb is simple: block crawling when fetching is wasteful, use noindex when visibility is the problem, and never use either as a substitute for access control.