Robots.txt file: rules, SEO use cases, and common mistakes

Search Engine Optimization

The robots.txt file is a text file located in the root directory of a website that sets access rules for search engine bots for specific URLs, folders, and page types. Its main purpose is not to “hide a page from Google,” but to manage crawling: to indicate what a bot should crawl and what is not worth spending server resources and crawl budget on.

Crawling is when a bot requests a page. Indexing is when a page, after processing, may be added to a search engine’s index. Noindex is a separate directive that tells Google not to include a document in search results. That is why robots.txt should not be presented as a universal way to remove a URL from search: for Google, it is primarily a crawl management tool.

If a page should no longer appear in search, robots.txt alone is not enough. A different combination is needed here: robots.txt for crawl control, noindex or X-Robots-Tag for deindexing, and authorization or password protection for restricted sections. When these tasks are mixed up, a site either wastes crawl on technical URLs or accidentally blocks important pages.

In practice, a properly configured robots.txt affects not only the technical cleanliness of a site, but also the effectiveness of website promotion . When a search engine does not spend crawl resources on utility pages, filters, parameter-based URLs, or technical sections, it reaches content with real SEO value more quickly.

This article is based on Google’s official documentation on robots.txt , the technical specification of how Google interprets robots.txt , guidance on noindex , and WordPress Developer Resources .

What robots.txt is and when you actually need it

Robots.txt is a rules file for bots located at /robots.txt that tells crawlers which parts of a website may or should not be crawled. For a small website with a few dozen pages, it can be very simple. For an online store, a content project, a WordPress site, a large corporate website, or a multilingual structure, the robots.txt file becomes part of the site’s technical foundation.

Robots.txt is most useful when a site has filters, internal search pages, sorting options, URL parameters, utility directories, test sections, technical scripts, or other areas that provide no search value. This is especially noticeable in e-commerce: if crawling is not controlled, a bot may keep crawling endless combinations of filters and parameters instead of reaching category pages, product cards, and content pages more quickly.

At the same time, robots.txt should not be treated as a tool for protecting private data. If a document is truly sensitive, it should not simply be “disallowed in robots.txt” — it should be removed from public access, protected by authentication, or secured at the server level. The file only publishes a rule for bots; it does not create a barrier for users or scrapers.

Crawling, indexing, and noindex are not the same thing

One of the most common mistakes in older SEO materials is mixing up crawling and indexing. If a page is blocked in robots.txt, that means Googlebot may not access it or see its content. But the URL may still appear in search if it is linked from other pages or external sources.

With noindex , the logic is different. For Google to see this directive, the page must remain accessible for crawling. A typical mistake is to place Disallow in robots.txt and expect the bot to read noindex inside the page. It will not, because access has already been blocked.

If you want to better understand how pages get into search results, take a look at our article about website indexing in search engines . In practice, people often misidentify the cause of the problem here: it may look like a page “is not being indexed,” while in fact it was simply opened or blocked for crawling incorrectly.

Where the robots.txt file should be located and what rules are mandatory

There are several basic requirements for robots.txt that should not be ignored. The file must be named exactly robots.txt , be a plain text document encoded in UTF-8, and be placed in the root of the host to which the rules apply. If the file is placed in a subfolder, search bots will not treat it as robots.txt.

There is also a technical nuance that often gets lost in general explanations. Robots.txt does not apply “to the entire site as a whole,” but within the scope of a specific protocol, host, and port. In other words, the rules in https://example.com/robots.txt do not automatically become rules for https://www.example.com/ or for subdomains. If a project is distributed across multiple hosts, this must be considered at the website development stage and in the site structure itself.

There is another telling sign: when robots.txt grows to hundreds of lines and keeps accumulating exceptions, the problem is often no longer the file itself. Usually, it means the site has too many chaotic parameters, utility URLs, or technical pages that should be organized at the architectural level instead of being patched with endless rules.

User-agent, Disallow, Allow, and Sitemap — what the main directives mean

User-agent specifies which crawler a block of rules applies to. Every access-control logic in robots.txt starts with this directive. If you set User-agent: * , the rule applies to all bots that support the Robots Exclusion Protocol.

Disallow is the main directive used to restrict crawling of specific sections, folders, or URL types. It is commonly used to block internal search, the cart, technical directories, test folders, or pages with low SEO value.

Allow is used as an exception when you need to permit a specific file or subpath inside an already blocked directory. This is less common, but useful when a folder as a whole should not be crawled, while one file inside it still needs to remain accessible to bots.

Sitemap does not block or allow crawling; it simply tells bots where the sitemap is located. That is why adding sitemap to robots.txt is a normal practice: it does not replace submitting the sitemap in Search Console, but it gives search engines one more way to discover it.

A basic example may look like this:

User-agent: *
Disallow: /search
Disallow: /cart/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://example.com/sitemap.xml

It is better to keep these robots.txt rules short and easy to understand. The more complex the file is, the easier it becomes to accidentally block the wrong folder, the wrong URL pattern, or even resources required for rendering.

What robots.txt syntax actually matters

Robots.txt syntax does not look complicated, but most mistakes happen in the details. Paths are defined from the site root, the file is sensitive to URL structure, and Google simply ignores invalid or unnecessary lines. Because of that, a broken robots.txt does not always “fail completely” — sometimes it works partly and partly does not, which is exactly why errors can go unnoticed for a long time.

There is also an important practical point: Google does not support some directives that still appear in old articles and templates. In particular, noindex , nofollow , and crawl-delay are not supported robots.txt rules for Google. So if your file still contains such entries, they should not be treated as a foundation for modern technical optimization.

Robots.txt for WordPress — what to check separately

Robots.txt for WordPress is a separate topic because on this platform the file may be served not only as a physical document on the server, but also dynamically. In addition, WordPress already has built-in XML sitemap support. That is convenient, but the final output still needs to be checked.

On WordPress, the issue is often not the file itself, but the fact that its behavior changes due to plugins, the theme, or custom logic. A site owner may be sure that robots.txt was configured correctly long ago, while the actual content at /robots.txt is already different. That is why checking robots.txt for WordPress is not a formality, but a normal part of a technical website audit .

For WordPress or WooCommerce online stores, this matters even more: parameter-based URLs, filters, sorting pages, the cart, checkout, and other sections tend to pile up quickly, and they should not be left open for uncontrolled crawling.

How to check robots.txt in Google Search Console

It is better to check robots.txt not in isolation, but together with Search Console, URL Inspection, Page Indexing, and Crawl Stats. The file itself may be formally correct, but that still does not mean it works in the site’s best interest.

The robots.txt report in Search Console helps you view the current version of the file and refresh the cache more quickly after changes. For individual URLs, Google Search Console with URL Inspection is useful because it shows whether robots.txt is blocking a page that should actually be crawlable. For the bigger picture, Page Indexing and Crawl Stats are worth reviewing — that is where you can see where the site is really losing crawl.

If a site has many pages, checking robots.txt “by eye” is not enough. When the file changes together with the site structure, migration, new language versions, or a redesign, it should be reviewed as part of an SEO site audit rather than just following the logic of “the file opens, so everything must be fine.”

Common mistakes in robots.txt

Most problems arise not because the syntax is complex, but because people misunderstand what this file is actually for.

They block a page in robots.txt and expect it to disappear from search. This is not a guaranteed outcome. If Google already knows the URL, it may still appear in search results even without full content.
They block pages that should be crawlable. As a result, the search engine cannot see important content, and the issue later looks like “slow indexing.”
They block resources needed for rendering. If important CSS or JavaScript is accidentally restricted, the page may open normally in a browser, but Google’s rendering will be incomplete.
They keep outdated or unsupported rules in the file. These rules are useless, but they create a false sense of control.
They publish the file in the wrong place. A robots.txt file in a subfolder, in the wrong encoding, or with an incorrect filename is effectively the same as having no robots.txt at all.
They rely on robots.txt to protect private sections. Restricted information requires access control, not just a recommendation for bots.

There is also one very real-world scenario: robots.txt was configured correctly on a staging domain or a test copy of the site, but during release it was either not changed or was accidentally moved to the main domain. These mistakes happen much more often than they seem, especially after a redesign or migration.

The availability of the file itself also matters. If robots.txt is returned with server errors, that affects crawling too. That is why during a redesign, migration, CDN change, or server configuration update, robots.txt should be checked just as carefully as the sitemap, redirects, and canonical setup.

Conclusion

A properly configured robots.txt file is not a decorative website element and not a way to “hide everything unnecessary from Google.” It is a working crawl management tool that helps remove noise, direct bots toward important sections, and avoid wasting crawl on URLs with no value.

In practice, everything comes down to a simple separation of roles. Robots.txt is used to control crawling. If a page needs to be removed from search, the focus should shift to noindex , X-Robots-Tag , or server-side access restrictions. When these tasks are not confused with one another, a site’s technical SEO logic becomes much cleaner.

If there are already doubts on a site about which URLs should be blocked, which should remain crawlable, and where robots.txt, meta robots, canonical, and sitemap are conflicting, those issues are better solved not with isolated edits, but through a full technical review of the site structure.