Back to blog

SEO

Robots.txt for Beginners: What It Does and What It Does Not Do

A plain-English guide to robots.txt files, crawl rules, sitemap lines, and common mistakes.

A robots.txt file gives crawlers instructions about which parts of a site they may crawl. It usually lives at the root of a domain, such as "https://example.com/robots.txt". Search engines check it before crawling many pages.

The file is useful, but it is not a security system. A disallow rule can ask well-behaved crawlers not to crawl a path, but it does not hide private content from people who already know the URL. Private pages should be protected with authentication, not just robots.txt.

A basic file may include a user-agent line, allow or disallow rules, and a sitemap location. For example, "User-agent: *" applies to all crawlers, "Disallow: /admin" asks them to avoid that path, and "Sitemap: https://example.com/sitemap.xml" points them to your sitemap. The robots.txt generator can create a clean starting file.

Be careful with broad disallow rules. "Disallow: /" tells crawlers not to crawl the entire site. That can be useful for a private staging environment, but it is usually a serious mistake on a public website. Before launch, check that important pages such as tools, blog posts, and policy pages are crawlable.

Robots.txt works best with a sitemap. The sitemap tells search engines which URLs matter, while robots.txt gives crawl guidance. If you add new resources like a blog, tool pages, or a meta tag generator, keep the sitemap updated.

Review the file whenever your site structure changes. If you move from a few pages to a larger content library, old rules may block new sections by accident.

The goal is not to control every crawler perfectly. The goal is to make your public site easier to discover while keeping low-value or private areas out of the crawl path.

A safe launch workflow

Before publishing a new site, open the robots.txt file directly in the browser. Do not rely only on memory or a framework setting. Confirm that the file is available at the root of the domain and that it does not contain a broad block like "Disallow: /" unless the site is intentionally private.

Next, check the sitemap line. A sitemap reference helps crawlers discover the URLs you want them to consider. If your site has a Tools hub, blog posts, policy pages, and individual utilities such as the Robots.txt Generator, those public pages should also be discoverable through normal internal links and the sitemap.

Finally, test a few important URLs manually. Open the homepage, a tool page, a blog post, and a policy page. If any of them should be public, robots.txt should not block their path.

Common robots.txt mistakes

The first mistake is treating robots.txt like privacy protection. It is not. A disallowed URL may still be visible if other pages link to it, and people can still open it if they know the address. Use authentication for private content.

The second mistake is blocking assets needed to render the page. If crawlers cannot access important CSS or JavaScript, they may not understand the page as users see it. For a modern site, be cautious with broad asset-folder blocks.

The third mistake is copying rules from another website. A rule that makes sense for one site can damage another. Generate a starting point with the Robots.txt Generator, then adapt it to your actual structure.

When to update the file

Review robots.txt whenever you launch a new section, move content, add a staging area, or change your sitemap. If you add a blog, the file may not need many new rules, but the sitemap should include the posts and internal links should point to them. If you add a private admin area, a disallow rule may reduce crawl noise, but it should not be the only protection.

Robots.txt is most valuable when it is boring, clear, and reviewed. For public content, the usual goal is simple: let crawlers reach useful pages, point them toward the sitemap, and avoid accidentally blocking the pages people came to find.

Related guides