TransWikia.com

Disallow root but not 4 subdirectories for robots.txt

Webmasters Asked by Quillion on November 6, 2021

I have a project and I would like to disallow everything starting with root.

From what I understand I think I can do so by doing this

Disallow: /
Disallow: /*

However I would like to allow 4 subdirectories and everything under those subdirectories.

This is how I think it should be done

Allow: /directory_one/
Allow: /directory_one/*
Allow: /directory_two/
Allow: /directory_two/*
Allow: /directory_six/
Allow: /directory_six/*
Allow: /about/
Allow: /about/*

So how would I go about disallowing everything starting from root but allowing only those 4 directories and everything under them?

Also if I want to allow specific directory and everything under it, do I have to declare it twice?

Will webcrawler be able to navigate to those subdirectories if root is disallowed?

2 Answers

I would not recommend trying to set up your site to disallow everything except certain directories.

  • The home page of your site will be blocked from crawling. Because most sites get many links to their home page, your site will be throwing away SEO value from a lot of incoming links. You will need to have direct external links to your deep crawlable pages to make your SEO work.
  • Futhermore, the home page is usually where bots find navigation to the other pages on your site. Without allowing crawling of the home page, bots may not be able to crawl all the pages of your site. You'd need to ensure that there are links between your deep crawlable pages.
  • Even though your home page wouldn't be crawled, it would likely still be indexed. When somebody searches for your brand name, the Google search results would likely have link to your home page with the unfriendly message "A description for this result is not available because of the site's robots.txt -- learn more"
  • Most crawlers don't know how to deal with Allow directives. Most crawlers will be disallowed from crawling your entire site. Only a few crawlers will know they are allowed to crawl those subdirectories. Luckily the major search engine bots process Allow directives.

I would recommend moving all disallowed content into its own subdirectory and disallowing it. For example, put all your blocked content into /private/ and use the following in robots.txt:

User-Agent: *
Disallow: /private

If you are comfortable blocking most bots other than search engine crawlers and with having worse SEO because your home page is not crawled, your idea of using Allow for specific directories could work. There is no need to use the wildcard * at the end of any directive. All robots.txt rules without a wildcard are "starts with" rules. There is an implied * at the end of each and every one. All you would need would be:

User-Agent: *
Disallow: /
Allow: /directory_one
Allow: /directory_two
Allow: /directory_six
Allow: /about

The order of the rules shouldn't matter. When multiple rules match, the longest rule should apply. So:

  • / (the home page) matches Disallow: / and crawling is not allowed
  • /foo matches Disallow: / and crawling is not allowed
  • /about/bob matches both Disallow: / and Allow: /about. The longer rule will apply and crawling would be allowed.

You can test this with Google's robots.txt testing tool that is part of search console.

Answered by Stephen Ostermiller on November 6, 2021

What you have would seem to be "about" correct, assuming you have the appropriate User-agent directive that precedes this?

Disallow: /
Disallow: /*

However, you don't need to repeat the same directive, one with a trailing * and one without. robots.txt is prefix matching. The trailing * is superfluous and these match the same URLs.

User-agent: *
Allow: /directory_one/
Allow: /directory_two/
Allow: /directory_six/
Allow: /about/
Disallow: /

Whilst Google (and the big search engines) match using the longest-path-matched method, I believe you should put the Allow directives first for those search engines that use the first-path-matched method.

Will webcrawler be able to navigate to those subdirectories if root is disallowed?

Yes, that is the reason for the overriding Allow directives. Strictly speaking, the Allow directive is a more recent addition to the "standard", but all main search engines support it.

Answered by DocRoot on November 6, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP