Google has an insatiable appetite for content, webpages, and anything else on your website it can find.
This is great, but as we mentioned in a previous post about crawl budget, large stores can suffer from Google trying to access every single page within their domain for a number of different reasons.
In this article we’re going to cover the use of the robots.txt file – a last-ditch option often used to prevent crawlers from accessing areas of a website it doesn’t need to crawl and you don’t want in the index.
Disclaimer: Although it can be very useful, this file should be used with extreme care, as the wrong implementation could exclude more than intended – and in extreme cases – remove your entire website from the Google index.
To begin with, let’s start with setting up a basic robots.txt file and how to implement it on your site. The basic format for a simple instruction set is as follows:-
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
This is the basic information a robots.txt file will need to instruct crawlers how to crawl your site. The user agent name will be the name of the crawler e.g. Google’s would be “Googlebot”, Microsoft Bing would be “Bingbot” etc. You can also provide instructions to all bots using “*”.
The URL string is the web address after the main domain to access your site e.g. if you have the website
https://www.example.co.uk and you want to block the “About Us” page, which is located at
https://www.example.co.uk/about-us.html, then after “Disallow: “ you would simply type
/about-us.html to have the crawler ignore your about us page.
You can set your entire site to be ignored by a crawler by using “/” after disallow, or alternatively, if you want to include your entire site for crawling, you can leave the string after “Disallow: “ blank, telling the particular crawler you’ve provided the instruction to that your entire site is to be indexed by their robot. So, with all this in mind, the bare minimum a robots.txt file would need would be something like the following:-
What the above does is it tells all crawlers that the entire site can be crawled without restriction. Once you’re happy with your robots.txt file, this needs to be saved to a file named “robots.txt”. It’s important to remember that the file name must be completely lower case and as written in the quotes. If the filename doesn’t match exactly, the file will be ignored and your restrictions won’t be applied.
The file then needs to be located on your site at the top level. So using our previous
https://www.example.co.uk site, you would place the robots.txt file so it can be found at
https://www.example.com/robots.txt. Placing the file anywhere else on your site will cause a crawler to assume your site doesn’t have a robots.txt file and proceed to crawl your entire site, so it’s important to ensure it’s located at the right address.
Using the knowledge above, we can create a more comprehensive robots.txt file with more instructions. Let’s take a look at the below example:-
Let’s breakdown the above example, the first instruction only applies to Google’s robot and we’ve told it to disallow the customer’s folder of the site, meaning any web page that starts with
https://www.example.co.uk/customers/ will be excluded from Google’s crawler. We’ve also included a line stating to disallow “/.xml$”. This tells Google’s crawler to exclude any page that ends in “.xml”. This is done by using the “” to denote any string of characters between the “/” and “.” characters and the “$” character to indicate the end of the web address.
A new line is then used to indicate that this the end of the instructions for Google’s robot and the next set of instructions now applies to a different robot, in this case, Microsoft’s Bing crawler. What we’ve told Bingbot is to ignore any site starting
https://www.example.co.uk/contests/. Another new line then indicates the end of the instructions for Bingbot.
The final section, we’ve applied to all crawlers, however, this is excluding Googlebot and Bingbot. If a robots.txt file includes a set of instructions for specific crawlers, then includes a section for all crawlers, any crawlers previously specified will ignore any instructions for all crawlers and only apply those aimed specifically at it. We’ve told all other crawlers to not crawl our site as per the single “/” after disallow, meaning that only Googlebot and Bingbot will crawl our site, minus the disallows specified under their respective instructions.
This should show how to create a more complex set of instructions for crawlers of your site based on your needs and their ranking rules. However, what you may find is your site may suffer performance issues due to crawlers accessing your site during peak traffic periods. The last section is going to look at one more instruction which can be used to mitigate performance problems which can occur whilst your site is crawled.
You can also defer the length of time between crawler requests to ensure that your site performance isn’t impacted. This is done through the “Crawl-delay” command. It’s implemented as follows:-
Crawl-delay: [time in milliseconds]
If this is included in your robots.txt file, it will instruct the relevant robots to delay accessing pages on your site by the time specified, for example:-
In this example, we’ve told all robots they can crawl the entire site, however, each page can only be accessed after a 10-second delay. By doing this, it spaces out how many requests all crawlers make to your site reducing performance issues for other users attempting to use your site.
This article has covered some basic use cases for the robots.txt file that should cover basic setup, specifying instructions for a specific crawler and how to delay their access to prevent performance issues for other users. However, there are more instructions and facilities that can be utilised within the robots.txt file which may benefit your site. For more information on this, you can contact our SEO experts who can answer any questions you have about the best implementation of the robots.txt file for your site and more.