Basic Setup
To begin with, let’s start with setting up a basic robots.txt file and how to implement it on your site. The basic format for a simple instruction set is as follows:-
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
This is the basic information a robots.txt file will need to instruct crawlers how to crawl your site. The user agent name will be the name of the crawler e.g. Google’s would be “Googlebot”, Microsoft Bing would be “Bingbot” etc. You can also provide instructions to all bots using “*”.
The URL string is the web address after the main domain to access your site e.g. if you have the website https://www.example.co.uk
and you want to block the “About Us” page, which is located at https://www.example.co.uk/about-us.html
, then after “Disallow: “ you would simply type /about-us.html
to have the crawler ignore your about us page.
You can set your entire site to be ignored by a crawler by using “/” after disallow, or alternatively, if you want to include your entire site for crawling, you can leave the string after “Disallow: “ blank, telling the particular crawler you’ve provided the instruction to that your entire site is to be indexed by their robot. So, with all this in mind, the bare minimum a robots.txt file would need would be something like the following:-
User-agent: *
Disallow:
What the above does is it tells all crawlers that the entire site can be crawled without restriction. Once you’re happy with your robots.txt file, this needs to be saved to a file named “robots.txt”. It’s important to remember that the file name must be completely lower case and as written in the quotes. If the filename doesn’t match exactly, the file will be ignored and your restrictions won’t be applied.
The file then needs to be located on your site at the top level. So using our previous https://www.example.co.uk
site, you would place the robots.txt file so it can be found at https://www.example.com/robots.txt
. Placing the file anywhere else on your site will cause a crawler to assume your site doesn’t have a robots.txt file and proceed to crawl your entire site, so it’s important to ensure it’s located at the right address.
‍