Robots.txt is a text file with extension .txt, which we create and upload to our website and which we use to prevent the robots of certain search engines from tracking content that we do not want them to index or show in their results.
It is a public file that we use to indicate to crawlers or spiders to track and index our website . In it, we can specify in a simple way, the directories, subdirectories, URLs or files of our website that should not be tracked or indexed by the search engines.
Because it is intimately related to the indexing of the website , it is essential to properly program this file, especially if our website is made with a content manager (CMS) that generates it automatically, since it can happen that they are accidentally included as not indexable parts that should be tracked.
Also called robot exclusion protocol or robots.txt protocol , it is advisory and does not guarantee full secrecy, but sometimes we find it used to keep private parts of a website. Because that isolation is not complete , its use is discouraged to keep certain areas private , as it serves as a recommendation and not as an obligation, being a hacker’s treat that with a browser and the necessary knowledge, can easily access them.
Normally the most common uses are to avoid accessibility to certain parts of the website, prevent the indexing of duplicate content (for example printable versions of the web), or to tell Google what our sitemap is, including its URL in the file .
How do we create the robots.txt file?
In order to create it, we need access to the root of the domain and upload the file in text format with name “robots.txt”, to the root directory of the first level of the server of our website.
Elements of the file Robots.txt
The main commands that we will use in a robots.txt will be
- User-agent: or user agent, are the robots or spiders of the search engines, you can see most of them in this database of web robots . Its syntax would be:
User-agent: [name of the robot to which the rule will apply]
- Disallow:, indicates to the user agent or user agent that it should not access, crawl or index a specific URL, subdirectory or directory.
Disallow: [directory you want to block]
- Allow:, it appears as against the previous one, with it you indicate to the tracker a URL, subdirectory or directory to which you must enter, track or index.
Allow: [URL of a blocked directory or subdirectory that you want to unblock]
The rules specified in the Disallow and Allow only apply to the user agents that we have specified in the line before them. Multiple Disallow lines can be included to different user agents.
- Slash “/”, must be attached before the element you want to block.
- Match rules are patterns that can be used to simplify the robots.txt file code.
Example: *,?, $
Asterisk (*): blocks a sequence of characters
Dollar symbol ($): to block URLs that end in a specific way
Syntax of commands most used in robots.txt
- Indications to a specific bot:
User-agent: [bot name]
- Indications to all bots:
- Blocking the entire website, using a slash “/”:
- Block a directory and its contents, include the name fo the directory after the forward slash:
- Block a specific web page, indicate after the bar the specific page:
- Block all images on the website:
User Agent: Googlebot-Image
- Block a single image, specify the image behind the slash:
- Block a specific file type, mentioning, after the bar, the extension:
- Block a sequence of characters , use the asterisk:
Disallow: private-directory * /
- Block URLs that end in a specific form , add the symbol $ at the end:
- Allow full access to all robots:
Another way would be to not use the robots.txt file or leave it empty.
- Block a specific robot or bot:
User-agent: [bot name]
- Allow tracking to a specific bot:
User-agent: [bot name]
When writing them, you must bear in mind that there is a distinction between uppercase, lowercase and spaces.
Testing the robots.txt file on Google
To check the functioning of the robots.txt file, we have the test tool for robot.txt in Google Search Console , where you can try and see how the Googlebot will read it , so it will show you possible errors or defects that the file has or can cause.
To carry out the test, go to Google Search Console and in your control panel, in the Tracking section, choose the option ” robots.txt Tester ” .
Inside the tester, your current robots.txt file will appear, you can edit it, or copy and paste the one you want to try. Once you have written the robots file to test, select the URL you want to check if it is going to be blocked and the tracking robot with which you want to test it.
Different Google bots:
The tool will give you two options: “allowed” , that is, the URL is not blocked, or “blocked” indicating the line of code that is blocking that URL.
Above is the example of how the robots.txt tester tells us when a URL is blocked and the line of code where the block is generated