Robots.txt is simply a text file protocol or standard through which webmasters communicate and instruct web crawlers on how to crawl a web page. This file text, also called robot exclusion protocol (REP), is used mainly to guide web crawlers on where to access, crawl and index the web pages. It also includes directives like meta robots, page-, or subdirectory- instructions for how search engines should handle links such as “follow” or “nofollow”.
Robots.txt Format Example:
This robots.txt format mentions “User-agent: * ” to allow all bots to crawl your web page while, “Dislallow:/ ” is used to restrict the robots from scanning the page. This may limit the crawler’s reach on your web pages, but many other bots such as email harvesters, spambots, malware robots and robots that scan for security vulnerabilities will still be able to scan your site.
Here is a robots.txt example:
This will block all the web crawlers from scanning all content on your website.
This robots.txt allow all web crawlers to scan all content present on your website.
This robots.txt will block a specific web crawler from crawling on a specific folder. In this case, that particular crawler is a “Googlebot” who is being instructed not crawl on “www.ctcdc.in/ctcdc-subfolder/ ''.
This syntax blocks a specific web crawler from crawling a specific web page. Here, only Bing’s crawler is told to avoid crawling the specific page at “www.ctcdc.in/ctcdc-subfolder/blocked-page. ''
How Does robots.txt work?
We know that all that search engines crawl web pages through crawlers, bots or spiders. These search engine bots follow links to get from one site to another through which they ultimately crawl across billions of links and websites. When a bot is about to crawl an optimized website,
it searches for robots.txt file, after finding which it reads it and then proceeds to conduct the instruction given by the file. If the robots.txt file doesn’t instruct the bot on how to scan the web page, the bot, then continues to scan the site.
A robots.txt file must be included within the top-level directory of the website. However, some crawlers that are crawling your website for ominous purposes, ignore the directory and continue to index your web page. Try to locate any of the sitemaps at the bottom of the robots.txt file which should be associated with your domain.
Technical robots.txt syntax
Technical robots.txt syntax is commonly used in five common types:
User-agent:It refers to the specific web crawler to which you’re giving scan instructions.
Disallow:This command restricts the crawler from scanning a specific URL.
Allow:This command is only applicable to Googlebot which commands it to access a page or subfolder even though its parent page or subfolder may be restricted to do so.
Crawl-delay:It specifies the waiting time limit of the crawler before the page loads.
Sitemap:It is used to call out the location of any XML sitemap associated with this URL.
robots.txt can become a little complicated when it comes to the actual URLs as they permit the utilization of pattern-matching to cover a range of different URL options. Google and Bing both stick to two steady expressions that can be utilized to identify pages or subfolders that an SEO wants to be excluded which includes the characters asterisk (*) and the dollar sign ($).
Where Does robots.txt Go On a Site?
All the web crawlers whether Googlebot or Facebot search for robots.txt file at only one location, which is the main directory of your root domain or homepage. If the crawler is unable to find a robots.txt file, then he continues to scan and index the site. If the directory of the robot file is different, then the bot will ignore the file as if it never existed.
Why Do You Need robots.txt?
Robots.txt files allows the crawler to scan certain areas of your site as it is considered very dangerous to disallow the Googlebot to crawl on your website. With this the role of robots.txt has become very important for some reason:
- For preventing plagiarised content from ranking on SERPs.
- For maintaining privacy in the entire section of the website.
- For locating sitemaps.
- For keeping the search engine from indexing images, PDFs, etc.
How To Create A robots.txt File?
Before creating a robots.txt file you need to check, whether you have it already or not. To check its pre-existence, you simply need to type /robots.txt in your root domain. If no robots.txt page appears, then you have no live robots.txt page. Then to create one, refer to this article
SEO Best Practices
Keep confirming that you do not block any content or web page that you want to be crawled.
- If you have blocked a page, its contents, including the links present in it will not be scanned by the crawler only except those links which directs you to a different page which is unblocked and is being indexed by the bots.
- If you want to hide sensitive information from being shown on the SERPs, do not use robots.txt, as it will allow the bots from unblocked pages to reach the sensitive information. Also, malware bots or spam bots might penetrate the web page and misuse the information. So try a different method to hide your data.
- Some search engines employ multiple user agents to crawl different contents such as images and content. Most user agents follow the equal instructions, so you need not modify your robots.txt according to different bots.
Robots.txt vs meta robots vs x-robots
Robots.txt are actual text files while meta robots and x-robots are meta directives. Where robots.txt instructs a crawler on the crawling process, the main focus of meta robots and x-robots is on indexation.