Knowledge Base

What is robots.txt file?

As it is known there are a lot of websites providing indexing/search services. Google and Yahoo are the ones which are well-known.

There is one thing worth mentioning; robots (also called as bots or crawlers) index all the files they can access.

From this point 2 issues raise:

  1. Bots can index sensitive information simplifying public access to it.
  2. The more files are being processed by bots, the higher load it is causing to the server.

And the good thing is that there is a way to control the crawlers, a robots.txt file.

The robots.txt contains instructions for search robots; it has to be placed in the root directory of a domain. The instructions may disallow access to some parts of a website, indicate how to “mirror” a website correctly, set time limits for robots to download a file from server.

File creation is as simple as creating a regular .txt file. You can use your favorite text editor and upload it to the server or you can create it using cPanel file manager or use console tools, in case you have shell access enabled.

And even if you are not intending to create disallowing rules for indexers, it is still recommended to create an empty robots.txt file.

Most useful examples of robots.txt files

To deny indexing of entire website for all bots:

User-agent: *
Disallow: /

To allow access to just one robot:

User-agent: Google
Disallow:
User-agent: *
Disallow: /

To deny access for all robots to a part of website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /personal/

To disallow indexing of some files in a directory for all bots:

User-agent: *
Disallow: /~user/junk.html
Disallow: /~user/playlist.html
Disallow: /~user/photos.html

List of robots can be found here:
http://www.robotstxt.org/db.html

There are also additional directives which are not supported by all the robots (text after # – “pound sign” is comment and doesn’t affect indexing rules):

User-agent: * # rules applied to all robots
Disallow: /downloads/ # disallowing access to ‘downloads’ directory
Request-rate: 1/5 # the directive sets the limit for crawlers allowing to load 1 page per 5 seconds
Visit-time: 0600-0845 # the directive states that pages are allowed to be indexed from 6 a.m till 8:45 a.m. only

 

In spite of the fact that these directives are rather simple, they give possibility to manage robots access to websites effectively.

[alert]Note: some small crawlers may ignore the robots.txt file.[/alert]

Please rate this article to help us improve our Knowledge Base.

0 0