As it is known there are a lot of websites providing indexing/search services. Google and Yahoo are the ones which are well-known.
There is one thing worth mentioning; robots (also called as bots or crawlers) index all the files they can access.
From this point 2 issues raise:
And the good thing is that there is a way to control the crawlers, a robots.txt file.
The robots.txt contains instructions for search robots; it has to be placed in the root directory of a domain. The instructions may disallow access to some parts of a website, indicate how to “mirror” a website correctly, set time limits for robots to download a file from server.
File creation is as simple as creating a regular .txt file. You can use your favorite text editor and upload it to the server or you can create it using cPanel file manager or use console tools, in case you have shell access enabled.
And even if you are not intending to create disallowing rules for indexers, it is still recommended to create an empty robots.txt file.
To deny indexing of entire website for all bots:
User-agent: * Disallow: /
To allow access to just one robot:
User-agent: Google Disallow: User-agent: * Disallow: /
To deny access for all robots to a part of website:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /personal/
To disallow indexing of some files in a directory for all bots:
User-agent: * Disallow: /~user/junk.html Disallow: /~user/playlist.html Disallow: /~user/photos.html
List of robots can be found here:
There are also additional directives which are not supported by all the robots (text after # – “pound sign” is comment and doesn’t affect indexing rules):
|User-agent: *||# rules applied to all robots|
|Disallow: /downloads/||# disallowing access to ‘downloads’ directory|
|Request-rate: 1/5||# the directive sets the limit for crawlers allowing to load 1 page per 5 seconds|
|Visit-time: 0600-0845||# the directive states that pages are allowed to be indexed from 6 a.m till 8:45 a.m. only|
In spite of the fact that these directives are rather simple, they give possibility to manage robots access to websites effectively.
[alert]Note: some small crawlers may ignore the robots.txt file.[/alert]