Robots Exclusion Standard
The Robots Exclusion Standard is a "non-official" standard which is followed by all police web crawler software. The file instructs webcrawlers on how to behave when visiting a webserver.
Howto instruct crawlers[edit]
Make a file /robots.txt in your websites root. (domain.tdl/robots.txt).
The two basic instructions are "User-agent" and "Disallow". This allows every crawler to access everything:
User-agent: * Disallow:
Disallowing nothing means allow everything. Disallowing / will disallow your whole domain:
User-agent: * Disallow: /
These are the basic instructions. It is possible to disallow many files and folders. It is also possible to have many sets of User-agent/disallow in order to instruct crawlers differently:
User-agent: NameOfBotWeDislike Disallow: /
User-agent: CatchBadBots Disallow: /trap/
User-agent: * Disallow: /directory/file1.html Disallow: /directory/file2.html
Respected by some, not by others[edit]
The two basic instructions mentioned above are followed by all "polite" crawler software.
Some will follow "crawl-delay" (in seconds):
User-agent: * Disallow: /trap/ Crawl-delay: 10 # Wait at least 10 seconds between crawls
Some also follow request-rate (pages pr/interval in seconds) and visit-time. Visit-time is read as GMT.
User-agent: * Disallow: /trap/ Request-rate: 1/5 # maximum rate is one page every 5 seconds Visit-time: 0600-0845 # only visit between 6:00 AM and 8:45 AM UT (GMT)
More information[edit]
Examples:
- Google: http://google.com/robots.txt
- Microsoft: http://www.microsoft.com/robots.txt