Search Engine Control: robots.txt
How to ask the search engines behave they way you want.
If you have a page on the web you will get visits from search engines. They just find you. They always do. You can not hide. It is inevitable.
1. Using robots.txt
All search engines ask for a /robots.txt (in the root of your website) file when first visiting your site.
The only syntaxes for this file are User-agent and Disallow. There is no Allow directive! To allow all spiders crawl your entire site, use a robots.txt with:
User-agent: * Disallow:
1.1. Examples
A wild card "*" specifies all robots, this allows all to index your entire site:
User-agent: * Disallow:
This asks all robots to stay out:
User-agent: * Disallow: /
To ask robots not to index a folder or file:
User-agent: * Disallow: /cgi-bin/ Disallow: /images/ Disallow: specialfile.html Disallow: /folder/anotherfile.html
To keep a single search engine out entirely:
User-agent: scooter Disallow: /
To keep a single search engine from accessing a single file:
User-agent: googlebot Disallow: myfile.cgi
Look at cnn's robots.txt for a very complex example..
You can use http://www.searchengineworld.com/cgi-bin/robotcheck.cgi to validate your robots.txt file.
Links:
- Robots.txt Tutorial: http://www.searchengineworld.com/robots/robots_tutorial.htm
- The Robots.txt Our Big Crawl http://www.searchengineworld.com/misc/robots_txt_crawl.htm
- Creating a Robots.txt file http://www.elsners.com/webdesign/sroy4.html
- STOP SEARCH ENGINE ROBOTS INDEXING YOUR PRIVATE FOLDERS BY 'ROBOTS.TXT' http://www.webmasters-central.com/wp/se/robotstxt.shtml
- The Web Robots FAQ
Pages where you can find out more about particular robots:
2. How to make spiders behave based on meta tags embedded in pages
You can use a tag called meta to ask robots not to index or follow links on a page. This tag should be placed in the head section of your document.
<html> <head> <meta name="robots" content="noindex,nofollow"> ... </head> <body>
The robots meta support the arguments (no)index and (no)follow. Examples:
<meta name="robots" content="index,follow"> <meta name="robots" content="noindex,nofollow">
Note that not all spiders understand or respect your wishes.
- Next: xhtml tag reference
- Previous: RSS Tutorial for Content Publishers and Webmasters
