Search Engine Control: robots.txt

How to ask the search engines behave they way you want.

  1. Using robots.txt
  2. How to make spiders behave based on meta tags embedded in pages

If you have a page on the web you will get visits from search engines. They just find you. They always do. You can not hide. It is inevitable.

1. Using robots.txt

All search engines ask for a /robots.txt (in the root of your website) file when first visiting your site.

The only syntaxes for this file are User-agent and Disallow. There is no Allow directive! To allow all spiders crawl your entire site, use a robots.txt with:

  User-agent: *

1.1. Examples

A wild card "*" specifies all robots, this allows all to index your entire site:

  User-agent: *

This asks all robots to stay out:

  User-agent: *
  Disallow: /

To ask robots not to index a folder or file:

  User-agent: *
  Disallow: /cgi-bin/
  Disallow: /images/
  Disallow: specialfile.html
  Disallow: /folder/anotherfile.html

To keep a single search engine out entirely:

  User-agent: scooter
  Disallow: /

To keep a single search engine from accessing a single file:

  User-agent: googlebot
  Disallow: myfile.cgi

Look at cnn's robots.txt for a very complex example..

You can use to validate your robots.txt file.


Pages where you can find out more about particular robots:

2. How to make spiders behave based on meta tags embedded in pages

You can use a tag called meta to ask robots not to index or follow links on a page. This tag should be placed in the head section of your document.

  <meta name="robots" content="noindex,nofollow">

The robots meta support the arguments (no)index and (no)follow. Examples:

  <meta name="robots" content="index,follow">
  <meta name="robots" content="noindex,nofollow">

Note that not all spiders understand or respect your wishes.

