The indexable web

From LinuxReviews
Jump to navigationJump to search

The indexable web (also called "the surface web" or "the visible web") is the part of the World Wide Web which can be indexed by search engines who follow the rules of web crawling such as following robots.txt.

The Search Engines

Search Engines make a (distributed) database of the web by using computer programs known as spiders (or "web crawlers") who start with a list of one or more websites and slowly follow hyperlinks from one page to another until they have indexed "the whole web". The sum of pages these spiders can reach are the indexable web.

The Robots Exclusion Standard

The Robots Exclusion Standard allows webmasters to have a file on their site called robots.txt (like this one]). If this file says you can't index /foo then most (polite) spiders don't do that. However, you can still read all about /foo when you visit a site which forbids /foo from being indexed, but you have to find it in some other way. Thus; /foo is on the web, but is not part of indexable web.

It should also be noted that spiders do not follow links who are generated by JavaScript or included in Flash-files. Polite spiders also don't try to break into password-protected areas.

The Visible web

The visible web is the part of the Internet you can find in search engines.

This is not the same as the The indexable web.

The difference between the indexable and the visible web is:

Most search engines censor sites. Such sites can be indexed but are not. Big search-engines can say "Linuxreviews? We don't like that site. We're going to put that on our lists of sites who don't appear in our search-engine right now".

And there are other reasons why sites who can be indexed are not; they may be new, the crawlers haven't stumbled on any links to it yet, etc.

The visible web is much smaller than the indexable web.

The Deep Web

The Deep Web is a term sometimes used for the parts of the Internet who are there, but can not be found by search engines. The term sometimes means the whole internet (visible and invisible) and sometimes the "invisible" parts of the net who can't be found using search-engines.