HTML Rules For Preventing Search Engines From Indexing Parts Off Web Pages

From LinuxReviews
Jump to navigationJump to search
Internet-web-browser.svg

There is no universal way to make search engines not index part of your website. It is, sadly, that simple. There are, however, some simple things you can do to prevent certain spiders from crawling certain parts of a web page.

MediaWiki[edit]

Let' start with something that will likely not concern you but it is a concern to us. The CirrusSearch MediaWiki extension supports a special <div> navigation-not-searchable class.

<div class="navigation-not-searchable">
This will not be indexed. Useful for templates that create navigation and things like that.
</div>

We use this on the News item pages where there is a collection of links to recent news items at the bottom. That part does not need to be indexed by anyone. The MediaWiki CirrusSearch extension is a special use-case and likely not why you're here. Moving on..

All The Search Engines[edit]

Here's the sad truth: There is no universial standard. And Yandex is the only one who's made up their own. That means that you can ask Yandex to not index parts of a web page and expect that they, and nobody else, will care.

Yandex[edit]

The Russian search engine Yandex (Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)) respects a special <noindex> tag. This is different from say <meta name="robots" content="noindex">, you're' supposed to place it within the HTML of pages like <noindex>Do not index this part </noindex>. This is silly since there is no valid <noindex> HTMl tag. You can, luckily, place it within comments:

<div class="navigation-not-searchable">
<!--noindex-->Yandex, and only Yandex, will ignore this part of a web page.<!--/noindex-->

While it is nice that you can instruct the Russian's to not index a part of a page this way, it's mostly irrelevant since nobody else cares about that noindex tag, not even a little. But if you're using something like MediaWiki and you're adding special no-no indexing a section of a page anyway, you might as well inform the Russian's while you're at it:

<div class="navigation-not-searchable"><!--noindex-->
This will not be indexed. Useful for templates that create navigation and things like that.
<!--/noindex--></div>

Google Search Appliance[edit]

Just to clarify right off the top: There is no way to make Google's web crawler ignore a given part of a web page.. You can ask it to ignore entire pages or not ignore entire pages.

Google produced a special rack-mounted search "appliance" called "Google Search Appliance" from 2002 to 2014. They terminated all support for it in 2018. This very special device had support for:

This was indexed.
<!--googleoff: all-->
This wasn't indexed
<!--googleon: all>
This was also indexed.

This is completely irrelevant today since regular Google Search didn't use these tags and the Google Search Appliance is discontinued.


Add your comment
LinuxReviews welcomes all comments. If you do not want to be anonymous, register or log in. It is free.