Sitemap

From LinuxReviews
Jump to navigationJump to search
Internet-web-browser.svg

A sitemap is a graphical or structured list of pages available on a website. Search engine crawlers look for and use XML formatted sitemaps to efficiently crawl the websites they index. Checking if your websites content management system can create a sitemap and make it available is a good idea. Sitemaps can also be human-readable lists or graphs that can be used to plan or optimize websites structure.

XML Sitemaps[edit]

Generating XML Sitemaps[edit]

A XML sitemap should consist of url nodes with loc indicating the location (URL), a lastmod last modified time-stamp, a crawl priority between 0 and 1 and optionally a changefreq with a value like weekly or monthly indicating how frequently the page is changed.

A url node can look like:

  <url>
    <loc>https://linuxreviews.org/AMD_graphics</loc>
    <lastmod>2020-08-06T07:43:40Z</lastmod>
    <priority>1.0</priority>
  </url>

The url nodes go between a urlset super-node. A <?xml header is also required. A complete sitemap.xml file could look like this:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://linuxreviews.org/AMD_graphics</loc>
    <lastmod>2020-08-06T07:43:40Z</lastmod>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://linuxreviews.org/AMP</loc>
    <lastmod>2020-07-22T20:34:38Z</lastmod>
    <priority>1.0</priority>
  </url>
</urlset>

Sitemap files need to be UTF-8 encoded (not UTF-16)[1] and data values need to have five special characters need to use special escape codes:

Character Escape Code
Ampersand &
&amp;
Single Quote '
&apos;
Double Quote "
&quot;
Greater Than >
&gt;
Less Than <
&lt;

A sitemap should not be lager than 50 MiB in size (uncompressed) and it should not contain more than 50k URLs. Gigantic sites can overcome that problem by creating a sitemap index file that points to multiple sitemaps. Such a sitemap index file should have a sitemapindex super-node with sitemap nodes containing loc entries with sitemap URls and lastmod time-stamps. A complete sitemap index file could look like:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap/importantpages.xml</loc>
    <lastmod>2020-08-12T11:33:19Z</lastmod>
</sitemap>
<sitemap>
    <loc>https://example.com/sitemap/boringpages.xml</loc>
    <lastmod>2020-08-12T11:33:19Z</lastmod>
</sitemap>
</sitemapindex>

Writing XML sitemap files manually is pretty futile. Your best option is to use a feature built into the content management system you are using or a plug-in. There are several sitemap plug-ins available for WordPress. MediaWiki has a built-in tool available in maintenance/generateSitemap.php.

Making Search Engines Aware Of Your Sitemap[edit]

Web crawlers used by search engines and others will pick up any sitemap specified in /robots.txt. All you need is one line with Sitemap: and an URL:

Sitemap: https://linuxreviews.org/sitemap.xml

Multiple Sitemap: lines are supported by robots.txt the standard.

Google[edit]

Google has a special page you can use to submit sitemaps to their search engine. You can simply visit http://www.google.com/ping?sitemap=+url to your sitemap (as in http://www.google.com/ping?sitemap=https://example.com/sitemap.xml) to submit a sitemap to that search engine.[2]

You do not need to submit a newly added sitemap if you add a URL to it in robots.txt and GoogleBot is crawling your site on a regular basis, it will see it and use it automatically.

Bing[edit]

Bing provides a special URL you can use to submit sitemaps. Their tool requires that the URL to the sitemap is URL-encoded so you can't just submit a plain URL. You can make the bingbot aware of your sitemap by going to http://www.bing.com/ping?sitemap=</code>+<code>encoded URL. A valid requests could look like:

http://www.bing.com/ping?sitemap=http%3A%2F%2Fwww.example.com/sitemap.xml

Bing will, like Google and others, look for and understand a Sitemap: line in your /robots.txt. You do not need to manually submit a sitemap if bingbot crawls your site on a regular basis.

Caveats[edit]

  • Having a sitemap with many pages does not necessarily mean that web crawlers will fetch all the pages listed in your sitemap. They may only fetch the first 100 or 1000 ordered by either the list they appear in or the last-modification date.
  • Web crawlers do not see a sitemap as a list of pages they are allowed to crawl, the majority will happily crawl any and all pages linked to (or not linked to in some cases). You have to list pages or locations you do not want crawlers to crawl in /robots.txt, not mentioning them in your sitemap is not enough.

Graphical Sitemaps[edit]

YaCy can make pretty neat web structure graphs shows all links to and from a website. It can not make a graph showing a websites internal structure.

Footnotes[edit]

  1. sitemaps.org: Sitemaps XML format
  2. support.google.com: Build and submit a sitemap


Add your comment
LinuxReviews welcomes all comments. If you do not want to be anonymous, register or log in. It is free.