Sitemap
A sitemap is a graphical or structured list of pages available on a website. Search engine crawlers look for and use XML formatted sitemaps to efficiently crawl the websites they index. Checking if your websites content management system can create a sitemap and make it available is a good idea. Sitemaps can also be human-readable lists or graphs that can be used to plan or optimize websites structure.
XML Sitemaps[edit]
Generating XML Sitemaps[edit]
A XML sitemap should consist of url
nodes with loc
indicating the location (URL), a lastmod
last modified time-stamp, a crawl priority
between 0 and 1 and optionally a changefreq
with a value like weekly
or monthly
indicating how frequently the page is changed.
A url
node can look like:
<url>
<loc>https://linuxreviews.org/AMD_graphics</loc>
<lastmod>2020-08-06T07:43:40Z</lastmod>
<priority>1.0</priority>
</url>
The url
nodes go between a urlset
super-node. A <?xml
header is also required. A complete sitemap.xml file could look like this:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://linuxreviews.org/AMD_graphics</loc>
<lastmod>2020-08-06T07:43:40Z</lastmod>
<priority>1.0</priority>
</url>
<url>
<loc>https://linuxreviews.org/AMP</loc>
<lastmod>2020-07-22T20:34:38Z</lastmod>
<priority>1.0</priority>
</url>
</urlset>
Sitemap files need to be UTF-8 encoded (not UTF-16)[1] and data values need to have five special characters need to use special escape codes:
Character | Escape Code | |
---|---|---|
Ampersand | & | &
|
Single Quote | ' | '
|
Double Quote | " | "
|
Greater Than | > | >
|
Less Than | < | <
|
A sitemap should not be lager than 50 MiB in size (uncompressed) and it should not contain more than 50k URLs. Gigantic sites can overcome that problem by creating a sitemap index file that points to multiple sitemaps. Such a sitemap index file should have a sitemapindex
super-node with sitemap
nodes containing loc
entries with sitemap URls and lastmod
time-stamps. A complete sitemap index file could look like:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap/importantpages.xml</loc>
<lastmod>2020-08-12T11:33:19Z</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap/boringpages.xml</loc>
<lastmod>2020-08-12T11:33:19Z</lastmod>
</sitemap>
</sitemapindex>
Writing XML sitemap files manually is pretty futile. Your best option is to use a feature built into the content management system you are using or a plug-in. There are several sitemap plug-ins available for WordPress. MediaWiki has a built-in tool available in maintenance/generateSitemap.php
.
Making Search Engines Aware Of Your Sitemap[edit]
Web crawlers used by search engines and others will pick up any sitemap specified in /robots.txt. All you need is one line with Sitemap:
and an URL:
Sitemap: https://linuxreviews.org/sitemap.xml
Multiple Sitemap:
lines are supported by robots.txt
the standard.
Google[edit]
Google has a special page you can use to submit sitemaps to their search engine. You can simply visit http://www.google.com/ping?sitemap=
+url
to your sitemap (as in http://www.google.com/ping?sitemap=https://example.com/sitemap.xml
) to submit a sitemap to that search engine.[2]
You do not need to submit a newly added sitemap if you add a URL to it in robots.txt
and GoogleBot is crawling your site on a regular basis, it will see it and use it automatically.
Bing[edit]
Bing provides a special URL you can use to submit sitemaps. Their tool requires that the URL to the sitemap is URL-encoded so you can't just submit a plain URL. You can make the bingbot
aware of your sitemap by going to http://www.bing.com/ping?sitemap=</code>+<code>encoded URL
. A valid requests could look like:
http://www.bing.com/ping?sitemap=http%3A%2F%2Fwww.example.com/sitemap.xml
Bing will, like Google and others, look for and understand a Sitemap:
line in your /robots.txt
. You do not need to manually submit a sitemap if bingbot
crawls your site on a regular basis.
Caveats[edit]
- Having a sitemap with many pages does not necessarily mean that web crawlers will fetch all the pages listed in your sitemap. They may only fetch the first 100 or 1000 ordered by either the list they appear in or the last-modification date.
- Web crawlers do not see a sitemap as a list of pages they are allowed to crawl, the majority will happily crawl any and all pages linked to (or not linked to in some cases). You have to list pages or locations you do not want crawlers to crawl in /robots.txt, not mentioning them in your sitemap is not enough.
Graphical Sitemaps[edit]
YaCy can make pretty neat web structure graphs shows all links to and from a website. It can not make a graph showing a websites internal structure.
Footnotes[edit]
- ↑ sitemaps.org: Sitemaps XML format
- ↑ support.google.com: Build and submit a sitemap
Enable comment auto-refresher