A sitemap is a graphical or structured list of pages available on a website. Search engine crawlers look for and use XML formatted sitemaps to efficiently crawl the websites they index. Checking if your websites content management system can create a sitemap and make it available is a good idea. Sitemaps can also be human-readable lists or graphs that can be used to plan or optimize websites structure.
Generating XML Sitemaps
A XML sitemap should consist of
url nodes with
loc indicating the location (URL), a
lastmod last modified time-stamp, a crawl
priority between 0 and 1 and optionally a
changefreq with a value like
monthly indicating how frequently the page is changed.
url node can look like:
<url> <loc>https://linuxreviews.org/AMD_graphics</loc> <lastmod>2020-08-06T07:43:40Z</lastmod> <priority>1.0</priority> </url>
url nodes go between a
urlset super-node. A
<?xml header is also required. A complete sitemap.xml file could look like this:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>https://linuxreviews.org/AMD_graphics</loc> <lastmod>2020-08-06T07:43:40Z</lastmod> <priority>1.0</priority> </url> <url> <loc>https://linuxreviews.org/AMP</loc> <lastmod>2020-07-22T20:34:38Z</lastmod> <priority>1.0</priority> </url> </urlset>
Sitemap files need to be UTF-8 encoded (not UTF-16) and data values need to have five special characters need to use special escape codes:
A sitemap should not be lager than 50 MiB in size (uncompressed) and it should not contain more than 50k URLs. Gigantic sites can overcome that problem by creating a sitemap index file that points to multiple sitemaps. Such a sitemap index file should have a
sitemapindex super-node with
sitemap nodes containing
loc entries with sitemap URls and
lastmod time-stamps. A complete sitemap index file could look like:
<?xml version="1.0" encoding="UTF-8"?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc>https://example.com/sitemap/importantpages.xml</loc> <lastmod>2020-08-12T11:33:19Z</lastmod> </sitemap> <sitemap> <loc>https://example.com/sitemap/boringpages.xml</loc> <lastmod>2020-08-12T11:33:19Z</lastmod> </sitemap> </sitemapindex>
Writing XML sitemap files manually is pretty futile. Your best option is to use a feature built into the content management system you are using or a plug-in. There are several sitemap plug-ins available for WordPress. MediaWiki has a built-in tool available in
Making Search Engines Aware Of Your Sitemap
Web crawlers used by search engines and others will pick up any sitemap specified in /robots.txt. All you need is one line with
Sitemap: and an URL:
Sitemap: lines are supported by
robots.txt the standard.
Google has a special page you can use to submit sitemaps to their search engine. You can simply visit
url to your sitemap (as in
http://www.google.com/ping?sitemap=https://example.com/sitemap.xml) to submit a sitemap to that search engine.
You do not need to submit a newly added sitemap if you add a URL to it in
robots.txt and GoogleBot is crawling your site on a regular basis, it will see it and use it automatically.
Bing provides a special URL you can use to submit sitemaps. Their tool requires that the URL to the sitemap is URL-encoded so you can't just submit a plain URL. You can make the
bingbot aware of your sitemap by going to
http://www.bing.com/ping?sitemap=</code>+<code>encoded URL. A valid requests could look like:
Bing will, like Google and others, look for and understand a
Sitemap: line in your
/robots.txt. You do not need to manually submit a sitemap if
bingbot crawls your site on a regular basis.
- Having a sitemap with many pages does not necessarily mean that web crawlers will fetch all the pages listed in your sitemap. They may only fetch the first 100 or 1000 ordered by either the list they appear in or the last-modification date.
- Web crawlers do not see a sitemap as a list of pages they are allowed to crawl, the majority will happily crawl any and all pages linked to (or not linked to in some cases). You have to list pages or locations you do not want crawlers to crawl in /robots.txt, not mentioning them in your sitemap is not enough.
YaCy can make pretty neat web structure graphs shows all links to and from a website. It can not make a graph showing a websites internal structure.