Web crawlers

From LinuxReviews
Jump to navigationJump to search

Web crawlers are computer programs scraping web pages for information. Some do it for the purpose of building and updating search engine databases which can be used by the general public, others do it to provide analysis and data to paying customers. Some web crawlers provide a benefit, others will only benefit your competition and potentially have a negative impact. Knowing what to block and what to allow isn't that easy or strait forward. Here's a quick guide to some of the more common one's you may encounter.

"SEO", advertisement other "research" robots[edit | edit source]

These robots are used by closed services which are only available to paying customers.

The most well-known ones are AhrefsBot, BLEXBot, mj12bot and SemrushBot. They are all run by different companies who all provide the same class of service: "Research" and "Analysis" to paying clients. Basically, you can register at these companies and pay them to tell you what web pages are on your website (along with other data). They may be useful if you are one paying them and using them. These services are not at all useful if you're not one of their customers. Allowing them may actually hurt you since many use them to setup sites with garbage content carrying the keywords similar to those used on your site in order to gain search engine traffic.

AhrefsBot[edit | edit source]

This belongs to a company offering SEO analytic services to paying customers. There is no benefit in having this waste bandwidth unless you are willing to pay for their services - in which case you need to allow it to get the data they collect about your site.

User-agent: AhrefsBot
Disallow: /

Attentio[edit | edit source]

Attentio from Belgium describes themselves as "a corporate intelligence service". Their bot used to be hostile and annoying. They are still around but we have not seen their bot since 2010 so blocking them may not be very important. Their bot used to identify itself as Attentio/Nutch-0.9-dev

BLEXBot[edit | edit source]

BLEXBot is just like AhrefsBot, it gathers data for "SEO analysis" for paying customers. No benefit unless you pay for their services.

User-agent: BLEXBot
Disallow: /

Brandwatch[edit | edit source]

This spider identifies itself as magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net) and it will fetch anything it thinks may or may not be some kid of RSS feed. It's function is to notify big corporations when they are mentioned in an article. There's no benefit provided by it at all. That being said, it's also not very abusive since it only grabs that it thinks may be some kind of feed. This will include /articles/howto-set-a-rss-feed/ and similar links that are not actually RSS feeds and it will hammer those pages time and time again; it is not smart enough to realize something that's not actually an RSS feed isn't.

Clickagy Intelligence Bot[edit | edit source]

This bot, with the non-descriptive user-agent Clickagy Intelligence Bot v2, tends to randomly crawl a single little-trafficked pages with Google Adsense advertisements on them right after such a page has been shown. It belongs to https://www.clickagy.com/ who describes themselves as "an audience intelligence platform, filtering the world's online behaviors in real-time". Whatever it is, it a) seems use data tied to Google Adsense and b) it is only interested in English content.

DotBot[edit | edit source]

DotBot is a bot used by a company called Moz. It's of no value unless you're one of Moz's paying customers.

User-agent: dotbot
Disallow: /

Grapeshot[edit | edit source]

If you are seeing a lot of hits from the bot GrapeshotCrawler out of 148.64.56.0/24 then you are likely using Google Adsense to service advertisements. This bot does not have any public benefit and it is not used by search engines; it's function is to analyze pages where advertisements have been shown to determine if those pages have "inappropriate" content or not. Setup a fresh page and put Adsense on it and you'll see GrapeshotCrawler snooping around right after the first few advertisements on that page have been shown. Disallowing this bot may result in less advertisers bidding on Adsense advertisement spots on your sites. Thus; blocking it is a good idea if you do not use Adsense since it has no other purpose - allowing it is a good idea if you do use Adsense.

LTX71[edit | edit source]

The only information about this bot is a small statement at http://ltx71.com/ claiming that

"We continuously scan the internet for security research purposes. Our crawling is not malicious and only notes summary information for a page. If you would like to direct us to avoid crawling your site or portions of it please update your site robots.txt"

ltx71.com

This bot should be seen as hostile. It claims to follow robots.txt bot does not care if you use the claimed string User-agent: ltx71 to deny. It operates out of Google's Cloud services so it's kid of hard to block. Uses IP 35.202.2.1 - among others.

MegaIndex.ru[edit | edit source]

MegaIndex, operating from Hetzner Online GmbH IP space using the user-agent Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)" is another "SEO back-link analysis" service. It is of absolutely no value unless you are using that service. It can safely be denied. Their crawler does fetch robots.txt but it doesn't care about instructions like Crawl-delay. It can be blocked by .htaccess since their crawler does identify itself.

mj12bot[edit | edit source]

This bot is, like BLEXBot and DotBot, just for SEO analysis services provided to paying customers. It is a very aggressive bot of no value - unless you are buying Majestic12's services.

User-Agent: MJ12bot
Disallow: /

SemrushBot[edit | edit source]

SemrushBot is a really annoying bot which will crawl and crawl and then re-crawl your site permanently. It's for services provided to paying clients only. It is safe to block this bot you are not interested in their "SEO analysis" services. It does follow the Robots Exclusion Standard standard and can be asked to leave with:

User-Agent: SemrushBot
Disallow: /

Useful Search Engine Bots[edit | edit source]

Commonly Blocked Bots[edit | edit source]

The following bots are useful and do provide a benefit yet they frequently appear as blocked in robots.txt examples, Apache blacklists and similar files. Thus; a rundown of what they actually are may be useful.

360Spider[edit | edit source]

This robot indexes sites for the Chinese search-engine 360 who's search website is https://www.so.com/

Their spider does obey the most basic robots.txt entries for User-agent: 360Spider but it can't seem to compute anything more advanced than a simple Disallow: /. It will not understand any (not really) "advanced" rules such as Disallow: /*?*title=Special:

Many choose to outright block 360Spider because their crawlers retarded and there's little traffic to be had unless your website's in Chinese. It will obey the following:

User-agent: 360Spider
Disallow: /

This crawler does provide a public benefit and you may want to allow it if you can handle the load and bandwidth.

Psbot[edit | edit source]

Psbot is really dumb web crawler bot which follows the Robots Exclusion Standard. The bot does give public benefit by allowing people to find your site using the picture search engine http://www.picsearch.com/

Many webmasters choose to deny the bot access anyway because the crawler is just so incredible stupid.

User-agent: psbot
Disallow: /

Like 360Spider, it does provide a public benefit and you may want to allow it if you don't mind it crawling around in circles, utterly confused.

YandexBot[edit | edit source]

Yandex a Russian search engine and it has become more known for being one compared to it's early days when nobody knew what a Yandex was. Their robot YandexBot does provide data for their search engine and a lot of people do use their service. Those people are mostly Russian and this is why it may seem like there is very little benefit in having YandexBot waste your precious bandwidth: It will not send a lot of traffic unless your website has Russian content.

One historical reason why many decided to block YandexBot is that their bot was utterly stupid and easily confused in it's early days. It would crawl around in circles on CMS systems with dynamic URLs. they pretty much fixed that a decade ago.

We really don't see any reason to block YandexBot. While it may not seem that useful it is a bot which is used for a publicly available search engine. It is also worth noting that Yandex is probably the best search engine for images (Bing being second) and everyone who tries it's image search and compares that to Google's are likely to go back to Yandex for their image searching needs.

yacybot[edit | edit source]

This is the peer to peer search engine software YaCy. You can download it and install it and run it yourself. It's horrible but the best p2p search engine software there is due to the fact that it's only p2p search engine software available. It's bot is, with it's default settings, rather aggressive. It does obey simpler robots.txt rules.

Lesser-known useful bots[edit | edit source]

Cliqzbot[edit | edit source]

It's the Germans coming for your content so they can feed data to their local self-developed web browser's search-function. They do not provide any publicly available search engine. They do make a web browser with a built-in search-function and their crawler Mozilla/5.0 (compatible; Cliqzbot/2.0; +http://cliqz.com/company/cliqzbot) is used to build a database used by their back-end. Allowing it is probably fine unless you're still upset with them because of the war. Their website is at https://cliqz.com/en/

coccocbot[edit | edit source]

This is a web crawler for a search engine in Vietnam. They do have a help-page at http://help.coccoc.com/search-engine but it's all in Vietnamese.

seznambot[edit | edit source]

Czech web robot used by Seznam.cz which is a popular web portal in that country. It's harmless. It respects robots.txt and crawl-delay.

Qwantify / qqwant.com[edit | edit source]

The user-agent "Mozilla/5.0 (compatible; Qwantify/2.4w; +https://www.qwant.com/)/2.4w" is used by a international search engine operated by the French. They appear to follow the Robots Exclusion Standard. France's search engine promises to "respect your privacy". They do not appear to have a very large user-base but it is a publicly available search engine and allowing their crawler is a good idea. You can test France's efforts to get into search at https://www.qwant.com/

Well-known search-engines[edit | edit source]

bingbot[edit | edit source]

Bingbot is mostly a web crawler used by software giant Microsoft to feed data to their "Bing" search engine. There is quite a lot of not-actually-bingbot crawling around.

PTR records on actual bingbot will be similar to msnbot-207-46-13-118.search.msn.com.

Many not actually bingbot spiders like not-bing 13.66.139.0 operate out of Microsoft IP space thanks to their cloud hosting efforts. It must be noted that Microsoft does not care if there's pages upon pages confirming that an IP in their space like 13.66.139.0 is doing all kinds of attacks using the bingbot user-agent.

googlebot[edit | edit source]

Google is a well-known front for US intelligence services with a heavily censored and manipulated search engine. They use bot which identifies itself as googlebot to crawl for that search engine.

Google's actual googlebot operated from IPs with a PTR record such as crawl-66-249-64-119.googlebot.com

There is a lot of not actually googlebot using that user-agent crawling the web. Outfits like Internet Vikings crawl from a wide range of IPs using their agent. Real googlebot will always have a identifying PTR record, impostors using the agent do not.

Mysterious robots[edit | edit source]

These mysteries remain unsolved.

MauiBot[edit | edit source]

"MauiBot (crawler.feedback+dc@gmail.com)" operates out of Amazon AWS. It requests robots.txt and appears to respect it. There is zero information as to the purpose of this bot.

MauiBot requests lots and lots of pages over time but is not very problematic; it appears to have a fixed 15 second delay between requests.

The fact that there is no information about any public service provided by it means that there is no benefit to allowing it crawl your site.

MauiBot does not appear to care about robots.txt and must therefore be blocked either by .htaccess or iptables.

User-Agent Faking Web Crawlers[edit | edit source]

This is actually one of the most common web crawlers in your logs.

Blazing SEO[edit | edit source]

Blazing SEO is a company which offers "SEO services". They own a whole range of IPs and operate from a ton of subnets. A whole lot of http POST requests trying to add link-spam is coming from their IP range. Trash from that origin uses user-agents like Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 - but it's not actually browsers coming out of their range, it's spam bots. A list of some of their IPs can be found in the separate article Blazing SEO.

insight.com[edit | edit source]

Insight likes to crawl a group of pages with a spider originating from 198.187.200.0/24 using random user-agents for each request.

internetvikings.com[edit | edit source]

This outfit out of Sweden claims to be in the business "Supporting your SEO strategy". They crawl sites using googlebot's user-agent.

$IPTABLES -I INPUT -i $BLACKLISTIF -s 151.252.28.0/24 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 192.121.156.0/22 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 192.121.71.0/24 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 192.71.54.0/23 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 193.183.100.0/22 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 193.235.238.0/23 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 194.71.208.0/22 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 5.133.211.0/24 -j DROP

Some of their IPs list abuse@ettnet.se as contact. That domain redirects to internetvikings.com. They are also known as "Internetbolaget Sweden AB". One of their web pages claims they have "We have about 160 000 IPs in 35 A-classes, 70 B-classes and 1200 C-classes" so blocking this outfit may be somewhat problematic.

It's not that strange that they would be faking Google-bot as user-agent given that they are in the business of link-spam.

Quality Network Corp[edit | edit source]

Quality Network Corp is another front for the criminal outfit Cyber World Internet Services which is by Spamhaus described as a "Spam host". They crawl using a range of commonly used browsers user-agents such as Firefox/45.0, Chrome/43 etc. They operate from a rather huge range of IPS.

The list of hostile Quality Network Corp IPs is so long we've moved the list to a seperate page.

Symantec Corporation[edit | edit source]

Symantec is historically known for anti-virus solutions. They currently describe themselves as making "security products and solutions to protect small, medium, and enterprise businesses from advanced threats, malware, and other cyber attacks."

Symantec has a web crawler network which uses web browser agents and looks somewhat like a lazy botnet. Their crawlers agent looks somewhat like:

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 OPR/60.0.3255.170"

They fetch pages but stand out as not being browsers by only fetching the page - not CSS, images or anything else related to it (like a real browser would).

Symantec is using a wide range of IPs and subnets from different providers thus resembling a botnet. It is not very apparent that the IPs they use are connected to their AS AS27471.

The purpose of their low-volume crawling is either to gather data on compromised hosts or raise demand for "security products" which protect smaller web hosts against the kind of cyber attacks Symantec is engaging in. A few fake requests per hour will obviously not knock your web server over (you have bigger problems if it does). However, the way they crawl with fake user-agents pretending to be browsers does put Symantec in the same category as spammers and other hostile actors.

Killing Garbage[edit | edit source]

Most of the hostile web crawlers will pretend to be some kind of web browser. This makes it very hard to block them by user-agent. However, it is quite possible to block those using some wildly outdated User-Agent string. This can obviously make real visitors victims - but it is not that likely that anyone would be using still most popular fake user-agent: Microsoft Internet Explorer 6.

We'll just leave the following PHP snip here for your consideration:

if (isset($_SERVER['HTTP_USER_AGENT'])){
        $ua = strtolower($_SERVER['HTTP_USER_AGENT']);
}else{   
        $ua = 'none';
}

$isCrap = strpos($ua, 'windows 98') !== false
        || strpos($ua, 'windows 95') !== false
        || strpos($ua, 'msie 7.0b') !== false
        || strpos($ua, 'nbertaupete95') !== false
        || strpos($ua, 'firefox/40.1') !== false
        || strpos($ua, 'firefox/3.6b4') !== false
        || strpos($ua, 'gecko/20091124') !== false
        || strpos($ua, 'msie 2.0') !== false
        || strpos($ua, 'msie 3.0') !== false
        || strpos($ua, 'msie 4.0') !== false
        || strpos($ua, 'msie 5') !== false
        || strpos($ua, 'msie 6') !== false
        || strpos($ua, 'chrome 24') !== false
        || strpos($ua, 'firefox/17') !== false
        || strpos($ua, 'firefox/21') !== false
        || strpos($ua, 'python-requests/2') !== false
        || strpos($ua, 'garlikcrawler') !== false
        || strpos($ua, 'user-agent: mozilla') !== false
        || strpos($ua, 'jdatabasedrivermysql') !== false
        || strpos($ua, 'os x x.y; rv:42') !== false
        || strpos($ua, 'apache-httpclient/4') !== false
        || strpos($ua, 'mj12bot') !== false
        || strpos($ua, 'webmeup-crawler') !== false
        || strpos($ua, 'semrush.com/bot') !== false
        || strpos($ua, 'ahrefs.com/robot') !== false
        || strpos($ua, 'iron/2') !== false;

if ($isCrap || $isRefererSpam){
        define("QUICK_CACHE_ALLOWED", false);
        header('X-Denied: Yes');
        if ($isCrap) header('X-BadUserAgent: '.$ua);
        if ($isRefererSpam) header('X-RefererSpammy: Yes');
        header('X-Rick-Would-Never: Let you down');
        die('402 Payment Required');
} ?>