Web crawlers

From LinuxReviews
Jump to navigationJump to search

Web crawlers are computer programs scraping web pages for information. Some do it for the purpose of building and updating search engine databases which can be used by the general public, others do it to provide analysis and data to paying customers. Some web crawlers provide a benefit, others will only benefit your competition and potentially have a negative impact. Knowing what to block and what to allow isn't that easy or strait forward. Here's a quick guide to some of the more common one's you may encounter.

"SEO", advertisement other "research" robots[edit]

These robots are used by closed services which are only available to paying customers.

The most well-known ones are AhrefsBot, BLEXBot, mj12bot and SemrushBot. They are all run by different companies who all provide the same class of service: "Research" and "Analysis" to paying clients. Basically, you can register at these companies and pay them to tell you what web pages are on your website (along with other data). They may be useful if you are one paying them and using them. These services are not at all useful if you're not one of their customers. Allowing them may actually hurt you since many use them to setup sites with garbage content carrying the keywords similar to those used on your site in order to gain search engine traffic.

AhrefsBot[edit]

This belongs to a company offering SEO analytic services to paying customers. There is no benefit in having this waste bandwidth unless you are willing to pay for their services - in which case you need to allow it to get the data they collect about your site.

User-agent: AhrefsBot
Disallow: /

Attentio[edit]

Attentio from Belgium describes themselves as "a corporate intelligence service". Their bot used to be hostile and annoying. They are still around but we have not seen their bot since 2010 so blocking them may not be very important. Their bot used to identify itself as Attentio/Nutch-0.9-dev

Barkrowler[edit]

This crawler is used for a service called "Babbar" which describes itself as "SEO is made easier". It's a subscription service that will "Thanks to Babbar’s data and metrics you can uncover the strong and weak points of your sites and their competitors.".

It uses the user-agent "Mozilla/5.0 (compatible; Barkrowler/0.9; +https://babbar.tech/crawler)"

This crawler is basically useless and should be blocked.

BLEXBot[edit]

BLEXBot is just like AhrefsBot, it gathers data for "SEO analysis" for paying customers. No benefit unless you pay for their services.

User-agent: BLEXBot
Disallow: /

Brandwatch[edit]

This spider identifies itself as magpie-crawler/1.1 (U; Linux amd64; en-GB; +http://www.brandwatch.net) and it will fetch anything it thinks may or may not be some kid of RSS feed. It's function is to notify big corporations when they are mentioned in an article. There's no benefit provided by it at all. That being said, it's also not very abusive since it only grabs that it thinks may be some kind of feed. This will include /articles/howto-set-a-rss-feed/ and similar links that are not actually RSS feeds and it will hammer those pages time and time again; it is not smart enough to realize something that's not actually an RSS feed isn't.

Brandwatch does not appear to care about the Robots Exclusion Standard.

Clickagy Intelligence Bot[edit]

This bot, with the non-descriptive user-agent Clickagy Intelligence Bot v2, tends to randomly crawl a single little-trafficked pages with Google Adsense advertisements on them right after such a page has been shown. It belongs to https://www.clickagy.com/ who describes themselves as "an audience intelligence platform, filtering the world's online behaviors in real-time". Whatever it is, it a) seems use data tied to Google Adsense and b) it is only interested in English content.

Cloudfind[edit]

Cloudfind, only identifying itself as Cloudfind/1.0, is a bot operated by cloudfindhq.com. Their story is that their bot "We use AI and Machine Learning to help advertisers on an affiliate network find the best publishers to recruit". That essentially means that someone will send you lots of spam asking you to join some obscure "affiliate network" if this bot deems your site to be "interesting".

DotBot[edit]

DotBot is a bot used by a company called Moz. It's of no value unless you're one of Moz's paying customers. Moz sells products like "Moz tools" and API access as a product called "Mozscape API". None of these provide any benefit to non-customers. They do appear to follow the Robots Exclusion Standard so you can ask them to kindly stay away:

User-agent: dotbot
Disallow: /

Grapeshot[edit]

If you are seeing a lot of hits from the bot GrapeshotCrawler out of 148.64.56.0/24 then you are likely using Google Adsense to service advertisements. This bot does not have any public benefit and it is not used by search engines; it's function is to analyze pages where advertisements have been shown to determine if those pages have "inappropriate" content or not. Setup a fresh page and put Adsense on it and you'll see GrapeshotCrawler snooping around right after the first few advertisements on that page have been shown. Disallowing this bot may result in less advertisers bidding on Adsense advertisement spots on your sites. Thus; blocking it is a good idea if you do not use Adsense since it has no other purpose - allowing it is a good idea if you do use Adsense.

ias-va[edit]

"ias-va" or "ias-va/3.1 (+https://www.admantx.com/service-fetcher.html)" is a belongs to a company named ADmantX who operates a "ADmantX Semantic Analysis Service". It's basically a service advertisers can use to get a page rating telling them if they should place advertisements on it or not. ADmantX is beneficial if you are using some kind of web advertisement provider like Google AdSense to show advertisements on your website.

ADmantX does not care about robots.txt. This is less problematic because it doesn't show up unless you are showing advertisements from a company who works with ADmantX. You should not block ADmantX; if it shows up it's because you are showing advertisements from one of their clients. It is not a spider that goes around crawling pages willy nilly.

LTX71[edit]

The only information about this bot is a small statement at http://ltx71.com/ claiming that

"We continuously scan the internet for security research purposes. Our crawling is not malicious and only notes summary information for a page. If you would like to direct us to avoid crawling your site or portions of it please update your site robots.txt"

ltx71.com

This bot should be seen as hostile. It claims to follow robots.txt bot does not care if you use the claimed string User-agent: ltx71 to deny. It operates out of Google's Cloud services so it's kid of hard to block. Uses IP 35.202.2.1 - among others.

MegaIndex.ru[edit]

MegaIndex, operating from Hetzner Online GmbH IP space using the user-agent Mozilla/5.0 (compatible; MegaIndex.ru/2.0; +http://megaindex.com/crawler)" is another "SEO back-link analysis" service. It is of absolutely no value unless you are using that service. It can safely be denied. Their crawler does fetch robots.txt but it doesn't care about instructions like Crawl-delay. It can be blocked by .htaccess since their crawler does identify itself.

MauiBot[edit]

"MauiBot (crawler.feedback+dc@gmail.com)" operates out of Amazon AWS. There is zero information as to the purpose of this bot.

MauiBot requests lots and lots of pages over time but is not very problematic; it appears to have a fixed 15 second delay between requests.

The fact that there is no information about any public service provided by it means that there is no benefit to allowing it crawl your site.

MauiBot does not appear to care about robots.txt and must therefore be blocked either by .htaccess or iptables.

One site owner reported that a few hours after first seeing MauiBot the site then experienced a number of login attempts from other AmazonAWS servers (on a site seeing relatively few login attempts via Amazon AWS servers, though might be coincidence).

mj12bot[edit]

This bot is, like BLEXBot and DotBot, just for SEO analysis services provided to paying customers. It is a very aggressive bot of no value - unless you are buying Majestic12's services.

User-Agent: MJ12bot
Disallow: /

SemrushBot[edit]

SemrushBot is a really annoying bot which will crawl and crawl and then re-crawl your site permanently. It's for services provided to paying clients only. It is safe to block this bot you are not interested in their "SEO analysis" services. It does follow the Robots Exclusion Standard standard and can be asked to leave with:

User-Agent: SemrushBot
Disallow: /

Useful Search Engine Bots[edit]

Commonly Blocked Bots[edit]

The following bots are useful and do provide a benefit yet they frequently appear as blocked in robots.txt examples, Apache blacklists and similar files. Thus; a rundown of what they actually are may be useful.

360Spider[edit]

This robot indexes sites for the Chinese search-engine 360 who's search website is https://www.so.com/

Their spider does obey the most basic robots.txt entries for User-agent: 360Spider but it can't seem to compute anything more advanced than a simple Disallow: /. It will not understand any (not really) "advanced" rules such as Disallow: /*?*title=Special:

Many choose to outright block 360Spider because their crawlers retarded and there's little traffic to be had unless your website's in Chinese. It will obey the following:

User-agent: 360Spider
Disallow: /

This crawler does provide a public benefit and you may want to allow it if you can handle the load and bandwidth.

Linespider[edit]

Linespider is a good bot operated by the Japanese Line corporation (a Naver subsidiary).

Psbot[edit]

Psbot is really dumb web crawler bot which follows the Robots Exclusion Standard. The bot does give public benefit by allowing people to find your site using the picture search engine http://www.picsearch.com/

Many webmasters choose to deny the bot access anyway because the crawler is just so incredible stupid.

User-agent: psbot
Disallow: /

Like 360Spider, it does provide a public benefit and you may want to allow it if you don't mind it crawling around in circles, utterly confused.

YandexBot[edit]

Yandex a Russian search engine and it has become more known for being one compared to it's early days when nobody knew what a Yandex was. Their robot YandexBot does provide data for their search engine and a lot of people do use their service. Those people are mostly Russian and this is why it may seem like there is very little benefit in having YandexBot waste your precious bandwidth: It will not send a lot of traffic unless your website has Russian content.

One historical reason why many decided to block YandexBot is that their bot was utterly stupid and easily confused in it's early days. It would crawl around in circles on CMS systems with dynamic URLs. they pretty much fixed that a decade ago.

We really don't see any reason to block YandexBot. While it may not seem that useful it is a bot which is used for a publicly available search engine. It is also worth noting that Yandex is probably the best search engine for images (Bing being second) and everyone who tries it's image search and compares that to Google's are likely to go back to Yandex for their image searching needs.

yacybot[edit]

This is the peer to peer search engine software YaCy. You can download it and install it and run it yourself. It's horrible but the best p2p search engine software there is due to the fact that it's only p2p search engine software available. It's bot is, with it's default settings, rather aggressive. It does obey simpler robots.txt rules.

Lesser-known useful bots[edit]

Cliqzbot[edit]

It's the Germans coming for your content so they can feed data to their local self-developed web browser's search-function. They do not provide any publicly available search engine. They do make a web browser with a built-in search-function and their crawler Mozilla/5.0 (compatible; Cliqzbot/2.0; +http://cliqz.com/company/cliqzbot) is used to build a database used by their back-end. Allowing it is probably fine unless you're still upset with them because of the war. Their website is at https://cliqz.com/en/

coccocbot[edit]

This is a web crawler for a search engine in Vietnam. They do have a help-page at http://help.coccoc.com/search-engine but it's all in Vietnamese.

PetalBot[edit]

PetalBot works for Huawei subsidiary Petal Search. It identifies itself as PetalBot as part of a general mobile phone user-agent:

"Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+http://aspiegel.com/petalbot)"

PetalBot is used for a search engine so it is useful. It respects the Robots Exclusion Standard.

Semantic Scholar[edit]

The Semantic Scholar looks for research papers. It's not very useful if you don't have any such content on your site. But it's also not problematic. It identifies itself as:

"Mozilla/5.0 (compatible) SemanticScholarBot (+https://www.semanticscholar.org/crawler)"

There is a free search engine at semanticscholar.org. That puts it in the "good" category since it does provide a service that is useful (to researches, specifically).

seznambot[edit]

Czech web robot used by Seznam.cz which is a popular web portal in that country. It's harmless. It respects robots.txt and crawl-delay.

Qwantify / qqwant.com[edit]

The user-agent "Mozilla/5.0 (compatible; Qwantify/2.4w; +https://www.qwant.com/)/2.4w" is used by a international search engine operated by the French. They appear to follow the Robots Exclusion Standard. France's search engine promises to "respect your privacy". They do not appear to have a very large user-base but it is a publicly available search engine and allowing their crawler is a good idea. You can test France's efforts to get into search at https://www.qwant.com/

Well-known search-engines[edit]

bingbot[edit]

Bingbot is mostly a web crawler used by software giant Microsoft to feed data to their "Bing" search engine. There is quite a lot of not-actually-bingbot crawling around.

PTR records on actual bingbot will be similar to msnbot-207-46-13-118.search.msn.com.

Many not actually bingbot spiders like not-bing 13.66.139.0 operate out of Microsoft IP space thanks to their cloud hosting efforts. It must be noted that Microsoft does not care if there's pages upon pages confirming that an IP in their space like 13.66.139.0 is doing all kinds of attacks using the bingbot user-agent.

googlebot[edit]

Google is a well-known front for US intelligence services with a heavily censored and manipulated search engine. They use bot which identifies itself as googlebot to crawl for that search engine.

Google's actual googlebot operated from IPs with a PTR record such as crawl-66-249-64-119.googlebot.com

There is a lot of not actually googlebot using that user-agent crawling the web. Outfits like Internet Vikings crawl from a wide range of IPs using their agent. Real googlebot will always have a identifying PTR record, impostors using the agent do not.

User-Agent Faking Web Crawlers[edit]

This is actually one of the most common web crawlers in your logs.

Blazing SEO[edit]

Blazing SEO is a company which offers "SEO services". They own a whole range of IPs and operate from a ton of subnets. A whole lot of http POST requests trying to add link-spam is coming from their IP range. Trash from that origin uses user-agents like Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36 - but it's not actually browsers coming out of their range, it's spam bots. A list of some of their IPs can be found in the separate article Blazing SEO.

insight.com[edit]

Insight likes to crawl a group of pages with a spider originating from 198.187.200.0/24 using random user-agents for each request.

internetvikings.com[edit]

This outfit out of Sweden claims to be in the business "Supporting your SEO strategy". They crawl sites using googlebot's user-agent.

$IPTABLES -I INPUT -i $BLACKLISTIF -s 151.252.28.0/24 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 192.121.156.0/22 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 192.121.71.0/24 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 192.71.54.0/23 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 193.183.100.0/22 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 193.235.238.0/23 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 194.71.208.0/22 -j DROP
$IPTABLES -I INPUT -i $BLACKLISTIF -s 5.133.211.0/24 -j DROP

Some of their IPs list abuse@ettnet.se as contact. That domain redirects to internetvikings.com. They are also known as "Internetbolaget Sweden AB". One of their web pages claims they have "We have about 160 000 IPs in 35 A-classes, 70 B-classes and 1200 C-classes" so blocking this outfit may be somewhat problematic.

It's not that strange that they would be faking Google-bot as user-agent given that they are in the business of link-spam.

This company is closely connected to the company running a crawler called "domaincrawler" using the user-agent (info@domaincrawler.com; http://www.domaincrawler.com/<yourdomainhere>)" operating out of 185.6.8.0/24.

Quality Network Corp[edit]

Quality Network Corp is another front for the criminal outfit Cyber World Internet Services which is by Spamhaus described as a "Spam host". They crawl using a range of commonly used browsers user-agents such as Firefox/45.0, Chrome/43 etc. They operate from a rather huge range of IPS.

The list of hostile Quality Network Corp IPs is so long we've moved the list to a seperate page.

Symantec Corporation[edit]

Symantec is historically known for anti-virus solutions. They currently describe themselves as making "security products and solutions to protect small, medium, and enterprise businesses from advanced threats, malware, and other cyber attacks."

Symantec has a web crawler network which uses web browser agents and looks somewhat like a lazy botnet. Their crawlers agent looks somewhat like:

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 OPR/60.0.3255.170"

They fetch pages but stand out as not being browsers by only fetching the page - not CSS, images or anything else related to it (like a real browser would).

Symantec is using a wide range of IPs and subnets from different providers thus resembling a botnet. It is not very apparent that the IPs they use are connected to their AS AS27471.

The purpose of their low-volume crawling is either to gather data on compromised hosts or raise demand for "security products" which protect smaller web hosts against the kind of cyber attacks Symantec is engaging in. A few fake requests per hour will obviously not knock your web server over (you have bigger problems if it does). However, the way they crawl with fake user-agents pretending to be browsers does put Symantec in the same category as spammers and other hostile actors.

Taking Out The Trash[edit]

robots.txt: For Worthless But Conforming Bots[edit]

The bots that are absolutely worthless and a complete waste of bandwidth yet follow the Robots Exclusion Standard can be asked to kindly go away with a robots.txt file.

User-agent: AhrefsBot
Disallow: /

User-agent: Cloudfind
Disallow: /

User-agent: dotbot
Disallow: /

User-agent: BLEXBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-Agent: MJ12bot
Disallow: /

htaccess: For Less Friendly Yet Identifiable Bots[edit]

RewriteEngine on
#
# Bad user-agents
RewriteCond  "%{HTTP_USER_AGENT}" "attentio" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "barkrowler" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "brandwatch" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "cloudfind" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "clickagy" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "domaincrawler" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "dotbot" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "gecko/20060728" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "ltx71" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "megaindex" [OR,NC]
#
# Bots Using Fake User-Agents
RewriteCond  "%{HTTP_USER_AGENT}" "msie3" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "msie 3" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "msie5" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "msie 5.5" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "msie 6.0" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" "mozilla/4.76" [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" megaindex\.com [OR,NC]
#
# Pure Trash Requests (brute force attacks, etc)
RewriteCond %{QUERY_STRING} CONCAT [OR,NC]
RewriteCond %{QUERY_STRING} union\+all\+select [OR,NC]
RewriteCond  "%{HTTP_USER_AGENT}" sqlmap [NC]
#
# Deny
RewriteRule   "^"  "-"  [F,L]

You may also want these rules if you are not using the WordPress content management system:

RewriteEngine on
# WordPress URLs You Will See Bots Scan For
RewriteRule (wlwmanifest.xml) - [F,L]
RewriteRule (sellers.json) - [NC,F,L]
RewriteRule (wp-content) - [NC,F,L]
RewriteRule (wp-login.php) - [NC,F,L]
RewriteRule (wp-admin) - [NC,F,L]

Other[edit]

Most of the hostile web crawlers will pretend to be some kind of web browser. This makes it very hard to block them by user-agent. Sometimes you'll just have to make some firewall rules and drop the most offending ranges..


avatar

Guillaumepitel

13 months ago
Score -1++
Hi, I operate Barkrowler for Babbar.Tech. It is indeed used to crawl the public web and collect netlinking data as well as perform a semantic analysis of web pages content. Unlike what you are asserting, we do obey robots.txt rules. Also, we offer a free access to anyone (just needs to create an account). As a consequence, I find your assertion "This crawler is basically useless and should be blocked. " quite unfair. We are a very young company, the service is publicly opened only since Nov 2020, and while we may not drive much traffic or directly benefit to public websites, remember that there is nothing free in this world, your site depends on Ads, and many businesses depend on organic traffic, which is largely driven by netlinking and SEO practices, so blocking reasonable crawlers . The only reason ois as harmful as generally blocking ads. One should block a crawler or a bot only if it is harming the website by overcrawling. We have spent quite some time to make the crawler "polite" by limiting the crawl rate per IP and per Host.
avatar

Www

13 months ago
Score 0++
the heck does it DEPEND on ads. stop being a liar , Guillaume !
avatar

Www

13 months ago
Score 0++
your "freemium" service is pretty useless, let me tell you.
avatar

Anonymous (80e527c48d)

4 months ago
Score 0

Guillaumepitel,

Your Babbar crawler just hit one of my sites for almost 1 TB of bandwidth by making over 1,000 repeated requests for the same large image file.

I am blocking it, of course, but want you to see the direct result of your horrible programming.

See below:

IP Details For: 51.15.20.50

Decimal: 856626226 Hostname: crawl-prod-16.babbar.eu ASN: 12876 ISP: Dedibox SAS Organization: Dedibox SAS Services: None detected Type: Corporate Assignment: Likely Static IP Continent: Europe Country: Netherlands State/Region: North Holland City: Haarlem

Number of Requests: 1079
avatar

Anonymous (bbe89ccab1)

5 months ago
Score 0
Babbar is abusive. Babbar is useless. Babbar has been blocked.
avatar

Anonymous (ea7dcc8b61)

5 months ago
Score 0

The SEO agencies and consultants collaborate with multiple top-quality brands to provide you with information and strategies that work the best. After years of trial and error tactics, they have gathered this knowledge. Thus, you must also collaborate with quality SEO consultant New York to gain brand experience with ease.It will also help you to access the proven strategies with decent SEO case studies. You can look for the best SEO consultant New York and perform keyword research depending on your website requirements.

<a href="https://www....-in-new-york ">SEO consultant New York !</a>

Add your comment
LinuxReviews welcomes all comments. If you do not want to be anonymous, register or log in. It is free.