PHP Web Crawler List

From LinuxReviews
Jump to navigationJump to search

This is a list of web crawlers in no particular order for use in anything PHP. It's not concerned with what is and isn't a good or bad bot, it's simply a list of user-agent snippets that allow you to separate web crawlers from humans visiting a website using a web browser.

if (isset($_SERVER['HTTP_USER_AGENT'])){
    $useragent = strtolower($_SERVER['HTTP_USER_AGENT']);
    $isBot = strpos($useragent, 'bing.com/bingbot') !== false
        || strpos($useragent, 'google.com/bot') !== false
        || strpos($useragent, 'googlebot') !== false
        || strpos($useragent, 'semrush.com/bot') !== false
        || strpos($useragent, 'gigabot') !== false
        || strpos($useragent, 'seznambot') !== false
        || strpos($useragent, 'yandex.com/bots') !== false
        || strpos($useragent, 'exabot.com/go') !== false
        || strpos($useragent, 'opensiteexplorer.org/dotbot') !== false
        || strpos($useragent, 'applebot') !== false
        || strpos($useragent, 'ahrefs.com/robot') !== false
        || strpos($useragent, 'facebookexternalhit') !== false
        || strpos($useragent, 'pinterestbot') !== false
        || strpos($useragent, 'webmeup-crawler') !== false
        || strpos($useragent, 'mojeek.com/bot') !== false
        || strpos($useragent, 'search/spider.html') !== false
        || strpos($useragent, 'flipboard.com/browserproxy') !== false
        || strpos($useragent, 'support.paper.li') !== false
        || strpos($useragent, 'sogou.com/docs') !== false
        || strpos($useragent, '360spider') !== false
        || strpos($useragent, 'naver.me/spd') !== false
        || strpos($useragent, 'info@domaincrawler') !== false
        || strpos($useragent, 'bytespider') !== false
        || strpos($useragent, 'yacybot') !== false
        || strpos($useragent, 'datanyze') !== false
        || strpos($useragent, 'garlikcrawler') !== false
        || strpos($useragent, 'megaindex.com/crawler') !== false
        || strpos($useragent, 'qwantify/') !== false
        || strpos($useragent, 'metajob.de/crawler') !== false
        || strpos($useragent, 'duckduckbot') !== false
        || strpos($useragent, 'commoncrawl.org/faq') !== false
        || strpos($useragent, 'barkrowler') !== false
        || strpos($useragent, 'help.coccoc.com/searchengine') !== false
        || strpos($useragent, 'linkpadbot') !== false
        || strpos($useragent, 'yisouspider') !== false
        || strpos($useragent, 'hypefactors.com/media-monitoring') !== false
        || strpos($useragent, 'mastodon/') !== false
        || strpos($useragent, 'friendica') !== false
        || strpos($useragent, 'pleroma') !== false
        || strpos($useragent, 'gabsocial') !== false
        || strpos($useragent, 'ia_archiver') !== false
        || strpos($useragent, 'daum.net') !== false
        || strpos($useragent, 'linespider') !== false
        || strpos($useragent, 'petalbot') !== false
        || strpos($useragent, 'grapeshotcrawler') !== false
        || strpos($useragent, 'proximic') !== false
        || strpos($useragent, 'twitterbot') !== false
        || strpos($useragent, 'admantx') !== false
        || strpos($useragent, 'smtbot') !== false
        || strpos($useragent, 'lounge') !== false
        || strpos($useragent, 'discordbot') !== false
        || strpos($useragent, 'semanticscholar') !== false
        || strpos($useragent, 'awario.com/bots') !== false
        || strpos($useragent, 'ltx71') !== false
        || strpos($useragent, 'dispatch') !== false
        || strpos($useragent, 'okhttp/') !== false
        || strpos($useragent, 'jetsli.de/crawler') !== false
        || strpos($useragent, 'nimbostratus') !== false
        || strpos($useragent, 'mediatoolkit') !== false
        || strpos($useragent, 'wordpress/') !== false
        || strpos($useragent, 'blackboard') !== false
        || strpos($useragent, 'serendeputy') !== false
        || strpos($useragent, 'radian6') !== false
        || strpos($useragent, 'linkedinbot') !== false
        || strpos($useragent, 'mauibot') !== false
        || strpos($useragent, 'cloudfind') !== false
        || strpos($useragent, 'anthill') !== false
        || strpos($useragent, 'newsblur') !== false
        || strpos($useragent, 'amazonbot/0') !== false
        || strpos($useragent, 'quiterss') !== false
        || strpos($useragent, 'rssbot') !== false
        || strpos($useragent, 'hypefactors') !== false
        || strpos($useragent, 'archive.org_bot') !== false
        || strpos($useragent, 'coccocbot') !== false
        || strpos($useragent, 'opengraphreader') !== false
        || strpos($useragent, 'mj12bot.com') !== false;
}else{
        $useragent = "Mozilla (useragent unset)";
        $isBot = false;
}

The above list may be useful for things like Open Web Analytics that are written in PHP and not all that good at figuring out what is and isn't a bot/web crawler.

Add your comment
LinuxReviews welcomes all comments. If you do not want to be anonymous, register or log in. It is free.