Make Money at Top Bucks
Topbucks can help you make fat cash on your website!

LinuxReviws.org --get your your Linux knowledge
> Linux Reviews > Web Design Tips and Info >

Using .htaccess for site control

How to prevent hot-linking, refer moved pages and block robots using .htaccess


  1. What is .htaccess?
  2. Preventing abuse by hot-linking to contents on your site
  3. Moved pages
  4. Bad Bot User-Agent based blocking


1. What is .htaccess?

.htaccess is an apache configuration you can place in any folder on your website and it will apply to all recursive folders. It can be used to password protect folders, change properties on specific file-types and more.

It is known to be good to leave a few blank lines at the end of your .htaccess file.

2. Preventing abuse by hot-linking to contents on your site

Bandwidth abuse / stealing by sites embedding images and other content from your server is a growing problem. Some forums allow users to add a picture to their post and it is common to use a link to another site as user picture.

With apache you can add FilesMatch to your .htaccess and Deny all request not refered from your domain. A FilesMatch section can begin like this:

  <FilesMatch "\.(gif|jpg|png|swf|mpg|avi)$">

And should contain and ruleset like Order Allow,Deny

The $ at the end of the match sets (foo|bar) makes the rule only apply to files ending with given strings, omitting it will make the match include files who contain the string anywhere in the filename (meaning myfile.foo.tar.bz2 would match foo).

Example .htaccess usage:

  SetEnvIfNoCase Referer "homelinux\.org" local_ref=1
   <FilesMatch "\.(gif|jpg|png|swf|mpg|avi)">
   Order Allow,Deny
   Allow from env=local_ref
   </FilesMatch>

For more additional information, refer to:

3. Moved pages

If you change your publishing system you may run into the problem that pages now moved are linked to an old and no longer working URL. You can use a directive Redirect

Example .htaccess file:

  Redirect /old/page.html  http://foo.bar.org/moved/here.php
  Redirect /foo/moved.html http://foo.bar.org/new/page/
  RedirectPermanent  /bar/movedforgood.html http://foo.org/new/

Use RedirectPermanent if the new links are permanent.

4. Bad Bot User-Agent based blocking

If you admin your own website and look at the logs you have probably noticed a high number of visitors with strange and unusual user-agent identification. You also probably seen bad robots eat all your bandwith by hitting you with a high number of request or downloading the same file using five to ten connections from time to time.

Here is a list of users agents you may not want visiting you made by HTTP_USER_AGENT .htaccess | awk '{print $3}' and also an example .htaccess to block them. This may save you bandwidth and prevent your site to become slow and overloaded.

Known "bad" guests:

   Alexibot
   asterias
   BackDoorBot
   Black.Hole
   BlackWidow
   BlowFish
   BotALot
   BuiltBotTough
   Bullseye
   BunnySlippers
   Cegbfeieh
   CheeseBot
   CherryPicker
   ChinaClaw
   CopyRightCheck
   cosmos
   Crescent
   Custo
   DISCo
   DittoSpyder
   Download\
   eCatch
   EirGrabber
   EmailCollector
   EmailSiphon
   EmailWolf
   EroCrawler
   Express\
   ExtractorPro
   EyeNetIE
   FlashGet
   Foobot
   FrontPage
   GetRight
   GetWeb!
   Go-Ahead-Got-It
   Googlebot-Image
   Go!Zilla
   GrabNet
   Grafula
   Harvest
   hloader
   HMView
   httplib
   HTTrack
   humanlinks
   ia_archiver
   Image\
   Image\
   Indy\
   InfoNaviRobot
   InterGET
   Internet\
   JennyBot
   JetCar
   JOC\
   Kenjin.Spider
   Keyword.Density
   larbin
   LeechFTP
   LexiBot
   libWeb/clsHTTP
   LinkextractorPro
   LinkScan/8.1a.Unix
   LinkWalker
   lwp-trivial
   Mass\
   Mata.Hari
   Microsoft.URL
   MIDown\
   MIIxpc
   Mister.PiX
   Mister\
   moget
   Mozilla/2
   Mozilla/3.Mozilla/2.01
   Mozilla.*NEWT
   Navroad
   NearSite
   NetAnts
   NetMechanic
   NetSpider
   Net\
   NetZIP
   NICErsPRO
   NPBot
   Octopus
   Offline.Explorer
   Openfind
   PageGrabber
   Papa\
   pavuk
   pcBrowser
   ProPowerBot/2.14
   ProWebWalker
   ProWebWalker
   QueryN.Metasearch
   ReGet
   RepoMonkey
   RMA
   SiteSnagger
   SlySearch
   SmartDownload
   SpankBot
   spanner
   SuperBot
   SuperHTTP
   Surfbot
   suzuran
   Szukacz/1.4
   tAkeOut
   Teleport
   Telesoft
   The.Intraformant
   TheNomad
   TightTwatBot
   Titan
   TJvMultiHttpGrabber Component
   toCrawl/UrlDispatcher
   True_Robot
   turingos
   TurnitinBot/1.5
   URLy.Warning
   VCI
   VoidEYE
   WebAuto
   WebBandit
   WebCopier
   WebEMailExtrac.*
   WebEnhancer
   WebFetch
   WebGo\
   Web.Image.Collector
   WebLeacher
   WebmasterWorldForumBot
   WebReaper
   WebSauger
   Website.Quester
   Webster.Pro
   WebStripper
   WebWhacker
   WebZip
   Wget
   Widow
   [Ww]eb[Bb]andit
   WWW-Collector-E
   WWWOFFLE
   Xaldon\
   Xenu's
   Zeus

Not all of these are bad or evil, that is a matter of opinion.

Programs downloading your entire site in one quick spoof

Offline.Explorer and wget (manual page) use much bandwidth by downloading large parts of or even your entire site at once so users can follow links on your site after they disconnect their modem.

Download managers

These are programs that can "download faster" by opening tons of connections requesting parts of the same file, commonly used by windows-users who install them after being brainwashed by pop-up ads telling them to. Download managers can strange your bandwidth and are generally bad. Users can still download the content, it does not hurt to restrict them to using a browser.

Search Engine Spiders

ia_archiver and Googlebot index your site and makes it available through various search engines. This is usually worth it because this will increase real traffic. Most of these spiders are fairly nice and only hit you every xx minutes.

Some of these, like TurnitinBot and SiteSnagger, are all bad. Some robots only crawl your site to find email addresses usable for sending SPAM.

Google can help you identify the various spiders and their purpose.

Evil programs that only eats your bandwidth

"TJvMultiHttpGrabber Component" is very bad. This is btsearch, a program that searches through websites and makes their content available through a nice GUI interface where it can be downloaded. The user has no idea where on the Internet the files are stored or who serves them.

How to deny access

Here is a nice example of a .htaccess that will redirect bots and thereby disallow them access to the real contents of your site, while allowing the commonly used browsers. This only works if you have the module mod_rewrite module enabled.

StopBadBots.htaccess.txt

  
   RewriteEngine on
   RewriteCond %{HTTP_USER_AGENT} ^Alexibot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^asterias [OR]
   RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Black.Hole [OR]
   RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
   RewriteCond %{HTTP_USER_AGENT} ^BlowFish [OR]
   RewriteCond %{HTTP_USER_AGENT} ^BotALot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^BuiltBotTough [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Bullseye [OR]
   RewriteCond %{HTTP_USER_AGENT} ^BunnySlippers [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Cegbfeieh [OR]
   RewriteCond %{HTTP_USER_AGENT} ^CheeseBot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
   RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
   RewriteCond %{HTTP_USER_AGENT} ^CopyRightCheck [OR]
   RewriteCond %{HTTP_USER_AGENT} ^cosmos [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
   RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
   RewriteCond %{HTTP_USER_AGENT} ^DittoSpyder [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
   RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
   RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
   RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
   RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
   RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
   RewriteCond %{HTTP_USER_AGENT} ^EroCrawler [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
   RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
   RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
   RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Foobot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^FrontPage [NC,OR]
   RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
   RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Googlebot-Image [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
   RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Harvest [OR]
   RewriteCond %{HTTP_USER_AGENT} ^hloader [OR]
   RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
   RewriteCond %{HTTP_USER_AGENT} ^httplib [OR]
   RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
   RewriteCond %{HTTP_USER_AGENT} ^humanlinks [OR]
   RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Indy\ Library [NC,OR]
   RewriteCond %{HTTP_USER_AGENT} ^InfoNaviRobot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
   RewriteCond %{HTTP_USER_AGENT} ^JennyBot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
   RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Kenjin.Spider [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Keyword.Density [OR]
   RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
   RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
   RewriteCond %{HTTP_USER_AGENT} ^LexiBot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^libWeb/clsHTTP [OR]
   RewriteCond %{HTTP_USER_AGENT} ^LinkextractorPro [OR]
   RewriteCond %{HTTP_USER_AGENT} ^LinkScan/8.1a.Unix [OR]
   RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
   RewriteCond %{HTTP_USER_AGENT} ^lwp-trivial [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Mata.Hari [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
   RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
   RewriteCond %{HTTP_USER_AGENT} ^MIIxpc [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Mister.PiX [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
   RewriteCond %{HTTP_USER_AGENT} ^moget [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Mozilla/2 [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3.Mozilla/2.01 [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
   RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
   RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
   RewriteCond %{HTTP_USER_AGENT} ^NetMechanic [OR]
   RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
   RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
   RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
   RewriteCond %{HTTP_USER_AGENT} ^NPBot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Offline.Explorer [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Openfind [OR]
   RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
   RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
   RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
   RewriteCond %{HTTP_USER_AGENT} ^ProPowerBot/2.14 [OR]
   RewriteCond %{HTTP_USER_AGENT} ^ProWebWalker [OR]
   RewriteCond %{HTTP_USER_AGENT} ^ProWebWalker [OR]
   RewriteCond %{HTTP_USER_AGENT} ^QueryN.Metasearch [OR]
   RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
   RewriteCond %{HTTP_USER_AGENT} ^RepoMonkey [OR]
   RewriteCond %{HTTP_USER_AGENT} ^RMA [OR]
   RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
   RewriteCond %{HTTP_USER_AGENT} ^SlySearch [OR]
   RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
   RewriteCond %{HTTP_USER_AGENT} ^SpankBot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^spanner [OR]
   RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^suzuran [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Szukacz/1.4 [OR]
   RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Telesoft [OR]
   RewriteCond %{HTTP_USER_AGENT} ^The.Intraformant [OR]
   RewriteCond %{HTTP_USER_AGENT} ^TheNomad [OR]
   RewriteCond %{HTTP_USER_AGENT} ^TightTwatBot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Titan [OR]
   RewriteCond %{HTTP_USER_AGENT} ^toCrawl/UrlDispatcher [OR]
   RewriteCond %{HTTP_USER_AGENT} ^toCrawl/UrlDispatcher [OR]
   RewriteCond %{HTTP_USER_AGENT} ^True_Robot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^turingos [OR]
   RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot/1.5 [OR]
   RewriteCond %{HTTP_USER_AGENT} ^URLy.Warning [OR]
   RewriteCond %{HTTP_USER_AGENT} ^VCI [OR]
   RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebBandit [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebEnhancer [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Web.Image.Collector [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebmasterWorldForumBot [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Website.Quester [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Webster.Pro [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WebZip [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
   RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WWW-Collector-E [OR]
   RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Xenu's [OR]
   RewriteCond %{HTTP_USER_AGENT} ^Zeus
   RewriteRule ^(.*)$ http://www.robotstxt.org/
  
   

[OR] tells apache the next line also contains a RewriteCond, the [OR] must be included on all lines except the line following the RewriteRule these will apply to.

[NC] means case-insensitive.

The RewriteRule can be used to refer "visitors" to a local file like goaway.txt or send them along to another site.

The logs will list redirected hits with status code 302. (HTTP Status Codes - Redirecting URLs in IIS and Apache)


Copyright GNU Copyleft http://linuxreviews.org/


- Next: RSS Tutorial for Content Publishers and Webmasters
- Previous: How to create and use shortcut (favicon) icons on your pages

Meet new people