Using .htaccess for site control
How to prevent hot-linking, refer moved pages and block robots using .htaccess
- What is .htaccess?
- Preventing abuse by hot-linking to contents on your site
- Moved pages
- Bad Bot User-Agent based blocking
1. What is .htaccess?
.htaccess is an apache configuration you can place in any folder on your website and it will apply to all recursive folders. It can be used to password protect folders, change properties on specific file-types and more.
It is known to be good to leave a few blank lines at the end of your .htaccess file.
2. Preventing abuse by hot-linking to contents on your site
Bandwidth abuse / stealing by sites embedding images and other content from your server is a growing problem. Some forums allow users to add a picture to their post and it is common to use a link to another site as user picture.
With apache you can add FilesMatch to your .htaccess and Deny all request not refered from your domain. A FilesMatch section can begin like this:
<FilesMatch "\.(gif|jpg|png|swf|mpg|avi)$">
And should contain and ruleset like Order Allow,Deny
The $ at the end of the match sets (foo|bar) makes the rule only apply to files ending with given strings, omitting it will make the match include files who contain the string anywhere in the filename (meaning myfile.foo.tar.bz2 would match foo).
Example .htaccess usage:
SetEnvIfNoCase Referer "homelinux\.org" local_ref=1 <FilesMatch "\.(gif|jpg|png|swf|mpg|avi)"> Order Allow,Deny Allow from env=local_ref </FilesMatch>
For more additional information, refer to:
- Apache Core Features: FilesMatch http://httpd.apache.org/docs/mod/core.html.en#filesmatch
- Preventing Image 'Theft' http://apache-server.com/tutorials/ATimage-theft.html
3. Moved pages
If you change your publishing system you may run into the problem that pages now moved are linked to an old and no longer working URL.
You can use a directive Redirect
Example .htaccess file:
Redirect /old/page.html http://foo.bar.org/moved/here.php Redirect /foo/moved.html http://foo.bar.org/new/page/ RedirectPermanent /bar/movedforgood.html http://foo.org/new/
Use RedirectPermanent if the new links are permanent.
4. Bad Bot User-Agent based blocking
If you admin your own website and look at the logs you have probably noticed a high number of visitors with strange and unusual user-agent identification. You also probably seen bad robots eat all your bandwith by hitting you with a high number of request or downloading the same file using five to ten connections from time to time.
Here is a list of users agents you may not want visiting you made by HTTP_USER_AGENT .htaccess | awk '{print $3}' and also an example .htaccess to block them. This may save you bandwidth and prevent your site to become slow and overloaded.
Known "bad" guests:
Alexibot asterias BackDoorBot Black.Hole BlackWidow BlowFish BotALot BuiltBotTough Bullseye BunnySlippers Cegbfeieh CheeseBot CherryPicker ChinaClaw CopyRightCheck cosmos Crescent Custo DISCo DittoSpyder Download\ eCatch EirGrabber EmailCollector EmailSiphon EmailWolf EroCrawler Express\ ExtractorPro EyeNetIE FlashGet Foobot FrontPage GetRight GetWeb! Go-Ahead-Got-It Googlebot-Image Go!Zilla GrabNet Grafula Harvest hloader HMView httplib HTTrack humanlinks ia_archiver Image\ Image\ Indy\ InfoNaviRobot InterGET Internet\ JennyBot JetCar JOC\ Kenjin.Spider Keyword.Density larbin LeechFTP LexiBot libWeb/clsHTTP LinkextractorPro LinkScan/8.1a.Unix LinkWalker lwp-trivial Mass\ Mata.Hari Microsoft.URL MIDown\ MIIxpc Mister.PiX Mister\ moget Mozilla/2 Mozilla/3.Mozilla/2.01 Mozilla.*NEWT Navroad NearSite NetAnts NetMechanic NetSpider Net\ NetZIP NICErsPRO NPBot Octopus Offline.Explorer Openfind PageGrabber Papa\ pavuk pcBrowser ProPowerBot/2.14 ProWebWalker ProWebWalker QueryN.Metasearch ReGet RepoMonkey RMA SiteSnagger SlySearch SmartDownload SpankBot spanner SuperBot SuperHTTP Surfbot suzuran Szukacz/1.4 tAkeOut Teleport Telesoft The.Intraformant TheNomad TightTwatBot Titan TJvMultiHttpGrabber Component toCrawl/UrlDispatcher True_Robot turingos TurnitinBot/1.5 URLy.Warning VCI VoidEYE WebAuto WebBandit WebCopier WebEMailExtrac.* WebEnhancer WebFetch WebGo\ Web.Image.Collector WebLeacher WebmasterWorldForumBot WebReaper WebSauger Website.Quester Webster.Pro WebStripper WebWhacker WebZip Wget Widow [Ww]eb[Bb]andit WWW-Collector-E WWWOFFLE Xaldon\ Xenu's Zeus
Not all of these are bad or evil, that is a matter of opinion.
Programs downloading your entire site in one quick spoof
Offline.Explorer and wget (manual page) use much bandwidth by downloading large parts of or even your entire site at once so users can follow links on your site after they disconnect their modem.
Download managers
These are programs that can "download faster" by opening tons of connections requesting parts of the same file, commonly used by windows-users who install them after being brainwashed by pop-up ads telling them to. Download managers can strange your bandwidth and are generally bad. Users can still download the content, it does not hurt to restrict them to using a browser.
Search Engine Spiders
ia_archiver and Googlebot index your site and makes it available through various search engines. This is usually worth it because this will increase real traffic. Most of these spiders are fairly nice and only hit you every xx minutes.
Some of these, like TurnitinBot and SiteSnagger, are all bad. Some robots only crawl your site to find email addresses usable for sending SPAM.
Google can help you identify the various spiders and their purpose.
Evil programs that only eats your bandwidth
"TJvMultiHttpGrabber Component" is very bad. This is btsearch, a program that searches through websites and makes their content available through a nice GUI interface where it can be downloaded. The user has no idea where on the Internet the files are stored or who serves them.
How to deny access
Here is a nice example of a .htaccess that will redirect bots and thereby disallow them access to the real contents of your site, while allowing the commonly used browsers. This only works if you have the module mod_rewrite module enabled.
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^Alexibot [OR]
RewriteCond %{HTTP_USER_AGENT} ^asterias [OR]
RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Black.Hole [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlowFish [OR]
RewriteCond %{HTTP_USER_AGENT} ^BotALot [OR]
RewriteCond %{HTTP_USER_AGENT} ^BuiltBotTough [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bullseye [OR]
RewriteCond %{HTTP_USER_AGENT} ^BunnySlippers [OR]
RewriteCond %{HTTP_USER_AGENT} ^Cegbfeieh [OR]
RewriteCond %{HTTP_USER_AGENT} ^CheeseBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^CopyRightCheck [OR]
RewriteCond %{HTTP_USER_AGENT} ^cosmos [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DittoSpyder [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^EroCrawler [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Foobot [OR]
RewriteCond %{HTTP_USER_AGENT} ^FrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot-Image [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^Harvest [OR]
RewriteCond %{HTTP_USER_AGENT} ^hloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^httplib [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^humanlinks [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InfoNaviRobot [OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JennyBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Kenjin.Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Keyword.Density [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^LexiBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^libWeb/clsHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkextractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkScan/8.1a.Unix [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^lwp-trivial [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mata.Hari [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIIxpc [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister.PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^moget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/2 [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3.Mozilla/2.01 [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetMechanic [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^NPBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline.Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^Openfind [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^ProPowerBot/2.14 [OR]
RewriteCond %{HTTP_USER_AGENT} ^ProWebWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^ProWebWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^QueryN.Metasearch [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^RepoMonkey [OR]
RewriteCond %{HTTP_USER_AGENT} ^RMA [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SlySearch [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SpankBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^spanner [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^suzuran [OR]
RewriteCond %{HTTP_USER_AGENT} ^Szukacz/1.4 [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Telesoft [OR]
RewriteCond %{HTTP_USER_AGENT} ^The.Intraformant [OR]
RewriteCond %{HTTP_USER_AGENT} ^TheNomad [OR]
RewriteCond %{HTTP_USER_AGENT} ^TightTwatBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Titan [OR]
RewriteCond %{HTTP_USER_AGENT} ^toCrawl/UrlDispatcher [OR]
RewriteCond %{HTTP_USER_AGENT} ^toCrawl/UrlDispatcher [OR]
RewriteCond %{HTTP_USER_AGENT} ^True_Robot [OR]
RewriteCond %{HTTP_USER_AGENT} ^turingos [OR]
RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot/1.5 [OR]
RewriteCond %{HTTP_USER_AGENT} ^URLy.Warning [OR]
RewriteCond %{HTTP_USER_AGENT} ^VCI [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebBandit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEnhancer [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.Image.Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebmasterWorldForumBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website.Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^Webster.Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZip [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWW-Collector-E [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xenu's [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^(.*)$ http://www.robotstxt.org/
[OR] tells apache the next line also contains a RewriteCond, the [OR] must be included on all lines except the line following the RewriteRule these will apply to.
[NC] means case-insensitive.
The RewriteRule can be used to refer "visitors" to a local file like goaway.txt or send them along to another site.
The logs will list redirected hits with status code 302. (HTTP Status Codes - Redirecting URLs in IIS and Apache)
Copyright GNU Copyleft http://linuxreviews.org/
- Next: RSS Tutorial for Content Publishers and Webmasters
- Previous: How to create and use shortcut (favicon) icons on your pages
