How to Expel Evil Robot

Web Crawler is not acting in good faith will ignore robots.txt, or worse, just use robots.txt to access a directory that disallow, he also does not matter how often to visit a website to retrieve the desired content. Fortunately, there are still ways to get rid of. It’s quite easy to expel evil robots. If you are a website administrator, can set a trap for example by defining a directory (let’s say the directory named / secret_4817, 4817 numbers are random numbers that are unlikely to directory accessible by normal user / instead of the Web Crawler) and set it as the robots.txt disallow directory.

Any access that occurs on the directory.

Web Crawler can be assumed to be inappropriate, and you can obtain its IP via a database or log files generated by a script which deliberately installed in the directory. IP the evil robots can be blocked through file.htaccess, for example with the following command:

RewriteEngine on Options + FollowSymLinks
RewriteBase /
RewriteCond% {REMOTE_HOST} ^ xxx.xxx.xxx.xxx
RewriteRule ^ .* $ X.html [L]

Note the command line with RewriteCond. replace xxx.xxx.xxx.xxx with the IP Web Crawler. If the name of a Web crawler that you want to block is already known, you can also write, for example:

RewriteCond% {HTTP_USER_AGENT} Slurp
RewriteRule ^ .* $ X.html [L]

The above lines to block the Web Crawler Slurp (from Yahoo) if for some reason, you do not want a Web Crawler is accessing your website.

Everything that has been exposed on the Internet does have risks and can be used by others for their own benefit. Cases of misuse of information are often the case. Technology provides the means to publish content, take content, also securing the content. But if you ask at the information technology security experts, no one wants to guarantee 100% secure. So make sure what you are exposed are not things confidential, and always minimize the possibility for misuse.