For a reason, maybe you do not want a Web Crawler, whether from search engines or other types of web robots, accessing all or part of your website then robots.txt can be used for this purpose.
Robots.txt file is placed at the root of the website (example: yourdomain.com / robots.txt), and is a standard that has been developed since 1994, when indexing web became popular. This standard does not guarantee that the Web crawler will follow it; it all depends on the cooperation the Web crawler to pay attention to this standard. There are special instructions that can be used to instruct robots not to access the web your website at all, just writes down the following instructions in robots.txt:
User-agent: *
Disallow: /
User-agent: * means the robots.txt instruction applies to all web robots. You can change the specific name of the web robot, if you just want to impose on the robots.txt instructions for particular web robots.
Disallow: / means the root directory and all its contents are not allowed to be accessed by web robots.
If you want to protect some directories or specific files, the writing is as follows:
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /download/browse.php
Instruction on the means to tell web robots do not access cgi-bin directory and the images (and its contents), nor access files / download / browse.php (but can access files in a directory other than browse.php / download).
List of Web Crawler
Some examples of Web Crawler:
1. Teleport Pro
Teleport Pro is a Web Crawler software for offline browsing. This software has been popular for a long time, especially when the Internet connection is not as easy and as fast as now. This software is paid and addressed in http://www.tenmax.com.
2. HTTrack
Written using C, as well as Teleport Pro, HTTrack is software that can download website content into a mirror on your hard drive, to be viewed offline. Interestingly, this software is free and can be downloaded at its official website http://www.httrack.com.
3. Googlebot
Web Crawler is to build a search index that is used by search engine Google. If people find your website through Google, it could be the services of Googlebot. Despite the consequences, some of your bandwidth will be taken because this crawling process.
4. Yahoo! Slurp
If a Web Crawler Googlebot is Google’s flagship search engine then Yahoo! rely Yahoo! Slurp, the technology developed by Inktomi Corporation, which acquired Yahoo.
5. YaCy
Slightly different from the others on the Web Crawler, YacY built on the principles of P2P networks (peer-to-peer), on-develop using Java, and distributed on several hundred machines (called YaCy peers). Each peer’s shared with the P2P principle to share the index so it does not require a central server. Examples of search engines that use YaCy are Sciencenet (http://sciencenet.fzk.de), to search for documents in the field of science.