ROBOTS.TXT

For a reason, maybe you do not want a Web Crawler, whether from search engines or other types of web robots, accessing all or part of your website then robots.txt can be used for this purpose.

Robots.txt file is placed at the root of the website (example: yourdomain.com / robots.txt), and is a standard that has been developed since 1994, when indexing web became popular. This standard does not guarantee that the Web crawler will follow it; it all depends on the cooperation the Web crawler to pay attention to this standard. There are special instructions that can be used to instruct robots not to access the web your website at all, just writes down the following instructions in robots.txt:

User-agent: *
Disallow: /

User-agent: * means the robots.txt instruction applies to all web robots. You can change the specific name of the web robot, if you just want to impose on the robots.txt instructions for particular web robots.
Disallow: / means the root directory and all its contents are not allowed to be accessed by web robots.

If you want to protect some directories or specific files, the writing is as follows:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /download/browse.php

Instruction on the means to tell web robots do not access cgi-bin directory and the images (and its contents), nor access files / download / browse.php (but can access files in a directory other than browse.php / download).

 

List of Web Crawler

Some examples of Web Crawler:

1. Teleport Pro

Teleport Pro is a Web Crawler software for offline browsing. This software has been popular for a long time, especially when the Internet connection is not as easy and as fast as now. This software is paid and addressed in http://www.tenmax.com.

2. HTTrack

Written using C, as well as Teleport Pro, HTTrack is software that can download website content into a mirror on your hard drive, to be viewed offline. Interestingly, this software is free and can be downloaded at its official website http://www.httrack.com.

3. Googlebot

Web Crawler is to build a search index that is used by search engine Google. If people find your website through Google, it could be the services of Googlebot. Despite the consequences, some of your bandwidth will be taken because this crawling process.

4. Yahoo! Slurp

If a Web Crawler Googlebot is Google’s flagship search engine then Yahoo! rely Yahoo! Slurp, the technology developed by Inktomi Corporation, which acquired Yahoo.

5. YaCy

Slightly different from the others on the Web Crawler, YacY built on the principles of P2P networks (peer-to-peer), on-develop using Java, and distributed on several hundred machines (called YaCy peers). Each peer’s shared with the P2P principle to share the index so it does not require a central server. Examples of search engines that use YaCy are Sciencenet (http://sciencenet.fzk.de), to search for documents in the field of science.

How to Expel Evil Robot

Web Crawler is not acting in good faith will ignore robots.txt, or worse, just use robots.txt to access a directory that disallow, he also does not matter how often to visit a website to retrieve the desired content. Fortunately, there are still ways to get rid of. It’s quite easy to expel evil robots. If you are a website administrator, can set a trap for example by defining a directory (let’s say the directory named / secret_4817, 4817 numbers are random numbers that are unlikely to directory accessible by normal user / instead of the Web Crawler) and set it as the robots.txt disallow directory.

Any access that occurs on the directory.

Web Crawler can be assumed to be inappropriate, and you can obtain its IP via a database or log files generated by a script which deliberately installed in the directory. IP the evil robots can be blocked through file.htaccess, for example with the following command:

RewriteEngine on Options + FollowSymLinks
RewriteBase /
RewriteCond% {REMOTE_HOST} ^ xxx.xxx.xxx.xxx
RewriteRule ^ .* $ X.html [L]

Note the command line with RewriteCond. replace xxx.xxx.xxx.xxx with the IP Web Crawler. If the name of a Web crawler that you want to block is already known, you can also write, for example:

RewriteCond% {HTTP_USER_AGENT} Slurp
RewriteRule ^ .* $ X.html [L]

The above lines to block the Web Crawler Slurp (from Yahoo) if for some reason, you do not want a Web Crawler is accessing your website.

Everything that has been exposed on the Internet does have risks and can be used by others for their own benefit. Cases of misuse of information are often the case. Technology provides the means to publish content, take content, also securing the content. But if you ask at the information technology security experts, no one wants to guarantee 100% secure. So make sure what you are exposed are not things confidential, and always minimize the possibility for misuse.

Web Crawler for Search Engine

Web Crawler would be a simple application, or otherwise, is very complex. All depends on the purposes and functions of the Web Crawler itself. For example, if used as a seeker of an e-mail address (e-mail harvester), which needs to be done is a Web crawler parses website content and collecting a string that matches the pattern of an e-mail.

This still includes a simple, if compared to the Web crawler which serves as a robot to help the search engines. Web Crawler like this would be preferred by the website owners because it makes their website a chance to find more people through search engines. Work done for the Web Crawler search engine is much more complex, even detail the algorithm, the architecture of search engine giants like Google and Yahoo are not easily revealed, and it is their business secrets. Crawler Web site to access a URL via the Internet, the process of crawling can be done with multi-thread technique to be optimal. Content such as text and metadata will be stored on storage media (storage), while the links / URLs that are found are stored in the queue (queue), and are scheduled for subsequent processing.

In implementation, some of the things below should clearly be taken into account:

1. Size of the website.

The more files and directories owned by a website then it will require a longer time to explore it. While one of the added values of a search engine is the number of databases they have. A study in 1999 showed that there were no search engines that do index the web more than 16%, Web Crawler has priority on the web page that is downloaded, and do not take the entire web content.

2. Change website.

The number of pages can be increased, decreased, or browse content changes / updates. To that end, Web Crawler is not enough to visit just once to get the actual conditions of a website. Imagine if a search engine produces a lot of links that no longer exist or are not updated, it will certainly diminish the quality search results that users want. But on the other hand, the Web crawler must also take into account how often to repeat visits (re-visit) in proportion and this may be different depending on the characteristics of the website. For example, the website contains updates of stock prices or currency requires a higher frequency of visits.

3. Content processing.

The more clever a Web crawler to process the content, the better the data it has, for example, can distinguish between texts which are meta tags, or text that is part of the HTML tags such as tables, fonts, bold, italic, and so on. Conducting the process in parallel can also optimize the work of the Web Crawler, a distributed web crawling techniques that use multiple computers.

Web Crawler that downloads the job content needs to be followed by further processing, ie indexing databases to be used for search and processing of data more quickly. When someone searches a keyword on a search engine, index server is used. From there, search engines will generate pieces of the document and display it on the user.

Web Crawler Deeds

Page view of your website or blog suddenly jumped quickly? Don’t like it, because not only caused by unique visitors may be due to repeated web crawlers. Page view web pages or request a soaring, could have been caused by an application is browsing your web pages one by one, taking the content, and keep it. This is done by a Web crawler application. This is not something new. When Internet access is still very limited to one or two decades ago, people may choose to download the entire website content you want to read, so that then can be read anytime in offline. There also are doing it for other purposes, eg to collect e-mail / phone contained in the content, or collect specific data such as image data or video. What exactly is Web Crawler application, and really always detrimental to a website owner?

Web Crawler Concept

Web Crawler is a program / automated script which processes a web page. Often also is called web robot or web spider. The basic idea is simple and similar to when you explore the pages manually by using the browser. Starting at the beginning point of a website address link and opened in the browser, then the browser to request and download data from the web server via HTTP protocol.

Each hyperlink is encountered on the content that appears will be opened again on windows / tabs new browser. Thus the process repeated. Well, a Web crawler to automate the job. In conclusion, two main functions of the Web Crawler are:

  1. Identify a hyperlink. Hyperlinks are found on content will be added to the list of visit, also called the crawl frontier.
  2. Doing the trip / visit recursively. From each hyperlink, the Web Crawler will explore and perform an iterative process, with provisions tailored to the needs of applications.

Especially for the process of looping visited hyperlinks, it can take place spider traps, ie the process repeated endlessly because the Web Crawler trapped to continue to search an unlimited amount. This can happen accidentally or intentionally. Accident could occur because there are errors in the design of the Web Crawler program so reread the hyperlinks that have been accessed, or a website inadvertently have an infinite dynamic pages, for example, dynamic pages are created based on a calendar date.

Deliberate action can occur if the website is designed to cripple the Web Crawler, for example by creating dynamic pages with infinite numbers. In addition to making content for specific interests, Web Crawler can also result in another loss for the website owner. Among others increased resource usage, such as bandwidth and server CPU usage. Plus, if two or more Web crawler to access the same website. One solution, there is a standard for the administrator of the website called robots.txt protocol, to determine which parts of the website who do not want accessed by the Web Crawler.