Page view of your website or blog suddenly jumped quickly? Don’t like it, because not only caused by unique visitors may be due to repeated web crawlers. Page view web pages or request a soaring, could have been caused by an application is browsing your web pages one by one, taking the content, and keep it. This is done by a Web crawler application. This is not something new. When Internet access is still very limited to one or two decades ago, people may choose to download the entire website content you want to read, so that then can be read anytime in offline. There also are doing it for other purposes, eg to collect e-mail / phone contained in the content, or collect specific data such as image data or video. What exactly is Web Crawler application, and really always detrimental to a website owner?
Web Crawler Concept
Web Crawler is a program / automated script which processes a web page. Often also is called web robot or web spider. The basic idea is simple and similar to when you explore the pages manually by using the browser. Starting at the beginning point of a website address link and opened in the browser, then the browser to request and download data from the web server via HTTP protocol.
Each hyperlink is encountered on the content that appears will be opened again on windows / tabs new browser. Thus the process repeated. Well, a Web crawler to automate the job. In conclusion, two main functions of the Web Crawler are:
- Identify a hyperlink. Hyperlinks are found on content will be added to the list of visit, also called the crawl frontier.
- Doing the trip / visit recursively. From each hyperlink, the Web Crawler will explore and perform an iterative process, with provisions tailored to the needs of applications.
Especially for the process of looping visited hyperlinks, it can take place spider traps, ie the process repeated endlessly because the Web Crawler trapped to continue to search an unlimited amount. This can happen accidentally or intentionally. Accident could occur because there are errors in the design of the Web Crawler program so reread the hyperlinks that have been accessed, or a website inadvertently have an infinite dynamic pages, for example, dynamic pages are created based on a calendar date.
Deliberate action can occur if the website is designed to cripple the Web Crawler, for example by creating dynamic pages with infinite numbers. In addition to making content for specific interests, Web Crawler can also result in another loss for the website owner. Among others increased resource usage, such as bandwidth and server CPU usage. Plus, if two or more Web crawler to access the same website. One solution, there is a standard for the administrator of the website called robots.txt protocol, to determine which parts of the website who do not want accessed by the Web Crawler.