Web Crawler would be a simple application, or otherwise, is very complex. All depends on the purposes and functions of the Web Crawler itself. For example, if used as a seeker of an e-mail address (e-mail harvester), which needs to be done is a Web crawler parses website content and collecting a string that matches the pattern of an e-mail.
This still includes a simple, if compared to the Web crawler which serves as a robot to help the search engines. Web Crawler like this would be preferred by the website owners because it makes their website a chance to find more people through search engines. Work done for the Web Crawler search engine is much more complex, even detail the algorithm, the architecture of search engine giants like Google and Yahoo are not easily revealed, and it is their business secrets. Crawler Web site to access a URL via the Internet, the process of crawling can be done with multi-thread technique to be optimal. Content such as text and metadata will be stored on storage media (storage), while the links / URLs that are found are stored in the queue (queue), and are scheduled for subsequent processing.
In implementation, some of the things below should clearly be taken into account:
1. Size of the website.
The more files and directories owned by a website then it will require a longer time to explore it. While one of the added values of a search engine is the number of databases they have. A study in 1999 showed that there were no search engines that do index the web more than 16%, Web Crawler has priority on the web page that is downloaded, and do not take the entire web content.
2. Change website.
The number of pages can be increased, decreased, or browse content changes / updates. To that end, Web Crawler is not enough to visit just once to get the actual conditions of a website. Imagine if a search engine produces a lot of links that no longer exist or are not updated, it will certainly diminish the quality search results that users want. But on the other hand, the Web crawler must also take into account how often to repeat visits (re-visit) in proportion and this may be different depending on the characteristics of the website. For example, the website contains updates of stock prices or currency requires a higher frequency of visits.
3. Content processing.
The more clever a Web crawler to process the content, the better the data it has, for example, can distinguish between texts which are meta tags, or text that is part of the HTML tags such as tables, fonts, bold, italic, and so on. Conducting the process in parallel can also optimize the work of the Web Crawler, a distributed web crawling techniques that use multiple computers.
Web Crawler that downloads the job content needs to be followed by further processing, ie indexing databases to be used for search and processing of data more quickly. When someone searches a keyword on a search engine, index server is used. From there, search engines will generate pieces of the document and display it on the user.