Everything you need to know about web crawling for your business

Data Mining   |   
Published February 8, 2019   |   

The darkest corners of the Internet harbor a lot of spiders invisible to the human eye. Yet they “crawl” on the Internet leaving their webs with a specific purpose. That purpose is to collect information or to understand the website’s structure and its usefulness. The latest, more sophisticated and fine search engines, like Google and AltaVista, are based on the use of spiders. Which automatically retrieve data from the web and transmit it to other technology. That then index the content of the source website to form the best set of search terms.

A web scraper is an agent that acts as a web spider but is more interesting from a legal point of view. A scraper, or data scraping tool is a kind of spider that is designed to work with specific Internet content. For example, with data on the cost of products or services. One of the options for using scraper agents is the so-called competitive pricing. By identifying the prices existing in the market for a particular category of goods to establish appropriate prices for their products. Besides, the scraper can combine data from several sources on the Internet and provide this summary information to the user.

Spider eyes and legs

The primary visual and movement organs of the web spider on the Internet is the HTTP message-oriented protocol through which the client connects to the server and sends requests. In response to these requests, the server generates a response. Each request or response consists of a header and body. The header contains status information and a description of the body content.

HTTP supports three basic types of requests:

  • HEAD request calls for the information about the assets of a particular server.
  • GET request asks for the asset itself, for example, a file or an image.
  • POST request allows the client to interact with the server through a web page (usually via a web form).

Biological analogies

The real spider in real life does not exist in isolation, but in interaction with the environment. Such a spider sees and feels the situation around itself and moves from one point to another following a specific goal. web spiders do the same. A web spider is a program written in a high-level language. Such a program interacts with the environment through network protocols, such as the HTTP Internet protocol. If your spider wants to contact you, it can send an email using SMTP protocol.

However, the capabilities of web spiders are not limited to the HTTP or SMTP protocols. Some spiders are capable of using web services technologies, such as SOAP, or the XML-RPC protocol. Other spiders browse web-based newsgroups using the NNTP protocol or search for interesting information in RSS news feeds. Unlike real spiders, most of which can distinguish only changes in the intensity of light and capture only moving objects, web spiders “see” and “feel” using several types of protocols.

Spider and scraper agents: Scope of applicability

Web spider and web scraping tools are useful applications. Various types of such applications are applied quite widely, and, both with good and with malicious intent. Let’s consider some areas of their application.

Search engine web crawlers

Web spiders make it easier to search the Internet and increase its effectiveness. The search engine uses many web spiders (crawlers) that ramble hither and thither on the Internet, extracting data from websites and indexing it. After completing this stage of work, the search engine can quickly view the local index to identify the most suitable results for the search you specify. The Google search engine additionally uses the PageRank algorithm, which ranks the search results by the number of links to other pages with each page found. Thus, a voting mechanism is implemented, in which the pages with the maximum number of votes receive the highest ranking in the search results.

Such a search on the Internet can be very costly. Both regarding the bandwidth required to transmit web content to the indexer and regarding the computational cost of indexing the results. Also, this method requires a large amount of storage. Although, today it is no longer a big problem. For example, Google offers 1000 MB of storage to each user of the Gmail mail service.

web spiders minimize the load they generate on the Internet using a set of policies. To present the scale of the problem, you need to consider that Google indexes over 8 billion web pages. Behavior policies determine which pages the web crawler should enter into the indexer. And how often the web crawler should return to a web site for retesting, as well as so-called “politeness policy.” Web servers can prohibit the work of crawlers using the standard robot.txt file, which tells crawlers what can and cannot be viewed on this server.

Corporate web crawlers

Like a standard search engine spider, a corporate web spider indexes the context inaccessible to regular visitors. For example, companies typically maintain internal websites for their employees. In this case, the scope of the spider is limited to the local environment. The limitation of the search area usually increases the stock of available computing power, which allows you to create specialized and more complete indices. Google has taken another step in this direction by providing a search engine for indexing the content of a personal computer user.

Specialized web crawlers

There are also a number of alternative applications for crawlers, for example, when archiving content or generating statistical information. The archive crawler scans the website and identifies local content to be stored for long periods. Such a crawler can be used for backup or, more generally, for taking snapshots of specific content on the Internet. A statistical crawler can be useful for understanding specific Internet content or for detecting the absence of such content. Crawlers can determine how many web servers are currently running. And how many web servers of a particular type are currently running. And even the number of inactive links (i.e., links that return an HTTP 404 error — the page was not found) running.

Another useful type of specialized crawler is a crawler for checking websites. Crawlers of this type look for the missing content, check all the links and ensure the correctness of the HTML code.

Web crawlers for email analysis

We now turn to the negative aspects. Unfortunately, only a few black sheep can significantly complicate the use of the Internet to many respectable users. Web crawlers for email analysis look for email addresses on websites. They are then used to send out the vast amounts of spam we encounter every day.

Legal aspects

Several lawsuits were filed regarding the use of web spiders to analyze information on the Internet, and all of them were taken into consideration. Recently, Farechase, Inc. responded to a lawsuit filed by American Airlines for the use of real-time scraper agents. Initially, this lawsuit claimed that the data collection violated the agreements of American Airlines with its customers (included in the Terms and Conditions document). After this claim was rejected, American Airlines filed a claim for harm, which was granted. Other cases claimed that the bandwidth consumed by spider and scraper agents worsens the working conditions of other users. All such claims are well-founded, which necessitates compliance with the rules of politeness.

What’s next

The use of web spiders and web crawler software for information analysis of the Internet can be fascinating, and for someone very profitable, occupation. However, as mentioned above, this type of activity has certain legal aspects. When using web spiders, always obey the instructions in the robots.txt file located on the web server you are analyzing and include this requirement in your policy of courtesy. More advanced technologies, such as SOAP, greatly simplify the work of web spiders and reduce their negative impact on ordinary Internet operations. Promising developments, such as semantic networks, will make the work of spiders even more manageable, so the number of relevant solutions and methods will continuously increase.