Chapter 09. Indexation

Indexing of pages is made by special program called robot. Every search engine has a lot of these robots. It is for parallel downloading of documents from different net places. There is no sense to download documents step by step, it is not effective. Imagine permanently growing up tree. On its trunks new branches are appearing permanently (page of sites). Certainly, again appearing sites will be indexed much faster, if robots are in every tree branch. Technically downloading module can be either multimedia (Altavista Merkator) or asynchronous input-output (GoogleBot) is used. Also developers need to solve the task of multiple-stream  DNS-server.

In multithread scheme downloading threads are called worms, and their manager – wormboy. Not many servers are capable to sustain loadings of several hundred worms, therefore the manager watches not to overload servers.

HTTP is used by robots for downloading pages. The robot transfer inquiry “get/path/document” and other useful lines to a server

The aim of downloading is network traffic downsizing with maximal fullness. Robots are submitted to the file robots.txt, where web master can limit indexing of pages by the robot.

Some robots are concerned to index dynamic pages. Although, webmasters avoid this places without any problem now. But such robots are remained ever less. Also each robot has a list of the resources referred to a spam. Accordingly, these resources are visited by boats much less or ignored during certain time.

In time of it, search systems do not filter information. It means that they only estimate relevance of pages. Information filter can be as service, in which users mark «undesirable» resources.

There are other modules in downloading models. They fulfill auxiliary functions, help to decrease traffic, increase depth of search, process often updated resources, store URL and links for not to download resources repeatedly.

There are modules of tracing of duplicates. They help to eliminate pages with not the unique information. I.e. if the robot finds the duplicate of already existing page or with slightly changed information it simply does not go further by page links.

After the page has been downloaded, it  is processed by HTML-parcer. It leaves only that information from the document which is really important for search: the text, fonts, links etc. What do mean all these modules for the concrete document? What does the search system with the document after downloading?

Placement into the index

Then words will be divided according to morphological and language principles. Though, not all searchers have morphological processing in the arsenal. Then the page gets to an index. Index updating in searchers occurs with certain periodicity (on the average from two weeks to one month). I.e. new pages collect and periodically go to the general index.