Chapter 11. Information gathering

You need to announce the search system about the new site. It can be done, for example, with the help of the link from the other site in the Database, and visited by the robot or with the help of the tools for the site adding, which were proposed by the search systems themselves.






Actually, having sent the request for the site adding to the Database, you create the first external link to your recourse, and the acquaintance of the SS with your site begins.

So, we’ve announced the search system about the existing of the new site, which actions will it attempt then?

Having moved to the site, the robot starts walking down the pages and downloads the information, they have (on the site it can be presented as a text, image, video and other files). This process is called “crawling” Robot also needs to establish, when it will visit the site again, so it creates the schedule with the help of which it will define the time of the next visit in order to check the already existing information about the site and also to add new pages to the Database, if they exist.

It is important to “meet” the search robot in the effective manner, to simplify the process of the site indexation as much as possible because its time is limited and it has a great amount of sites, except yours, in its “visit schedule”. That’s why we have to create the appropriate conditions. We can do it in this way:

-         At first it is necessary to be sure in the site availability, it can be checked by the typing the domain name of the site in the address line of the browser.

-         Be sure that there are no problems with the site navigation, try not to use Java Script and Flash, the menu, designed with the help of these technologies is not recognized by the robots effectively.

-         Test your site for the absence of the errors 404.

-         Also don’t send the robot on the pages, accessible only for the registered users.

-         Take care of the levels of the nestings of your pages, because there are some limitations as for the deep site penetration.

-         Take into account that the maximal size of the text, represented on the page consists of 256 kb, try not to exceed this limit.


If to consider the process of the information search, it can be divided into the following stages: information gathering from the site pages in the Internet, site indexation, search as for the request and ranging the results. Let’s consider every stage separately.

If some page is added to the SS index, it can be shown in the search results. If the page is absent in the index, the search system knows nothing about it, and so it can’t use the information from this page.

The majority of the middle-sized sites (sites, which contains several dozens hundreds of pages) usually doesn’t have any problems with the correct indexation of the SS. However, there is the number of moments, which should be taken into account while working over the site.

The Search system can get to know about the recently created site with the help of two methods:

- Manual addition of the site address through the corresponding form of search system. In this case you inform the search system about a new site and its address gets to turn on indexation. It is necessary to add only main page of a site, the others will be found by the search robot by links;
- Let the search robot find your site independently. If there is at least one external reference from other resources already indexed by search system on your new resource,  the search robot in short terms will visit and index your site. In most cases it is recommended to use this variant. It means you should receive some external links on a site and simply wait for robot arrival. Manual addition of a site can even extend a robot waiting time.

The time spent for the site indexing as usually is from 2-3 days to 2 weeks depending on the search system. Google indexes sites faster of all other search systems.

Try to make a site friendly for search robots. For this purpose you should consider the following factors:

-         All the pages of your site should be available by links from the main page not more than 3 transfers. If this site structure does not permit it, you need to make so-called site map which allows to fulfill the given rule.

- Do not repeat widespread errors. Identifiers of sessions complicate indexation. If you use navigation through the scripts, you must duplicate links as usually – search systems are not able to read scripts (more in detail about these and other errors it is told in chapter 2.3);


-         Remember that search systems index not more than 100-200 kb of a text on a page. It will be indexed only the beginning of a page (first 100-200 kb). It means that you have not to use pages with a size of 11 kb, if you want it to be fully indexed.

It is possible to operate behavior of search robots with the help of a file robots.txt. It is possible to resolve or forbid for those indexations or other pages. There is also the special tag «Nofollow», allowing to close for indexation separate parts of page, however this tag is supported only by the Russian search systems.

Data bases of search systems are uploaded usually; notes in a data base can be changed, disappeared and appeared again. Thus the number of indexed pages of your site can be changed periodically.

One of the most often reasons of a page disappearing from an index is unavailability of a service.   It means that search robot could not get access to it. After restoration of service efficiency the site must be appeared in the index again after some time.

The more external links your site has, the faster it is reindexed.

It is possible to trace process of indexation of a site with the help the analysis of log-files of a server in which all visits of search robots are registered. In corresponding section we will tell about programs in detail which allow making it.