Home
Home Page
In detail about Robots.txt
Safe programming on PHP
We write elementary rassylhhik
Use of HTML-tables for a conclusion of diagrams
What is dorvei? Whether Banjat for them?
How to learn{find out} on what searches find my site in search machines?
Cunnings of tabulared design. (we form a framework of the table)
The guest book on PHP
BB-codes
We hide counters
Use of patterns of design in ASP.NET
RSS the generator
Search optimization of a site
Metatags
Optimization for MSN
We
Cookies - fresh rolls
Superdynamical webs - interfaces
Links
 

In detail about Robots.txt

Search the server always before indexation of your resource search in the root of your domain for a file with a name "robots.txt" (http://www.mydomain.com/robots.txt). This file informs robots (spiders - indeksatoram), they can index what files, and what no.


Format of a file robots.txt - special. He will consist of recordings. Each recording will consist of two fields: lines with the name of the client application (user-agent), and one or several lines beginning with Disallow directive:


          

<Field> ":" <value>


Robots.txt it should be created in text format Unix. The majority of good text editors symbols of translation of line Windows in Unix already are able to transform. Or your FTP-client should be able to do{make} it. For editing do not try to use the HTML-editor, especially such which has no text mode of display of a code.



Field User-agent


Line User-agent contains the name of the robot. For example:



User-agent: googlebot


If you address to all robots, you can use a symbol of substitution "*":



User-agent: *


You can find names of robots in broad gullies of your web - server. For this purpose choose only searches to a file robots.txt. The majority of search sites is appropriated{given} with short names to the spiders - indeksatoram.



Field Disallow:


The second part of recording will consist of lines Disallow. These lines - directives for the given robot. They inform the robot what files and-or catalogues to the robot is non-authorized to index. For example following directive forbids to spiders to index a file email.htm:



Disallow: email.htm


The directive can contain and the name of the catalogue:



Disallow:/cgi-bin/


This directive forbids to spiders - indeksatoram to climb in the catalogue "cgi-bin".


In Disallow directives can be used also and symbols of substitution. The standard dictates, that/bob directive will forbid to spiders to index both/bob.html and/bob/index.html.


If Disallow directive will be empty, it means, that the robot can index ALL files. At least one Disallow directive should be present for each field User-agent that robots.txt was considered correct. Completely empty robots.txt means the same as though it  was not in general.



Blanks and comments


Any line in robots.txt, beginning with *, is considered the comment. The standard allows to use comments at the end of lines with directives, but it is considered bad style:



Disallow: bob *comment


Some spiders cannot correctly disassemble the given line and instead of it will understand her  as an interdiction on indexation of resources bob*comment. The morals is those, that comments should be on a separate line.


The blank in the beginning of a line is authorized, but not recommended.



Disallow: bob *comment



Examples


The following directive allows to index to all robots all resources of a site as the symbol of substitution "*" is used.



User-agent: *

Disallow:


This directive forbids to all robots it to do{make}:



User-agent: *

Disallow:/


The given directive forbids to all robots to come into catalogues "cgi-bin" and "images":



User-agent: *

Disallow:/cgi-bin/

Disallow:/images/


The given directive forbids to robot Roverdog to index all files of the server:



User-agent: Roverdog

Disallow:/


The given directive forbids to the robot googlebot to index a file cheese.htm:



User-agent: googlebot

Disallow: cheese.htm


If you more complex  examples interest, poputajtes` to extend a file robots.txt from any large site, for example CNN or Looksmart.


Additions to standards


In spite of the fact that there were offers on expansion of the standard and introduction of Allow directive or the account of the version of the robot, these offers it is formal and have not been authorized.

Campaign in searches robots.txt


At check of ours validatora robots.txt (see the end of clause{article}), to us was required to find many - many "forage" for him . We have created the spider who downloaded from each found site only one file robots.txt. We have walked under all links and the domains brought in Open Directory Project. So we have walked on 2.4 million URL and have dug out files robots.txt approximately on 75 kilobyte.


During this campaign we have found out huge quantity{amount} of problems with files robots.txt. We have seen, that 5 % robots.txt bad style, and 2 % falov have been so badly written, that any robot could not understand them. The list of some problems which have been found out by us:


The turned syntax


One of the most widespread mistakes - the turned syntax:



User-agent: *

Disallow: scooter


And should be so:



User-agent: scooter

Disallow: *


Some Disallow directives in one line:


Many specified some directives on one line:



Disallow:/css//cgi-bin//images/


Various spiders will understand this directive on miscellaneous. The some people will ignore blanks and will understand the directive as an interdiction on indexation of the catalogue/css // cgi-bin // images/. Or they will take only one catalogue (/images/or/css/) and will ignore the everything else.


Correct syntax is those:



Disallow:/css/

Disallow:/cgi-bin/

Disallow:/images/


Translation of a line in format DOS:


One more widespread mistake - editing of a file robots.txt in format DOS. In spite of the fact that because of prevalence of the given mistake many spiders - ideksatory have learned to understand her , we count it a mistake. Always edit the robots.txt in mode UNIX and zakachivajte a file on a site in mode ASCII. Many FTP-clients are able at zakachke in a text mode to translate symbols of a line from a DOS-format in a UNIX-format. But the some people do not do{make} it.


Comments at the end of a line:


According to the standard, it is correct:



Disallow:/cgi-bin/*this bans robots from our cgi-bin


But in the recent past there were robots which swallowed all the line long as the directive. Now to us such robots are unknown, but whether the risk is justified? Place comments on a separate line.


Blanks in the beginning of a line:



Disallow:/cgi-bin/


The standard speaks nothing concerning blanks, but it is considered bad style. And besides, whether it is necessary to risk?


Redirect on other page at a mistake 404:


It is rather distributed, when the web - server at a mistake 404 (the File is not found) gives out to the client special page. Thus the web - server does not give out to the client an error code and at all does not do{make} a redirect. In this case the robot does not understand, that the file robots.txt is absent, instead of it he will receive html-page with any message. Certainly any problems here to arise should not, but whether it is necessary to risk? The god knows, how will disassemble the robot this html-file, having accepted it  for robots.txt. That it did not occur, place even empty robots.txt in the root of your web - server.


Conflicts of directives:


That you have made on a place of the robot slurp, having seen the given directives?



User-agent: *

Disallow:/

*

User-agent: slurp

Disallow:


The first directive forbids to all robots to index a site, but the second directive allows slurp it to do{make} to the robot. So all the same should do{make} slurp? We cannot guarantee, that all robots will understand these directives correctly. In the given example slurp should proindeksirovat` all site, and all others should not leave directly from a threshold.


The top register of all letters - bad style:



USER-AGENT: EXCITE

DISALLOW:


In spite of the fact that the standard it is indifferent concerns to the register of letters in robots.txt, in names of catalogues and files the register all the same is important. It is the best way to follow examples and in the top register to write the first letters only in words User and Disallow.


The list of all files


One more mistake - transfer of all files in the catalogue:



Disallow:/AL/Alabama.html

Disallow:/AL/AR.html

Disallow:/Az/AZ.html

Disallow:/Az/bali.html

Disallow:/Az/bed-breakfast.html


It is possible to replace the above-stated example on:



Disallow:/AL

Disallow:/Az


Remember, that initial inclined feature designates, that the question is the catalogue. Certainly, nothing forbids to you to list a couple of files, but we conduct speech about style. The given example is taken from a file robots.txt which size exceeded 400 kilobyte, in him 4000 files have been mentioned! Interestingly, how much robots - spiders, having seen on this file, have decided to not come any more on this site.


There is only Disallow directive!


There is no such Allow directive, is only Disallow. This example incorrect:



User-agent: Spot

Disallow:/john/

allow:/jane/


It will be correct so:



User-agent: Spot

Disallow:/john/

Disallow:


There is no opening inclined feature:


That the robot - spider with the given directive should make:



User-agent: Spot

Disallow: john


According to standards this directive forbids to index a file "john" and the catalogue john ". But it is the best way, for fidelity, to use inclined feature that the robot could distinguish a file from the catalogue.


Still we saw, how people wrote down in a file robots.txt keywords for the site (to think only - of what?).


There were such files robots.txt which have been made as html-documents. Remember, in FrontPage to do{make} robots.txt does not cost.


Incorrectly adjusted server


Why suddenly on search robots.txt the web - server gives out a binary file? It occurs in the event that your web - server is adjusted incorrectly, or you have incorrectly begun to rock a file on the server.


Always after you have begun to rock a file robots.txt on the server, check it . It is enough to type{collect} in a browser simple search:



http://www.mydomain.com/robots.txt


That's all that is necessary for check.


Features Google:


Google - the first search site which supports regular expressions in directives. That allows to forbid indexation of files on their expansions.



User-agent: googlebot

Disallow: *.cgi


In a field user-agent you should use a name "googlebot." do not risk to allow the similar directive to other robots - spiders.



META - teg robots


META teg robots serves to resolve or forbid to the robots coming on a site, to index the given page. Besides this teg is intended to offer robots to pass on all pages of a site and proindeksirovat` them. Now this teg gets the increasing value.


Besides it tegom those who cannot dostupit`sja to the root of the server can use and change a file robots.txt.


Some search the server, such as Inktomi for example, completely understand meta-teg robots. Inktomi will pass on all pages of a site if value given tega will be " index, follow ".


Format meta-tega Robots


Meta teg robots is located in teg the html-document. The format is simple enough (the register of letters of value does not play):



<HTML>

<HEAD>

<META NAME = "ROBOTS" CONTENT = " NOINDEX, NOFOLLOW ">

<META NAME = "DESCRIPTION" CONTENT = " This page …. ">

<TITLE...> </TITLE>

</HEAD>

<BODY>


Values meta-tega robots


Given meta-tegu it is possible to appropriate{give} a variant four values. The attribute content can contain the following values:



index, noindex, follow, nofollow


If it is a little bit{some} values, they are divided{shared} by points.


Now only the following values are important:


INDEX directive speaks the robot, that the given page can be indexed.


FOLLOW directive informs the robot, that he is authorized to pass under the links present on the given page. Some authors assert{approve}, that at absence of the given values, search the server by default operate how if he  had been gave INDEX directives and FOLLOW. Unfortunately it not so in relation to search site Inktomi. For Inktomi default values are equal " index, nofollow ".


So, global directives look so:


To index all = INDEX, FOLLOW


To not index anything = NOINDEX, NOFLLOW


Examples meta-tega robots:



<META NAME = "ROBOTS" CONTENT = " NOINDEX, FOLLOW ">

<META NAME = "ROBOTS" CONTENT = " INDEX, NOFOLLOW ">

<META NAME = "ROBOTS" CONTENT = " NOINDEX, NOFOLLOW ">