Robots.txt Notes

by ponsankari[ Edit ] 2011-06-02 14:53:44

Robots.txt Notes

The exact mixed-case directives may be required, so be sure to capitalize Allow: and Disallow: , and remember the hyphen in User-agent:
An asterisk (*) after User-agent:: means all robots. If you include a section for a specific robot, it may not check in the general all robots section, so repeat the general directives.
The user agent name can be a substring, such as "Googlebot" (or "googleb"), "Slurp", and so on. It should not matter how the name itself is capitalized.
Disallow tells robots not to crawl anything which matches the following URL path
Allow is a new directive: older robot crawlers will not recognize this.
URL paths are often case sensitive, so be consistent with the site capitalization
The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL
In the original REP directory paths start at the root for that web server host, generally with a leading slash (/). This path is treated as a right-truncated substring match, an implied right wildcard.
One or more wildcard (*) characters can now be in a URL path, but may not be recognized by older robot crawlers
Wildcards do not lengthen a path -- if there's a wildcard directive path that's shorter, as written, than one without a wildcard, the one with the path spelled out will generally override the one with the wildcard.
Sitemap is a new directive for the location of the Sitemap file
A blank line indicates a new user agent section.
A hash mark (#) indicates a comment

Tagged in:

1087

You must LOGIN to add comments

Robots.txt Notes

by ponsankari[ Edit ] 2011-06-02 14:53:44

Tagged in:

Tags

Comments