Robots.txt

From i.STAR Help

Contents

The Robots.txt File

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

So don't try to use /robots.txt to hide information.

Image:Lightbulb.gif More detailed info can be found on robotstxt.org as well as a detailed Robots Database

Image:Lightbulb.gif What Google has to say on robots.txt can be found here

Where is my Robots.txt File?

Current CAM build-outs provide a robots.txt file in the root directory of your website, with a link to your website's sitemap index to assist search engine crawlers in indexing your site.

FAQs on Robots.txt File

Q: What is the purpose of a robots.txt file?

To indicate which parts of the site should not be indexed by search engines. The purpose of the robots.txt protocol is to provide a mechanism for web servers to indicate to search engine crawlers which parts of their server should not be accessed, in other words, to prevent robots from reading certain parts of their server wich could contain sensitive or confidential information. With a robots.txt file, you can exclude certain spiders from indexing your site with a robots.txt directive, provided the spider obeys the rules in that file.

Q: How to create a /robots.txt file?

You can use anything that produces a text file.

  • On Microsoft Windows, use notepad.exe, or wordpad.exe (Save as Text Document), or even Microsoft Word (Save as Plain Text)
  • On the Macintosh, use TextEdit (Format->Make Plain Text, then Save as Western)

Q: Where to put it?

The short answer: in the top-level (root) directory of your web server.

The longer answer: When a robot looks for the "/robots.txt" file for URL, it strips the path component from the URL (everything from the first single slash), and puts "/robots.txt" in its place.

For example, for "http://www.example.com/shop/index.html, it will remove the "/shop/index.html", and replace it with "/robots.txt", and will end up with "http://www.example.com/robots.txt".

So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.

Q: What to put in it

The "/robots.txt" file is a text file, with one or more records. Usually contains a single record looking like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

In this example, three directories are excluded.

Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

The Robots <META> tag

You can use a special HTML <META> tag to tell robots not to index the content of a page, and/or not scan it for links to follow.

For example:

<html>
<head>
<title>...</title>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
</head>

There are two important considerations when using the robots <META> tag:

  • robots can ignore your <META> tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the NOFOLLOW directive only applies to links on this page. It's entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.

Don't confuse this NOFOLLOW with the rel="nofollow" link attribute.

FAQs on Robots Meta Tag

Q: Where to put it?

Like any <META> tag it should be placed in the HEAD section of an HTML page, as in the example above. You should put it in every page on your site, because a robot can encounter a deep link to any page on your site.

Q: What to put into it?

The "NAME" attribute must be "ROBOTS".

Valid values for the "CONTENT" attribute are: "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW". Multiple comma-separated values are allowed, but obviously only some combinations make sense. If there is no robots <META> tag, the default is "INDEX,FOLLOW", so there's no need to spell that out. That leaves:

<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
<META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

Image:Lightbulb.gif Complete list of FAQs about Web Robots can be found here


Back to SEO Info Page