Robots.txt

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol. By writing a structured text file you can indicate to robots that certain parts of your server are not to be cached / accessed / crawled by some or all robots.

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use.
  • /robots.txt should not be used to hide information.

How to create a /robots.txt file Where to put it The short answer: in the top-level directory of your web server.

The longer answer:

When a robot looks for the “/robots.txt” file for URL, it strips the path component from the URL (everything from the first single slash), and puts “/robots.txt” in its place.

For example, for “http://www.example.com/shop/index.html, it will remove the “/shop/index.html”, and replace it with “/robots.txt”, and will end up with “http://www.example.com/robots.txt“.

So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site’s main “index.html” welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lower case for the filename: “robots.txt”, not “Robots.TXT.

What program should I use to create /robots.txt?

  • On Microsoft Windows, use notepad.exe, or wordpad.exe (Save as Text Document), or even Microsoft Word (Save as Plain Text)
  • On the Macintosh, use TextEdit (Format->Make Plain Text, then Save as Western)
  • On Linux, vi or emacs

Points to note

  • Wildcards are _not_ supported: instead of ‘Disallow: /tmp/*’ just say ‘Disallow: /tmp/’.
  • You shouldn’t put more than one path on a Disallow line (this may change in a future version of the spec)

Example

# /robots.txt file for http://webcrawler.com/
# mail webmaster@webcrawler.com for constructive criticism

User-agent: webcrawler Disallow:

User-agent: lycra Disallow: /

User-agent: * Disallow: /tmp Disallow: /logs

  • The first two lines, starting with ‘#’, specify a comment
  • The first paragraph specifies that the robot called ‘webcrawler’ has nothing disallowed: it may go anywhere.
  • The second paragraph indicates that the robot called ‘lycra’ has all relative URLs starting with ‘/’ disallowed. Because all relative URL’s on a server start with ‘/’, this means the entire site is closed off.
  • The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the ‘*’ is a special token, meaning “any other User-agent”; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.

Leave a Reply