UPDATE: Please see the Drupal robots.txt tutorial on Drupalzilla.com...
Drupal 4.7 doesn't come with a robots.txt file. Drupal 5 does come with a robots.txt file, but it is not ideal. Many modules create duplicate content that you should block off with additional rules in the robots.txt file.
A robots.txt file attempts to keep search engines out of places that they shouldn't go. For example, a search engine should not be trying to index your comment reply pages.
This is the robots.txt file that comes with Drupal 5.1:
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt
# Paths (clean URLs)
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
Below is a basic robots.txt file that should only be used on sites that have the following conditions:
- Clean URLs enabled
- Pathauto module installed that overrides posts from having paths that start with the text /node. I.e., there should be no content URLs on the site that look like this: http://example.com/node/55 — the following robots.txt file would prevent them from being crawled by search engine spiders.
- Read the comments in the robots.txt file below carefully!
(Note: I had to add extra # symbols to prevent the Drupal text filter from closing my <code> element with a </p> tag. That is why there are extra #s in the file.)
# block the Internet Archiver if you want the privacy
User-agent: ia_archiver
Disallow: /
#
# All other robots
User-agent: *
# Directories
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
#
# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt
#
# Paths (clean URLs) -- modified from default!
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/
Disallow: /contact
Disallow: /logout
Disallow: /search/
#
# Users
# I block my user's pages on most sites because they don't have much content. Only leave it unblocked if your user pages have good content on them.
# If you DON'T want your user's pages crawled, use this, otherwise delete it:
Disallow: /user
#
# If you do want your user pages indexed, then delete the line above and uncomment these
# Disallow: /user/register
# Disallow: /user/password
# Disallow: /user/login
#
# This prevents Drupal's default non-clean URLs from being indexed
Disallow: /?q=
#
# block tracker pages
Disallow: /tracker/
# block paginated tracker pages
Disallow: /tracker?
#
# IMPORTANT: THE FOLLOWING LINE BLOCKS ALL /node* URLs -- only use it if you do not have content with URLs like http://example.com/node/10. Alternatively use the Global Redirect module.
Disallow: /node
#
# Disallow robots from all but the main feed - requires Pathauto feed aliases turned on. Optionally only block Google from the following paths because it causes the most havoc with Google
Disallow: /*/feed
# Prevent print-friendly duplicate pages from being crawled
Disallow: /book/export/
# This is important for the Forward module
Disallow: /forward/
# This prevents duplicate content caused by some modules like the Forums and Views modules
Disallow: /*sort=
# Prevent duplicate content created by the Image module
Disallow: /*size=
# If you have the Front module this will keep a duplicate front page from being indexed
Disallow: /front_page
# If you use the Views module's frontpage this will keep the duplicate from being indexed
Disallow: /frontpage
The above robots.txt file is not complete, but gives a basic idea of things you have have to look out for when configuring your robots.txt file.
A future "Part 2" of this tutorial will show how to troubleshoot your Drupal robots.txt file and add additional rules that you might need based on your installed modules.
Bookmark/Search this post with:
Comments
Your suggested ROBOTS.TXT and syntax problem
Thank you for the information about the suggested modificatified ROBOTS.TXT file for Drupal 5.1 installs.
As I understand from the www.robotstxt.org web site, the "disallow" command does not support regular expressions or * character. Each command must give a complete file name or directory name.
I am not sure if the following will be effective:
Disallow: /*/feed
Anyone has any experiences?
Robots.txt wildcards
The wildcards syntax is not part of the robots.txt standard, but it is accepted by Google, Yahoo and MSN. Visit the newer Drupal robots.txt tutorial for more information.