Robots.txt and Drupal

An important aspect of Drupal SEO is the robots.txt file. Drupal 5 was the first version of Drupal that came with a robots.txt file, but it still needs some modifications.

One of the most serious SEO problems with Drupal is duplicate content. With the addition of contributed modules it can get so bad that one might refer to it as druplicate content. (ow...)

A key element of SEO on sites is getting a good, clean crawl. A robots.txt file is important for a clean crawl because it tells robots where they aren't supposed to go. There are many places on a Drupal site that search engine crawlers shouldn't go.

Drupal's Default Robots.txt File

I've attached Drupal 5's default robots.txt file for reference and will address it in sections:

Crawl Delay

The first thing I would do is remove the Crawl-delay line. Unless you have a very large site or spidering problems, it's not needed. The other robots.txt rules that I mention here should help cut down on the number of pages crawled.

User-agent: *
Crawl-delay: 10

Directories

The next section of the default robots.txt file addresses the physical directories created by Drupal:

# Directories
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/

That section can be left as-is. Just keep in mind that it will probably keep search engines out of your logo and image files also because you are blocking your /sites/, /modules/, and /themes/ directories. If you use an alternate logo image, rename it so that it includes a keyword and place it in your /files/ directory.

Files

The next section addresses files that are included with Drupal. I've never seen any of these files indexed, but you can leave this section in if you wish. Don't delete your CHANGELOG.txt file as some people recommend, because it lets you know what version of Drupal you are running in case you forget later.

# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt

Paths

This is the most important section of the default robots.txt file because it contains some errors:

# Paths (clean URLs)
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/

Drupal doesn't have trailing slashes on the URLs, so you may want to remove trailing slashes from some of the rules as shown below:

Disallow: /admin/
Disallow: /aggregator
Disallow: /comment/reply/
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search/
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

For example, each "Login or register to post comments" link on each node creates URLs like http://example.com/user/login?destination=comment/reply/806%2523comment_form and http://example.com/user/register?destination=comment/reply/806%2523comment_form. Drupal's default robots.txt rules will not block search engines from spidering those URLs, but if you remove the trailing slashes as I've mentioned above, it will.

The Aggregator Module creates URLs of duplicate content like http://example.com/aggregator?page=3 that are not blocked by the default robots.txt file. Removing the trailing slash on the end of "/aggregator/" in the default robots.txt file will solve that problem.

Paths (no clean URLs)

The next section of the robots.txt file addresses paths that should be blocked if you aren't using clean URLs:

# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

UPDATE: Please ignore the following lines. Further testing has shown that this rule will block all dynamic URLs in Google. So don't use it!

Most of the people reading these Drupal SEO tutorials are using clean URLs. If you are using clean URLs you can delete that section and replace it with the following line:

Disallow: /?

That line would block all of the URLs that start with ?q= as well as other miscellaneous query strings that might later appear for various reasons.

If you are not using clean URLs, modify the above section using the same logic as for the "clean paths" section above it. If your site has been indexed without clean URLS—for example, the page http://example.com/?q=node/25 has PageRank and you are going to implement clean URLs—you should use .htaccess to do 301 redirects from the dynamic versions of the URLs to the clean ones. In that case do not block the dynamic URLs from search engines because you would want them to transfer the PageRank of the dynamic URLs to the clean URLs. If that issue applies to you and my explanation doesn't make sense, please let me know in a comment below and I'll try to explain it another way.

Additional Rules

I also recommend adding the following rules, after carefully reading and understanding the explanations given with them:

Each module potentially adds many extra URLs on the site which often create massive amounts of duplicate content and that also increase the crawling load on your server. The following rules address some extra robots.txt rules for core modules.

Disallow: /node$
The URL http://example.com/node is a duplicate of http://example.com/.
Disallow: /user$
This will disallow the user form at http://example.com/user. If you would like to block all user pages, remove the trailing ampersand from this rule and all user pages will be blocked.
Disallow: /*sort=
This takes care of problems with the Forum Module, the Views Module and other modules that sort tables by column headers.
Disallow: /search$
This will block your search form at http://example.com/search. That URL does a 302 redirect to http://example.com/search/node which is already blocked by the default robots.txt file.
Disallow: /*/feed$
Drupal creates RSS feeds for many types of content in the format http://example.com/taxonomy/term/25/0/feed. If you don't block those RSS feeds, Google will put them in the Supplemental Results (even if they don't label the Supplemental Results in the SERPs anymore). The RSS feeds are duplicate content because they are the same text content except marked up with RSS/XML instead of X/HTML. This rule with block all the RSS feeds on the site except for the main RSS feed which is located at http://example.com/rss.xml by default.
Disallow: /*/track$
This will block all of the URLs created by the Tracker Module which are in the format http://example.com/user/5/track.
Disallow: /tracker?
The Tracker Module creates a paginated list of all the nodes on your site, beginning with the most recent. I believe that it's best for search engines to spider your content by approaching it in keyword-themed areas of the site (as they do through taxonomy or well-constructed Views). The tracker module organizes your content chronologically, not by keyword as taxonomy or Views do. The Tracker Module can also create thousands of extra pages on the site like this example from Drupal.org: http://drupal.org/tracker?page=6379. My recommendation is to leave http://example.com/tracker exposed to search engines, while blocking the paginated tracker pages like http://example.com/tracker?page=50. Leaving the just the first page of /tracker exposed to search engines allows search engines to rapidly find and index your latest content as it is posted.
Disallow: [front page] (replace with the path to your alternate front page)
If you change the default front page by modifying the settings on your site information page (http://example.com/admin/settings/site-information), you should block the path to the custom front page with robots.txt. Alternatively, install the Global Redirect Module which should redirect that URL to your true front page.

An improved version of Drupal's Robots.txt file that summarizes the explanations above can be download here.

Please see the Drupal SEO Module Database for instructions about specific rules. If you have questions about a specific module that I haven't covered yet, please contact me and I'll try to review the module as soon as possible.