Robots.txt and Drupal

Tags:

An important aspect of Drupal SEO is the robots.txt file. Drupal 5 was the first version of Drupal that came with a robots.txt file, but it still needs some modifications.

One of the most serious SEO problems with Drupal is duplicate content. With the addition of contributed modules it can get so bad that one might refer to it as druplicate content. (ow...)

A key element of SEO on sites is getting a good, clean crawl. A robots.txt file is important for a clean crawl because it tells robots where they aren't supposed to go. There are many places on a Drupal site that search engine crawlers shouldn't go.

Drupal's Default Robots.txt File

I've attached Drupal 5's default robots.txt file for reference and will address it in sections:

Crawl Delay

The first thing I would do is remove the Crawl-delay line. Unless you have a very large site or spidering problems, it's not needed. The other robots.txt rules that I mention here should help cut down on the number of pages crawled.

User-agent: *
Crawl-delay: 10

Directories

The next section of the default robots.txt file addresses the physical directories created by Drupal:

# Directories
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/

That section can be left as-is. Just keep in mind that it will probably keep search engines out of your logo and image files also because you are blocking your /sites/, /modules/, and /themes/ directories. If you use an alternate logo image, rename it so that it includes a keyword and place it in your /files/ directory.

Files

The next section addresses files that are included with Drupal. I've never seen any of these files indexed, but you can leave this section in if you wish. Don't delete your CHANGELOG.txt file as some people recommend, because it lets you know what version of Drupal you are running in case you forget later.

# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt

Paths

This is the most important section of the default robots.txt file because it contains some errors:

# Paths (clean URLs)
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/

Drupal doesn't have trailing slashes on the URLs, so you may want to remove trailing slashes from some of the rules as shown below:

Disallow: /admin/
Disallow: /aggregator
Disallow: /comment/reply/
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search/
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

For example, each "Login or register to post comments" link on each node creates URLs like http://example.com/user/login?destination=comment/reply/806%2523comment_form and http://example.com/user/register?destination=comment/reply/806%2523comment_form. Drupal's default robots.txt rules will not block search engines from spidering those URLs, but if you remove the trailing slashes as I've mentioned above, it will.

The Aggregator Module creates URLs of duplicate content like http://example.com/aggregator?page=3 that are not blocked by the default robots.txt file. Removing the trailing slash on the end of "/aggregator/" in the default robots.txt file will solve that problem.

Paths (no clean URLs)

The next section of the robots.txt file addresses paths that should be blocked if you aren't using clean URLs:

# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

UPDATE: Please ignore the following lines. Further testing has shown that this rule will block all dynamic URLs in Google. So don't use it!

Most of the people reading these Drupal SEO tutorials are using clean URLs. If you are using clean URLs you can delete that section and replace it with the following line:

Disallow: /?

That line would block all of the URLs that start with ?q= as well as other miscellaneous query strings that might later appear for various reasons.

If you are not using clean URLs, modify the above section using the same logic as for the "clean paths" section above it. If your site has been indexed without clean URLS—for example, the page http://example.com/?q=node/25 has PageRank and you are going to implement clean URLs—you should use .htaccess to do 301 redirects from the dynamic versions of the URLs to the clean ones. In that case do not block the dynamic URLs from search engines because you would want them to transfer the PageRank of the dynamic URLs to the clean URLs. If that issue applies to you and my explanation doesn't make sense, please let me know in a comment below and I'll try to explain it another way.

Additional Rules

I also recommend adding the following rules, after carefully reading and understanding the explanations given with them:

Each module potentially adds many extra URLs on the site which often create massive amounts of duplicate content and that also increase the crawling load on your server. The following rules address some extra robots.txt rules for core modules.

Disallow: /node$
The URL http://example.com/node is a duplicate of http://example.com/.
Disallow: /user$
This will disallow the user form at http://example.com/user. If you would like to block all user pages, remove the trailing ampersand from this rule and all user pages will be blocked.
Disallow: /*sort=
This takes care of problems with the Forum Module, the Views Module and other modules that sort tables by column headers.
Disallow: /search$
This will block your search form at http://example.com/search. That URL does a 302 redirect to http://example.com/search/node which is already blocked by the default robots.txt file.
Disallow: /*/feed$
Drupal creates RSS feeds for many types of content in the format http://example.com/taxonomy/term/25/0/feed. If you don't block those RSS feeds, Google will put them in the Supplemental Results (even if they don't label the Supplemental Results in the SERPs anymore). The RSS feeds are duplicate content because they are the same text content except marked up with RSS/XML instead of X/HTML. This rule with block all the RSS feeds on the site except for the main RSS feed which is located at http://example.com/rss.xml by default.
Disallow: /*/track$
This will block all of the URLs created by the Tracker Module which are in the format http://example.com/user/5/track.
Disallow: /tracker?
The Tracker Module creates a paginated list of all the nodes on your site, beginning with the most recent. I believe that it's best for search engines to spider your content by approaching it in keyword-themed areas of the site (as they do through taxonomy or well-constructed Views). The tracker module organizes your content chronologically, not by keyword as taxonomy or Views do. The Tracker Module can also create thousands of extra pages on the site like this example from Drupal.org: http://drupal.org/tracker?page=6379. My recommendation is to leave http://example.com/tracker exposed to search engines, while blocking the paginated tracker pages like http://example.com/tracker?page=50. Leaving the just the first page of /tracker exposed to search engines allows search engines to rapidly find and index your latest content as it is posted.
Disallow: [front page] (replace with the path to your alternate front page)
If you change the default front page by modifying the settings on your site information page (http://example.com/admin/settings/site-information), you should block the path to the custom front page with robots.txt. Alternatively, install the Global Redirect Module which should redirect that URL to your true front page.

An improved version of Drupal's Robots.txt file that summarizes the explanations above can be download here.

Please see the Drupal SEO Module Database for instructions about specific rules. If you have questions about a specific module that I haven't covered yet, please contact me and I'll try to review the module as soon as possible.

Comments

Thanks

This was a huge help. Thank you very much!!!

I've tried using Disallow:

I've tried using Disallow: /*sort= and even
Disallow: /*?sort=
Disallow: /*&sort=

but when I test it in the google webmaster tool. It still shows the urls with sorting to be "allowed" for googlebot. Why is that? It's a real problem especially using quicktab. It creates the same page for each quicktab also!!

http://www.domain.com/forums/taxonomy/nodetitle?sort=desc&order=Last reply&quicktabs_1=0

I've tried all of this and still doesn't seem to block any of it:
Disallow: /*?quicktabs_1=1
Disallow: /*?sort=
Disallow: /*&sort=
Disallow: /*?tid=
Disallow: /*?quicktabs*
Disallow: /*sort=

Webmaster Tips's picture

robots.txt

This rule should work:

User-agent: *
Disallow: /*sort=

  • Does your robots.txt file have those two lines with the rule under the "*" user-agent?
  • If you have a separate section of the robots.txt file for Googlebot, it should go there too because Google will ignore the "*" section if you have a specific Googlebot section.
  • Is your robots.txt file located in the root of your domain?

If you're still having problems with it, please post the URL of your homepage or send it to me by email.

Wishful Thinking

In your recommendations, you use pattern matching idioms( '*', '$') that are not part of the robots.txt standard. I've never come across these being suggested anywhere, and they certainly arenot part of the relevant standards. The Robots.txt standard only allows you to specify url prefixes. No wildcards, no end-of-url '$'. Just prefixes.

Some robots do support their own custom extensions (mostly to do with speed of indexing), but I've never seen these ones you use suggested before. IF some robots do work as you claim, please cite your sources or test results.

The relevant standards can be found at:
http://www.robotstxt.org/norobots-rfc.txt
http://www.robotstxt.org/orig.html

Webmaster Tips's picture

Wildcards in robots.txt

Hi Ngaur,

The asterisk and question mark are supported by both Google and Yahoo. Read more here:

I think the official robots.txt standard should be updated because basic pattern matching is required for controlling robots on modern dynamic websites.

Pretty Cool then

You probably should identify which rules are based on the non-standard extensions, but if google and yahoo adhere to these, then that's hugely useful.

Can I suggest that if the relevant modules are used(print and forward), people should probably add rules to cover the urls for printer friendly pages and email forwarding pages:

Disallow: /print/
Disallow: /forward?

Also, disallow the event module's calendar pages. This produces a vast number of pages produced with very little original content. It is important though to make sure the event nodes can still be found, and for that the xmlsitemap module is useful, as are views, so long as you don't use them to recreate the problems of the calendar module.

So long as it doesn't clobber the URLs you use for the events themselves, you could have:

Disallow: /event/

If necessary though, you could be more careful:

Disallow: /event/*/table/
Disallow: /event/*/list/
Disallow: /event/*/month/
Disallow: /event/*/week/
Disallow: /event/*/day/

It occurs to me that what's really needed here is a robots.txt module which provides an API for modules to register URLs for inclusion, presumably with an admin interface to select which of those extra rules to actually use. I wonder if anyone has tried this yet?

Webmaster Tips's picture

Drupal SEO

There's also a section on this site for Drupal Modules SEO that covers robots.txt rules for some of the contrib and additional modules.

I think the Forward Module requires:
Disallow: /forward/

See also this site's robots.txt file for ideas.

A robots.txt API is a good idea.

Syndicate content