Drupal SEO

I've decided to organize a section of this Web site around posts related to Drupal SEO. I'll be adding a couple of other tutorials in the next week or two. In the meantime, checkout the previous Drupal search engine optimization articles below.

If you are completely new to search engine optimization, read and bookmark the intro to SEO page, and check out SEO Elite Software.

I also offer Drupal SEO consulting services.

Drupal SEO Consulting Services

I offer two Drupal-related SEO services:

  1. SEO Site Audits
  2. SEO Campaigns

I've written some of the most comprehensive Drupal SEO tutorials on the Web, including the Drupalzilla robots.txt tutorial and the basic Drupal SEO tutorial that is at the top of Google for drupal seo right after Drupal.org.

SEO Site Audits

An SEO Site Audit is a detailed analysis of your site's configuration and structure, and it contains recommendations on optimizing your SEO with Drupal-specific tips.

SEO site audits are delivered in PDF format, with 2 hour of consulting beyond the delivery of the site audit. A typical site audit is between 35 to 50 pages in length. SEO site audits are done at a flat rate of $2000 USD.

SEO Campaigns

An SEO Campaign is a longer consulting agreement of 6 or more months where I work with your Web developers to systematically increase traffic through comprehensive search engine and social media optimization techniques.

For more information, please inquire through the form below:

SEO and SMO traffic

SEO Services Inquiry

Please enter the name of your company or business
Please enter the URL of the website that you are inquring about
Please enter your message here

Basic Drupal SEO: On-site Optimization

NEW! This Drupal SEO tutorial has been updated and rewritten in May 2008.

Drupal is a great open source GPL content management system. With a few modifications it can be configured for excellent on-site search engine optimization. This tutorial only covers the very basics of on-site optimization. It will make sure that search engines are able to spider your site, and prevent some common Drupal SEO errors.

This is just a basic introduction to configuring a Drupal site for good search engine rankings. Other tutorials will go into more depth.

Summary

  1. Enable Clean URLs
  2. Enable Path Module and install and enable Pathauto, Global Redirect and Token Modules.
  3. Configure the Pathauto Module
  4. Install and enable the Meta Tags Module.
  5. Install enable the Page Title Module
  6. Do NOT install the Drupal Sitemap Module.
  7. Fix .htaccess to redirect to "www" or remove the "www" subdomain.
  8. Fix your theme's HTML headers if they aren't right
  9. Recommended: create a custom front page
  10. Modify your robots.txt file.

Drupal and Clean URLs

Enable clean URLs

Search engines prefer clean URLs. In Drupal 6, clean URLs should be automatically enabled if your server allows it. In Drupal 5 you can enable clean URLs under administer —> settings —> Clean URLs. Clean URLs are necessary for the pathauto module, mentioned below.

Drupal Modules for SEO

Install the pathauto module and enable it

The pathauto module is highly recommended. Pathauto will automatically make nice customized URLs based on things like title, taxonomy, content type, and username. You also have to enable the path module for pathauto to work.

Think carefully about how you want your URLs to look. It takes some experience with Drupal to get the exact URL paths that you might want. The URLs are controlled by a combination of taxonomy and pathauto, and I hope to cover that in another tutorial. You can also use the path module to write custom URLs for each page, but that might become tedious and inconsistent on a large site.

At the very least, enable the path module and install the pathauto module. It will generate nice-looking URLs for you without much configuration.

Caution: The above advice is directed towards new Drupal sites. If you have an existing Drupal site be very careful that you don't rename your previously existing URLs with the pathauto module. It is generally a very bad idea to change existing URLs because the search engines will no longer be able to find those pages.

Here are some pathauto settings to watch out for:

For update action choose "Do nothing. Leave the old alias intact." Otherwise the URLs of nodes will change every time you change the title of your post, causing problems with search engines:

Drupal Pathauto update action settings

There is also a more comprehensive Pathauto tutorial.

Install the Global Redirect Module

The Global Redirect Module will automatically do 301 redirects to your URL aliases. So if you have a node a example.com/node/5, the Global Redirect Module will redirect that URL to your alias at example.com/my-page.

Read more about the Global Redirect Module.

Install the Meta Tags (Nodewords) Module

The Meta Tags Module (formerly called "Nodewords Module") can be highly beneficial to your site. There is a myth in some search engine optimization circles that says, "meta tags are not important". This is not true.

Meta tags are not meant to be used for keyword stuffing. Don't use them for that purpose because it isn't going to help you. The really important meta tag is the meta description.

The meta description should be different on every page for best results. The meta description should be one or two brief sentences to summarize the page. It should be written for your human visitors, but it is not a bad idea to tastefully and sparingly insert a couple of your keywords. Often when a search engine lists your site in the search engine results pages, it will use your page's HTML title for the title, and your meta description for the text snippet. That is why the meta description should be written with human visitors in mind. You want a text snippet that is going to make them want to click on the link.

Here is one textbook example from this site in the Google SERPs:

Drupal meta description being used as a text snippet in Google

I generally configure the Drupal Nodewords module to output the meta description and meta keywords on every page. I have a few default keywords set, and add a couple more on every post to make a unique combination of relevant keywords. I don't spend much time with it because I don't think the meta keywords are that important.

On the nodewords module's administration page, be sure to check the box that says "Use the teaser of the page if the meta description is not set?". That way each page will get a unique meta description even if you have denied access to create custom meta tags for nodes to some users.

Install the Page Title Module

The Page Title Module allows you to set custom page titles on every page. Highly recommended.

Google Sitemaps Module

Google Sitemaps are not essential, but I've been adding them to my Drupal sites. I think that Google Sitemaps were created by Google primarily for debugging Googlebot and not for the benefit of search engine optimizers.

There is a Drupal Sitemap Module, but the last time I checked it had serious bugs that made it unusable. In any case, I don't think that most Web sites need XML sitemaps. Other SEOs have similar opinions about sitemaps.

I recommend not using the Drupal Sitemaps Module.

Drupal Rewrite Rules

Make sure that your site does a permanent (301) redirect in either of the following two ways:

  • http://example.com to http://www.example.com, or
  • http://www.example.com to http://example.com

You can setup this redirect in your .htaccess file.

To remove the www from your site, look for the following code in your .htaccess file and uncomment and adapt:

  # To redirect all users to access the site WITHOUT the 'www.' prefix,
  # (http://www.example.com/... will be redirected to http://example.com/...)
  # uncomment and adapt the following:
  # RewriteCond %{HTTP_HOST} ^www\.example\.com$ [NC]
  # RewriteRule ^(.*)$ http://example.com/$1 [L,R=301]

To redirect to the www version of the site, look for the following code and uncomment and adapt:

  # To redirect all users to access the site WITH the 'www.' prefix,
  # (http://example.com/... will be redirected to http://www.example.com/...)
  # adapt and uncomment the following:
  # RewriteCond %{HTTP_HOST} ^example\.com$ [NC]
  # RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]
  

Be sure to replace example.com with your domain name, and then test the redirects in a browser.

Fix Your HTML Headers

There should be one <h1> header element on every page and it should have your keywords in it.

  1. Enclose your site name in DIV tags, not HTML header tags.
  2. I would add one H1 element to the home page.
  3. On teaser views, the node titles should be enclosed in H2 tags, while the main header of the page (e.g., taxonomy term name) should be enclosed in H1 tags.
  4. On node view pages, the node title should be enclosed in H1 tags.

Duplicate Content from /node

By default, the front page of a Drupal site has nearly identical content to the page at /node. Search engines are going to spider and index /node because on the paginated home page view, the link to the first page in the series points at /node.

The fix for this is simple — always use a custom front page when building a Drupal site.

Drupal PHP Session IDs

I haven't seen this problem on Drupal sites in a long time, but if you see PHP session IDs in your URLs, it is very bad for search engines. They have to be removed if you want search engines to be able to spider your site well. A PHP session ID in your URL might look something like this: ?PHPSESSID=37765439acbd6c12345ee987776e65be.

From what I understand, this is the fix if your server supports mod_php — it goes in your .htaccess file:

# Fix PHP session ID problems in Drupal
php_value session.use_trans_sid 0
php_value session.use_only_cookies 1

Otherwise you can probably fix it my modifying your php.ini file (or creating one). I don't know the exact procedure for every host, only that your web site must not have PHP session IDs in the URLs if you want good spidering by search engines. Search Drupal.org or Google for how to turn off PHP session IDs on your server.

Drupal and Robots.txt

The default Drupal robots.txt file has critical errors in it even in Drupal 6.2 (bug report already filed).

Read this Drupal robots.txt tutorial for more information.

Watch out for contributed modules that create duplicate content through extra URLs. This can be a serious problem.

Further Reading

To learn more about search engine optimization, check out the SEO resources page.

Drupal and Canonical URLs

Google Engineer Matt Cutts talks about canonical home page URLs on his blog. The concept is basically this:

For the most part, search engines view different URLs as being entirely different pages. So the following URLs all may show the same content, but search engines will often see them a different pages with duplicate content:

  1. http://example.com
  2. http://www.example.com/
  3. http://example.com/index.php
  4. and so on...

Drupal does not link to its index.php file so the third URL example is generally not an issue with Drupal. However you should choose between using the www version of the domain name or the non-www version of the domain name. Drupal makes this easy by providing instructions in the default .htaccess file as shown below:

# adapt and uncomment the following:
# RewriteCond %{HTTP_HOST} ^example\.com$ [NC]
# RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]
#
# To redirect all users to access the site WITHOUT the 'www.' prefix,
# (http://www.example.com/... will be redirected to http://example.com/...)
# adapt and uncomment the following:
# RewriteCond %{HTTP_HOST} ^www\.example\.com$ [NC]
# RewriteRule ^(.*)$ http://example.com/$1 [L,R=301]

When setting up your Drupal site you should decide whether you want your site to have the www subdomain or not and choose one of the two options in the .htaccess file.

For SEO purposes it doesn't matter either way unless your site has already been live for a while. If Google has indexed your site and shows the PageRank value of the home page (as seen in the Google Toolbar or through a Firefox Extension like Search Status), then Google has already chosen one version or the other for your domain name. In that case I would redirect to the version of the domain that Google has already accepted. You can determine which version Google has chosen by typing your domain name into Google without the www like this: example.com

Google should show your domain name at the top of the SERPs. If Google shows your home page with the www then you should redirect your site to the www-version. If they leave off the www then redirect to the version without the www.

Some people would say that it doesn't matter which one you redirect to even if Google has already indexed the site. But sometimes when you 301 redirect pages or sites, Google drops the original URL and it takes a while to get it ranked again. That is why I recommend going with the choice that Google has already made for you.

Drupal Modules SEO

The old Drupalzilla.com site had a database of Drupal modules with tips for SEO. I've copied some of the information into the pages below.

Abuse Module

The Abuse Module allows users to flag content as spam. It outputs an extra link at the bottom of teasers and full nodes.

Drupal's Abuse Module

The image above shows the link that is created at the bottom of nodes and comments that allows users to flag content for review by the moderators. The URLs that are linked-to have the structure http://example.com/abuse/report/comment/347. If you have a node with 15 comments, the Abuse Module will create 16 extra pages on the site, 15 abuse form pages for the comments and one for the node.

To fix this problem, the following rule should be added to your robots.txt file when using the Abuse Module:

Disallow: /abuse/

Forum Module

The Forum Module is part of Drupal core. If you enable the Forum Module, there are some things to be aware of.

Whenever Web sites create tables that are sortable by column headers, you are looking at potential duplicate content.

Drupal's Forum Module

The image above shows table headers in Drupal's Forum Module. When you click on one of those links, it re-sorts the data in the table appending parameters to the original URL.

In the example image above, the original URL structure is http://example.com/forums/introduce-yourself. Drupal's Forum Module creates the following additional URLs in the header links:

Link Text URL
Title http://example.com/forums/introduce-yourself?sort=asc&order=Topic
Replies http://example.com/forums/introduce-yourself?sort=asc&order=Replies
Created http://example.com/forums/introduce-yourself?sort=asc&order=Created
Last Reply http://example.com/forums/introduce-yourself?sort=desc&order=Last+reply

After visiting those pages you (and spiders) will also find the following URLs:

Link Text URL
Title http://example.com/forums/introduce-yourself?sort=desc&order=Topic
Replies http://example.com/forums/introduce-yourself?sort=desc&order=Replies
Created http://example.com/forums/introduce-yourself?sort=desc&order=Created
Last Reply http://example.com/forums/introduce-yourself?sort=asc&order=Last+reply

Pagination of the forums makes it even worse because each page can then be sorted in these 8 ways. Here is one example from Drupal.org: http://drupal.org/forum/2?sort=asc&order=Last+reply&page=393.

SEO Recommendation for the Forum Module

The recommended fix for this problem is to add the following line to the robots.txt file:

Disallow: /*sort=

SEO Recommendations for Drupal Core

The following line should be added to the default Drupal robots.txt file because the Forum Module is distributed with Drupal:

Disallow: /*sort=

Forward Module

The Forward Module adds a link to each teaser and full node that allows users to email the node to people.

Drupal's Forward Module

The image above shows the link that is created on every node. The URL structure of the link is http://example.com/forward/343. If your site has 3000 nodes, the Forward Module will create 3000 extra pages with nothing but a form that allows people to email the nodes to their friends.

To fix the issue, add the following line to your robots.txt file:

Disallow: /forward/

Global Redirect Module

The Global Redirect module has three main features:

If a requested URL has a URL alias, Global Redirect will do a 301 redirect to the URL alias.
For example, if you have a URL alias for node 25 called page-title, the Global Redirect Module will do a 301 redirect from http://example.com/node/25 to http://example.com/page-title.
It will remove trailing slashes from URLs.
For example, the Global Redirect Module will redirect a request for http://example.com/page-title/ to http://example.com/page-title. If search engines spider both versions, they will see two different URLs with duplicate content.
If a requested URL is being used as Drupal's front page, it will 301 redirect to the actual front page.
For example, if you are using the path frontpage as your site's front page, a request for http://example.com/frontpage will 301 redirect to http://example.com/.

If you search around the Web for Drupal SEO tutorials, many people recommend using mod_rewrite rules in an .htaccess file to deal with issues like removing trailing slashes. But on sites that also have non-Drupal content, you may have URLs that do have trailing slashes.

A slash is the symbol for a directory. For example, in the URL http://example.com/ the trailing slash is the symbol for the root directory of example.com. If you leave the trailing slash off, the server will add it. If you request a physical directory on a Drupal site (or any site) like http://example.com/modules the server will correct you by appending the trailing slash: http://example.com/modules/. If you have non-Drupal content on your server—perhaps a WordPress blog at http://example.com/software/—you will have URLs with trailing slashes. The WordPress blog would not be located at http://example.com/software, it would be located at http://example.com/software/. You would not want to remove trailing slashes from all URLs.

That is why the Global Redirect module is a good option. It will only remove trailing slashes from URLs that are handled by Drupal.

Image Module

The Image Module allows users to upload images as nodes.

It creates duplicate content on sites—at least two duplicate URLs for every image node created.

Drupal's Image Module screenshot

In the image above, the link to "Thumbnail" appends the query string ?size=thumbnail to the URL and redisplays the content. Once you are on the thumbnail page, a link to the preview page will be displayed. If you have allowed anonymous users to "View Original Image" in the Access Control settings, then there will be an additional link to the original image.

The URLs of duplicate content that a default installation of the Image Module are shown below:

  • http://example.com/node-title — this is the actual node's URL
  • http://example.com/node-title?size=thumbnail
  • http://example.com/node-title?size=preview
  • http://example.com/node-title?size=_original
    • The names of the image sizes are controlled through the Image Module settings at http://example.com/admin/settings/image:

      Drupal's Image Module settings

      So, for example, if you created an additional image size called tiny, the Image Module would then create an extra page of duplicate content for each image node on the site by appending ?size=tiny to the original URLs of the nodes.

      To fix this issue, add the following line to your robots.txt file:

      Disallow: /*size=

Paging Module

Drupal's Paging Module is a popular way to break up nodes across multiple pages. This module does create some problematic SEO issues though.

Drupal's Paging Module

As shown in the image above, the Paging Module is able to break up each node into multiple pages which creates more URLs. For example, if the page above had the URL http://example.com/page-title, the following other URLs would be created for the paginated views:

  • The number 2 would link to http://example.com/page-title?page=0,1.
  • The number 3 would link to http://example.com/page-title?page=0,2
  • And when you are on either of those two sub-pages, the number 1 would link back to the first page as http://example.com/page-title?page=0,0 instead of its original URL at http://example.com/page-title.

That results in a single page of content with two different URLs: http://example.com/page-title?page=0,0 contains duplicate content of the node's main URL http://example.com/page-title.

Temporary SEO Fix

The current SEO fix is to add the following line to your robots.txt file to prevent the duplicate pages from being indexed:

Disallow: /*?page=0,0$

The syntax in the above robots.txt rule is recognized by Google Search, Yahoo Search, and MSN Live Search.

Module development recommendations

Future versions of this module should be built so that the main URL is not duplicated. The link back to the main node page should not have a query string. Also, it would be best if the URLs that it generates were not dynamic.

The following example shows a possible URL structure for this module that would be better for search engine indexing:

  • Main node URL: http://example.com/page-title
  • First pagination: http://example.com/page-title/1
  • Second pagination: http://example.com/page-title/2
  • and so on...

Path Module

The Path Module is a Drupal core module. Enabling it allows you to create URL aliases.

Drupal's standard URLs (once you have enabled "clean URLS" in the Admin panel) are in this format:
http://example.com/node/25

Once you have enabled the Path Module you will be able to create URL aliases for each URL. If you created a URL alias for that URL (node 25) called custom-page-title, you would then be able to access the content of node 25 at http://example.com/custom-page-title.

You would also still be able to access the content of node 25 at http://example.com/node/25. Generally, you do not have to worry about this unless your site has already been indexed with the original "node" URLs. In either case you could install the Global Redirect Module which would automatically redirect http://example.com/node/25 to your URL alias at http://example.com/custom-page-title.

A related module is the recommended Pathauto Module which automatically creates URL aliases for each node on your site.

Spam Module

The Spam Module filters content and comments for spam, as well as lets users flag contents for review by the administrators.

The Drupal Spam Module creates URLs on the site like:
http://example.com/spam/report/comment/1

To prevent low-quality pages from being indexed, add the following line to your robots.txt file when using the Spam Module:

Disallow: /spam/

Tracker Module

The Tracker Module creates "track" pages for each user.

For example, a page that tracks user #234 would have a tracker page located at http://example.com/user/234/track. Those pages should be blocked from search engines with the following rule:

Disallow: /*/track$

That robots.txt syntax is recognized by Google Search, Yahoo Search, and MSN Live Search.

The tracker module also keeps track of recent posts on the site at URLs like http://example.com/tracker and on large sites creates thousands of tracker pages like http://drupal.org/tracker?page=6379.

My recommendation is to leave the first page of the Recent Posts (http://example.com/tracker) exposed to search engines, while blocking the paginated tracker pages like http://example.com/tracker?page=50. Leaving the just the first page of /tracker exposed to search engines allows search engines to rapidly find and index your latest content as it is posted.

The following rule blocks all but the first of your site-wide tracker pages:

Disallow: /tracker?

Drupal SEO and Case Sensitive URLs

Search engines like Google and Yahoo are based on Unix (Linux or BSD). Unlike on Windows, filenames on Unix servers are case-sensitive. That means a file called INDEX.HTML is a different file than index.html.

Drupal has an SEO issue where URLs are not case sensitive. I'll explain why this is a problem.

Here's an example of case-sensitive URLs on a Unix server—Google.com:

google-com.png

One page with more than one URL can be seen as duplicate content in the eyes of search engines. A page that shows the same content regardless of the letter case of the URLs, is showing duplicate content.

Here's an experiment I did with Drupal and case sensitive URLs. It shows that both versions are indexed by Google as duplicate content:

google-case-sensitive-URLs-drupal.png

I posted an issue here. I think it's a MySQL problem. Here's the code from the Drupal 5.7 Path Module:

case 'load':
$path = "node/$node->nid";
// We don't use drupal_get_path_alias() to avoid custom rewrite functions.
// We only care about exact aliases.
$result = db_query("SELECT dst FROM {url_alias} WHERE src = '%s'", $path);
if (db_num_rows($result)) {
$node->path = db_result($result);
}
break;

Here's what MySQL.com says:

The default character set and collation are latin1 and latin1_swedish_ci, so non-binary string comparisons are case insensitive by default. This means that if you search with col_name LIKE 'a%', you get all column values that start with A or a. To make this search case sensitive, make sure that one of the operands has a case sensitive or binary collation.

I think that it's something that needs to be fixed in the Path Module and/or added to the Global Redirect module.

A Drupal site isn't going to be affected by this naturally. It would only happen if someone working on the site manually links to URLs in a different case than the case of the Drupal URL aliases. I wouldn't call it a "critical" issue, but it definitely should be fixed as soon as possible. Theoretically it could be used to maliciously affect a site's rankings.

Comments and opinions welcome.

Drupal SEO: "404 Ok" and .htaccess

NOTE: This tutorial is no longer current. Please see the Drupal SEO Tutorial for current information on Drupal 5 and Drupal 6.

There are two problems in Drupal 4.7 that may cause problems with search engine spiders.

Drupal .htaccess: Redirecting to www

Tip: .htaccess is only used with Drupal on Apache server. If you are using Windows and want to install Apache, try Apache2triad which includes Apache, PHP, MySQL, Perl, Python, and much more. Apache2triad installs with a double-click. You can run Drupal on IIS, but I don't think it's a good idea.

If you don't know what URL canonicalization is, read this first.

The default .htaccess in Drupal 4.7 has some lines that you can uncomment to redirect your visitors in one of the following two ways:

  1. http://example.com to http://www.example.com
  2. http://www.example.com to http://example.com

This is the relevant section of the default Drupal .htaccess file — it is a bad idea to use this code on your site:

  RewriteEngine on

  # If your site can be accessed both with and without the prefix www.
  # you can use one of the following settings to force user to use only one option:
  #
  # If you want the site to be accessed WITH the www. only, adapt and uncomment the following:
  # RewriteCond %{HTTP_HOST} !^www\.example\.com$ [NC]
  # RewriteRule .* http://www.example.com/ [L,R=301]
  #
  # If you want the site to be accessed only WITHOUT the www. , adapt and uncomment the following:
  # RewriteCond %{HTTP_HOST} !^example\.com$ [NC]
  # RewriteRule .* http://example.com/ [L,R=301]

It is a bad idea to use these default RewriteRules because they will only redirect to the Drupal home page. For example, they will redirect a request for http://example.com/MyPage to http://www.example.com/, when it should redirect to http://www.example.com/MyPage. A site should redirect to the requested page, not back to the home page.

This default Drupal .htaccess file is dangerous because external web sites might link to a page on your site like http://example.com/MyBestPage and if you use the default Drupal RewriteRules it will redirect the search engines (and visitors) to http://www.example.com/ — the "www" version of your home page; not the intended page. Don't risk confusing the search engines with 301 (permanent) redirects to your home page when you don't intend for them to go to your home page.

To fix this problem, use the following lines in your Drupal .htaccess file instead, right after the line that says RewriteEngine On, replacing example.com with your domain name:

  RewriteCond %{HTTP_HOST} !^www\.example\.com$ [NC]
  RewriteRule (.*) http://www.example.com/$1 [R=301,L]

If you prefer to remove the www then use the following rule instead:

  RewriteCond %{HTTP_HOST} !^example\.com$ [NC]
  RewriteRule (.*) http://example.com/$1 [R=301,L]

Tip: If you want to know the details on how those rewrite rules work, check out this mod_rewrite cheat sheet.

Drupal's 404 Ok Error

Drupal has a problem when you are running PHP as CGI. Instead of sending "404 Not Found" errors when it can't find a page, it will send "404 Ok" errors. You can read more about it on PHP.net. When a search engine spider requests a page that doesn't exist, you want to send a proper "404 Not Found" header.

To see if you are sending faulty "404 Ok" headers, you can use Firefox with the LiveHTTPheaders extension. After you have installed that extension and restarted Firefox, go to Tools —> Live HTTP headers. That will open up the header-viewer window. Then go to your web site to a page that doesn't exist (like http://www.example.com/asdf1234). Check the LiveHTTPheader window to see if it sends a correct "404 Not Found" header or an incorrect "404 Ok" header. If it says "404 Ok", then there is a problem and you can fix it as explained below.

To fix the Drupal 404 error problem open up the file /includes/common.inc. In Drupal 4.7.3, it is about line 288 where you will find this:

  drupal_set_header('HTTP/1.0 404 Not Found');

Change that line to:

  drupal_set_header('Status: 404 Not Found');

Then check your headers again in Firefox with LiveHTTPheaders. If it says "404 Not Found" then your problem is solved. If it doesn't work, leave a comment below and let me know...

Update: PHP "301 OK" Header Errors

As mentioned below in the comments there is also a "403 OK" error that can exist on some configurations. For an example on how to fix the similar "301 OK" PHP header error, see my post on PHP redirects.

Drupal.org Bug Report

See also the Drupal bug report page for this problem.

Note: The HTTP 1.0 specifications say that "the Status-Code is intended for use by automata and the Reason-Phrase is intended for the human user. The client is not required to examine or display the Reason-Phrase." But — I did have a problem where Google would not remove some of my pages even with the manual removal tool until I fixed the headers from "404 OK" to "404 Not Found". I'm not sure what the current status on this issue is, but to be safe I recommend sending a correct "404 Not Found" header. Not all bots may be programmed according the standards.

Robots.txt and Drupal

An important aspect of Drupal SEO is the robots.txt file. Drupal 5 was the first version of Drupal that came with a robots.txt file, but it still needs some modifications.

One of the most serious SEO problems with Drupal is duplicate content. With the addition of contributed modules it can get so bad that one might refer to it as druplicate content. (ow...)

A key element of SEO on sites is getting a good, clean crawl. A robots.txt file is important for a clean crawl because it tells robots where they aren't supposed to go. There are many places on a Drupal site that search engine crawlers shouldn't go.

Drupal's Default Robots.txt File

I've attached Drupal 5's default robots.txt file for reference and will address it in sections:

Crawl Delay

The first thing I would do is remove the Crawl-delay line. Unless you have a very large site or spidering problems, it's not needed. The other robots.txt rules that I mention here should help cut down on the number of pages crawled.

User-agent: *
Crawl-delay: 10

Directories

The next section of the default robots.txt file addresses the physical directories created by Drupal:

# Directories
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/

That section can be left as-is. Just keep in mind that it will probably keep search engines out of your logo and image files also because you are blocking your /sites/, /modules/, and /themes/ directories. If you use an alternate logo image, rename it so that it includes a keyword and place it in your /files/ directory.

Files

The next section addresses files that are included with Drupal. I've never seen any of these files indexed, but you can leave this section in if you wish. Don't delete your CHANGELOG.txt file as some people recommend, because it lets you know what version of Drupal you are running in case you forget later.

# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt

Paths

This is the most important section of the default robots.txt file because it contains some errors:

# Paths (clean URLs)
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/

Drupal doesn't have trailing slashes on the URLs, so you may want to remove trailing slashes from some of the rules as shown below:

Disallow: /admin/
Disallow: /aggregator
Disallow: /comment/reply/
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search/
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

For example, each "Login or register to post comments" link on each node creates URLs like http://example.com/user/login?destination=comment/reply/806%2523comment_form and http://example.com/user/register?destination=comment/reply/806%2523comment_form. Drupal's default robots.txt rules will not block search engines from spidering those URLs, but if you remove the trailing slashes as I've mentioned above, it will.

The Aggregator Module creates URLs of duplicate content like http://example.com/aggregator?page=3 that are not blocked by the default robots.txt file. Removing the trailing slash on the end of "/aggregator/" in the default robots.txt file will solve that problem.

Paths (no clean URLs)

The next section of the robots.txt file addresses paths that should be blocked if you aren't using clean URLs:

# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

UPDATE: Please ignore the following lines. Further testing has shown that this rule will block all dynamic URLs in Google. So don't use it!

Most of the people reading these Drupal SEO tutorials are using clean URLs. If you are using clean URLs you can delete that section and replace it with the following line:

Disallow: /?

That line would block all of the URLs that start with ?q= as well as other miscellaneous query strings that might later appear for various reasons.

If you are not using clean URLs, modify the above section using the same logic as for the "clean paths" section above it. If your site has been indexed without clean URLS—for example, the page http://example.com/?q=node/25 has PageRank and you are going to implement clean URLs—you should use .htaccess to do 301 redirects from the dynamic versions of the URLs to the clean ones. In that case do not block the dynamic URLs from search engines because you would want them to transfer the PageRank of the dynamic URLs to the clean URLs. If that issue applies to you and my explanation doesn't make sense, please let me know in a comment below and I'll try to explain it another way.

Additional Rules

I also recommend adding the following rules, after carefully reading and understanding the explanations given with them:

Each module potentially adds many extra URLs on the site which often create massive amounts of duplicate content and that also increase the crawling load on your server. The following rules address some extra robots.txt rules for core modules.

Disallow: /node$
The URL http://example.com/node is a duplicate of http://example.com/.
Disallow: /user$
This will disallow the user form at http://example.com/user. If you would like to block all user pages, remove the trailing ampersand from this rule and all user pages will be blocked.
Disallow: /*sort=
This takes care of problems with the Forum Module, the Views Module and other modules that sort tables by column headers.
Disallow: /search$
This will block your search form at http://example.com/search. That URL does a 302 redirect to http://example.com/search/node which is already blocked by the default robots.txt file.
Disallow: /*/feed$
Drupal creates RSS feeds for many types of content in the format http://example.com/taxonomy/term/25/0/feed. If you don't block those RSS feeds, Google will put them in the Supplemental Results (even if they don't label the Supplemental Results in the SERPs anymore). The RSS feeds are duplicate content because they are the same text content except marked up with RSS/XML instead of X/HTML. This rule with block all the RSS feeds on the site except for the main RSS feed which is located at http://example.com/rss.xml by default.
Disallow: /*/track$
This will block all of the URLs created by the Tracker Module which are in the format http://example.com/user/5/track.
Disallow: /tracker?
The Tracker Module creates a paginated list of all the nodes on your site, beginning with the most recent. I believe that it's best for search engines to spider your content by approaching it in keyword-themed areas of the site (as they do through taxonomy or well-constructed Views). The tracker module organizes your content chronologically, not by keyword as taxonomy or Views do. The Tracker Module can also create thousands of extra pages on the site like this example from Drupal.org: http://drupal.org/tracker?page=6379. My recommendation is to leave http://example.com/tracker exposed to search engines, while blocking the paginated tracker pages like http://example.com/tracker?page=50. Leaving the just the first page of /tracker exposed to search engines allows search engines to rapidly find and index your latest content as it is posted.
Disallow: [front page] (replace with the path to your alternate front page)
If you change the default front page by modifying the settings on your site information page (http://example.com/admin/settings/site-information), you should block the path to the custom front page with robots.txt. Alternatively, install the Global Redirect Module which should redirect that URL to your true front page.

An improved version of Drupal's Robots.txt file that summarizes the explanations above can be download here.

Please see the Drupal SEO Module Database for instructions about specific rules. If you have questions about a specific module that I haven't covered yet, please contact me and I'll try to review the module as soon as possible.

How to Nofollow Drupal Comments

This is a quick hack to Drupal 5 to add rel=nofollow to the comment authors' homepages. I'm not recommending adding nofollow to comments, only describing the technique for people who are looking for it.

Generally it isn't a good idea to modify Drupal core code, but this method works.

Open /includes/theme.inc.

Change line 1052 from:
$output = l($name, 'user/'. $object->uid, array('title' => t('View user profile.')));

to:

$output = l($name, 'user/'. $object->uid, array('title' => t('View user profile.'), 'rel' => t('nofollow')));

And change line 1064 from:

$output = l($object->name, $object->homepage);

to:

$output = l($object->name, $object->homepage, array('rel' => t('nofollow')));

Technique originally described here.

Drupal.org SEO

This is a series of 3 tutorial I wrote in 2007 that have suggestions for optimizing Drupal.org for better search engine rankings.

Drupal.org SEO, Part One

This tutorial offers some advice on how to optimize Drupal.org for search engines. This is part one of a series.

Problem: Google Rankings Drop

The following table shows a ranking drop in Google over the past 8 months for some of Drupal's main keywords:

Keyword Rank on 28 Jan 2007 Rank on 15 Aug 2007
content management system #7 #15
cms #7 #36

Drupal is obviously one of the best, most flexible, open-source content management systems available. I think that Drupal is the best general CMS because it is very flexible, it runs on standard LAMP servers, and it is usable even by people without much of a technical background. It would be great to see Drupal.org in the top 5 on Google for both of those keywords.

This series of articles on SEO for Drupal.org will attempt to address issues that may help increase organic search engine traffic.

Drupal.org's Robots.txt File

Drupal.org has some basic SEO issues. One of the problems is the massive amount of duplicate content that can be spidered by search engines. In addition to duplicate content issues, having those pages exposed to spiders also puts a heavier load on the server because of the number of pages that can be crawled.

Google has not been obeying robots.txt files lately, but it's a good idea to use robots.txt files correctly in preparation for when Google fixes their crawling problem. (An example of the Googlebot bug can be seen here where Google has indexed and cached sections of this site that have been blocked off with robots.txt since it was launched.)

I've written a Drupal Robots.txt Tutorial which explains some errors in the Drupal 5's default robots.txt file. I've summarized recommended changes to http://drupal.org/robots.txt in the attached PDF file.

Home Page Title Element

The <title> element of a Web page gives search engines an idea about what the theme of the page is about. The home page title element is especially important. The current home page title element has the text drupal.org | Community plumbing. This is how Google displays it in the SERPs:

Drupal's title element shows in Google's SERPs

It isn't the most attractive listing and might not attract as many clickthroughs as a more descriptive title element. I recommend changing the home page title element of Drupal.org to:

<title>Drupal | Open Source Content Management System (CMS)</title>

or

<title>Drupal CMS | An Open Source Content Management System</title>

Even better would be to add the keyword PHP. Many people are searching for things like "how do I make a cms in php?". Displaying those keywords in the home page title would be helpful. For example:

<title>Drupal CMS | Open Source Content Management System in PHP</title>

In addition to adding primary keywords to the home page title element, that change would modify the listing of the site in the SERPs so that it looked more descriptive like this:

The important thing is to get both keywords in the home page title element:

  • CMS
  • Content Management System
  • PHP (recommended)

Drupal.org SEO, Part Two

This article is part two of the series on optimizing Drupal.org for search engines.

Remove Nofollow From Internal Links

Rel=nofollow is a microformat that when applied to links tells search engines, "I do not vouch for the quality of this link". It tells Google that the linking page does not vouch for the quality of the linked-to page.

Nofollow is used on Drupal.org to reduce the motivation of users to post spam. When the site is viewed with the Firefox Search Status Extension, the nofollowed links show up highlighted in pink, showing the extent of the issue:

Drupal nofollow

The problem with nofollow on Drupal.org is that it is getting applied to internal links. So pages that are important and that get linked to often are not getting the search engine "link juice" that they should.

One solution would be to apply a "nofollow whitelist" to the Input Filter on Drupal.org, so that pages from Drupal, its subdomains, and other official sites always have rel=nofollow removed from their links. That way nodes that are important and that get linked to often by webmasters and users would start to get more link juice and be seen as more important by search engines. That would include the often referenced pages like the excellent Drupal Handbooks.

Optimized Title Elements

Drupal.org has so much link popularity (PageRank 9) that it could rank #1 for just about any Drupal term that it is optimized for. One important factor is to make sure that the relevant keywords appear in the <title> elements, the <h1> elements, and in the body text. I'll use the Drupal Handbooks as an example. The Drupal Handbooks contain some of the best Drupal tutorials on the Web, yet the Handbooks do not rank in Google for those keywords.

If you search Google for drupal tutorials Drupal.org is only #7 and #8, and the pages listed are just forum threads—not the main tutorial section, the Drupal Handbooks.

Search engine visitors are more likely to search for the keywords drupal tutorials than drupal handbooks. Some of the best Drupal tutorials are found in the Drupal handbook pages, and it would be appropriate for Google to have the Drupal handbooks at #1 for the keywords drupal tutorials.

On the page http://drupal.org/handbooks, I would change to title element from:

<title>Drupal handbooks | drupal.org</title>

to

<title>Drupal handbooks and tutorials | drupal.org</title>

I would also change the H1 element of that page to <h1>Drupal Handbooks and Tutorials</h1>. Currently the H1 element is just <h1>Drupal Handbooks</h1> as shown in the image below:

Drupal handbooks H1 element

More Coming Soon

This is Part Two in a series on SEO recommendations for Drupal.org. Part One can be found here. Part Three covers duplicate content issues.

Drupal.org SEO Part 3

This is part 3 of a case study on how Drupal.org could be further optimized for search engine rankings.

Google has indexed over 2,500 pages on the subdomain www2.drupal.org. Here is a screenshot:

Google has indexed the www2 subdomain on Drupal.org

Since www2.drupal.org is a duplicate of drupal.org, Google is indexing duplicate content on the site which can hurt rankings. It also puts extra load on the servers because of the extra pages being crawled.

There are two possible solutions:

  1. Send 301 redirects from all pages on the www2 subdomain to their corresponding pages on the main domain drupal.org.
  2. Alternatively, block all pages on the www2 subdomain from robots with the robots.txt file.

The first option could be implemented with .htaccess. The second option could be implemented by having the URL http://www2.drupal.org/robots.txt serve the following content:

User-agent: *
Disallow: /

That would prevent search engines from crawling and indexing the duplicate content. (The main robots.txt file at http://drupal.org/robots.txt would serve different content—the regular robots.txt file.)