An important aspect of Drupal SEO is the robots.txt file. Drupal 5 was the first version of Drupal that came with a robots.txt file, but it still needs some modifications.
One of the most serious SEO problems with Drupal is duplicate content. With the addition of contributed modules it can get so bad that one might refer to it as druplicate content. (ow...)
A key element of SEO on sites is getting a good, clean crawl. A robots.txt file is important for a clean crawl because it tells robots where they aren't supposed to go. There are many places on a Drupal site that search engine crawlers shouldn't go.
I've attached Drupal 5's default robots.txt file for reference and will address it in sections:
The first thing I would do is remove the Crawl-delay line. Unless you have a very large site or spidering problems, it's not needed. The other robots.txt rules that I mention here should help cut down on the number of pages crawled.
User-agent: *
Crawl-delay: 10
The next section of the default robots.txt file addresses the physical directories created by Drupal:
# Directories
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
That section can be left as-is. Just keep in mind that it will probably keep search engines out of your logo and image files also because you are blocking your /sites/, /modules/, and /themes/ directories. If you use an alternate logo image, rename it so that it includes a keyword and place it in your /files/ directory.
The next section addresses files that are included with Drupal. I've never seen any of these files indexed, but you can leave this section in if you wish. Don't delete your CHANGELOG.txt file as some people recommend, because it lets you know what version of Drupal you are running in case you forget later.
# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.txt
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt
This is the most important section of the default robots.txt file because it contains some errors:
# Paths (clean URLs)
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Drupal doesn't have trailing slashes on the URLs, so you may want to remove trailing slashes from some of the rules as shown below:
Disallow: /admin/
Disallow: /aggregator
Disallow: /comment/reply/
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search/
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
For example, each "Login or register to post comments" link on each node creates URLs like http://example.com/user/login?destination=comment/reply/806%2523comment_form and http://example.com/user/register?destination=comment/reply/806%2523comment_form. Drupal's default robots.txt rules will not block search engines from spidering those URLs, but if you remove the trailing slashes as I've mentioned above, it will.
The Aggregator Module creates URLs of duplicate content like http://example.com/aggregator?page=3 that are not blocked by the default robots.txt file. Removing the trailing slash on the end of "/aggregator/" in the default robots.txt file will solve that problem.
The next section of the robots.txt file addresses paths that should be blocked if you aren't using clean URLs:
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
UPDATE: Please ignore the following lines. Further testing has shown that this rule will block all dynamic URLs in Google. So don't use it!
Most of the people reading these Drupal SEO tutorials are using clean URLs. If you are using clean URLs you can delete that section and replace it with the following line:
Disallow: /?
That line would block all of the URLs that start with ?q= as well as other miscellaneous query strings that might later appear for various reasons.
If you are not using clean URLs, modify the above section using the same logic as for the "clean paths" section above it. If your site has been indexed without clean URLS—for example, the page http://example.com/?q=node/25 has PageRank and you are going to implement clean URLs—you should use .htaccess to do 301 redirects from the dynamic versions of the URLs to the clean ones. In that case do not block the dynamic URLs from search engines because you would want them to transfer the PageRank of the dynamic URLs to the clean URLs. If that issue applies to you and my explanation doesn't make sense, please let me know in a comment below and I'll try to explain it another way.
I also recommend adding the following rules, after carefully reading and understanding the explanations given with them:
Each module potentially adds many extra URLs on the site which often create massive amounts of duplicate content and that also increase the crawling load on your server. The following rules address some extra robots.txt rules for core modules.
Disallow: /node$Disallow: /user$Disallow: /*sort=Disallow: /search$Disallow: /*/feed$Disallow: /*/track$Disallow: /tracker?/tracker exposed to search engines allows search engines to rapidly find and index your latest content as it is posted.Disallow: [front page] (replace with the path to your alternate front page)An improved version of Drupal's Robots.txt file that summarizes the explanations above can be download here.
Please see the Drupal SEO Module Database for instructions about specific rules. If you have questions about a specific module that I haven't covered yet, please contact me and I'll try to review the module as soon as possible.