Hunting Googlebot with GNU SEO Tools

Are you curious where Google goes when it spiders your Web site? There are several GNU tools that can quickly give you useful information.

These tips will probably work on any Unix-based operating system including Mac OS X and BSD, as well as GNU / Linux. I am using them on Ubuntu 6.06 Dapper Drake. If you are using Windows, consider getting Cygwin or using a live CD.

GNU SEO Tools

There are many excellent GNU or Unix tools for the SEO toolkit. This post is just an introduction.

You can either download your raw access log files, or connect to your Linux or BSD server over SSH and run the commands there. This tutorial assumes you have a standard access log downloaded to your computer and that you are in the same directory as your access log. In this tutorial, the access log is named access_log. Just replace access_log with the name of your log file, or rename your log file to access_log. The log file in this example is 356,162 lines long and is a little large to comfortably open in a text editor. Some log files are millions of lines long and cannot reasonably be opened in graphical text editors. These tools are able to extract information from them without your needing to view the files.

An Introduction to the Commands

You can learn more about each of the following commands by typing man [command] in the terminal, replacing [command] with the actual name of the command.

cat
Concatenates files. Commonly used for sending the contents of a text file into another command with a pipe. cat [file] will print the contents of the file to the screen. You can pipe the contents into another command as shown below.
wc -l
With the -l option, this prints the number of lines in a file.
grep
Grep means "globally search for regular expression and print". Grep is covered in detail below.
sort
Sorts the output that is passed to it.
uniq
Prints only the unique items that are passed to it.
cut
An amazing tool that can extract columns from text that is given to it. In our case, we can use it to extract specific columns from the Apache log file such as requested URLs, IP addresses, or HTTP error codes.
head
Extracts a specified number of lines from the beginning of a file.
tail
Extracts a specified number of lines from the end of a file. Useful for breaking off a section of a large file.
sed
A text processing tool that allows you to filter text that is piped to it. In this case it can be used to extract a range of line numbers from a very large text file.
|
The pipe symbol allows you to send the output of one command into another command. You can make long chains of commands this way.

If you don't already know what a regular expression is, read the Wikipedia article on regular expressions first. Also try typing man grep in the terminal.

If you don't know what it means to pipe one command to another, see also the Learning the GNU / Linux Command Line tutorial for an introduction to how it works.

Grep

Grep means "globally search for regular expression and print".

Grepping your log files for search engine spiders

Tip: For documentation on the grep command, type man grep in the terminal.

The syntax for the grep command is grep [options] PATTERN [FILE...].

To grep for hits that Googlebot has made on your site and save them to a file named googlebot_access.txt, use this line:

grep Googlebot access_log >googlebot_access.txt

You can then count the number of hits that Googlebot has made on the site with the wc -l command:

wc -l googlebot_access.txt

Tip: For documentation on the wc command, type man wc in the terminal.

The number of Googlebot hits in my sample log file is 1825 out of 356,162.

To grep for Yahoo or MSN, use one of the following lines:

grep Yahoo access_log >yahoobot_access.txt

grep msnbot access_log >msnbot_access.txt

Also take a look at this list of search engine bots to get information on their user agent strings. More user agents can be found here.

Refining the Search

You can find the number of 404 errors with the grep command. The [[:space:]] before and after the 404 make sure that you don't match the number 404 elsewhere in the line, such as in the IP address:

grep [[:space:]]404[[:space:]] googlebot_access.txt

Optionally pipe the output of the command to the wc command to count the number of 404 errors that Googlebot has received:

grep [[:space:]]404[[:space:]] googlebot_access.txt | wc -l

To find all the 301 redirects that Googlebot has hit and write them to a file called googlebot_301s, try this:

grep [[:space:]]301[[:space:]] googlebot_access.txt >googlebot_301s.txt

Here is a sample line from an Apache log file:

66.249.65.77 - - [31/Oct/2006:05:22:34 -0500] "GET / HTTP/1.1" 200 30527 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

To learn how to read that information, check out the Apache log file docs.

Since the data is in columns, the cut command can be used to extract columns like this:

cat googlebot_301s.txt | cut -d' ' -f7

The cat command followed by a pipe sends the output of the file into the cut command. The -d' ' option means to set the column delimiter as a space. The -f7 part means to extract the 7th column — in this case, the URLs from the 301 headers that Googlebot encountered.

You can make the output easier to read by extracting only the URLs from the hits that returned a certain response code — in this case the URLs from every 404 error that Googlebot encountered:

grep [[:space:]]404[[:space:]] googlebot_access.txt | cut -d' ' -f7 >googlebot_404_URLs.txt

After you run that command, you will have a file named googlebot_404_URLs.txt that contains the URLs for every 404 error that was sent to Googlebot. Sample output of URLs from this site that Googlebot requested last month that returned a 404 error:

 /trackback/47
 /trackback/58
 /trackback/69
 /trackback/70
 /trackback/49
 /trackback/63
 /trackback/69
 /trackback/80
 /trackback/56
 /trackback/62
 /trackback/40
 /trackback/56
 /trackback/63
 /trackback/28
 /trackback/82
 /trackback/80
 /trackback/75
 /trackback/66

This tells me that I forgot to block the /trackback/ directory with robots.txt before I removed that Drupal module, and Googlebot is still looking for those URLs. Running that command tells me that I should add another rule to the robots.txt file.

You could also refine the command by sorting the URLs and only showing unique results like this:

grep [[:space:]]404[[:space:]] googlebot_access.txt | cut -d' ' -f7 | sort | uniq >googlebot_404_URLs.txt

Head and Tail

Last week I was working with a log file that was over 4 GB in size. I couldn't download it, and I couldn't open it. Even grepping through it would have taken a long time. I only needed the last (most recent) part of the file. The command to extract the last part of a file is called tail, and it can easily be executed on the server over SSH.

If you want to get the last 100,000 lines of a file, type tail -n 100000 filename. You can then pipe those lines into grep and print the output to a file named googlebot_access like this:

tail -n 100000 access_log | grep Googlebot >googlebot_access

Then you can download the smaller googlebot_access file to your own computer.

To extract only the beginning of a file, use the head command:

head -n 100000 access_log | grep Googlebot >googlebot_access

Note in the above examples that you don't need a .txt file extension in UNIX. I usually add it though because I often move files between Windows and Linux.

If you want to extract a range of lines from the middle of a large file, you can use sed. The following example extracts lines 5000 to 8000 from the file called access_log and saves them to a new file called new_file.txt:

sed -n '5000,8000p' access_log >new_file.txt

Summary

These GNU/Unix tools can be useful for quickly getting search engine related data from your log files. They are useful to have in the toolkit.

Commands covered in this tutorial:

  • grep
  • tail
  • head
  • sed
  • wc
  • cat
  • sort
  • uniq
  • cut

Other Apache Logfile Resources

More Fun With Log Files

Interesting related Perl scripts from John Bokma's site:

More log searching information coming soon...

Average: 5 (2 votes)

Syndicate content