Are you curious where Google goes when it spiders your Web site? There are several GNU tools that can quickly give you useful information.
These tips will probably work on any Unix-based operating system including Mac OS X and BSD, as well as GNU / Linux. I am using them on Ubuntu 6.06 Dapper Drake. If you are using Windows, consider getting Cygwin or using a live CD.
There are many excellent GNU or Unix tools for the SEO toolkit. This post is just an introduction.
You can either download your raw access log files, or connect to your Linux or BSD server over SSH and run the commands there. This tutorial assumes you have a standard access log downloaded to your computer and that you are in the same directory as your access log. In this tutorial, the access log is named access_log. Just replace access_log with the name of your log file, or rename your log file to access_log. The log file in this example is 356,162 lines long and is a little large to comfortably open in a text editor. Some log files are millions of lines long and cannot reasonably be opened in graphical text editors. These tools are able to extract information from them without your needing to view the files.
You can learn more about each of the following commands by typing man [command] in the terminal, replacing [command] with the actual name of the command.
If you don't already know what a regular expression is, read the Wikipedia article on regular expressions first. Also try typing man grep in the terminal.
If you don't know what it means to pipe one command to another, see also the Learning the GNU / Linux Command Line tutorial for an introduction to how it works.
Grep means "globally search for regular expression and print".
Tip: For documentation on the grep command, type man grep in the terminal.
The syntax for the grep command is grep [options] PATTERN [FILE...].
To grep for hits that Googlebot has made on your site and save them to a file named googlebot_access.txt, use this line:
grep Googlebot access_log >googlebot_access.txt
You can then count the number of hits that Googlebot has made on the site with the wc -l command:
wc -l googlebot_access.txt
Tip: For documentation on the wc command, type man wc in the terminal.
The number of Googlebot hits in my sample log file is 1825 out of 356,162.
To grep for Yahoo or MSN, use one of the following lines:
grep Yahoo access_log >yahoobot_access.txt
grep msnbot access_log >msnbot_access.txt
You can find the number of 404 errors with the grep command. The [[:space:]] before and after the 404 make sure that you don't match the number 404 elsewhere in the line, such as in the IP address:
grep [[:space:]]404[[:space:]] googlebot_access.txt
Optionally pipe the output of the command to the wc command to count the number of 404 errors that Googlebot has received:
grep [[:space:]]404[[:space:]] googlebot_access.txt | wc -l
To find all the 301 redirects that Googlebot has hit and write them to a file called googlebot_301s, try this:
grep [[:space:]]301[[:space:]] googlebot_access.txt >googlebot_301s.txt
Here is a sample line from an Apache log file:
220.127.116.11 - - [31/Oct/2006:05:22:34 -0500] "GET / HTTP/1.1" 200 30527 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
To learn how to read that information, check out the Apache log file docs.
Since the data is in columns, the cut command can be used to extract columns like this:
cat googlebot_301s.txt | cut -d' ' -f7
The cat command followed by a pipe sends the output of the file into the cut command. The -d' ' option means to set the column delimiter as a space. The -f7 part means to extract the 7th column — in this case, the URLs from the 301 headers that Googlebot encountered.
You can make the output easier to read by extracting only the URLs from the hits that returned a certain response code — in this case the URLs from every 404 error that Googlebot encountered:
grep [[:space:]]404[[:space:]] googlebot_access.txt | cut -d' ' -f7 >googlebot_404_URLs.txt
After you run that command, you will have a file named googlebot_404_URLs.txt that contains the URLs for every 404 error that was sent to Googlebot. Sample output of URLs from this site that Googlebot requested last month that returned a 404 error:
/trackback/47 /trackback/58 /trackback/69 /trackback/70 /trackback/49 /trackback/63 /trackback/69 /trackback/80 /trackback/56 /trackback/62 /trackback/40 /trackback/56 /trackback/63 /trackback/28 /trackback/82 /trackback/80 /trackback/75 /trackback/66
This tells me that I forgot to block the /trackback/ directory with robots.txt before I removed that Drupal module, and Googlebot is still looking for those URLs. Running that command tells me that I should add another rule to the robots.txt file.
You could also refine the command by sorting the URLs and only showing unique results like this:
grep [[:space:]]404[[:space:]] googlebot_access.txt | cut -d' ' -f7 | sort | uniq >googlebot_404_URLs.txt
Last week I was working with a log file that was over 4 GB in size. I couldn't download it, and I couldn't open it. Even grepping through it would have taken a long time. I only needed the last (most recent) part of the file. The command to extract the last part of a file is called tail, and it can easily be executed on the server over SSH.
If you want to get the last 100,000 lines of a file, type tail -n 100000 filename. You can then pipe those lines into grep and print the output to a file named googlebot_access like this:
tail -n 100000 access_log | grep Googlebot >googlebot_access
Then you can download the smaller googlebot_access file to your own computer.
To extract only the beginning of a file, use the head command:
head -n 100000 access_log | grep Googlebot >googlebot_access
Note in the above examples that you don't need a .txt file extension in UNIX. I usually add it though because I often move files between Windows and Linux.
If you want to extract a range of lines from the middle of a large file, you can use sed. The following example extracts lines 5000 to 8000 from the file called access_log and saves them to a new file called new_file.txt:
sed -n '5000,8000p' access_log >new_file.txt
These GNU/Unix tools can be useful for quickly getting search engine related data from your log files. They are useful to have in the toolkit.
Commands covered in this tutorial:
Interesting related Perl scripts from John Bokma's site: