Extracting Search Engine Hits from Log Files

Tags:

This page describes some ways to use the GNU/Linux terminal to extract search engine hits from a Web site's log files.

To extract just the Googlebot hits on the site from a logfile, try this:

grep 'Googlebot\/' access.log > googlebot_access.log

That will write the Googlebot hits to a new logfile called googlebot_access.log.

You can also pipe that output into another command, for example to extract only the URLs that Googlebot is requesting:

grep 'Googlebot\/' access.log | \
cut -d' ' -f7 > googlebot_urls_access.log

(The above line assumes that the URLs are in column 7 of the log file. You might have to adjust it based on your log file format.)

Once you have a file with Google's hits on your site, you can then grep it for specific response codes. For example, to get all the 404 Not Found pages that Googlebot is hitting, you could use the following:

grep 'Googlebot\/' access.log | \
grep [[:space:]]404[[:space:]]' > googlebot_404s.txt

Having a list of URLs that send 404s to search engines can tell you interesting information:

  • The location of pages that used to exist on an older version of the site that were not redirected with a 301 to their new locations (hint: some of them may still have inbound links and even PageRank)
  • Inbound links that have typos in the URLs, or that go to pages that were removed at some point
  • and more...

You can also grep for different types of headers (302, 301, 500, etc.) which will usually provide other interesting information, especially on a large site.

You can extract all major search engine information and convert it to a CSV file for processing in a spreadsheet. You can open a space-delimited file in a spreadsheet, but converting it to a comma-delimited format will allow you to have blank columns in case you need to remove the dashes in the log file.

The following one-liner will do the following:

  1. Use egrep to extract any lines that contain Googlebot/, Yahoo!, or msnbot from a file named access.log
  2. Use tee to write that output to a file called se_access.txt, and also send the output to the next command
  3. Use cut to extract columns 1 to 15 (delimited by spaces) and output it with commas as the delmiter to a file named se_access.csv
  4. Open the CSV file with OpenOffice.
egrep '(Googlebot\/|Yahoo!|msnbot)' access.log | \
tee se_access.txt | \
cut -d' ' -f1-15 --output-delimiter=, \
> se_access.csv; openoffice -calc se_access.csv
Syndicate content