This page describes some ways to use the GNU/Linux terminal to extract search engine hits from a Web site's log files.
To extract just the Googlebot hits on the site from a logfile, try this:
grep 'Googlebot\/' access.log > googlebot_access.logThat will write the Googlebot hits to a new logfile called googlebot_access.log.
You can also pipe that output into another command, for example to extract only the URLs that Googlebot is requesting:
grep 'Googlebot\/' access.log | \
cut -d' ' -f7 > googlebot_urls_access.log(The above line assumes that the URLs are in column 7 of the log file. You might have to adjust it based on your log file format.)
Once you have a file with Google's hits on your site, you can then grep it for specific response codes. For example, to get all the 404 Not Found pages that Googlebot is hitting, you could use the following:
grep 'Googlebot\/' access.log | \
grep [[:space:]]404[[:space:]]' > googlebot_404s.txtHaving a list of URLs that send 404s to search engines can tell you interesting information:
You can also grep for different types of headers (302, 301, 500, etc.) which will usually provide other interesting information, especially on a large site.
You can extract all major search engine information and convert it to a CSV file for processing in a spreadsheet. You can open a space-delimited file in a spreadsheet, but converting it to a comma-delimited format will allow you to have blank columns in case you need to remove the dashes in the log file.
The following one-liner will do the following:
egrep '(Googlebot\/|Yahoo!|msnbot)' access.log | \
tee se_access.txt | \
cut -d' ' -f1-15 --output-delimiter=, \
> se_access.csv; openoffice -calc se_access.csv
Did you find this post helpful? Leave a comment below, and subscribe to my RSS feed.