IIS logs are often configured to output the filename in one column and the query string in the following column. An example of a line from an IIS log is shown below, with a highlighted filename and query string:
2006-10-19 00:22:41 66.249.65.99 - nnn.nnn.nn.nnn 80 GET /products.aspx item=12345 404 0 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - -
Unfortunately, the default settings on IIS do not seem to output the actual full URLs requested. It may be useful to get a list of URLs that were accessed by Google in order to process them further.
The following one-liner does the following:
(NOTE: I've used backslashes to escape the end of the line — the following is a one-liner, but because of this Web page's formatting, I'm displaying it on multiple lines.)
grep [[:space:]]404[[:space:]] se_access.txt | \
cut -d' ' -f8,9 | \
awk '{ print "http://www.example.com"$1"?"$2}' | \
sed 's/\?-//g' > 404_errors.txtThe final result is a file named 404_errors.txt that contains a list of URLs that are being requested on a site by search engines that don't exist. The example above would take the following line from an IIS log:
2006-10-19 00:22:41 66.249.65.99 - nnn.nnn.nn.nnn 80 GET /products.aspx item=12345 404 0 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - -
and convert it to the following line:
http://www.example.com/products.aspx?item=12345A list of URLs that send 404s is very useful for debugging sites. The list of URLs can be processed further as needed.
Did you find this post helpful? Leave a comment below, and subscribe to my RSS feed.