Extracting and Reconstructing URLs from an IIS Log

IIS logs are often configured to output the filename in one column and the query string in the following column. An example of a line from an IIS log is shown below, with a highlighted filename and query string:

2006-10-19 00:22:41 66.249.65.99 - nnn.nnn.nn.nnn 80 GET /products.aspx item=12345 404 0 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - -

Unfortunately, the default settings on IIS do not seem to output the actual full URLs requested. It may be useful to get a list of URLs that were accessed by Google in order to process them further.

The following one-liner does the following:

  1. greps a log file that contains only hits from search engines for 404 errors. This will give a list of every "404 Not Found" page that search engines are visiting.
  2. It then uses the cut command to extract columns 8 and 9 — in this case, the page (/products.aspx) and the query string (item=12345).
  3. Then it uses awk to print out http://www.example.com/[filename]?[query string]
  4. Because not every requested page has a query string, sed can be used to remove the ?- that will be found on hits that don't have a query string

(NOTE: I've used backslashes to escape the end of the line — the following is a one-liner, but because of this Web page's formatting, I'm displaying it on multiple lines.)

grep [[:space:]]404[[:space:]] se_access.txt | \
cut -d' ' -f8,9 | \
awk '{ print "http://www.example.com"$1"?"$2}' | \
sed 's/\?-//g' > 404_errors.txt

The final result is a file named 404_errors.txt that contains a list of URLs that are being requested on a site by search engines that don't exist. The example above would take the following line from an IIS log:

2006-10-19 00:22:41 66.249.65.99 - nnn.nnn.nn.nnn 80 GET /products.aspx item=12345 404 0 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - -

and convert it to the following line:

http://www.example.com/products.aspx?item=12345

A list of URLs that send 404s is very useful for debugging sites. The list of URLs can be processed further as needed.

No votes yet
Syndicate content