IIS logs are often configured to output the filename in one column and the query string in the following column. An example of a line from an IIS log is shown below, with a highlighted filename and query string:
2006-10-19 00:22:41 66.249.65.99 - nnn.nnn.nn.nnn 80 GET /products.aspx item=12345 404 0 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - -
Unfortunately, the default settings on IIS do not seem to output the actual full URLs requested. It may be useful to get a list of URLs that were accessed by Google in order to process them further.
The following one-liner does the following:
- greps a log file that contains only hits from search engines for 404 errors. This will give a list of every "404 Not Found" page that search engines are visiting.
- It then uses the cut command to extract columns 8 and 9 — in this case, the page (/products.aspx) and the query string (item=12345).
- Then it uses awk to print out http://www.example.com/[filename]?[query string]
- Because not every requested page has a query string, sed can be used to remove the ?- that will be found on hits that don't have a query string
(NOTE: I've used backslashes to escape the end of the line — the following is a one-liner, but because of this Web page's formatting, I'm displaying it on multiple lines.)
grep [[:space:]]404[[:space:]] se_access.txt | \
cut -d' ' -f8,9 | \
awk '{ print "http://www.example.com"$1"?"$2}' | \
sed 's/\?-//g' > 404_errors.txt
The final result is a file named 404_errors.txt that contains a list of URLs that are being requested on a site by search engines that don't exist. The example above would take the following line from an IIS log:
2006-10-19 00:22:41 66.249.65.99 - nnn.nnn.nn.nnn 80 GET /products.aspx item=12345 404 0 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - -
and convert it to the following line:
http://www.example.com/products.aspx?item=12345
A list of URLs that send 404s is very useful for debugging sites. The list of URLs can be processed further as needed.