This section contains actual techniques that can be used to analyze search engine activity on a Web site. As mentioned elsewhere, these definitely work on GNU/Linux (in my case, Ubuntu 6.06), but should also work on BSD, Mac OS X, and even Windows (with Cygwin). For best results, use a Unix-based operating system for these techniques and not Windows.
This is an ongoing project and many more pages will be added to this section soon. If you have any script recipes to add, leave a comment, or send me an email. The primary focus is on basic shell scripting, but scripts in Ruby, Python, PHP, Perl and other languages may also be added.
This page describes some ways to use the GNU/Linux terminal to extract search engine hits from a Web site's log files.
To extract just the Googlebot hits on the site from a logfile, try this:
grep 'Googlebot\/' access.log > googlebot_access.logThat will write the Googlebot hits to a new logfile called googlebot_access.log.
You can also pipe that output into another command, for example to extract only the URLs that Googlebot is requesting:
grep 'Googlebot\/' access.log | \
cut -d' ' -f7 > googlebot_urls_access.log(The above line assumes that the URLs are in column 7 of the log file. You might have to adjust it based on your log file format.)
Once you have a file with Google's hits on your site, you can then grep it for specific response codes. For example, to get all the 404 Not Found pages that Googlebot is hitting, you could use the following:
grep 'Googlebot\/' access.log | \
grep [[:space:]]404[[:space:]]' > googlebot_404s.txtHaving a list of URLs that send 404s to search engines can tell you interesting information:
You can also grep for different types of headers (302, 301, 500, etc.) which will usually provide other interesting information, especially on a large site.
You can extract all major search engine information and convert it to a CSV file for processing in a spreadsheet. You can open a space-delimited file in a spreadsheet, but converting it to a comma-delimited format will allow you to have blank columns in case you need to remove the dashes in the log file.
The following one-liner will do the following:
egrep '(Googlebot\/|Yahoo!|msnbot)' access.log | \
tee se_access.txt | \
cut -d' ' -f1-15 --output-delimiter=, \
> se_access.csv; openoffice -calc se_access.csvSometimes you will end up with a list of URLs that you would like to check the HTTP response codes on. You might have 200 pages that are sending Google a 302 redirect header and you would like to check them all at once.
This very rough example reads a list of URLs from a file, fetches their HTTP response codes and redirect location, and prints them to the screen:
while read inputline
do
url="$(echo $inputline)"
headers="$(lynx -dump -head $url | grep -e HTTP -e Location)"
echo "$url $headers"
sleep 2
done < filename.txtIt is a rough script because the Location field of the headers returned by Lynx sometimes spans two lines. (I'm going to fix that problem soon.)
The sleep command tells the script to pause for 2 seconds between requests. It is optional, but if I am requesting a lot of URLs from one site, I usually pause between requests so that it doesn't make the server do too much work at once.
The basic syntax for processing a file line-by-line in the shell is:
while read inputline
do
[some commands here]
done < [input filename]You can check the year that a domain was registered with the following command:
whois example.com | grep -i 'creat' | head -n1 | grep -o '[[:digit:]]{4}'The above line does the following:
You can extract the exact day with the following command:
whois example.com | grep -i 'creat' | head -n1 | \
egrep -o '[[:digit:]]{2}-[a-zA-Z0-9]{1,10}-[[:digit:]]{4}'It works in a similar manner to the first example, but uses a regular expression to extract the full date.
You can also run this on a list of domains in a text file by reading each line of the file.
Lynx can be used with the -dump option to dump the text and links from a Web page in the terminal. That output can then be piped into the grep command, which can extract the URLs or other information.
The following line will count the number of outgoing links on a Web page, including internal and external links:
lynx -dump "http://www.example.com/" | grep -o "http.*" | wc -lSee my other GNU/Linux Lynx tutorial for more details on how lynx and grep can work together to extract links. The wc -l command counts the number of lines. In this case, each line is one link, so counting the lines gives you the number of links on a Web page.
lynx -dump "http://www.example.com/" | grep -o "http.*" | grep -v "http://www.example.com" | wc -lUsing grep with the -v option tells it to give you all of the lines that don't match. In this case it will give you all of the links that don't include the domain name of the current Web page.
lynx -dump "http://www.example.com/" | grep -o "http://www.example.com" | wc -lSimilar to the above example, this will only count URLs that do include the domain name of the current Web page.
IIS logs are often configured to output the filename in one column and the query string in the following column. An example of a line from an IIS log is shown below, with a highlighted filename and query string:
2006-10-19 00:22:41 66.249.65.99 - nnn.nnn.nn.nnn 80 GET /products.aspx item=12345 404 0 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - -
Unfortunately, the default settings on IIS do not seem to output the actual full URLs requested. It may be useful to get a list of URLs that were accessed by Google in order to process them further.
The following one-liner does the following:
(NOTE: I've used backslashes to escape the end of the line — the following is a one-liner, but because of this Web page's formatting, I'm displaying it on multiple lines.)
grep [[:space:]]404[[:space:]] se_access.txt | \
cut -d' ' -f8,9 | \
awk '{ print "http://www.example.com"$1"?"$2}' | \
sed 's/\?-//g' > 404_errors.txtThe final result is a file named 404_errors.txt that contains a list of URLs that are being requested on a site by search engines that don't exist. The example above would take the following line from an IIS log:
2006-10-19 00:22:41 66.249.65.99 - nnn.nnn.nn.nnn 80 GET /products.aspx item=12345 404 0 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - -
and convert it to the following line:
http://www.example.com/products.aspx?item=12345A list of URLs that send 404s is very useful for debugging sites. The list of URLs can be processed further as needed.
You can quickly scrape Web pages in a rough manner with the Lynx Browser and grep (and other tools that will be explained in the near future). Lynx is able to dump the contents of Web pages in two ways: only the text of the page, or the entire HTML source of the page.
To extract the text of a Web page with the HTML tags stripped out, you can use the -dump option like this:
lynx -dump "http://www.example.com/"If you want the entire source code, you can use the -source option:
lynx -source "http://www.example.com/"You can then pipe the Web page into grep and sed like this:
lynx -source "http://www.example.com/" | grep -o 'your regular expression here' | sed 's/html tags here//g'The steps are:
It's a rough, simple way to scrape a page and may not provide perfect results, but it shows the basic concept and can be modified to your needs.
The following script shows how to loop through a list of URLs in a text file called urls.txt and scrape some content from them:
while read inputline
do
url="$(echo $inputline)"
mydata="$(lynx -source $url | grep -o 'your regular expression here' | sed 's/html tags here//g')"
echo "$url,$mydata" >> myfile.csv
sleep 2
done <urls.txtThe steps of the script are as follows:
This is just a rough example of one way to scrape pages in the Linux terminal. If you know a scripting language like Perl, Python or Ruby, you can use those to parse the HTML in a more elegant fashion. This page will be greatly expanded soon...
Web designers often link to index.html in directories throughout a Web site — or even worse, only partially throughout a Web site. If you are dealing with a static HTML site, it should be fairly easy to fix with this recipe.
The following line in the GNU/Linux terminal will find and replace (delete) the text index.html recursively in all files, starting in the current directory:
find ./* -type f -exec sed -i 's/index.html//g' {} \;(Adapted from a LinuxForums.org post.)
You can then redirect all instances of index.html to the roots of the directories (the slash) with the following lines in the .htaccess file:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(([^/]+/)*)index\.html\ HTTP/
RewriteRule index\.html$ http://www.example.com/%1 [R=301,L]I found a script from the book Wicked Cool Shell Scripts that quickly extracts useful data from Apache log files.
I'm not sure what license the script is under so I can't reprint it here, but it's worth checking out.