How to Scrape Web Pages from the GNU/Linux Shell

You can quickly scrape Web pages in a rough manner with the Lynx Browser and grep (and other tools that will be explained in the near future). Lynx is able to dump the contents of Web pages in two ways: only the text of the page, or the entire HTML source of the page.

Counting Outbound Links on a Web Page With Lynx

How to count the number of outbound links on a page with Lynx and GNU

Lynx can be used with the -dump option to dump the text and links from a Web page in the terminal. That output can then be piped into the grep command, which can extract the URLs or other information.

Automated HTTP Response Code Checking

Sometimes you will end up with a list of URLs that you would like to check the HTTP response codes on. You might have 200 pages that are sending Google a 302 redirect header and you would like to check them all at once.

This very rough example reads a list of URLs from a file, fetches their HTTP response codes and redirect location, and prints them to the screen:

while read inputline
  url="$(echo $inputline)"
  headers="$(lynx -dump -head $url | grep -e HTTP -e Location)"
  echo "$url $headers"
  sleep 2
done < filename.txt

It is a rough script because the Location field of the headers returned by Lynx sometimes spans two lines. (I'm going to fix that problem soon.)

The sleep command tells the script to pause for 2 seconds between requests. It is optional, but if I am requesting a lot of URLs from one site, I usually pause between requests so that it doesn't make the server do too much work at once.

The basic syntax for processing a file line-by-line in the shell is:

while read inputline
  [some commands here]
done < [input filename]

Apache Log Statistics

I found a script from the book Wicked Cool Shell Scripts that quickly extracts useful data from Apache log files.

I'm not sure what license the script is under so I can't reprint it here, but it's worth checking out.

Shell Scripting

The power of Unix-based operating systems (including GNU/Linux, BSD, and Mac OS X) is that you can pipe terminal commands together and write scripts with them.

Piping commands means to send the output of one command to the input of the next command. An example would be to use the grep command to find all of the lines in a logfile that contain the text Googlebot and then send those lines to the wc command to count them:

grep 'Googlebot' | wc -l

The output would be the number of lines that contain the text Googlebot.

Series of commands can also be put into scripts and reused. See below for a list of resources for learning about shell scripting.

Shell Scripting Tutorials

Extracting and Reconstructing URLs from an IIS Log

IIS logs are often configured to output the filename in one column and the query string in the following column. An example of a line from an IIS log is shown below, with a highlighted filename and query string:

2006-10-19 00:22:41 - nnn.nnn.nn.nnn 80 GET /products.aspx item=12345 404 0 Mozilla/5.0+(compatible;+Googlebot/2.1;++ - -

Unfortunately, the default settings on IIS do not seem to output the actual full URLs requested. It may be useful to get a list of URLs that were accessed by Google in order to process them further.

The following one-liner does the following:

  1. greps a log file that contains only hits from search engines for 404 errors. This will give a list of every "404 Not Found" page that search engines are visiting.
  2. It then uses the cut command to extract columns 8 and 9 — in this case, the page (/products.aspx) and the query string (item=12345).
  3. Then it uses awk to print out[filename]?[query string]
  4. Because not every requested page has a query string, sed can be used to remove the ?- that will be found on hits that don't have a query string

(NOTE: I've used backslashes to escape the end of the line — the following is a one-liner, but because of this Web page's formatting, I'm displaying it on multiple lines.)

grep [[:space:]]404[[:space:]] se_access.txt | \
cut -d' ' -f8,9 | \
awk '{ print ""$1"?"$2}' | \
sed 's/\?-//g' > 404_errors.txt

The final result is a file named 404_errors.txt that contains a list of URLs that are being requested on a site by search engines that don't exist. The example above would take the following line from an IIS log:

2006-10-19 00:22:41 - nnn.nnn.nn.nnn 80 GET /products.aspx item=12345 404 0 Mozilla/5.0+(compatible;+Googlebot/2.1;++ - -

and convert it to the following line:

A list of URLs that send 404s is very useful for debugging sites. The list of URLs can be processed further as needed.

SEO Techniques


This section contains actual techniques that can be used to analyze search engine activity on a Web site. As mentioned elsewhere, these definitely work on GNU/Linux (in my case, Ubuntu 6.06), but should also work on BSD, Mac OS X, and even Windows (with Cygwin). For best results, use a Unix-based operating system for these techniques and not Windows.

This is an ongoing project and many more pages will be added to this section soon. If you have any script recipes to add, leave a comment, or send me an email. The primary focus is on basic shell scripting, but scripts in Ruby, Python, PHP, Perl and other languages may also be added.

The wc Command: Counting Lines in a File



The wc is useful for counting lines in a file. In the case of log files, it can count hits. The -l option will print out the number of lines.

wc -l access.log

You can also use it on multiple files — in the following case, on all files with a .log extension:

wc -l *.log

sort and uniq



The sort command sorts output.


The uniq command only returns unique lines.

See the GNU/Linux command line tutorial for detailed instructions. Or, type man sort or man uniq in a terminal.


awk Text-Processing Language

Awk is a text processing language. For more information about awk, type man awk in a terminal. Awk is a great tool that will be covered in more detail later.

For an introduction, try this Awk tutorial. If using the GNU version of Awk (gawk), you can download the manual. If you are using GNU/Linux you are probably using gawk.

Syndicate content