Lynx Browser

Counting Outbound Links on a Web Page With Lynx

How to count the number of outbound links on a page with Lynx and GNU

Lynx can be used with the -dump option to dump the text and links from a Web page in the terminal. That output can then be piped into the grep command, which can extract the URLs or other information.

Automated HTTP Response Code Checking

Sometimes you will end up with a list of URLs that you would like to check the HTTP response codes on. You might have 200 pages that are sending Google a 302 redirect header and you would like to check them all at once.

This very rough example reads a list of URLs from a file, fetches their HTTP response codes and redirect location, and prints them to the screen:

while read inputline
do
  url="$(echo $inputline)"
  headers="$(lynx -dump -head $url | grep -e HTTP -e Location)"
  echo "$url $headers"
  sleep 2
done < filename.txt

It is a rough script because the Location field of the headers returned by Lynx sometimes spans two lines. (I'm going to fix that problem soon.)

The sleep command tells the script to pause for 2 seconds between requests. It is optional, but if I am requesting a lot of URLs from one site, I usually pause between requests so that it doesn't make the server do too much work at once.

The basic syntax for processing a file line-by-line in the shell is:


while read inputline
do
  [some commands here]
done < [input filename]

Lynx Browser

Lynx Browser

Google's Webmaster Guidelines say,

Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.

I've already written a Lynx Browser tutorial as well as devoting an entire section of this Web site to Lynx, but here is a quick reference:

Get the text from a Web page as well as a list of links

lynx -dump "http://www.example.com/"

Get the source code from a Web page with Lynx

lynx -source "http://www.example.com/"

Get the response headers with Lynx

lynx -dump -head "http://www.example.com/"

Tutorial Page for Lynx Browser Aficionados

I've covered Lynx in a few tutorials, but have never really given a general introduction to the browser.

I just found a web page that gives a nice introduction to Lynx for general browsing.

Using a text browser for surfing the Web might seem a little strange, but it can be very useful in certain situations.

Using Lynx to see what your web site looks like to search engines

Google's webmaster guidelines recommend using Lynx to see how your site might look to a search engine:

"Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would."

Using Lynx to read the BBC News

Screenshot of the BBC News viewed with Lynx

How to Save Web Pages and Web Sites for Offline Viewing

There are many ways to save web pages and web sites for offline viewing. These methods will work on Linux, Windows and/or Mac OS X. These tools will save entire web pages and web sites. If you are looking for a way to take screenshots, try this page instead.

Checking HTTP Headers With the Lynx Browser

You can check your HTTP headers quickly with Lynx. If you are using Linux, set a keyboard shortcut to open up a terminal. On Ubuntu (or probably any GNOME-based distro), just go to System —> Preferences —> Keyboard Shortcuts. You can then set a keyboard shortcut. I set the keyboard shortcut for opening a terminal to Ctrl-Alt-Shift-t.

Once you have a keyboard shortcut set up for opening a terminal, using tools like Lynx become very fast. When you want to check HTTP headers, just use your keyboard shortcut to open a terminal, and then follow the directions below. I'm not sure if there is a keyboard shortcut to open a terminal on Mac OS X or Windows, but the techniques below will still work.

Tip: To install Lynx on Ubuntu/Debian, type sudo apt-get install lynx. If you want to install Lynx on Windows, I recommend using Cygwin. I'm not sure if Lynx comes with Mac OS X, but if it isn't on your Mac you can get the Mac version here.

Then type the following line in the terminal, replacing example.com with the domain name that you want to check:

lynx -head -dump "http://www.example.com"

The headers will then be output in the terminal. To save the headers to a file, just use:

Wicked Cool Shell Scripts

The web site for the book Wicked Cool Shell Scripts offers some excerpts from the book online. One that I particularly like (because I like Lynx) is the script to Track BBC News with Lynx.

In that chapter, the author teaches you how to scrape content from the BBC news site. It's a great tutorial, and there are several other free examples from the book Wicked Cool Shell Scripts.

Related Posts

Learning the GNU / Linux Command Line

FreeSoftwareMagazine.com has an article on how to learn the GNU/Linux command line.

The FreeSoftwareMagazine.com tutorial is a nice introduction to some basics of the command line, beginning with whoami and working through how to list directory contents and how to use man and info pages.

There is also a great introduction to the terminal called Unix for the Beginning Mage that you can download as a free PDF book at UnixMages.com . The terminal for GNU/Linux and Mac OS X is very similar since they are both based on Unix. You can probably use any of these tutorials on Mac OS X also with little or no modification.

I've created a little GNU/Linux command line tutorial below:

Fun in the Terminal With Lynx

The GNU/Linux command line gives you a lot of small tools that can be connected with each other by piping the output of one tool into another tool.

For example, you might see a page with a lot of links on it that you want to examine more closely. You could open up a terminal and type something like the following:

$ lynx -dump "http://www.example.com" | grep -o "http:.*" >file.txt

That will give you a list of outgoing links on the web page at http://www.example.com, nicely printed to a file called file.txt in your current directory.

Here's how it works:

Lynx is a Web browser that only reads text. This makes it great for extracting text from web pages. The option -dump tells Lynx to grab the web page and display it in the terminal. That is followed by the URL you want to visit. So lynx -dump "http://www.example.com" is just saying, "Lynx, dump the output of http://www.example.com to the screen".

You can try the first part by itself to see what it does, replacing http://www.example.com with another URL of your choice. In the following example I've used the home page of the the BBC news:

$ lynx -dump "http://news.bbc.co.uk"

Notice in the image snippet below of the tail end of Lynx's output, that Lynx gives a list of URLs, proceeded by numbers. We are going to extract only those URLs from the output in the next step:

BBC news page using Lynx

Extracting the Links from Lynx

Now we can look at the next part of the URL extraction process:

$ lynx -dump "http://www.example.com" | grep -o "http:.*" >file.txt

When you use a pipe (the | symbol), it tells the computer to take the output from the first tool and send it to the following tool. So we are taking the output of Lynx and sending it to grep.

Grep is a tool to search for text and display each line that contains a matching pattern. The option, -o tells grep to only return the matching part of the line and not the entire line. We are searching for anything that matches "http:.*", which is a simple regular expression.

A regular expression is a pattern that is made up of symbols that tell the computer what to look for in order to make a match. We want to find anything that matches the pattern: http: [and anything that comes after that]. A period (.) in a regular expression symbolizes one character of any type. The asterisk (*) symbolizes zero or more of the preceeding character. So "http.*" means "match 'http' and any number of characters that follow it". This will extract only the URLs from Lynx's output.

We could stop there and just run it as this, which will send the output to the screen:

$ lynx -dump "http://www.example.com" | grep -o "http:.*"

But it would be nice to save the output for later. To save the output to a file, just add the > symbol. In this case the output is being directed to a file named file.txt as shown below.

$ lynx -dump "http://www.example.com" | grep -o "http:.*" >file.txt

Other Options

Here is an example of some other options that you can add. The command sort sorts the results, and uniq removes any duplicate entries.

$ lynx -dump "http://www.example.com" | grep -o "http:.*" | sort | uniq >file.txt

Future Tutorial

In another tutorial I will expand on this one to show you how to convert your file of plain-text URLs into hyperlinked HTML URLs. I'll also show you an example of what you can use this for.

Further Reading

Check out these other links for more information on using the terminal, and writing regular expressions:

Intro to Search Engine Optimization (SEO)

Search Engine Optimization (SEO)

A few resources for learning more about search engine optimization can be found below. See also my article on the Top 10 Firefox Extensions for SEO.

Syndicate content