Learning the GNU / Linux Command Line

FreeSoftwareMagazine.com has an article on how to learn the GNU/Linux command line.

The FreeSoftwareMagazine.com tutorial is a nice introduction to some basics of the command line, beginning with whoami and working through how to list directory contents and how to use man and info pages.

There is also a great introduction to the terminal called Unix for the Beginning Mage that you can download as a free PDF book at UnixMages.com . The terminal for GNU/Linux and Mac OS X is very similar since they are both based on Unix. You can probably use any of these tutorials on Mac OS X also with little or no modification.

I've created a little GNU/Linux command line tutorial below:

Fun in the Terminal With Lynx

The GNU/Linux command line gives you a lot of small tools that can be connected with each other by piping the output of one tool into another tool.

For example, you might see a page with a lot of links on it that you want to examine more closely. You could open up a terminal and type something like the following:

$ lynx -dump "http://www.example.com" | grep -o "http:.*" >file.txt

That will give you a list of outgoing links on the web page at http://www.example.com, nicely printed to a file called file.txt in your current directory.

Here's how it works:

Lynx is a Web browser that only reads text. This makes it great for extracting text from web pages. The option -dump tells Lynx to grab the web page and display it in the terminal. That is followed by the URL you want to visit. So lynx -dump "http://www.example.com" is just saying, "Lynx, dump the output of http://www.example.com to the screen".

You can try the first part by itself to see what it does, replacing http://www.example.com with another URL of your choice. In the following example I've used the home page of the the BBC news:

$ lynx -dump "http://news.bbc.co.uk"

Notice in the image snippet below of the tail end of Lynx's output, that Lynx gives a list of URLs, proceeded by numbers. We are going to extract only those URLs from the output in the next step:

BBC news page using Lynx

Extracting the Links from Lynx

Now we can look at the next part of the URL extraction process:

$ lynx -dump "http://www.example.com" | grep -o "http:.*" >file.txt

When you use a pipe (the | symbol), it tells the computer to take the output from the first tool and send it to the following tool. So we are taking the output of Lynx and sending it to grep.

Grep is a tool to search for text and display each line that contains a matching pattern. The option, -o tells grep to only return the matching part of the line and not the entire line. We are searching for anything that matches "http:.*", which is a simple regular expression.

A regular expression is a pattern that is made up of symbols that tell the computer what to look for in order to make a match. We want to find anything that matches the pattern: http: [and anything that comes after that]. A period (.) in a regular expression symbolizes one character of any type. The asterisk (*) symbolizes zero or more of the preceeding character. So "http.*" means "match 'http' and any number of characters that follow it". This will extract only the URLs from Lynx's output.

We could stop there and just run it as this, which will send the output to the screen:

$ lynx -dump "http://www.example.com" | grep -o "http:.*"

But it would be nice to save the output for later. To save the output to a file, just add the > symbol. In this case the output is being directed to a file named file.txt as shown below.

$ lynx -dump "http://www.example.com" | grep -o "http:.*" >file.txt

Other Options

Here is an example of some other options that you can add. The command sort sorts the results, and uniq removes any duplicate entries.

$ lynx -dump "http://www.example.com" | grep -o "http:.*" | sort | uniq >file.txt

Future Tutorial

In another tutorial I will expand on this one to show you how to convert your file of plain-text URLs into hyperlinked HTML URLs. I'll also show you an example of what you can use this for.

Further Reading

Check out these other links for more information on using the terminal, and writing regular expressions:


Syndicate content