How to Scrape Web Pages from the GNU/Linux Shell

You can quickly scrape Web pages in a rough manner with the Lynx Browser and grep (and other tools that will be explained in the near future). Lynx is able to dump the contents of Web pages in two ways: only the text of the page, or the entire HTML source of the page.

To extract the text of a Web page with the HTML tags stripped out, you can use the -dump option like this:

lynx -dump ""

If you want the entire source code, you can use the -source option:

lynx -source ""

You can then pipe the Web page into grep and sed like this:

lynx -source "" | grep -o 'your regular expression here' | sed 's/html tags here//g'

The steps are:

  1. Lynx fetches the source code of the page
  2. Grep used with the -o option extracts just the matching part of the line that contains your regular expression—probably some HTML tags with some text that you want to extract in the middle. Note that this will only work if it all appears on one line. I'm going to provide some better examples soon, but for now this script does have some useful applications.
  3. Sed then strips out the HTML tags to leave just the text within the HTML tags.

It's a rough, simple way to scrape a page and may not provide perfect results, but it shows the basic concept and can be modified to your needs.

The following script shows how to loop through a list of URLs in a text file called urls.txt and scrape some content from them:

while read inputline
  url="$(echo $inputline)"
  mydata="$(lynx -source $url |  grep -o 'your regular expression here' | sed 's/html tags here//g')"
  echo "$url,$mydata" >> myfile.csv
  sleep 2
done <urls.txt

The steps of the script are as follows:

  1. The while/do/done loop reads the file urls.txt into the script line by line.
  2. The current line of the file (a URL) is assigned to the variable $url.
  3. Lynx is used to fetch the source code of the variable $URL.
  4. The source of the URL is then piped into grep where some text inside of HTML tags is extracted.
  5. sed is used to strip out the HTML tags.
  6. The URL and title are then appended to a new file called myfile.csv.
  7. If necessary, you can have the script sleep for a couple of seconds before moving on to the next URL

This is just a rough example of one way to scrape pages in the Linux terminal. If you know a scripting language like Perl, Python or Ruby, you can use those to parse the HTML in a more elegant fashion. This page will be greatly expanded soon...

Syndicate content