You can quickly scrape Web pages in a rough manner with the Lynx Browser and grep (and other tools that will be explained in the near future). Lynx is able to dump the contents of Web pages in two ways: only the text of the page, or the entire HTML source of the page.
To extract the text of a Web page with the HTML tags stripped out, you can use the -dump option like this:
lynx -dump "http://www.example.com/"
If you want the entire source code, you can use the -source option:
lynx -source "http://www.example.com/"
You can then pipe the Web page into grep and sed like this:
lynx -source "http://www.example.com/" | grep -o 'your regular expression here' | sed 's/html tags here//g'
The steps are:
It's a rough, simple way to scrape a page and may not provide perfect results, but it shows the basic concept and can be modified to your needs.
The following script shows how to loop through a list of URLs in a text file called urls.txt and scrape some content from them:
while read inputline do url="$(echo $inputline)" mydata="$(lynx -source $url | grep -o 'your regular expression here' | sed 's/html tags here//g')" echo "$url,$mydata" >> myfile.csv sleep 2 done <urls.txt
The steps of the script are as follows:
This is just a rough example of one way to scrape pages in the Linux terminal. If you know a scripting language like Perl, Python or Ruby, you can use those to parse the HTML in a more elegant fashion. This page will be greatly expanded soon...