Scripting

How to Find Files With Linux

I just bought a new laptop (Thinkpad T500, which I'll review later) and was trying to copy the files from my old laptop to a portable hard drive. There was an error every time the computer tried to copy a file that contained a colon or a question mark.


Ruby Bikini - How to Process XML in Ruby

Continuing in the series of Brazilian bikini Web development tutorials, here is an experiment with the Yahoo Search API, Ruby and Brazilian bikinis.

The script uses Ruby to convert the XML from the Yahoo Image Search API into XHTML Strict as shown in the image below:

 

 

Ruby Bikini

Please download the attached Ruby file to follow along.


Useful Links of the Day

I've been busy with work lately and haven't had time to write much.

Here are some useful scripting links that have been sitting in my Firefox tabs for a week or so:


Checking Domain Age Programatically

You can check the year that a domain was registered with the following command:

whois example.com | grep -i 'creat' | head -n1 | grep -o '[[:digit:]]{4}'

The above line does the following:

  1. The whois command gets the WHOIS record for the domain example.com.
  2. The grep command extracts the line that says Creation date or Creation. The -i option means to search case-insensitively.
  3. head -n1 returns just the first line that matches, otherwise you may end up with two lines matching.
  4. The final grep -o extract just the 4 digits on the line — that should give you the year that the domain was registered.

You can extract the exact day with the following command:

whois example.com | grep -i 'creat' | head -n1 | \
egrep -o '[[:digit:]]{2}-[a-zA-Z0-9]{1,10}-[[:digit:]]{4}'

It works in a similar manner to the first example, but uses a regular expression to extract the full date.

You can also run this on a list of domains in a text file by reading each line of the file.

How to Scrape Web Pages from the GNU/Linux Shell

You can quickly scrape Web pages in a rough manner with the Lynx Browser and grep (and other tools that will be explained in the near future). Lynx is able to dump the contents of Web pages in two ways: only the text of the page, or the entire HTML source of the page.

Automated HTTP Response Code Checking

Sometimes you will end up with a list of URLs that you would like to check the HTTP response codes on. You might have 200 pages that are sending Google a 302 redirect header and you would like to check them all at once.

This very rough example reads a list of URLs from a file, fetches their HTTP response codes and redirect location, and prints them to the screen:

while read inputline
do
  url="$(echo $inputline)"
  headers="$(lynx -dump -head $url | grep -e HTTP -e Location)"
  echo "$url $headers"
  sleep 2
done < filename.txt

It is a rough script because the Location field of the headers returned by Lynx sometimes spans two lines. (I'm going to fix that problem soon.)

The sleep command tells the script to pause for 2 seconds between requests. It is optional, but if I am requesting a lot of URLs from one site, I usually pause between requests so that it doesn't make the server do too much work at once.

The basic syntax for processing a file line-by-line in the shell is:


while read inputline
do
  [some commands here]
done < [input filename]

Extracting Search Engine Hits from Log Files

Tags:

This page describes some ways to use the GNU/Linux terminal to extract search engine hits from a Web site's log files.

To extract just the Googlebot hits on the site from a logfile, try this: