How to Find Files With Linux

I just bought a new laptop (Thinkpad T500, which I'll review later) and was trying to copy the files from my old laptop to a portable hard drive. There was an error every time the computer tried to copy a file that contained a colon or a question mark.

Ruby Bikini - How to Process XML in Ruby

Continuing in the series of Brazilian bikini Web development tutorials, here is an experiment with the Yahoo Search API, Ruby and Brazilian bikinis.

The script uses Ruby to convert the XML from the Yahoo Image Search API into XHTML Strict as shown in the image below:



Ruby Bikini

Please download the attached Ruby file to follow along.

Useful Links of the Day

I've been busy with work lately and haven't had time to write much.

Here are some useful scripting links that have been sitting in my Firefox tabs for a week or so:

Checking Domain Age Programatically

You can check the year that a domain was registered with the following command:

whois | grep -i 'creat' | head -n1 | grep -o '[[:digit:]]{4}'

The above line does the following:

  1. The whois command gets the WHOIS record for the domain
  2. The grep command extracts the line that says Creation date or Creation. The -i option means to search case-insensitively.
  3. head -n1 returns just the first line that matches, otherwise you may end up with two lines matching.
  4. The final grep -o extract just the 4 digits on the line — that should give you the year that the domain was registered.

You can extract the exact day with the following command:

whois | grep -i 'creat' | head -n1 | \
egrep -o '[[:digit:]]{2}-[a-zA-Z0-9]{1,10}-[[:digit:]]{4}'

It works in a similar manner to the first example, but uses a regular expression to extract the full date.

You can also run this on a list of domains in a text file by reading each line of the file.

How to Scrape Web Pages from the GNU/Linux Shell

You can quickly scrape Web pages in a rough manner with the Lynx Browser and grep (and other tools that will be explained in the near future). Lynx is able to dump the contents of Web pages in two ways: only the text of the page, or the entire HTML source of the page.

Automated HTTP Response Code Checking

Sometimes you will end up with a list of URLs that you would like to check the HTTP response codes on. You might have 200 pages that are sending Google a 302 redirect header and you would like to check them all at once.

This very rough example reads a list of URLs from a file, fetches their HTTP response codes and redirect location, and prints them to the screen:

while read inputline
  url="$(echo $inputline)"
  headers="$(lynx -dump -head $url | grep -e HTTP -e Location)"
  echo "$url $headers"
  sleep 2
done < filename.txt

It is a rough script because the Location field of the headers returned by Lynx sometimes spans two lines. (I'm going to fix that problem soon.)

The sleep command tells the script to pause for 2 seconds between requests. It is optional, but if I am requesting a lot of URLs from one site, I usually pause between requests so that it doesn't make the server do too much work at once.

The basic syntax for processing a file line-by-line in the shell is:

while read inputline
  [some commands here]
done < [input filename]

Extracting Search Engine Hits from Log Files


This page describes some ways to use the GNU/Linux terminal to extract search engine hits from a Web site's log files.

To extract just the Googlebot hits on the site from a logfile, try this:

Shell Scripting

The power of Unix-based operating systems (including GNU/Linux, BSD, and Mac OS X) is that you can pipe terminal commands together and write scripts with them.

Piping commands means to send the output of one command to the input of the next command. An example would be to use the grep command to find all of the lines in a logfile that contain the text Googlebot and then send those lines to the wc command to count them:

grep 'Googlebot' | wc -l

The output would be the number of lines that contain the text Googlebot.

Series of commands can also be put into scripts and reused. See below for a list of resources for learning about shell scripting.

Shell Scripting Tutorials


awk Text-Processing Language

Awk is a text processing language. For more information about awk, type man awk in a terminal. Awk is a great tool that will be covered in more detail later.

For an introduction, try this Awk tutorial. If using the GNU version of Awk (gawk), you can download the manual. If you are using GNU/Linux you are probably using gawk.

The tr Command


The tr command translates, deletes, or squeezes characters. The following example takes a logfile that has been converted into a CSV file and deletes all dashes. (Many log files put a dash in a column if there is no data available.)

cat access_log.csv | tr -d - > access_log_no_dashes.csv

(You probably wouldn't want to do this if the site has URLs with dashes in them.)

Syndicate content