GNU / Linux SEO Tools

I am building a section of this site about using open-source software for SEO. I've mentioned the subject of GNU/Linux and SEO a few times in posts scattered throughout this blog, but I would like to organize the information in an easy to navigation reference.

Note: I don't think there is any SEO software made specifically for Linux, but if you have a dual-boot system or run Windows in a virtual machine, try SEO Elite.

GNU/Linux, Mac OS X, BSD, UNIX, Windows

Most of these SEO techniques will work on any major operating system, even Windows to some extent, but I am most familiar with Linux and I'm only sure that these will work in Linux. I highly recommend using a unix-based operating system for these tutorials. Windows is not ideal for these tasks, even with Cygwin. In a worst-case scenario, you could get away with running Linux inside of Windows with QEMU or VMware.

This is only the beginning of a large resource that I am writing on using GNU/Linux for SEO. If you would like to add GNU/Linux SEO recipes and scripts, leave a comment, or send me an email.

If this is your first time in the terminal, check out my earlier Learning the GNU/Linux Command Line tutorial, the GNU/Linux terminal tips, and the GNU SEO Tools tutorial.

Intro to Search Engine Optimization (SEO)

Search Engine Optimization (SEO)

A few resources for learning more about search engine optimization can be found below. See also my article on the Top 10 Firefox Extensions for SEO.

SEO Software Tip If you are looking for SEO software to help you with link building and other SEO tasks, you might want to try SEO Elite Software.

Guidelines and Tools From the Search Engines

Google

Yahoo

MSN (Live) Search

SEO Intro for Local Small Businesses

Semantic Markup

Basic Semantic Markup

Writing code that machines can understand.

Microformats

Articles and Newsletters

Many of these sites have free newsletters and RSS feeds that you can subscribe to.

SEO Blogs

How Search Engines Rank Web Sites

Techniques

More techniques coming soon...

Competitive Analysis

Link Building

Online SEO Tools

Directories

A few links of interest. The first two are recommended by Google.

Searching Across Multiple Search Engines

RSS / ATOM Syndication

RSS is useful for submitting items to Google Base. I think ATOM has great potential for SEO. More on that soon...

Other SEO Tools

Lynx Browser and SEO Research

Sometimes you might want to get just the text from a web page to analyze the keywords and content.

Google's webmaster guidelines page recommends the following:

"Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site."

Lynx Web Browser — Information from Wikipedia about Lynx, a text-only browser that has some interesting uses. On Linux (for example with a Knoppix disc) you can type in some instructions at the command line, and get Lynx to retrieve all the text in a web page, and/or a list of links on the page. For example, open a terminal window in Linux and type the following:

$ lynx -dump "http://tips.webdesign10.com" >file1.txt

All the text from that URL will now be in a file called "file1.txt", along with the text from alt tags, and a complete list of all outgoing links from that page. You can then process the file how you want (for example, with SEO-related scripts). See also this Lynx tutorial.

Search Engine Optimization (SEO) is Not Evil

Because SEO has been abused by many people, it is often perceived as being an unwholesome pursuit. SEOs are associated with search engine cloaking, deceptive JavaScript redirects, hidden text, keyword stuffing, doorway pages, link farms, and other practices designed to manipulate search engine results.

Not all SEO is like that. Legitimate businesses need to optimize their sites for search engines in order to be competitive online. Every day I come across legitimate business Web sites that will never rank well in search engines in their current form. Many Web designers/developers have no concept of how search engines work, and they accidentally create sites that will never rank well no matter how good they look.

The concepts explained in this tutorial relate to on-site optimization — that is, fixing problems that search engines may be encountering when crawling a Web site. Search engines are machines and they see Web sites much differently than humans. These techniques will give information about how search engines are seeing a Web site.

Basic GNU/Linux Commands

The following list of commands is not comprehensive. It is just meant to be a quick introduction that you can print out and paste to the wall next to your computer. To get documentation on any GNU/Linux command (most of these are usually GNU), just type man [command] in a terminal. To learn how to use the man command, type man man in a terminal.

An alternative to the man command is the info command. To learn how to use info, type info info in a terminal.

An introduction to some of the commands is below. To print out these pages as a reference, you can visit the print-friendly version.

Head and Tail

head

Use head to get the first 10 lines of a file. To change the number of lines use the -n option like this:

head -n 1000 access.log

That will output the first 1000 lines of a file.

tail

To get the last 10 lines of a file, use tail. To change the number of lines, use the -n option. The following command outputs the last 1000 lines of a file name access.log:

tail -n 1000 access.log

awk

awk Text-Processing Language

Awk is a text processing language. For more information about awk, type man awk in a terminal. Awk is a great tool that will be covered in more detail later.

For an introduction, try this Awk tutorial. If using the GNU version of Awk (gawk), you can download the manual. If you are using GNU/Linux you are probably using gawk.

fg, Ctrl-z and jobs

Pausing Commands

Ctrl-z

To pause a task in the terminal press Ctrl-z. That will free up the terminal for other tasks, while still running your other tasks in the background.

fg

If you have sent a program into the background with Ctrl-z, you can bring it back to the foreground with the fg command. If you have more than one program running in the background, you can put the number of the job after the command.

jobs

Typing jobs in the terminal will give you a list of all the programs running in the background:

$ jobs
[1]-  Stopped                 vim linux_seo_tools.html
[2]+  Stopped                 man iwlist

To restart my instance of vim in the above example, I would type fg 1. To restart the man page, I would type fg 2.

Grep Tutorial

Using the grep Command

Grep is one of the most useful commands. You can use grep on any UNIX-based operating system, and you can even get grep for Windows. When I'm stuck in Windows, I use use grep inside of Cygwin.

The syntax for grep is:

grep [options] PATTERN [FILE...]

Commonly used options are:

-e
This turns on extended regular expressions. It allows you to use regex patterns that aren't possible in regular grep. You can either turn this on by using grep -e or egrep — they are the same thing. See below for more about extended regular expressions.
-i
This makes the search case-insensitive.
-v
This means to find every line that doesn't match.
-o
This means to extract only the matching part of the line.
-m [number]
Stop after [number] of matching lines.
-r
Search recursively, decending into all subdirectories.
-n
Add line numbers to show where the match was in the file.

There are many more options than the common ones I've listed above. Type man grep in the terminal for a complete list.

Grep Example

To search a log file for every line that contains Googlebot and write to a file called google_access.log, you could use this:

grep Googlebot access.log > google_access.log

The following example of grep takes a logfile that only has hits from Googlebot and removes all of the requests for .png files.

grep -v \.png google_access.log > google_access_no_pngs.log

Lynx Browser

Lynx Browser

Google's Webmaster Guidelines say,

Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.

I've already written a Lynx Browser tutorial as well as devoting an entire section of this Web site to Lynx, but here is a quick reference:

Get the text from a Web page as well as a list of links

lynx -dump "http://www.example.com/"

Get the source code from a Web page with Lynx

lynx -source "http://www.example.com/"

Get the response headers with Lynx

lynx -dump -head "http://www.example.com/"

pwd: Finding Your Location in the Filesystem

pwd

If you lose your place in the filesystem and would like to know your current location, type pwd. This is useful when connected to a Unix/Linux server over SSH where the terminal prompt does not give your location in the filesystem.

Sed - A Stream Editor

sed

Sed is a stream editor. It allows you to transform text when piping it through a series of commands. A common use for sed is to substitue characters. The syntax for substitution in sed is s/old/new/g, where new replaces old. The g means to replace globally on the line. If you leave the g off, sed will only replace the first occurance of old. You can replace g with a number to replace a certain instance of a word. For example, to replace the 2nd occurance of old with new on each line you could use s/old/new/2.

An example of sed would be to take a file with a list of URLs and remove the query strings from the URLs like this:

cat urls.txt | sed 's/\?.*//g'

The backslash is an escape character. Because the question mark has a meaning in regular expressions, the backslash escapes that regular meaning so that it is treated just as a normal question mark. The period indicates any character, and the asterisk means "zero or more of the preceeding character". When put together it means replace the question mark and any characters after it with nothing.

For an overview of what sed can do, type man sed in a terminal, and check out some of these sed resources:

sort and uniq

sort

The sort command sorts output.

uniq

The uniq command only returns unique lines.

See the GNU/Linux command line tutorial for detailed instructions. Or, type man sort or man uniq in a terminal.

The cat Command

cat

The cat command concatenates (combines) multiple files into one. It is also used to output the contents of a file.

The following command will combine all files in the current directory with the extension .log and put them in a file calle big_log_file.log:

cat *.log > big_log_file.log

The following command will output the contents of a logfile. This is useful for piping the contents into another command:

cat access.log

You can create a new file with cat like this:

cat > myfile.txt

Then hit enter and type in the text of your file. When you are finished, use Ctrl-c to exit the cat command.

The diff Command

diff

The diff command finds the difference between files. Type man diff for full details. Diff will be covered in more detail soon.

The echo Command

echo

The echo command prints text. The following command prints Hello World to the screen:

echo "Hello World"

The following command prints Hello World to a file called myfile.txt:

echo "Hello World" > myfile.txt

The tee Command

tee

The tee command writes the text that is passed to it to a file, and then passes it to standard output — generally the output will be piped to another command.

Here is an example of the tee command:

cat access.log | grep 'msnbot' | tee msn_access.log | egrep '(jpg|png|gif)' > msn_image_access.log

The above example does the following:

  1. cat is used to output the contents of a logfile.
  2. Then all of the lines containing msnbot are extracted with grep
  3. The tee command writes the input that it recieves to a file called msn_access.log, and then passes the same output the the next command.
  4. Egrep extracts lines from the msnbot lines that contain the text jpg, png, or gif (i.e., images files).
  5. The output is then written to a file called msn_image_access.log.

The result is two new log files — one containing hits from msnbot, and the other containing only msnbot's requests for images on the site.

(NOTE: the above example is not going to be highly precise, but it is simplified so that it doesn't get too confusing.)

You can also use tee to write output to a file and to the screen at the same time like this:

grep 'Googlebot' access.log | tee googlebot_access.log

The tr Command

tr

The tr command translates, deletes, or squeezes characters. The following example takes a logfile that has been converted into a CSV file and deletes all dashes. (Many log files put a dash in a column if there is no data available.)

cat access_log.csv | tr -d - > access_log_no_dashes.csv

(You probably wouldn't want to do this if the site has URLs with dashes in them.)

The wc Command: Counting Lines in a File

wc

The wc is useful for counting lines in a file. In the case of log files, it can count hits. The -l option will print out the number of lines.

wc -l access.log

You can also use it on multiple files — in the following case, on all files with a .log extension:

wc -l *.log

unzip, tar and File Compression

unzip

Used to decompress files with a .zip extension. Basic usage: unzip [filename].

tar

tar is used for compressing and uncompressing files. You might find yourself using it to uncompress archived log files with a tar.gz extension. Type man tar in a terminal for instructions on how to use it.

Variables

Variables in the Shell

When a variable is first used you assign it a value with an equals sign like this:

myvariable="Hello"

When you want to access the variable, put a dollar sign in front of it:

echo $myvariable
Hello

There are also built-in variables like:

  • $HOME
  • $PATH
  • $USER or $USERNAME
  • $OSTYPE
  • $LINES
  • $SHELL
  • $COLUMNS

For example, to find your home directory, you can type echo $HOME, or use it in a script. The following line will download the source code from the URL http://www.example.com/ and save it to the Desktop of the current user:

lynx -source "http://www.example.com/" > $HOME/Desktop/example.html

whois, nslookup, ping

whois

The whois command will get domain registration information about a Web site. Usage:

whois example.com

nslookup

Used to query nameservers. You can find out a site's IP address with this command.

$ nslookup google.com
Server:         66.94.25.120
Address:        66.94.25.120#53

Non-authoritative answer:
Name:   google.com
Address: 64.233.187.99
Name:   google.com
Address: 64.233.167.99
Name:   google.com
Address: 72.14.207.99

ping

Ping a site to see if you get a response. It will also tell you the remote site's IP address.

$ ping google.com
PING google.com (64.233.187.99) 56(84) bytes of data.
64 bytes from jc-in-f99.google.com (64.233.187.99): icmp_seq=1 ttl=242 time=33.0 ms
64 bytes from jc-in-f99.google.com (64.233.187.99): icmp_seq=2 ttl=242 time=31.8 ms
64 bytes from jc-in-f99.google.com (64.233.187.99): icmp_seq=3 ttl=242 time=44.2 ms
64 bytes from jc-in-f99.google.com (64.233.187.99): icmp_seq=4 ttl=242 time=29.7 ms
64 bytes from jc-in-f99.google.com (64.233.187.99): icmp_seq=5 ttl=242 time=52.0 ms
64 bytes from jc-in-f99.google.com (64.233.187.99): icmp_seq=6 ttl=242 time=46.8 ms
64 bytes from jc-in-f99.google.com (64.233.187.99): icmp_seq=7 ttl=242 time=48.8 ms
64 bytes from jc-in-f99.google.com (64.233.187.99): icmp_seq=8 ttl=242 time=55.3 ms
64 bytes from jc-in-f99.google.com (64.233.187.99): icmp_seq=9 ttl=242 time=43.4 ms

Shell Scripting

The power of Unix-based operating systems (including GNU/Linux, BSD, and Mac OS X) is that you can pipe terminal commands together and write scripts with them.

Piping commands means to send the output of one command to the input of the next command. An example would be to use the grep command to find all of the lines in a logfile that contain the text Googlebot and then send those lines to the wc command to count them:

grep 'Googlebot' | wc -l

The output would be the number of lines that contain the text Googlebot.

Series of commands can also be put into scripts and reused. See below for a list of resources for learning about shell scripting.

Shell Scripting Tutorials

SEO Techniques

This section contains actual techniques that can be used to analyze search engine activity on a Web site. As mentioned elsewhere, these definitely work on GNU/Linux (in my case, Ubuntu 6.06), but should also work on BSD, Mac OS X, and even Windows (with Cygwin). For best results, use a Unix-based operating system for these techniques and not Windows.

This is an ongoing project and many more pages will be added to this section soon. If you have any script recipes to add, leave a comment, or send me an email. The primary focus is on basic shell scripting, but scripts in Ruby, Python, PHP, Perl and other languages may also be added.

Extracting Search Engine Hits from Log Files

This page describes some ways to use the GNU/Linux terminal to extract search engine hits from a Web site's log files.

To extract just the Googlebot hits on the site from a logfile, try this:

grep 'Googlebot\/' access.log > googlebot_access.log

That will write the Googlebot hits to a new logfile called googlebot_access.log.

You can also pipe that output into another command, for example to extract only the URLs that Googlebot is requesting:

grep 'Googlebot\/' access.log | \
cut -d' ' -f7 > googlebot_urls_access.log

(The above line assumes that the URLs are in column 7 of the log file. You might have to adjust it based on your log file format.)

Once you have a file with Google's hits on your site, you can then grep it for specific response codes. For example, to get all the 404 Not Found pages that Googlebot is hitting, you could use the following:

grep 'Googlebot\/' access.log | \
grep [[:space:]]404[[:space:]]' > googlebot_404s.txt

Having a list of URLs that send 404s to search engines can tell you interesting information:

  • The location of pages that used to exist on an older version of the site that were not redirected with a 301 to their new locations (hint: some of them may still have inbound links and even PageRank)
  • Inbound links that have typos in the URLs, or that go to pages that were removed at some point
  • and more...

You can also grep for different types of headers (302, 301, 500, etc.) which will usually provide other interesting information, especially on a large site.

You can extract all major search engine information and convert it to a CSV file for processing in a spreadsheet. You can open a space-delimited file in a spreadsheet, but converting it to a comma-delimited format will allow you to have blank columns in case you need to remove the dashes in the log file.

The following one-liner will do the following:

  1. Use egrep to extract any lines that contain Googlebot/, Yahoo!, or msnbot from a file named access.log
  2. Use tee to write that output to a file called se_access.txt, and also send the output to the next command
  3. Use cut to extract columns 1 to 15 (delimited by spaces) and output it with commas as the delmiter to a file named se_access.csv
  4. Open the CSV file with OpenOffice.
egrep '(Googlebot\/|Yahoo!|msnbot)' access.log | \
tee se_access.txt | \
cut -d' ' -f1-15 --output-delimiter=, \
> se_access.csv; openoffice -calc se_access.csv

Automated HTTP Response Code Checking

Sometimes you will end up with a list of URLs that you would like to check the HTTP response codes on. You might have 200 pages that are sending Google a 302 redirect header and you would like to check them all at once.

This very rough example reads a list of URLs from a file, fetches their HTTP response codes and redirect location, and prints them to the screen:

while read inputline
do
  url="$(echo $inputline)"
  headers="$(lynx -dump -head $url | grep -e HTTP -e Location)"
  echo "$url $headers"
  sleep 2
done < filename.txt

It is a rough script because the Location field of the headers returned by Lynx sometimes spans two lines. (I'm going to fix that problem soon.)

The sleep command tells the script to pause for 2 seconds between requests. It is optional, but if I am requesting a lot of URLs from one site, I usually pause between requests so that it doesn't make the server do too much work at once.

The basic syntax for processing a file line-by-line in the shell is:


while read inputline
do
  [some commands here]
done < [input filename]

Checking Domain Age Programatically

You can check the year that a domain was registered with the following command:

whois example.com | grep -i 'creat' | head -n1 | grep -o '[[:digit:]]{4}'

The above line does the following:

  1. The whois command gets the WHOIS record for the domain example.com.
  2. The grep command extracts the line that says Creation date or Creation. The -i option means to search case-insensitively.
  3. head -n1 returns just the first line that matches, otherwise you may end up with two lines matching.
  4. The final grep -o extract just the 4 digits on the line — that should give you the year that the domain was registered.

You can extract the exact day with the following command:

whois example.com | grep -i 'creat' | head -n1 | \
egrep -o '[[:digit:]]{2}-[a-zA-Z0-9]{1,10}-[[:digit:]]{4}'

It works in a similar manner to the first example, but uses a regular expression to extract the full date.

You can also run this on a list of domains in a text file by reading each line of the file.

Counting Outbound Links on a Web Page With Lynx

How to count the number of outbound links on a page with Lynx and GNU

Lynx can be used with the -dump option to dump the text and links from a Web page in the terminal. That output can then be piped into the grep command, which can extract the URLs or other information.

The following line will count the number of outgoing links on a Web page, including internal and external links:

lynx -dump "http://www.example.com/" | grep -o "http.*" | wc -l

See my other GNU/Linux Lynx tutorial for more details on how lynx and grep can work together to extract links. The wc -l command counts the number of lines. In this case, each line is one link, so counting the lines gives you the number of links on a Web page.

How to count the number of links to external sites on a page

lynx -dump "http://www.example.com/" | grep -o "http.*" | grep -v "http://www.example.com" | wc -l

Using grep with the -v option tells it to give you all of the lines that don't match. In this case it will give you all of the links that don't include the domain name of the current Web page.

How to count the number of internal links on a page

lynx -dump "http://www.example.com/" | grep -o "http://www.example.com" | wc -l

Similar to the above example, this will only count URLs that do include the domain name of the current Web page.

Extracting and Reconstructing URLs from an IIS Log

IIS logs are often configured to output the filename in one column and the query string in the following column. An example of a line from an IIS log is shown below, with a highlighted filename and query string:

2006-10-19 00:22:41 66.249.65.99 - nnn.nnn.nn.nnn 80 GET /products.aspx item=12345 404 0 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - -

Unfortunately, the default settings on IIS do not seem to output the actual full URLs requested. It may be useful to get a list of URLs that were accessed by Google in order to process them further.

The following one-liner does the following:

  1. greps a log file that contains only hits from search engines for 404 errors. This will give a list of every "404 Not Found" page that search engines are visiting.
  2. It then uses the cut command to extract columns 8 and 9 — in this case, the page (/products.aspx) and the query string (item=12345).
  3. Then it uses awk to print out http://www.example.com/[filename]?[query string]
  4. Because not every requested page has a query string, sed can be used to remove the ?- that will be found on hits that don't have a query string

(NOTE: I've used backslashes to escape the end of the line — the following is a one-liner, but because of this Web page's formatting, I'm displaying it on multiple lines.)

grep [[:space:]]404[[:space:]] se_access.txt | \
cut -d' ' -f8,9 | \
awk '{ print "http://www.example.com"$1"?"$2}' | \
sed 's/\?-//g' > 404_errors.txt

The final result is a file named 404_errors.txt that contains a list of URLs that are being requested on a site by search engines that don't exist. The example above would take the following line from an IIS log:

2006-10-19 00:22:41 66.249.65.99 - nnn.nnn.nn.nnn 80 GET /products.aspx item=12345 404 0 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - -

and convert it to the following line:

http://www.example.com/products.aspx?item=12345

A list of URLs that send 404s is very useful for debugging sites. The list of URLs can be processed further as needed.

How to Scrape Web Pages from the GNU/Linux Shell

You can quickly scrape Web pages in a rough manner with the Lynx Browser and grep (and other tools that will be explained in the near future). Lynx is able to dump the contents of Web pages in two ways: only the text of the page, or the entire HTML source of the page.

To extract the text of a Web page with the HTML tags stripped out, you can use the -dump option like this:

lynx -dump "http://www.example.com/"

If you want the entire source code, you can use the -source option:

lynx -source "http://www.example.com/"

You can then pipe the Web page into grep and sed like this:

lynx -source "http://www.example.com/" | grep -o 'your regular expression here' | sed 's/html tags here//g'

The steps are:

  1. Lynx fetches the source code of the page http://www.example.com/.
  2. Grep used with the -o option extracts just the matching part of the line that contains your regular expression—probably some HTML tags with some text that you want to extract in the middle. Note that this will only work if it all appears on one line. I'm going to provide some better examples soon, but for now this script does have some useful applications.
  3. Sed then strips out the HTML tags to leave just the text within the HTML tags.

It's a rough, simple way to scrape a page and may not provide perfect results, but it shows the basic concept and can be modified to your needs.

The following script shows how to loop through a list of URLs in a text file called urls.txt and scrape some content from them:

while read inputline
do
  url="$(echo $inputline)"
  mydata="$(lynx -source $url |  grep -o 'your regular expression here' | sed 's/html tags here//g')"
  echo "$url,$mydata" >> myfile.csv
  sleep 2
done <urls.txt

The steps of the script are as follows:

  1. The while/do/done loop reads the file urls.txt into the script line by line.
  2. The current line of the file (a URL) is assigned to the variable $url.
  3. Lynx is used to fetch the source code of the variable $URL.
  4. The source of the URL is then piped into grep where some text inside of HTML tags is extracted.
  5. sed is used to strip out the HTML tags.
  6. The URL and title are then appended to a new file called myfile.csv.
  7. If necessary, you can have the script sleep for a couple of seconds before moving on to the next URL

This is just a rough example of one way to scrape pages in the Linux terminal. If you know a scripting language like Perl, Python or Ruby, you can use those to parse the HTML in a more elegant fashion. This page will be greatly expanded soon...

Recursively Find and Replace in GNU/Linux

Web designers often link to index.html in directories throughout a Web site — or even worse, only partially throughout a Web site. If you are dealing with a static HTML site, it should be fairly easy to fix with this recipe.

The following line in the GNU/Linux terminal will find and replace (delete) the text index.html recursively in all files, starting in the current directory:

find ./* -type f -exec sed -i 's/index.html//g' {} \;

(Adapted from a LinuxForums.org post.)

You can then redirect all instances of index.html to the roots of the directories (the slash) with the following lines in the .htaccess file:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(([^/]+/)*)index\.html\ HTTP/
RewriteRule index\.html$ http://www.example.com/%1 [R=301,L]

Apache Log Statistics

I found a script from the book Wicked Cool Shell Scripts that quickly extracts useful data from Apache log files.

I'm not sure what license the script is under so I can't reprint it here, but it's worth checking out.