Slug: the-chain-gang-fetchmail-procmail-python-and-analog Date: 2003-03-18 Title: “The Chain Gang: fetchmail, procmail, python, and analog” layout: post

Ok, so I finally got my server log processing chain working the way I want, and I figured I'd better document it 1) because I'm going to forget very soon how to do it, and 2) because I still get hits from Google for the word 'fetchmail' from the last time I mentioned it, so there's obviously an interest.

First, a description of the problem area, then into the howto.

I run several sites on Conversant, and I'm subscribed to the server logs, meaning each night around 11:00 pm I get the web server log file emailed to me, in NCSA Extended format. I'd been subscribed to the log at my usual address, and was manually saving out the log files and running analog on them. Having recently acquired a pc running linux, I decided that I wanted to move this process to that box and automate it.

So, I was going to use fetchmail to get the email, procmail to sort it, some other script to pull out the attached files and save them to the proper place, then schedule a cron job to run analog on the log files later. I'm going to explain this process from the POV of a unix user, since both my boxen run unix. It would be adapted for use on Windows, I suppose, but I'm not certain about fetchmail and procmail. (Cygwin, perhaps?)

So, the tools we'll need:

Collecting our tools

fetchmail and procmail should be found on almost any unix. Otherwise you can get them from the links above. They should compile and install very easily using the typical configure, make, make install process.

I used python for my mail processing script because I had it on the machine and it was easier than doing it in Perl. Python can be found at python.org and also is available for most any OS.

Analog is a great webserver log analysis tool, and is also available for most any OS. Download it from the above link and follow the installation instructions. (Basically the same - configure, make, make install).

Step One: fetchmail

fetchmail is a simple utility that does one thing, and does it well. It requires a single configuration file in the home directory of the user it's going to be running as, .fetchmailrc.

This is my .fetchmailrc:

poll mail.redmonk.net protocol pop3 username <username> password <pwd> is "logs" here

That's all there is to it. fetchmail has a lot of options you can read about by running man fetchmail, but these are the most common I think. Later we'll be running fetchmail via cron, but you can also run it in daemon mode by appending -daemon <numberOfSeconds> to the call to fetchmail the first time you run it. After that it will run every <numberOfSeconds> until you call it again with -daemon 0.

BTW - you may have to create a ~/.forward file if your fetchmail is not configured to run procmail by default. Contents:

/path/to/procmail

Step Two: procmail

procmail was a pain to configure, largely because I couldn't find a decent reference for the rule syntax used in .procmailrc, so I had to do a lot of trial and error. The best reference I did find was Timo's procmail tips and recipes, and it was invaluable.

Here are the relevant sections of my .procmailrc file:

Setup:<blockquote>

SHELL=/bin/sh
DEFAULT=/var/mail/logs

# I created a user called 'logs' just for this purpose. # The cron jobs are in this user's crontab too. # replace 'logs' with the name of the user procmail is running for.
LOGFILE=/home/logs/procmail.log
# Troubleshooting:
VERBOSE=yes LOGABSTRACT=all
# Shortcuts:
BASEDIR=/home/logs/slproc LOGDIR=/data/log/sites MAILDIR=/data/mail/

</blockquote>

This is the main rule. I've got three of these, one for each server's log file email that fetchmail is expected to receive.

# :0 starts the rule, 'f' means the rule is a filter, # 'w' means wait for the piped process to complete # before continuing
:0 fw
# Is it is from logs@free-conversant.com?
* $ ^From:.*logs@free-conversant.com.*
# Is it from the redmonk domain and site? # My emails come in with a subject something like: # "NCSA Extended for redmonk.redmonk (3/10/03)"
* ^Subject:.*redmonk\.redmonk.*
# Pipe the entire message text through the saveAttached.py script # saveAttached.py takes one argument: the path of a directory to save # the attachment in. # saveAttached.py prints the raw message back out on stdout, # so that this line can still append it to the appropriate mail file.
${BASEDIR}/saveAttached.py ${LOGDIR}/redmonk/ \ >> ${MAILDIR}/redmonk.mail

When developing procmail rules, I cannot emphasize how much time and energy it saved me to follow the directions on Timo's procmail tips page for creating a testbench to test individual rules. Essentially you create a separate procmail.rc file with your rules in it, you save out a message that matches the rule you're trying to test, and feed them both to procmail like so:<blockquote> procmail procmail.rc < test.msg </blockquote> The direction on Timo's page show you how to create a shell script to run the test for you while writing out and letting you view the procmail log so you can debug the rule.

As I said before, I've got three of these rules, but someone with more experience than I could probably grab the site name from the email title and use that to pass to the python script to tell it where to save the log file and what mail file to append the message to. I've got my log directories setup in /data/log/sites/<sitename>, and the mail files are named <sitename>.mail.

Step Three: saveAttached.py

In studying procmail it looked like it was going to be impossible, or at least beyond my limited understanding, to use procmail to pull out and decode the MIME attached log files. So, I decided to write a python filter script that would take the raw message on stdin, pull out the attached file and decode it, and save the log file out to disk.

The script is here:

It's fairly straight forward, one bit I had to research was how to get the data from stdin, as I only had experience in using sys.argv. It was stupidly easy:<blockquote> rawData = sys.stdin.read() </blockquote> The other was how to manipulate the raw email message. Python has a built-in module, email which makes short work of parsing the data:<blockquote>

p=email.Parser.Parser()
msg=p.parsestr(rawData)

for part in msg.walk():
    if part.get_main_type()=="multipart":
        continue
    name=part.get_param("name")
    if name==None:
        name="noName"
    if (name.endswith('.log')):
        # grab the logfile
        f=open(dirPath+name,"wb")
        f2=open(dirPath+'access.log',"wb")
        f.write(part.get_payload(decode=1))
        f2.write(part.get_payload(decode=1))
        f.close()
        f2.close()

</blockquote>

Step Four: cron

Now that we have the system in place for downloading and storing the log files, we just need to tell it when to run. For that we have cron, the canonical unix scheduler.

Here are the relevant portions of my crontab:<blockquote>

# minute hour dayOfMonth monthOfYear weekDay command # fetchmail
0 1 1-31 * * /usr/bin/fetchmail
# analog
0 2 1-31 * * /usr/local/bin/analog -g/home/logs/analog/rm-analog.cfg

</blockquote> To edit your crontab, make sure your EDITOR environment variable is set first (this is in tcsh):<blockquote>

setenv EDITOR pico

</blockquote> Or, of course, vi or emacs, or whatever. Then:<blockquote>

crontab -e

</blockquote> Will open your crontab in your favorite editor (as a tmp file). Once you've made your changes, save, and exit, cron will install your edited crontab.

Now, back to my crontab, er, our crontab.<blockquote>

# minute hour dayOfMonth monthOfYear weekDay command # fetchmail
0 1 1-31 * * /usr/bin/fetchmail

</blockquote> This line tells cron to run fetchmail at minute 0 of hour 1 (1:00 am) of every day (1-31) of every month of every year blah blah blah…

I could also (in retrospect) have used * as the value for month. This is the "workhorse" line that triggers our pipeline: fetchmail passes on to procmail by default, so after fetchmail runs, it automatically kicks off procmail to run it's filters, which calls out saveAttached.py script to save out the attached files. At this point, the logs files are in /data/log/sites/<sitename>. An hour later (just for safety's sake) cron runs the log analyzer, analog<blockquote>

# analog
0 2 1-31 * * /usr/local/bin/analog -g/home/logs/analog/rm-analog.cfg

</blockquote> This is where stage one of the pipeline ends, and stage two begins. This line runs analog at 2 am, every day of every… You get the idea. It calls analog, passing in a custom config file, rm-analog.cfg.

Step Five: analog

rm-analog.cfg is just a normal analog.cfg file with various options turned on and off, and with LOGFILE set to the /data/log/sites/redmonk/ directory. Here are some interesting bits of my config file:

LOGFILE /data/log/sites/redmonk/redmonk.redmonk.*.*
OUTFILE /home/logs/public_html/reports/redmonk/index.html

Because this is linux, I can turn on reverse DNS lookups (it's broken on Mac OS X)<blockquote>

DNS WRITE

</blockquote>

And because it runs overnight I'm not worried about the length of time it takes to run when there are a number of new hosts to look up.

One last thing in my rm-analog.cfg file that I like particularly, even though it's unrelated to the pipeline. Being a weblogger, I like seeing activity on my RSS feed reflected in the results, so I've added…<blockquote>

PAGEINCLUDE *.rss
TYPEALIAS .rss  ".rss   [Rich Site Summary]"

</blockquote>

…to my configuration.

I know there are a few ways I could make this process more efficient. If you have comments or suggestions, you can email me, or leave your notes here, and I'll update this page (as well as my own setup!)