As owner of blah.net, I get a lot of spam. I don't really need the catchall account that collects email from all of the non-existent addresses @blah.net, but I hate the idea that something awesome might get sent there and I'd miss out on it. This is a summary of how I spam filter all of that mail.
I'm setting up a new system with Ubuntu 9.10 64bit. In addition to all of the incoming email, I also have a large archive of mail that for one reason or another hasn't been filtered yet. I use that archive whenever I'm trying to get a new installation setup and I want material for training the bayesian filter. Evolution has the best support for different input formats. Most of my old email archives are in either mbox or maildir format, both of which are supported by Evolution.
Evolution supports two different options for spam filtering, Bogofilter and SpamAssassin. Both options provide Bayesian filtering, while SpamAssassin also provides support for a number of remote services, DNS blacklists, content filtering and many other options through a series of plugins. Evolution is part of the base Ubuntu Desktop install, but I had to install SpamAssassin separately.
sudo apt-get install spamassassin
Next, I switched Evolution from using Bogofilter to using SpamAssassin. I also enabled the Remote Tests option so that SpamAssassin would catch more spam.

Unfortunately, the large number of DNS based checks and my home network were not a good combination. It was taking 10-12 seconds to filter most messages, which is definitely not the kind of performance I was hoping for. A google search for "ubuntu dns caching" pointed me to a post about using dnsmasq for DNS caching. After following those instructions, filtering some messages was taking about a half second with the rest still taking 10-12 seconds. I added the following lines to /etc/dnsmasq.conf to send queries I knew were filtering related to a publicly accessible DNS server. I did this on a per domain basis rather than send everything as this is on my laptop which is also my primary workstation. I still want internal DNS entries when I'm at work and at home.
server=/spamhaus.org/4.2.2.1
server=/bondedsender.org/4.2.2.1
server=/njabl.org/4.2.2.1
server=/spamcop.net/4.2.2.1
server=/sorbs.net/4.2.2.1
server=/surbl.org/4.2.2.1
server=/uribl.com/4.2.2.1
server=/open-whois.org/4.2.2.1
server=/habeas.com/4.2.2.1
server=/support-intelligence.net/4.2.2.1
server=/dnswl.org/4.2.2.1
server=/isipp.com/4.2.2.1
This bypassed the DNS servers running on my openwrt router and the AT&T UVerse router it's connected to. This cut the filtering time in half for the messages that weren't hitting the dnsmasq cache. Within a few minutes, approximately 1/2 to 3/4 of the messages were being filtered in about half a second, with the rest of them being filtered in 5-6 seconds. This was good enough for me to keep moving and looking at other ways to optimize the spam filtering process.
Next, I fine tuned some of the options in /etc/spamassassin/local.cf:
use_bayes 1
use_bayes_rules 1
bayes_auto_learn 1
bayes_learn_during_report 1
bayes_use_hapaxes 1
bayes_learn_to_journal 1
lock_method flock
skip_rbl_checks 0
Most of these are just restatements of the default settings, but I want to make sure they stay if the defaults change. The bayes_learn_to_journal is the only non default option. This should provide somewhat better performance during periods of heavy spam reporting / learning.
Now that basic filtering is setup, I want to start bringing in new email. First we'll need an MTA, in this case postfix. I'm also going to want procmail for doing some additional filtering and sorting of messages.
sudo apt-get install postfix procmail
In my case, this system is my laptop and my primary workstation. I only need local mail delivery and selected that option when configuring postfix. I want maildir delivery, so:
sudo postconf -e 'home_mailbox = Maildir/'sudo postconf -e 'mailbox_command = procmail -a "$EXTENSION"
Before we start sending any mail through the system, let's get a basic ~/.procmailrc file setup:
VERBOSE=on
DEFAULT=${HOME}/Maildir/
MAILDIR=${HOME}/Maildir/
LOGFILE=${HOME}/procmail.log
COMSAT=no
Currently email for blah.net is sent through Google Apps first to filter out the bulk of the spam. So to bring it into the system, I'll use fetchmail.
sudo apt-get install fetchmail
And we'll need a basic ~/.fetchmailrc:
poll imap.gmail.com protocol IMAP user "spam@blah.net" there with password "password" fetchall ssl expunge 1
As is, this won't actually filter any of the new incoming mail. That will take some additional procmail rules which I'll cover during the next update.