February 09, 2003

Spambashing

I've been changing my spam filtering techniques again.

Once upon a time, I set up SpamAssassin on my Unix account at plokta.com, with some procmail configuration to filter messages through it appropriately. And it mostly worked, with some false negatives (spam marked as clean) and a few false positives (which are more serious), but I couldn't trust it to delete spam unseen, just to let me filter it to a different mailbox for manual inspection. It took more memory than my quota allowed, and if the reaper (running every thirty seconds) happened to run while it was filtering a message, it was killed and the message wasn't filtered -- but procmail let me run it again if it failed the first time, which fixed that.

Then the influential essay A Plan for Spam was published, outlining a plan for building a learning filter using a Bayesian algorithm (which apparently later turned out not to actually be Bayesian). It promised better detection rates than SpamAssassin. So I installed SpamOracle in the shell account, and set to training it. It seemed to work rather better than SpamAssassin, but it was difficult to continue to train it because I was reading my email on a Windows machine on a different continent from the shell account where the filtering was happening, and there was no easy way to get the categorised emails back to the SpamOracle database.

With both SpamAssassin and SpamOracle, my email accounts at Demon (lots and lots of spam and a very small number of non-spams) and BT (pretty well no spam, since that address has never been broadcast) weren't getting filtered at all.

Next, I got my Powerbook and started reading email on a Mac instead of a Windows machine. The Mac's mail application has its own learning filter built in, which makes it very easy to train as there's a button on the toolbar to let you change a mail's status between spam and not-spam. Unfortunately, after two months of training, I can report that it doesn't work very well, still giving a lot of false negatives and a few false positives. It would be nice if Apple provided some hooks to allow third parties to replace it with their own algorithms while keeping the easy training -- but they don't.

This weekend, Tibs mentioned that Spambayes was working very well, so I checked it out. Their claims for their algorithm are extremely impressive. And it will operate in a mode where you run a POP3 proxy locally, and it filters the messages as they come in, but remembers the messages. There's a local web-page you can use to classify them, and it's very simple to use, if not quite as well integrated as the Mac's own mail app.

So I'm running three local POP3 proxies, and am busy training. I don't have a corpus of pre-classified material in a format that I can easily feed to it, so it may take a week or two to become fully effective.

Posted by Mike Scott at February 9, 2003 10:05 PM | TrackBack
Comments

Your experiences with the spam filtering built into the Mac OS X mail client are interesting. I've been using it since November and I'm quite happy with it. I'm still getting the occasional false positive but I'm only getting 2 or 3 false negatives a month. I left it in training mode for about 2 months which may explain it.

Posted by: David Stewart on February 10, 2003 12:15 AM

Oh, neat - not only was my comment useful, but
it also appears to have been accurate!

Posted by: Tibs on February 12, 2003 01:09 PM
Post a comment