February 10, 2003

Goodbye Spam

After less than a day of training, having been fed 99 spams and 80 good emails, Spambayes is getting it right pretty well every time. I expect that when I get new dodgy looking emails like newsletters from commercial web-sites that I am legitimately signed up for it'll put them into its "unsure" category the first time, but apart from that it looks as though it's doing the job. It's certainly already more effective than the Mac Mail app was after two months of training, or SpamOracle was with a corpus of thousands of emails. I recommend this one.

Posted by mikeplokta at 03:13 PM | Comments (0) | TrackBack

February 09, 2003

Spambashing

I've been changing my spam filtering techniques again.

Once upon a time, I set up SpamAssassin on my Unix account at plokta.com, with some procmail configuration to filter messages through it appropriately. And it mostly worked, with some false negatives (spam marked as clean) and a few false positives (which are more serious), but I couldn't trust it to delete spam unseen, just to let me filter it to a different mailbox for manual inspection. It took more memory than my quota allowed, and if the reaper (running every thirty seconds) happened to run while it was filtering a message, it was killed and the message wasn't filtered -- but procmail let me run it again if it failed the first time, which fixed that.

Then the influential essay A Plan for Spam was published, outlining a plan for building a learning filter using a Bayesian algorithm (which apparently later turned out not to actually be Bayesian). It promised better detection rates than SpamAssassin. So I installed SpamOracle in the shell account, and set to training it. It seemed to work rather better than SpamAssassin, but it was difficult to continue to train it because I was reading my email on a Windows machine on a different continent from the shell account where the filtering was happening, and there was no easy way to get the categorised emails back to the SpamOracle database.

With both SpamAssassin and SpamOracle, my email accounts at Demon (lots and lots of spam and a very small number of non-spams) and BT (pretty well no spam, since that address has never been broadcast) weren't getting filtered at all.

Next, I got my Powerbook and started reading email on a Mac instead of a Windows machine. The Mac's mail application has its own learning filter built in, which makes it very easy to train as there's a button on the toolbar to let you change a mail's status between spam and not-spam. Unfortunately, after two months of training, I can report that it doesn't work very well, still giving a lot of false negatives and a few false positives. It would be nice if Apple provided some hooks to allow third parties to replace it with their own algorithms while keeping the easy training -- but they don't.

This weekend, Tibs mentioned that Spambayes was working very well, so I checked it out. Their claims for their algorithm are extremely impressive. And it will operate in a mode where you run a POP3 proxy locally, and it filters the messages as they come in, but remembers the messages. There's a local web-page you can use to classify them, and it's very simple to use, if not quite as well integrated as the Mac's own mail app.

So I'm running three local POP3 proxies, and am busy training. I don't have a corpus of pre-classified material in a format that I can easily feed to it, so it may take a week or two to become fully effective.

Posted by Mike Scott at 10:05 PM | Comments (2) | TrackBack

February 08, 2003

Why Microsoft Is Doomed

There comes a time in every big corporation's life when it gets too big and too arrogant. Generally speaking, it's all downhill from there. Microsoft's licensing has got too difficult for the average user. Check John Scalzi's tale of Front Page woe.

Posted by Mike Scott at 09:39 AM | Comments (0) | TrackBack