Tools We Use: SpamSieve
Having to sort through the increasingly repulsive spam that’s rushing into our electronic mailboxes is becoming more unpleasant than ever. You can reduce the flow, though, with one of three basic approaches to filtering spam out of your email stream: Boolean filters, points-based filters, and so-called "Bayesian" statistical filters. Put simply, a Boolean filter looks for string of text, and if it’s found, considers the message spam. Points-based filters refine that approach, assigning (or removing) points for each criteria matched by a given message; they decide if a message is spam or not by how many points that message accumulates. Statistical (or Bayesian) filters, which were most popularly described in relation to spam in August of 2002 (and refined last month) by Paul Graham, use a statistical approach that combines the probability that any given word or phrase (implementations vary) to decide if the message is spam.
Bayesian Filters — The beauty of Bayesian filtering is that it works on the contents of your email, which is probably rather different from mine and anyone else’s. That’s because you must train a Bayesian filter with a sample of both spam and legitimate messages, and because the Bayesian filter continually examines new messages, it can adapt to the kind of mail you receive, both good and bad.
Bayesian filters aren’t perfect. Legitimate mail, such as promotional mailings from companies you’ve bought from in the past, can look a lot like spam at first, and it’s also hard to identify spam messages with minimal text accurately. Spam may get through when it’s sufficiently related to your profession; for instance, I get spam advertising translation services because of the TidBITS translations. It’s also possible for spammers to pollute your corpus of good and bad words by including lots of good words in a spam message, thus reducing the accuracy of the filter over time. On the positive side, it’s possible that improved algorithms can address these problems.
There are two main implementations of statistical Bayesian filtering for Mac OS X: Apple’s Mail and Michael Tsai’s SpamSieve, the latter of which I’ve been testing with Eudora 5.2 for some months now.
SpamSieve — Along with its implementation of Bayesian filter, I especially appreciate the fact that SpamSieve works inside Eudora, and also inside a number of other email programs, including Entourage, Mailsmith, and PowerMail. Although it’s not available for Mac OS 9, it does also work with Emailer running in Classic mode. I’m not interested in using Mail, and other spam utilities (such as Matterform Media’s points-based Spamfire utility, which also has many proponents) work outside of your email program, forcing you to scan for false positives in a separate interface). SpamSieve works with any number of accounts and filters mail from any source your email program supports. Once it has identified messages as spam, it can mark or move them, and in some of the email programs, your filters can continue to work on the marked messages.
SpamSieve accomplishes this by using the AppleScript capabilities of these email programs to pass information to and from SpamSieve itself. The integration is relatively seamless, except in Eudora, the current version of which has limitations that restrict SpamSieve to filtering mail that ends up in the In box (not in any other folder). Since the communication happens via AppleScript, you can edit the included scripts to customize them further. Even while I’m waiting for the next version of Eudora to bring SpamSieve’s capabilities to messages I filter out of my In box, I’ve found it extremely worthwhile.
I initially trained SpamSieve with about 600 spam messages from my disgustingly large collection of spam and 600 good messages from my In box (yes, it has been that full, though I’ve beaten it back down into the 300s). If you don’t have spam around, you could either train SpamSieve as you receive it (probably with lower accuracy at first) or wait briefly until you’ve collected a representative sample. I’ve also told SpamSieve to learn from new messages. Since the middle of January, SpamSieve has filtered over 2,600 messages, about 55 percent of which were spam. In that time, it has reported 88 percent accuracy, with a false negative rate of 11 percent and a false positive rate of 1 percent (an alternative way I’ve used to verify SpamSieve’s accuracy came up with lower numbers – 80 percent accuracy, with 19 percent false negatives – I’m working with Michael Tsai to figure out the discrepancy). Most of the false positives were solicited commercial email or messages forwarded to me and a large number of other people, both of which are likely to run afoul of SpamSieve’s filtering until it has been trained to recognize similar messages. Because SpamSieve filters on the contents of your particular email stream, your mileage may vary, as it has for other TidBITS staff members, who have seen somewhat less reliable results.
New features in SpamSieve 1.3 include increased resilience to the ways spammers are now obfuscating common words, the capability to use email addresses in Apple’s Address Book as a whitelist (so mail from people whose addresses are stored in the Address Book is never considered spam), editing of SpamSieve’s corpus of words, type-to-select in the Corpus window, and the capability to see statistics from after any given date.
If you’ve longed for the Bayesian filtering in Apple’s Mail, but weren’t willing to give up your preferred email program for that one capability, I’d strongly encourage you to take a look at SpamSieve. Michael Tsai is developing it actively, and has been extremely responsive to comments and suggestions.
SpamSieve 1.3 is $20 shareware (upgrades from previous versions are free) and is a 1.5 MB download.
PayBITS: Did Adam’s article turning you on to SpamSieve
seriously reduce your spam volume? Say thanks via PayBITS!
Read more about PayBITS: <http://www.tidbits.com/paybits/>