Joe Kissell 24 May 2004

Getting to Know Apple Mail’s Spam Filter

Like most people who use Apple Mail, I had high hopes that its improved Junk Mail filter, a much-touted benefit of upgrading to Panther, would live up to Apple’s hype. After months of diligently training the filter, I was still less than satisfied with its results. Based on the hundreds of messages I’ve read on Apple’s discussion forums, my experience is not unique. Rather than live with the spam, though (or trade Mail for another application), I decided to look into the problem more deeply. Armed with an ever-increasing personal collection of many thousands of spam messages, I experimented with Mail’s Junk Mail settings, compared it with other filters, and tried to discover how it really works – and why it sometimes fails. I discovered that at least some of the problems I’d been having were due to a misunderstanding of the application’s design, which is not as self-explanatory as some of Apple’s applications.

I have some good news and some bad news. The good news is that Mail’s Junk Mail filter, when properly configured, can be a reliable tool for keeping spam out of your In box. The bad news is that even if the Junk Mail filter is working as well as it possibly can, you may still see more spam than you would like. Luckily, other applications and techniques can make up for the Junk Mail filter’s deficiencies. You’ll be able to make better decisions about how to use (or supplement) the Junk Mail filter when you know what happens behind the scenes.

I’ve distilled innumerable hours of research and experimentation into my latest Take Control ebook, "Take Control of Spam with Apple Mail," from which this article is excerpted.

<http://www.tidbits.com/takecontrol/spam-Apple- Mail.html>

How Mail’s Junk Mail Filter Works — Mail’s Junk Mail filter has two modes: Training and Automatic. (You can also disable it entirely.) Exactly what happens in these two modes can confuse the uninitiated. In particular, many people wonder whether the filter continues to learn if you switch to the Automatic mode. (It does indeed, though Apple’s documentation obscures this fact.) Here are the details.

When a message arrives, Mail runs a built-in rule – a rule that does not appear in the Rules list. In order to minimize false positives, this special junk mail rule runs after all other rules. But it doesn’t run if one of your other rules includes the "Stop evaluating rules" action, and it applies only if a given message has not already been moved or deleted by some other rule.

You can see the special junk rule by opening the Junk Mail preference pane and clicking the Advanced button at the bottom of the window. Depending on the options you choose, this rule optionally checks to see whether the message’s sender is in your Previous Recipients list or your Address Book, and whether the To field of the message contains your full name – since spam often does not. (The Previous Recipients list is a useful adjunct to the Junk Mail filter, although as I explain in detail in the ebook, incorrect data can easily creep into it, rendering it counter-productive.) If the answer to any of these questions is yes, the rule stops and your message is left alone. If all three conditions are negative, though, it checks one final condition: "Message is junk mail." If this final condition is true, the filter takes the action you specify: in Training mode, it changes the color of the message subject to Brown; in Automatic mode, it moves the message to your Junk mailbox. (This action is one of only two differences between Training mode and Automatic mode. The other is that in Automatic mode, Mail consolidates the Junk mailboxes for all of your accounts under a single Junk icon in the Mailbox list.)

The "Message is junk mail" condition sounds rather mysterious, but it means that, if true, Mail’s latent semantic analysis filter (discussed just ahead) has assigned the message a value beyond its arbitrary threshold for spam. You cannot modify this threshold, but if you later mark the message as Not Junk, you decrease the probability that Mail will consider a similar message to be spam in the future.

The Panther version of Mail added yet another criterion for spam checking: headers inserted by a spam filter running on your ISP’s mail server. Many ISPs use server-based spam filters such as SpamAssassin or Brightmail. If such a filter identifies a message as potentially spam, it tags it by adding to the message a special header, such as X-Spam-Flag. (These special headers are normally hidden, but you can display them by choosing View > Message > Long Headers [Command-Shift-H].)

If a message contains this header, this means your ISP’s spam filter – which may be more advanced than the one built into Mail – suspects the message to be spam. As long as the checkbox Trust Junk Mail Headers Set by Your Internet Service Provider is selected in Junk Mail preferences, Mail marks such messages as Junk Mail (unless they were exempted for some other reason, such as being sent by someone in your Address Book). Some server-side spam filters use other headers besides X-Spam-Flag to identify spam. The only way to tell Mail to look for a different header is to edit its preference file manually; I give full instructions in the ebook.

About Statistical Filtering — Because of the constantly evolving nature of spam, tools that attempt to filter out messages based on a fixed list of keywords, patterns, or rules become less effective as time goes on. Statistical filters address this problem by making up their own rules (in a sense) as they process your mail. The most frequently used statistical spam-filtering method is Bayesian filtering (found in Eudora, SpamSieve, and SpamAssassin, among others).

Apple Mail uses a related technique known as Adaptive Latent Semantic Analysis (LSA). Both methods compute the probability of a given message being spam based on an analysis of the contents of existing spam and non-spam messages. And both methods become more accurate as they are exposed to new samples of good and bad messages. Although from a user’s perspective Bayesian and LSA filters are similar, they differ in some important ways.

Bayesian Filters — To oversimplify tremendously, think of a Bayesian filter as consisting primarily of two lists: "good" words and "bad" words. You build these lists dynamically as you use the filter. Every time you indicate that a message is spam, the filter adds all the words in that message to its Bad Words list; every time you indicate that a message is legitimate, all its words go onto the Good Words list. Of course, most words appear on both lists, so the filter determines the probability that each word is a spam indicator based on the proportion of times it appeared in bad versus good messages.

When a new message comes in, the filter calculates the average spam score of its words, and if that score exceeds a predetermined threshold, the message is deemed to be spam. Bayesian filters are highly dynamic, adapting themselves not only to the type of mail each individual receives (since one person’s spam is another’s ham) but also to the changing tactics of spammers. Although the system is not perfect, it means that if next month a new scam emerges for selling real estate on Mars, a Bayesian filter will learn to reject such messages after you manually mark a few examples as spam. Most Bayesian filters also take into account email headers and other message attributes in order to avoid being fooled by spam messages containing long (but often hidden) passages of ordinary prose.

Latent Semantic Analysis in Mail — Where Bayesian filters are based on a relatively straightforward computation of word frequency, LSA filters go further by identifying spam-like words, phrases, and messages based on their similarity in meaning to text you’ve already identified as spam. Instead of assigning simple weights to each word individually, an LSA filter takes into account the overall context in which a word appears. For example, the word "enlargement," when it appears in a discussion about photography, would not normally be an indicator of spam – whereas the same word in the context of cosmetic surgery or low-cost prescription medicine would be a very good indicator of spam. (Again, this is an oversimplification. For more details on latent semantic analysis, see part 2 of Francois Joseph de Kermadec’s three-part series "The Fight Against Spam" at MacDevCenter.)

<http://www.macdevcenter.com/pub/a/mac/2004/05/ 18/spam_pt2.html>

Like a Bayesian filter, an LSA filter keeps learning as you use it. This assumes, of course, that you diligently correct all its mistakes. In Mail, this means marking all spam messages the filter misses as Junk Mail, and marking all incorrectly identified legitimate messages as Not Junk Mail.

On paper, LSA seems to be a more sophisticated technique that is less likely to be foiled by clever spammers. In practice, however, Mail’s implementation of LSA leaves something to be desired. In my own tests, it learned more slowly than Bayesian filters. It also tends to err on the cautious side, with no way to adjust its threshold for what it considers spam. And because it looks for patterns of words rather than patterns of characters, lots of seemingly obvious spam messages get through undetected.

Mail stores its statistics of good and bad message characteristics in a single file: ~/Library/Mail/LSMMap2. (LSM, by the way, stands for "least square method," a mathematical algorithm used in latent semantic analysis. I could tell you were wondering about that.) If this file becomes damaged – which unfortunately can happen pretty easily – junk mail filtering stops working correctly.

You can’t repair the LSMMap2 file, but you can delete it manually or reset it by clicking a button on Mail’s Junk Mail preference pane. Doing so solves the most serious junk mail problems, but also eliminates all the training you have given your junk mail filter, so its accuracy will probably be poor until it has processed enough new legitimate and spam messages to rebuild its database.

What can damage Mail’s junk statistics file in the first place? In addition to the usual things (such as crashes while the file is open, directory errors, or shutting down improperly), LSMMap2 can occasionally suffer damage in the course of ordinary activities such as filing a message or marking it as Junk Mail, if the message itself contains certain kinds of errors. Although you can’t prevent damage from occurring, you can take steps to make recovery easier (see the ebook for more information).

Marking a Message as Junk Mail/Not Junk Mail — Because statistical filters increase their accuracy gradually as you use them (and because spammers constantly learn new tricks to thwart them), spam sometimes gets past the Junk Mail filter and appears to be ordinary mail. (This is known in the lingo as a "false negative.") You could just delete such messages, but if you do, you actually increase the probability that similar messages will sneak through in the future. Instead, you must select each unmarked spam and manually mark it as Junk Mail (choose Message > Mark > As Junk Mail [Command-Shift-J]).

Note that merely moving a message to your Junk mailbox is not enough; only if the Junk Mail flag is set, as shown by the "paper bag" icon in the message list, does Mail consider the message junk mail. When you mark a message as Junk Mail, you modify the filter’s statistical lists and remove the sender’s address from the Previous Recipients list if it was there.

Conversely, Mail may incorrectly mark a legitimate message as Junk (a "false positive"). After all, statistical filters only judge probabilities. If someone sent you, say, an article discussing the sorts of words that spammers often use, that might tip the scales. (This problem of overzealous spam filters has caused no end of problems for TidBITS and the TidBITS Talk mailing list.) So you must always mark such messages as Not Junk (choose Message > Mark > As Not Junk Mail) to tell Mail they are legitimate.

Surprisingly, though, marking a message as Not Junk also adds the message’s Sender to your Previous Recipients list! This guarantees that no message from that sender will be marked as spam in the future (assuming you use the default settings), which may or may not be what you want. This is just one of many surprises I encountered with the design of the Previous Recipients list.

Take Control of Spam with Apple Mail — I’ve tried to explain the basics of how Mail’s Junk Mail filter works here, but in my full 59-page "Take Control of Spam with Apple Mail" ebook, I go much further, providing detailed, practical steps Mail users can take to eliminate spam, prevent false positives, and solve problems with the Junk Mail filter. The ebook includes a great deal of additional background information, plus extensive discussions of add-on filters and other techniques that go beyond Mail’s built-in anti-spam capabilities. If you’ve been suffering from spam in Mail, I’m confident my advice will reduce your frustration level and help you avoid wasting so much time dealing with junk mail. "Take Control of Spam with Apple Mail" costs $5, and as with all Take Control ebooks, purchasers are entitled to receive all minor updates for free.

<http://www.tidbits.com/takecontrol/spam-Apple- Mail.html>

[Joe Kissell is a San Francisco-based writer, consultant, and Mac developer who kicked off the Take Control series with the best-selling "Take Control of Upgrading to Panther." His Interesting Thing of the Day Web site returns with daily articles beginning 01-Jun-04.]

<http://www.tidbits.com/takecontrol/panther/ upgrading.html>

<http://itotd.com/>

Share

Subscribe today so you don’t miss any TidBITS articles!