Series: Filtering Gone Bad
Is your email being censored in the effort to fight spam?
Article 1 of 2 in series
by Geoff Duncan
One of the things I handle behind the scenes for TidBITS is bounce management: the tedium of figuring out which addresses should be removed from our various mailing lists due to delivery errorsShow full article
One of the things I handle behind the scenes for TidBITS is bounce management: the tedium of figuring out which addresses should be removed from our various mailing lists due to delivery errors. We consider maintaining "clean" mailing lists part of running an email-based publication responsibly: just as we don't want to send TidBITS to people who don't want it, we don't want to waste bandwidth, effort, or time (for us or anyone else) trying to deliver TidBITS to addresses which aren't accepting it. I can't claim there are no undeliverable addresses on our mailing lists - that's an impossible goal - but we try to run a tight ship. And it's necessary work: Internet access providers regularly shut down, are acquired, and change their names; and - if our experience is any indicator - people simply abandon (or are forced to abandon) email addresses far more often than they unsubscribe from mailing lists. So we get lots of bounces.
I briefly outlined TidBITS's bounce management process in "Not Your Grampa's Mailing List" back in TidBITS-420, and although some of the details have changed, the idea remains the same. Basically, a custom tool I wrote ferrets out bouncing email addresses from the collection of bounces we receive each week, determining whether an address is eligible for removal based on the number and types of errors that come back over a particular period. Different lists have different removal criteria: it might take four to eight weeks of errors for an address to be removed from the main TidBITS list (which only sends a message once a week), while addresses would be removed from a discussion list like TidBITS Talk more quickly (although a higher number of errors would be required).
In the last year or so, we've noticed a new trend: some weeks, we get errors from hundreds (or even thousands) of subscribers whose servers refuse delivery of TidBITS issues. On the heels of these errors, we usually receive a flurry of complaints: "Why didn't I get this week's issue?" or "Please fix my subscription - I didn't get TidBITS today but your system says I'm still on the list!"
The reason for these errors is that from time to time, some email systems conclude that TidBITS is spam or - worse - an email-borne worm or virus. These email systems are utterly wrong - TidBITS is never sent to any address that has not subscribed, and an issue of TidBITS has never contained a worm or virus - but they serve to highlight some interesting points:
Email is increasingly being filtered for its content;
That filtering is often being done without the knowledge or consent of affected users;
Over time, inaccurate filtering will substantially reduce the general utility of email.
In short, we're starting to see signs that email, often hailed as the Internet's "killer app," is in danger of becoming an unreliable, arbitrarily censored medium - and there's very little we can do about it.
Them's Spam-Fighting Words! What causes some email systems to misinterpret TidBITS as spam or malicious email? I can't be specific here - or thousands of subscribers will never receive this TidBITS issue! - but I can point to some recent examples:
Jeff Carlson's article on the Palm i705 in TidBITS-635 made a passing reference to a well-known Pfizer drug for men, technically known as sildenafil citrate. Our mail error logs indicate over 2,500 TidBITS issues were rejected by over 1,000 sites because they contained the drug's name; many of the rejections were from relatively high-profile sites like the Association for Computing Machinery (ACM) and VeriSign. (Even leaving aside errors which cited that particular word, we received a substantially above-average number of errors for the week, which probably puts the total closer to 4,000 rejected issues, or about 10 percent of that week's mailing).
Adam's article on bandwidth limitations on Apple's Mac.com service in TidBITS-634 caused TidBITS to be rejected as a worm by approximately 250 sites because it contained the proper name of Apple's Web page hosting service and the words "my" and "pictures" in succession.
In a particularly bizarre example, approximately 180 mail servers rejected TidBITS issues containing Matt Neuburg's articles on Unicode under Mac OS X, seemingly because the title of his articles named a particular fruit and the text contained the words "keystroke" and/or "keycode."
Adam's article in TidBITS-618 on copyright caused issues to be rejected by approximately 120 servers because it mentioned the name of a well-known peer-to-peer music swapping service and the name of a pop music group.
Adam's article "A Couple of Cool Concepts" caused TidBITS-616 to be rejected by over 1,100 sites because it sarcastically referred to an advertising campaign for a particular type of wireless video camera. Still other sites rejected it because it contained the word "undress" and another word describing a hair color.
Filter Me Timbers -- It's important to note that these TidBITS issues are being rejected by mail servers - typically run by businesses, organizations, or ISPs - rather than by individual mail clients like Eudora or Outlook Express. Current email programs can process incoming mail in any number of ways, and there's no way to prevent users from intentionally - or unwittingly - creating a rule or filter which marks TidBITS as spam and deletes it outright. In fact, publications like TidBITS have run afoul of client-side filtering such as that included in Microsoft's Outlook Express and Entourage.
Although the utter opacity of tools like Microsoft's Junk Mail Filter somewhat belies this distinction, the crucial difference between client-side mail filtering and server-side mail filtering is that the former are largely under the control of individual email users, while the latter are typically governed by organizational policy. In an organization, this may mean only one or two people in charge of thousands of email accounts determine what mail will or won't be accepted in the organization, and there's often no way for users to determine whether or how their email is being filtered.
For instance, the servers which rejected Adam's article on Mac.com services largely did so because they were running particular commercial anti-virus packages, and those organizations trusted those products would not reject legitimate email. Obviously, they were wrong. On the flip side, every copy of TidBITS-601 sent to subscribers at a large aerospace company (whose name sounds like "boing!") was rejected because it contained a particular URL; apparently, an email administrator somewhere within this organization of tens of thousands of people decided that any email message containing that URL should be rejected outright. Ironically, the offending URL was owned by a company that counts the aerospace company among its clients. Oops.
Senseless Censors -- It's hard to argue with the practical necessity of filtering email, given the tremendous amount of spam clogging the Internet. (A company that provides an anti-spam filtering service to large organizations, Brightmail, estimates that the amount of spam has gone up by 600 percent this year.) The costs of spam are quite real in terms of storage, bandwidth, and processing power, not to mention vast amounts of human time deleting, filtering, identifying, and cleaning up after spam. There's no denying administrators are trying to save time, trouble, and (in some cases) actual harm by assaying email before it gets to users's desktops. Even TidBITS performs some very basic filtering on incoming mail, and I'm more aggressive with mail filtering on my business's servers.
The thing to remember is that, like Web content filtering, email content filtering is at best unintelligent and arbitrary. A rule which seems perfectly sensible to reject spam regarding long distance telephone service may have the unintended consequence of rejecting all email from your Aunt Tillie, simply because Aunt Tillie's Internet provider has IP numbers which contain a subset of a spammer's advertised phone number. (That's a real problem one of my clients encountered - although Aunt Tillie's name has been changed.) Similarly, a rule designed to screen out promotions for adult Web sites might prevent a user from participating in a breast cancer support group's mailing list. It's easy to come up with countless examples where blocking mail based on specific words, terms, and phrases in email can do the wrong thing.
As much as on-target filtering might save administrators and users time, money, and trouble, filtering that backfires also has direct costs. Part of that cost is passed off to the sender whose email has been improperly identified: every time spam filtering hits TidBITS, I get to track the problem down, deal with email administrators, and assuage irritated subscribers. (That's time I could be spending - should be spending - doing useful things like writing articles or improving TidBITS services.) Part of the cost also stays with the organization doing the filtering, largely to support users who didn't receive expected email or dealing with remote administrators like me to figure out what's going wrong. Misfiring filters reduce the utility of email for all involved.
Put a Sock In It -- We've sometimes tried to avoid words and terms in TidBITS that might trigger overly broad content filters. (Here "we" mostly means "me," because I'm the staff member most familiar with the email errors and problems TidBITS encounters.) For instance, we changed portions of Dan Kohn's "Steal This Essay" series to omit a term describing adult materials (it starts with the letter P and rhymes with "corn"), and lately hardly a week goes by where we don't make changes to an issue to avoid phrases and terms which have set off overly aggressive filters. Recently self-censored articles include Adam's series on converting to Mac OS X, "Corrupt Audio Disks Stick in Mac's Craw" in TidBITS-631, "Goodies from Kensington" in TidBITS-630, "Mac OS X: Curse of the New" in TidBITS-629, and "Was Bill Gates Lying?" in TidBITS-628. These articles run the gamut of everything TidBITS covers from analysis and commentary to news and reviews. As you've noticed, in this article I'm also trying to avoid terms or sequence of words which have caused TidBITS to be rejected.
To a degree, publishing offensive or controversial terms is a judgment call: is the editorial value worth the potential backlash and arbitrary rejection of TidBITS? But when we reach a point where TidBITS cannot mention the name of Apple's Web hosting service in the same issue as a phrase such as "my" followed by "pictures" without confusing hundreds of readers and committing (already limited) staff hours to sorting out the problem, a line has been crossed. When TidBITS cannot publish the name of a common fruit in the same issue as a word like "keystroke," mention a type of medication even in passing, or discuss a well-known online advertising campaign, we've exited the Realm of the Reasonable and landed squarely on Planet Preposterous.
All Done Now -- There's no way TidBITS can hope to self-censor against these types of mishaps: the terms and phrases are simply too arbitrary and unpredictable. Maybe tomorrow someone will release a new Windows worm, and commercial anti-virus software will start blocking all email containing the words "stopwatch" and "banana." (If you didn't get this issue as expected via email, maybe that's why!)
As a result, there's no way we can make reasonable assurances TidBITS will be able to reach you via email: we simply have no way of knowing what you or your provider might consider content non grata. We will continue to make reasonable efforts to avoid controversial or offensive terms, and may "dress up" such terms in ways so they are likely to get by some types of email filtering. We will not, however, refrain from publishing commentary about topics that are likely to set off spam filters: that's knuckling under to the email administrators who - probably unintentionally - have caused this situation. And although all discussions of true censorship and freedom of the press are generally only relevant in relation to the government, if this sort of content filtering continues to become more prevalent, there will be no freedom of speech through email.
So here's what you should do. If TidBITS doesn't arrive when you expect in email, first check our Web site to make sure the issue was published (we do take a couple of issues off each year). Then send email to <firstname.lastname@example.org>, which should always return the current issue, probably within minutes. If it hasn't arrived in an hour or two, it's a good bet that whoever manages your email server has a foolish content filter in place that we've failed to anticipate in our use of the English language. (If this requested issue does arrive, it's more likely that there were communication problems between our servers and yours that have cleared up since we sent the first copy.) The next step is to ask your email administrator - nicely - if they are performing content filtering on incoming email because you haven't received mail you expected. You may wish to ask them to remove their content filtering for all the reasons mentioned above: feel free to point them at this article. These actions won't solve the larger problem, but it might make administrators think a little harder about the impacts of email filtering.
If all else fails, you subscribe to the announcement version of TidBITS, which delivers a brief email message containing an abstract of the issue and a table of contents with links to articles on the Web. Because the announcement version of TidBITS doesn't contain the full text of the issue, it has a good chance of passing through content filters.
Article 2 of 2 in series
Geoff's article "Email Filtering: Killing the Killer App" in TidBITS-637 struck some chords. Not surprisingly, the volume of messages to TidBITS Talk exploded, and I struggled to direct messages into appropriate threadsShow full article
Geoff's article "Email Filtering: Killing the Killer App" in TidBITS-637 struck some chords. Not surprisingly, the volume of messages to TidBITS Talk exploded, and I struggled to direct messages into appropriate threads. We received a number of reprint requests (including one to post to the Investigative Journalists and Editors mailing list) and were contacted by other publications (including the New York Times) about the subject. David Strom ran with the topic in his Web Informant column, and I was a guest on David Lawrence's Online Tonight radio show on Wednesday to talk about it (there's a streaming version available if you'd like to listen to the conversation).
It's certainly gratifying to see that we can raise awareness of problems like this to such an extent, but the real story is that we're coming up on a critical tipping point for email. The mushrooming volume of spam has caused the value and utility of email to drop significantly for many people already, and the way overzealous server-side content filtering makes email unreliable stands only to worsen the very problem it's attempting to fix. Spam seeks to fill your mailbox, and poorly targeted content filtering, in attempting to prevent the spam from passing through your mail server, can block many of the messages you want to receive. Both hurt the utility of email. Making the problem even worse is that many people (such as Mac.com users - see the discussion Dan Frakes started on MacInTouch on the topic) don't even realize that such content filtering is taking place, so they may never realize that legitimate messages are going missing.
Collateral Spammage 2002 -- There may be no way to determine what percentage of mail servers have some sort of content filtering in place, but I think this is a good occasion to reprise our poll from two years ago asking how much spam you receive each week. When I went back to check that poll's results, I was shocked to see that the answer at the highest end of the range was "more than 71 spams per week." I'm receiving almost that many per day now! That's 25 percent of my mail! So please, visit our home page and tell us how many spams you're receiving per week these days so we can all see how much worse the problem has gotten over the last two years.
Clarifications and Effects -- Although many people instantly understood what the effect of an overzealous server-side content filter could be, it wasn't entirely clear to all. First off, I want to make it clear that our concerns with content filtering in no way mean that we're in favor of spam. As far as we're concerned, spam is the scourge of the Internet, and we've devoted far more time, energy, and money in fighting spam than almost any small business. Similarly, the fact that we're opposed to erroneous content filtering doesn't mean we're opposed to spam filtering in general, whether it's performed at the server or in users' email programs. There are many ways of blocking spam and rampant PC worms at the server that don't rely on arbitrary content filtering. We employ server-side filters ourselves, but we take pains to minimize the likelihood that our filtering will cause problems for legitimate senders, and whenever we find that it has done so, we work to help address the problem.
Second, since Geoff focussed on the effect that server-side content filtering was having on our attempts to deliver TidBITS issues to our subscribers, many people didn't put it together that this sort of content filtering applies to all email messages, not just those coming from mailing lists or email publications like TidBITS. The size of our mailing list means we notice the problem sooner and suffer more than individuals will, but email messages sent from individual to individual cannot escape the effects of poorly written server-side content filters. The lost mail may not be a big deal, or it might be exceedingly important personal news or critical business communication. Neither the sender nor the recipient have any way of knowing. To paraphrase John Donne, never send to know for whom the content filters toll; they toll for thee.
Third and finally, a common theme among the messages we received was that losing some legitimate messages was worth the reduction in spam thanks to content filtering. Obviously, I can't argue with individual situations - it may be that your mail is sufficiently unimportant that you don't care if some never arrives. More generally, though, I feel that attitude is a tremendously slippery slope. Spammers are parasites who will kill their host, but treating the disease with content filtering is almost certain to have the same effect. Just as we don't automatically treat infections with amputation, neither should we automatically treat spam with server-side content filters.
Overall Practicalities -- Last week, we suggested a few ways you could get TidBITS even if your mail server refused to accept an issue; this week, let me suggest a few ways we can work together to address the problem of bad content filtering.
Contact your ISP or network administrator and ask them point blank if they are performing content filtering on your email, being clear to distinguish from general spam filtering. If so, see if they understand the consequences of those actions. Most likely will, but may consider the loss of some legitimate mail an acceptable trade-off. If they persist in that belief, ask if there's any way the content filtering can be turned off, at least for your account if not every account. I doubt most will respond to a single person complaining, but if you can make the case (often a business case) to other users affected by the content filters, the groundswell might be sufficient to get content filtering removed. As an alternative approach, you might suggest they modify the system so messages caught by content filters are merely quarantined, rather than being deleted, so users at least have the opportunity to recover important messages that ran afoul of the content filters. (The downside of the quarantine approach is that it makes checking email more difficult for the user, thus potentially increasing the cost of dealing with spam.)
If all else fails, I would encourage you to find another ISP or mail host that does not perform content filtering (or at least lets you control what happens to matched messages). Be sure to convey your reasons for switching ISPs to the customer service department at the old ISP so they understand how the lack of reasonable filtering policies negatively affects their business too. Obviously, if you're dealing with your company's network administrator, there's no way to switch, but it will probably be easier to make a business case to management about the effect of legitimate mail being deleted.
Once you've established that all the messages that should reach you are coming through, the next step is to manage them effectively on your machine. TidBITS Talk participants have contributed a number of suggestions for how they manage their spam, and for those of you who are Macworld subscribers, check the August issue (not yet on the Web) for my article on stopping spam.
The core problem is, of course, spam itself, and the Internet community will have to come together to address spam at a fundamental level. There have been numerous proposals, ranging from legislation (probably necessary at some level, but flawed in its geographic scope and enforcement provisions) to modifications to the Simple Mail Transfer Protocol (SMTP) that delivers every message to its intended recipient. Other efforts focus on plugging the economic loophole that spam exploits; ensuring that spam doesn't pay would certainly take a bite out of the spam load. Most likely, we'll need a combination of approaches, and the urgency of developing them increases daily.