How to fail to filter spam

Overview

Below is discussed an unsuccessful methodology for reducing the amount of spam received by an individual who hosts multiple domains. The methodology used was searching for references to web sites in the bodies of email messages.

Meat

In the autumn of 2002, I was regularly receving about 100 unsolicited commercial email (commonly "spam") messages per day, about 50 each in my work and home accounts. I had experimented with Realtime Blackhole Lists, but the one I was using went commercial and I didn't feel like paying for the service. I blocked IP addresses on my own for a while, but after cutting myself off from a handful of important mailing lists, I decided it wasn't worth the effort. However, another idea had struck me very deeply, and I wanted to try it out.

The vast majority of spam I had received contained links to web sites, or at list links to images in HTML email. Further, I noticed that many of the links had readily distinguishable domain names, such as "dailypromo.net". A nagging suspicion that would not let go of me suggested that I could block a lot of spam simply by blocking any email that contained those phrases.

At the time I was running qmail as my mail transfer agent, but was growing frustrated with its poor handling of virtual domains. At work I've used Postfix, and had noticed its body_checks feature. I decided to play with it on my home network. One of the main draws of body_checks is that the check happens before the server accepts the message. When the terminating "." is sent by the client, the server evaluates the message, and if one of the patterns is found, the message is rejected and the client is given a "550 Error" message. This means that my server is not responsible for sending bounce messages, which can be quite annoying.

The list of patterns is here. Yes, I did this for an entire year.

I should note here that I am the primary NS and MX relay for a small handful of personal domains for friends and family. It's much easier to remember dodge@mumford.cx than d0d63@yahoo.com, so for friends and family I just relay their mail from a custom domain to their real accounts. It's also useful when they get fed up with their ISPs and switch -- the email address stays the same.

So I started collecting spam, and ran it through a simple perl script to extract the URLs. That actually turned out to be a little trickier than I first thought. HTML does not lend itself to determining the difference between links and content through simple scripting. It also rapidly became apparent that exceptions would be required. Much spam pointed at custom sites at yahoo.com, ebay.com, or geocities.com, or linked to more-or-less respectable sites to gain credibility, such as microsoft.com or w3.org. This is why the script has an exceptions list.

In a very short amount of time, I found my mail server bogged down to an absolute crawl. The growing list of regular expressions was taking a very, very long time to process. I should also probably mention that the server involved here was a 25MHz Intel 486 SX with 8 MB of RAM. In other words... and old piece of shit. It took about two minutes for each message to be processed when there were 60 checks to make, but the rate at which messages came in was greater than that, so it just choked. Eventually I found that postfix worked much faster when compiled with libpcre, and rebuilt the mailserver. Bingo, no more problems. I have since changed servers to a 266MHz Pentium with 100MB RAM, it runs the current list with absolutely no performance problems whatsoever.

So I waited for the magic to happen. And waited. And, yes, eventually, the amount of spam decreased somewhat. I haven't measured it, but it feels like at home I get about 25% of the spam I get at work. I also don't get many repeat messages.

But there will always be three significant problems with this approach:

The number of spam monkeys may not be infinite, but they're doing a fine job of creating new domains.
Miscategorizing a message from a mailing list as spam will silently prevent you from getting mail from mailing lists. You will never, ever find out why you stopped getting those messages (unless you look at the log files).
Many spam messages are base64 encoded, so the URLs aren't easily regex-able.

Conclusions

This methodology is better than nothing, but the pitfalls are just too great to continue. I dunno, I'll probably install SpamAssassin or something.