penk

I originally posted this blurb on elbows, thought folks in LJ land may enjoy the geekiness of it.

I just upgraded to SpamAssassin 2.50 on homeport's main mail server. I'm using a private procmail rule to wash all my inbound mail through SA for ranking, and if something gets tagged via an X-Spam-Flag: Yes, procmail tosses it into my spam folder, and I never see it.

Since 2.50 now comes with a bayesian filtering system which not only self-learns, but can also be 'seeded' with examples ahead of time, I was excited to give it a try.

Once set up, I needed examples for it of 'non-spam' (referred to in SA terms as 'ham' mail), so I used my elbows folder (about 1500 messages) for seeding:

$ sa-learn --ham --showdots --mbox Mail/elbows

This took about 3 minutes on the PIII-550 that is 'lightship', our main server.

Next of course, I needed 'spam' samples. Forutnately, I save all my spam into the 'spams' folder, so I just fed that to it:

$ sa-learn --spam --showdots --mbox Mail/spams

Another 3 minutes or so, and I was ready to go.

Before my upgrade, I was seeing perhaps 1 in 20 spams I receive 'slip through' the old SA filters (I was running 2.20, no bayesian filtering). SO far, in 7-8 hours, I have not seen one spam that SA missed, and I haven't gotten any false positives.

Curious, I started looking through the spams that were caught. SA can summarize why it thought something was spam. If its review causes the mail to have a score higher than '5.0', its considered spam.

This particular piece of mail struck me as possibly being legit, but the bayesian filter said "Nuh uh. This looks like other spam". Note that had it not been for the bayesian filter, this would have ended up in my inbox, but because I had good sample history, it flagged it with a '90-99% probability of being spam'. That +2.9 on the ranking put it over 5.0, and voila. I never saw it.

Way cool. Here's the summary:

Content analysis details:   (5.70 points, 5 required)
NO_COST            (0.3 points)  BODY: No such thing as a free lunch (3)
OFFERS_ETC         (0.4 points)  BODY: Stop with the offers, coupons,
discounts etc!
HTML_IMAGE_RATIO_08 (0.8 points)  BODY: HTML has a low ratio of text to
image area
HTML_30_40         (0.7 points)  BODY: Message is 30% to 40% HTML
HTML_FONT_COLOR_GRAY (0.1 points)  BODY: HTML font color is gray
HTML_FONT_BIG      (0.3 points)  BODY: FONT Size +2 and up or 3 and up
HTML_WEB_BUGS      (0.1 points)  BODY: Image tag with an ID code to
identify you
HTML_FONT_COLOR_RED (0.1 points)  BODY: HTML font color is red
BAYES_90           (2.9 points)  BODY: Bayesian classifier says spam
probability is 90 to 99%  [score: 0.9815]
CLICK_BELOW        (0.0 points)  Asks you to click below

Go team 8)

(ob-other-stats - on a slow day I'll get between 50 and 75 spams. On a really busy day, usually Friday->Saturday, that number can go over 200. Spam is a real problem)