Aerospace



Home

Company Information

Information Request

Linux How-to Guides

ADSP 21xx
Digital Signal Processing
Tutorials

SW Utilities

On-line Order Form

Linux Support

Windows Support


Bonk

Have you found this site useful? Did we save you time? Did we cure your head-ache? Is your hair growing back now?

Please make a donation to help with maintenance.


Custom Search

SpamProbe Howto Guide

For Mandrake 9.1 and SpamProbe 0.9e

Herman Oosthuysen 15 November 2003

Licenced under GNU GPL, 2003, http://www.gnu.org


General

Spamprobe is arguably the best Bayesian mail filter available.  Where most filters count only single words, Spamprobe counts word pairs as well.  It also handles the mail headers and HTML tags in an intelligent fashion.  The result is a very good filter with about 99% effectiveness and I have never seen any false positives.

SpamProbe works well on a Fetchmail/Procmail system, which is what I describe here.

With a 99% effective filter, a spammer would have to increase his spam transmissions by 100 times and all messages would have to be different, to get a significant number of spams past the filter.  Hopefully, spam filters will improve even more over time, making spamming completely impractical.


Where to get it

You can get SpamProbe here: http://spamprobe.sourceforge.net

You will also need BerkeleyDB, available here: http://www.sleepycat.com


Installation

First install BerkeleyDB, then SpamProbe.

To install BerkeleyDB, download it to your home directory:

  • cd ~
  • mkdir berkleydb
  • cd berkleydb

Go and get the tar file from the sleepycat web site: http://www.sleepycat.com

  • tar -zxvf db-4.1.25.tar.gz  (modify as per your downloaded version)
  • cd db-4.1.25

Now start up a browser and read docs/index.html Click Building for UNIX/POSIX systems

To do a standard UNIX build of Berkeley DB, change to the build_unix directory and then enter the following two commands:

  • ../dist/configure
  • make

This will build the Berkeley DB library.

To install the Berkeley DB library, enter the following commands:

  • su
  • password
  • make install
  • exit

To rebuild Berkeley DB, enter:

  • make clean
  • make

Now, here is the trick, which caused me to write this howto.  Make a symbolic link from /usr/lib to the berkeley library:

  • ln  -s  /usr/local/BerkeleyDB.4.1/lib/libdb-4.1.so  /usr/lib/libdb-4.1.so

otherwise, SpamProbe can't find the schtoopidttt library.

To install SpamProbe, download it to your home directory:

  • cd ~
  • mkdir spamprobe
  • cd spamprobe

Go and get the tar file from the SpamProbe web site: http://spamprobe/sourceforge.net

  • tar -zxvf spamprobe-0.9e.tar.gz
  • cd spamprobe-0.9e

Configure and build spamprobe:

  • ./configure --with-db=/usr/local/BerkeleyDB.4.1
  • make

Install it:

  • su
  • password
  • make install
  • exit

Database Setup

This howto describes using SpamProbe with a common database.  That makes it easy to make corrections to the database, since there is only one to worry about.  Generally, for a given business, the e-mail will look pretty much the same for each user, since they all work on the same stuff, therefore using a common database should be good enough.

If you want to use multiple databases, then you have to create a .spamprobe directory for each user, including root:

  • su -
  • password
  • mkdir .spamprobe

Now for the users:

  • cd /home
  • su username
  • mkdir username/.spamprobe

and repeat for each and every user.  You need this, since procmail runs with the permissions of the user the mail is addressed to.  The system therefore could keep a different database for each user.  Note that the procmail setup below will have to change if you want to use multiuple databases.


Procmail Setup

Procmail has to run spamprobe on each and every incoming message.  Each message is also fed back into SpamProbe, to allow it to evolve its database.  Errors muyst be manually corrected.

We handle errors by creating two new mail users: spam and ham. 

TIP: Note that if you define user names and domain names in lower case, they become case insensitive in Unix/Linux.  Therefore, NEVER define user/host/domain names with uppercase letters in them.

If a user receives a good messages classified an spam, the user should forward it to user Spam, which will then cause SpamProbe to correct its behaviour.  Similarly, if a spam message is received in the user's inbox, the user should forward it to user Ham, which will cause SpamProbe to correct its database accordingly.

Here are the relevant parts from my /etc/procmail/procmailrc file. 

Place this definition at the top of the file:

# Spamprobe configuration SPAMPROBE=/usr/local/bin/spamprobe -d /var/spool/mail

Place this code before you sort the mail for each user:

### Spamprobe - Naive Bayesian Word Probability Filter
## Avoid running spamprobe again on spam corrections
:0
* ! (^TO_spam@YOURDOMAIN\.com)
{
   # Score the message
   :0
   SCORE=| $SPAMPROBE receive

   # Add the score to X-Spamprobe header
   :0 wf
   | formail -I "X-SpamProbe: $SCORE"

   # Put a copy of spams in the spamprobe box
   :0 ac:
   * (^X-Spamprobe: SPAM)
   /var/spool/mail/spamprobe

}

### Spam Corrections
### To correct a missclassification, forward it to the spam user address
:0
* (^TO_spam@YOURDOMAIN\.com)
{
   :0
   * (^X-SpamProbe: SPAM)
   * ! (^X-Loop: SpamProbe)
   {
      # Was seen as spam, should be ham and reverse
header
      :0 wf
      | $FORMAIL -I "X-SpamProbe: GOOD" -rk

      :0 wf
      | $FORMAIL -I "X-Loop: SpamProbe"

      # After the To/From reversal, fix the From line
again
      :0 wf
      | $FORMAIL -I "From " -a "From "

      # Put it in Hambox and copy for redelivery and
user verification
      :0 c:
      /var/spool/mail/ham

      # Rescan the hambox
      :0 wc
      | $SPAMPROBE good /var/spool/mail/ham
   }

   :0
   * (^X-SpamProbe: GOOD)
   * ! (^X-Loop: SpamProbe)
   {
      # Was seen as ham, should be spam and reverse
header
      :0 wf
      | $FORMAIL -I "X-SpamProbe: SPAM" -rk

      :0 wf
      | $FORMAIL -I "X-Loop: SpamProbe"

      # After the To/From reversal, fix the From line
again
      :0 wf
      | $FORMAIL -I "From " -a "From "

      # Put it in Spambox and copy for redelivery and
user verification
      :0 c:
      /var/spool/mail/spam

      # Rescan the spambox
      :0 wc
      | $SPAMPROBE spam /var/spool/mail/spam
   }
}

In addition, at the very end of my procmailrc file, I have the following code, to handle the leftovers:

### Unknowns - Whatever is left over is
spam by definition
# Avoid handling the spam twice though
:0
* ! (^X-SpamProbe:.*)
{
   # Add a spam header
   :0 wf
   | $FORMAIL -I "X-SpamProbe: SPAM"

   # Put it in Spambox and copy it
   :0 c:
   /var/spool/mail/spam

   # Rescan the spambox
   :0 Wc
   | $SPAMPROBE spam /var/spool/mail/spam
}

SpamProbe Education

In order to use SpamProbe, you have to teach it right from wrong.  To do this, you need a Bible of Good messages and an Apokriva of Spam messages.  If you were careful to delete all crud from your inbox, then that will do for the good messages.  Hopefully, you also have a junkbox full of spam.  If not, well, it is easy enough to get spam to train SpamProbe on...

Before doing the commands below, first compact your mailboxes using your e-mail client, so that deleted/moved mail is really deleted/moved.  This is very important, else SpamProbe will read 'moved' spam in the inbox for instance and corrupt its database, reducing its effectiveness.

To teach SpamProbe about Ham:

  • /usr/local/bin/spamprobe  -d  /var/spool/mail  good  /path/to/your/inbox

I create a new inbox each year, so I have to run the above multiple times on each inbox.

To teach SpamProbe about Spam:

  • /usr/local/bin/spamprobe  -d  /var/spool/mail   spam  /path/to/your/junkbox

Repeat the above for each user in the system.

This process will create the SpamProbe database /var/spool/mail/sp_words.

Finally, ensure that SpamProbe can always access the database:

  • su
  • password
  • chown root:mail sp_words
  • chmod 660 sp_words

This is required, since procmail runs with the permissions of the user to whom the mail is addressed, so the database must be readable by everybody in the mail group.  Change as required for your system.


E-Mail Client Configuration

With this setup, all mail will be delivered to the user, but the mail will contain a new header, which can be used by the client, to sort the mail into the inbox and junkbox.

Configure your e-mail client to look for the header X-SpamProbe: SPAM and dump it into the junkbox.


La Voila!

Have fun,

Herman.



Copyright © 2005-2008, Aerospace Software Ltd., GPL.