Bayesian filter for dummies

It wasn’t long after the creation of email that email spam became a growing problem. Nowadays, we have spam filters in place to protect our inboxes from being inundated with unwanted messages, promotions and chain letters.

One such method of filtering emails uses a Bayesian interpretation of probability and is known as a Bayesian filter.

So, what exactly is a Bayesian filter?

Bayesian spam filter

A Bayesian filter is an email spam filter that uses Bayesian logic. This means it uses a Bayesian interpretation of probability and Bayes’ theorem to calculate how likely it is an email is spam. But that’s all a bit technical.

Let’s rewind. A statistician named Thomas Bayes came up with an equation that allows new information to update the outcome of a probability calculation. This equation can be applied in all manner of scenarios – including the fight against spam. Hence, our modern Bayesian spam filters.

Now, spam characteristically contains certain content and features. For instance, certain keywords, header content, content length and so on can all indicate the likelihood of the email being spam.

So, an email subjected to a Bayesian filter will get analysed based on these characteristics and assigned a probability of being spam. The more of these characteristics that a message has, the more likely it is to end up in your spam folder.

How it works

A Bayesian filter works by comparing your incoming email with a database of emails, which are categorised into ‘spam’ and ‘not spam’.

Bayes’ theorem is used to learn from these prior messages. Then, the filter can calculate a spam probability score against each new message entering your inbox.

This “learning” process also happens on the fly. For example, every time you instruct the filter to spam or quarantine certain messages, it will incorporate that data into future actions.

So, the Bayesian filter will improve with time – even as spammers invent new ways to get their emails through. 

For example

We know that approximately 55% of all email sent today is spam. This means that, out of 1000 emails, 550 of them are spam, 450 are legitimate. So, a message going through a Bayesian filter is 55% likely to be spam.

The filter will consider all sorts of spam characteristics. For instance, say an email with ‘act now’ in the header has a 20% chance of being spam, and a 10% chance of not being spam. The filter will apply this, and build on the previous information. So:

  • Out of those 1000 emails, 550 are spam, and 20% (or 110) of those have ‘act now’ in the header.
  • Of the 450 emails that aren’t spam, 10% (or 45) have ‘act now’ in the header.

This means that, in this example, 155 messages have ‘act now’ in the header, and 110 (or 70%) of them are spam. This would mean that, to the Bayesian filter, an email with ‘act now’ in its header has a 70% chance of being spam.

Then, the Bayesian filter will search the email for the next characteristic. But this time, it will start with this 70% chance of being spam. And so on.

A Bayesian filter

In short, a Bayesian filter is an email spam filter. It looks for certain characteristics in emails and uses them to calculate the probability of that email being spam.

For every spam characteristic found, a Bayesian filter will increase the probability that the email is spam. If the filter eventually estimates that the email has a 99% or higher probability of being spam, into the spam folder it goes.

Useful links

The history of email spam

What is an email parser and why do I need one?

What is machine learning? A beginner’s guide