If the name Bayes sounds familiar to you, it might be because you’ve taken intensive math lectures in high school or college. But if it doesn’t, don’t worry! This article will help you understand why the Naive Bayes algorithm is prominent in supervised machine learning, even though it is “naive”, as the name suggests. And as always (click here for more articles on popular supervised ML algorithms) , a real life scenario will be used to make sure that you not only understand the working of the algorithm but understand its applications.
The Naive Bayes algorithm is useful while handling large volumes of data, because it is a fast (yet simple) classification algorithm. It is extensively used in Natural Language Processing (NLP) and performs very well given tasks like opinion mining and text classification. Like many other supervised algorithms, the NB model can be applied to both binary and multi-class classification problems. The algorithm is named after the probability theorem that it uses at its core. So, before we jump into the actual algorithm, let’s take a brief look at Bayes’ Theorem.
Bayes’ Theorem deals with conditional probabilities. This means that using this theorem, we can find the probability of an event, provided we know the probabilities of some other specific events. The truth is, understanding Bayes’ Theorem is absolutely essential in order to understand the Naive Bayes Algorithm. To that end, let’s take a look at an example that will help us understand the theorem. Consider a scenario in which a medical clinic prescribes narcotic drugs to 10% or its patients. Let’s say that 3% of the patients at that clinic are drug addicts. Out of all the people prescribed drugs, 5% of them are addicts. Clearly, there is a chance that a drug addict gets prescribed narcotic drugs. This probability can be determined by Bayes Theorem, which states the following:
Here, A is the event that the clinic prescribes narcotic drugs to one of its patients and B is the event that a patient is addicted to narcotics. P(B|A) (probability of B occurring given that A already happened) is then, the event wherein a narcotic addict is prescribed narcotic drugs. We know this is 5%. So plugging in the rest of the values, our answer comes out to be 16.7%. This is how Bayes Theorem works mathematically.
The machine learning algorithm uses this mathematical theorem to compute and predict the probabilities for each class given in the classification problem. The class with the highest probability is considered as the most likely class. More formally, this class is known as the Maximum A Posteriori. Now that we’ve understood the theorem correctly, let’s take a look at how the algorithm works.
Here, we’re going to find out how the Naive Bayes algorithm can be used to solve one of the most common problems in the Machine Learning paradigm- spam classification. Basically, we are going to look at how we can use a Naive Bayes Classifier to decide whether a particular email is spam or ham (not spam). This is clearly a binary classification problem. As we saw above, the Bayes theorem is used to solve problems of the form “The probability of A given B”, right? In this case, our event A will be the email being spam. But what about B? What is the one thing we can obtain from all emails, whether they are spam or not? Well, we can obtain the words in the email! More formally, the Bayes Theorem will calculate the probability of an email being spam given a feature vector of words in the email. Without getting into the nitty gritty of it, a feature vector (in this case) is just a numerical representation of words. The equation looks like this:
How is this probability calculated? The method is fairly intuitive. How do we humans decide whether an email is spam? We look for misspelled words, grammatical errors and sentences that just seem odd, right? That’s what the algorithm does too. While generating the feature vector, it converts the words to numbers in such a way that the words retain their context. So, given enough data, the algorithm knows how to recognize which words/sentences look fishy. So, an email titled “CLICK HERE TO WIN LOTERY” will be marked spam (The probability of the email being spam will be very high) whereas something like “Just checking in” probably wouldn’t be marked as spam.
There’s just one little thing left. If this algorithm is good at handling large volumes of data, fast and accurate, why is it called Naive? That’s because it makes an assumption that isn’t realistic at all. It assumes that all input variables are completely independent of each other. To explain what that means, consider the following sentences:
- “U r selected for lottery”
- “U r the lucky winner”
- “U r today’s lucky winer!”
Clearly, all three of these sentences would indicate spam. If you had been given a thousand more such examples, the next time you saw ‘U r’ in the start of a sentence, you would automatically classify it as spam, right? The Naive Bayes Algorithm probably would too, but it wouldn’t recognize that ‘U r’ coming together is a sign of spam at all. It would treat the words ‘U’ and ‘r’ as completely independent words. That’s why we call it the ‘Naive’ Bayes algorithm :)
That’s it! Another common supervised machine learning algorithm is the Support Vector Machine, which I’ll be covering in my next article!