Naive Bayes
Naive Bayes is a powerful classification algorithm, implemented on the principles of Bayes theorem. It assumes that there is non-dependence between the feature variables considered in the dataset.
Bayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if cancer is related to age, then, using Bayes theorem, a person's age can be used to more accurately assess the probability that they have cancer, compared to the assessment of the probability of cancer made without knowledge of the person's age.
A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a vegetable may be considered to be a carrot if it is orange, cone-shaped, and about three inches in length. The algorithm is naive as it considers all of these properties independently to contribute to the probability that this vegetable is a carrot. Generally, features are not independent, but Naive Bayes considers them so for prediction.
Let's see a practical usage where the Naive Bayes algorithm is used. Let's assume we have several news feeds and we want to classify these feeds into cultural events and non-cultural. Let's consider the following sentences:
- Dramatic event went well—cultural event
- This good public rally had a huge crowd—non-cultural event
- Music show was good—cultural event
- Dramatic event had a huge crowd—cultural event
- The political debate was very informative—non-cultural event
When we are using Bayes theorem, all we want to do is use probabilities to calculate whether the sentences fall under cultural or non-cultural events.
As in the case of the carrot, we had features of color, shape, and size, and we treated all of them as independent to determine whether the vegetable considered is a carrot.
Similarly, to determine whether a feed is related to a cultural event, we take a sentence and then, from the sentence, consider each word as an independent feature.
Bayes' theorem states that p(A|B) = p(B|A). P(A)/ P(B), where P(Cultural Event|Dramatic show good) = P(Dramatic show good|Cultural Event).P(Cultural event)/P(Dramatic show good).
We can discard the denominator here, as we are determining which tag has a higher probability in both cultural and non-cultural categories. The denominator for both cultural and non-cultural events is going to be the entire dataset and, hence, the same.
P(Dramatic show good) cannot be found, as this sentence doesn't occur in training data. So this is where the naive Bayes theorem really helps:
P( Dramatic show good) = P(Dramatic).P(show).P(good)
P(Dramatic show good/Cultural event) = P(Dramatic|cultural event).P(Show|cultural event)|P(good|cultural event)
Now it is easy to calculate these and determine the probability of whether the new news feed will be a cultural news feed or a political news feed:
P( Cultural event ) = 3/5 ( 3 out of total 5 sentences)
P(Non-cultural event) = 2/5
P(Dramatic/cultural event) = Counting how many times Dramatic appears in cultural event tags = 2/13 ( 2 times dramatic appears in the total number of words of cultural event tags)
P( Show/cultural event) = 1/13
P(good/cultural event) =1/13
Here is the final calculated summary:
Now, we just multiply the probabilities and see which is bigger, and then fit the sentence into that category of tags.
So we know from the table that the tag is going to belong to the cultural event category, as that is what is going to result in a bigger product when the individual probabilities are multiplied.
These examples have given us a good introduction to the Naive Bayes theorem, which can be applied to the following areas:
- Text classification
- Spam filtering
- Document categorization
- Sentiment analysis in social media
- Classification of news articles based on genre