I am continuing with my data mining and machine learning algorithms series. Naive Bayes is a nice algorithm for classification and prediction.
It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute, which can later be used to predict an outcome of the predicted attribute based on the known input attributes. The algorithm supports only discrete or discretized attributes, and it considers all input attributes to be independent. Starting with a single dependent and a single independent variable, the algorithm is not too complex to understand (I am using an example from my old book about statistics - Thomas H. Wonnacott, Ronald J. Wonnacot: Introductory Statistics, Wiley 1990).
Let’s say I am buying a used car. In an auto magazine I find that 30% of second-hand cars are faulty. I take with me a mechanic who can make a shrewd guess on a basis of a quick drive. Of course, he isn’t always right. Of the faulty cars he examined in the past he correctly pronounced 90 % faulty and wrongly pronounced 10% ok. When judging good cars, he correctly pronounced 80% of them as good, and wrongly 20% as faulted. In the graph, we can see that 27% (90% of 30%) of all cars are actually faulty and then correctly identified as such. 14% (20% of 70%) are judged faulty, although they are good. Altogether, 41% (27% + 14%) of cars are judged faulty. Of these cars, 67% (27% / 41%) are actually faulty. To sum up: Once the car has been pronounced faulty by the mechanic, the chance that it is actually faulty rises from the original 30% up to 67%. The following figure shows this process.
The calculations in the previous slide can be summarized in another tree, the reverse tree. You can start branching with opinion of the mechanic (59% ok and 41% faulty). Moving to the right, the second branching shows the actual conditions of the cars, and this is the valuable information for you. For example, the top branch says: Once the car is judged faulty, the chance that it actually turns faulty is 67%. The third branch from the top displays clearly: Once the car is judged good, the chance that it is actually faulty is just 5%. You can see the reverse tree in the following figure.
As mentioned, Naïve Bayes treats all of the input attributes as independent of each other with respect to the target variable. While this could be a wrong assumption, it allows multiplying the probabilities to determine the likelihood of each state of the target variable based on states of input variables. For example, let’s say that you need to analyze whether there is any association between NULLs in different columns of your Products table. You realize that if Color is missing, 80% of Weight values are missing as well; and if Class is missing, 60% of Weight values are missing as well. You can multiply these probabilities. If Weight is missing, you can calculate the product:
0.8 (Color missing for Weight missing) * 0.6 (Class missing for Weight missing) = 0.48
You can also check what happens to the not missing state of the target variable, the Weight:
0.2 (Color missing for Weight not missing) * 0.4 (Class missing for Weight not missing) = 0.08
You can easily see that the likelihood that Weight is missing is much higher than the likelihood it is not missing when Color and Class are unknown. You can convert the likelihoods to probabilities by normalizing their sum to 1:
P (Weight missing if Color and Class are missing) = 0.48 / (0.48 + 0.08) = 0.857
P (Weight not missing if Color and Class are missing) = 0.08 / (0.48 + 0.08) = 0.143
Now when you know that the Color value is NULL and the Class value is null, then you have nearly 86% chances that you get NULL also in the Weight attribute. This might lead you to some conclusions where to start improving your data quality.
In general, you use the Naive Bayes algorithm for classification. You want to extract models describing important data classes and then assign new cases to predefined classes. Some typical usage scenarios include:
- Categorizing bank loan applications (safe or risky) based on previous experience
- Determining which home telephone lines are used for Internet access
- Assigning customers to predefined segments.
- Quickly obtaining a basic comprehension of the data by checking the correlation between input variables.