# Naive Bayes from Scratch in Python

Posted by Kenzo Takahashi on Sun 17 January 2016

Naive bayes is a basic bayesian classifier. It's simple, fast, and widely used. You will see the beauty and power of bayesian inference. Naive bayes comes in 3 flavors in scikit-learn: MultinomialNB, BernoulliNB, and GaussianNB. In this post, we are going to implement all of them. Does it sound like a lot of work? It is. So let's get started.

### Before We Get Started

For this tutorial, I assume you know the followings:

• Python(list comprehension, basic OOP)
• Basic Linear Algebra
• Probability(gaussian distribution)

My code follows the scikit-learn style. If you are unfamiliar with scikit-learn, I recommend you check out the website. I also briefly mention it in my post, K-Nearest Neighbor from Scratch in Python.

I'm using python3. If you want to use python2, add this line at the beginning of your file and everything should work fine.

from __future__ import division


### MultinomialNB

I will use the example from chapter 13 on An Introduction to Information Retrieval. I think it's the best introduction to multinomial naive bayes. PDF is also available for free. I will not explain how naive bayes works, so I expect you to know what is in the chapter.

Here is the initial code. alpha is a smoothing parameter which will be used later.

import numpy as np
np.set_printoptions(precision=6)

class MultinomialNB(object):
def __init__(self, alpha=1.0):
self.alpha = alpha

def fit(self, X, y):
# group by class
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
print(separated)


First we need to group the training data by class. There are a lot of ways to accomplish the task. The obvious way is to use something like defaultdict. Instead, I'm using np.unique. It's much cleaner. But notice that we have to loop through X, the entire training data, for each class. If you have 20 target classes to classify, it would be inefficient. But most of the time you only have 2 classes, so it will be fine.

Let's try our code. I converted the data from table 13.1 to a matrix. We have 4 documents with 6 words(Chinese, Beijing, Shanghai, Macao, Tokyo, Japan) for the training set. In the book, the target class is yes or no, but I changed them to 0 and 1, respectively:

X = np.array([
[2,1,0,0,0,0],
[2,0,1,0,0,0],
[1,0,0,1,0,0],
[1,0,0,0,1,1]
])
y = np.array([0,0,0,1])
nb = MultinomialNB().fit(X, y)


This is the format that scikit-learn expects. Automatically constructing a word count matrix from raw text is not part of naive bayes. There is feature_extraction.text submodule for that. I might write another post about it in the future.

The output is a list of lists, each of which contains the samples:

[[array([2, 1, 0, 0, 0, 0]), array([2, 0, 1, 0, 0, 0]), array([1, 0, 0, 1, 0, 0])],
[array([1, 0, 0, 0, 1, 1])]]


We have three samples that belong to class 0, and one for class 1. Since we are using 0 and 1 for class, we can use index as a key. This simplifies the overall code dramatically. If we used dictionary instead, we would not be able to take advantage of numpy. Note that scikit-learn can take any value as a class.

#### Exercise 1

Your first exercise is to calculate $$log\hat{p}(c)$$, the prior log probability for each class. Log probability is used to avoid floating point underflow. You can use count_sample, which is the number of training samples.

def fit(self, X, y):
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
count_sample = X.shape
self.class_log_prior_ = # Your code here
return self

nb = MultinomialNB().fit(X, y)
print(nb.class_log_prior_)


Output:

[-0.2876820724517809, -1.3862943611198906]


These numbers might not make much sense to you, but they are log of .75 and .25.

#### Solution

self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]


#### Exercise 2

Next we want to count each word for each class and add self.alpha as smoothing:

def fit(self, X, y):
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
count_sample = X.shape
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
count = # Your code here
print(count)
return self

nb = MultinomialNB(alpha=1).fit(X, y)


Output:

[[6 2 2 2 1 1]
[2 1 1 1 2 2]]


Chinese appears 5 times and Beijing, Shanghai, Macao once in class 0. In class 1, Chinese, Tokyo, and Japan appear once. Since alpha=1, 1 is added to each count.

#### Solution

We can use numpy's sum to broadcast sum operation. + self.alpha is also broadcasted.

count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha


#### Exercise 3

Finally we can calculate the log probability of each word, $$log\hat{p}(t|c)$$. You divide count by the length of text:

def fit(self, X, y):
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
count_sample = X.shape
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
count = np.array([np.array(i).sum(axis=0) for i in separated])
# log probability of each word
self.feature_log_prob_ = # Your code here
return self

nb = MultinomialNB().fit(X, y)
print(nb.feature_log_prob_)


Output:

[[-0.847298 -1.94591  -1.94591  -1.94591  -2.639057 -2.639057]
[-1.504077 -2.197225 -2.197225 -2.197225 -1.504077 -1.504077]]


#### Solution

There is a lot going on in 1 line. In our example, count.sum(axis=1) is [14, 9]. To broadcast it, we need to transpose it. But for some reason, transposing 1d array is a bit inconvenient in numpy. Just adding .T does nothing. So you have to use [np.newaxis]:

self.feature_log_prob_ = np.log(count / count.sum(axis=1)[np.newaxis].T)


### Predict

#### Exercise 4

Now we have the code to train the data, we can predict. predict_log_proba outputs the log probability of each class. Use self.class_log_prior_ and self.feature_log_prob_ for the calculation:

def predict_log_proba(self, X):


The first test data is the same as the book. The second one I made up:

nb = MultinomialNB().fit(X, y)
X_test = np.array([[3,0,0,0,1,1],[0,1,1,0,1,1]])
print(nb.predict_log_proba(X_test))


Output:

[array([-8.10769 , -8.906681]), array([-9.457617, -8.788898])]


#### Solution

def predict_log_proba(self, X):
return [(self.feature_log_prob_ * x).sum(axis=1) + self.class_log_prior_
for x in X]


#### Exercise 5

predict_log_proba does most of the work, so predict just calls it and picks the maximum value.

def predict(self, X):

nb = MultinomialNB().fit(X, y)
X_test = np.array([[3,0,0,0,1,1],[0,1,1,0,1,1]])
print(nb.predict(X_test))


Output:

[0 1]


#### Solution

You can use argmax to return the corresponding index:

def predict(self, X):
return np.argmax(self.predict_log_proba(X), axis=1)


Here is the complete code:

class MultinomialNB(object):
def __init__(self, alpha=1.0):
self.alpha = alpha

def fit(self, X, y):
count_sample = X.shape
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha
self.feature_log_prob_ = np.log(count / count.sum(axis=1)[np.newaxis].T)
return self

def predict_log_proba(self, X):
return [(self.feature_log_prob_ * x).sum(axis=1) + self.class_log_prior_
for x in X]

def predict(self, X):
return np.argmax(self.predict_log_proba(X), axis=1)


### BernoulliNB

Bernoulli naive bayes is similar to multinomial naive bayes, but it only takes binary values. In our example, each value will be whether or not a word appears in a document. That is a very simplified model. Nevertheless, when word frequency is less important, bernoulli naive bayes may yield a better result.

We will continue using the same example. The corresponding part in the book is here.

Most of fit is the same as MultinomialNB. I added two lines. n_doc is the number of documents in each class + smoothing where smoothing is 2 * self.alpha.

class BernoulliNB(object):
def __init__(self, alpha=1.0):
self.alpha = alpha

def fit(self, X, y):
count_sample = X.shape
# group by class
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
# class prior
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
# count of each word
count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha

smoothing = 2 * self.alpha
# number of documents in each class + smoothing
n_doc = np.array([len(i) + smoothing for i in separated])
print(n_doc)


We have to change X to have binary values. We can use numpy.where to change every value that is greater than 0 to 1:

nb = BernoulliNB(alpha=1).fit(np.where(X > 0, 1, 0), y)


Output:

[5 3]


#### Exercise 6

The last part of fit is self.feature_prob_. We don't take log at this point. You will see why in a minute:

def fit(self, X, y):
count_sample = X.shape
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha
smoothing = 2 * self.alpha
n_doc = np.array([len(i) + smoothing for i in separated])
# probability of each word
self.feature_prob_ = # Your code here
return self

nb = BernoulliNB(alpha=1).fit(np.where(X > 0, 1, 0), y)
print(nb.feature_prob_)


Output:

[[ 0.8       0.4       0.4       0.4       0.2       0.2     ]
[ 0.666667  0.333333  0.333333  0.333333  0.666667  0.666667]]


#### Solution

It's similar to MultinomialNB:

self.feature_prob_ = count / n_doc[np.newaxis].T


#### Exercise 7

This is probably the toughest exercise in this tutorial. Just like last time, your code should output the log probability of each class. In multinomial model, nonoccuring words do not affect the posterior probability. But in bernoulli model, $$1 - \hat{p}(t|c)$$ is used instead. That's why I didn't take log in fit. Hint: You can use if statement, but my solution does not have one:

def predict_log_proba(self, X):

X_test = np.array([[1,0,0,0,1,1],[1,1,1,0,0,1]])
nb = BernoulliNB(alpha=1).fit(np.where(X > 0, 1, 0), y)
print(nb.predict_log_proba(X_test))


Output:

[array([-8.927341, -6.82724 ]), array([-7.459403, -5.898527])]


#### Solution

My solution is a really long 1 line. Sometimes you can express a control flow using math. np.abs(x - 1) flips 1 and 0. It does look cryptic at first, but I found it much cleaner than using if-else.

def predict_log_proba(self, X):
return [(np.log(self.feature_prob_) * x + \
np.log(1 - self.feature_prob_) * np.abs(x - 1)
).sum(axis=1) + self.class_log_prior_ for x in X]


predict is the same as MultinomialNB, so we are done!

class BernoulliNB(object):
def __init__(self, alpha=1.0):
self.alpha = alpha

def fit(self, X, y):
count_sample = X.shape
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha
smoothing = 2 * self.alpha
n_doc = np.array([len(i) + smoothing for i in separated])
self.feature_prob_ = count / n_doc[np.newaxis].T
return self

def predict_log_proba(self, X):
return [(np.log(self.feature_prob_) * x + \
np.log(1 - self.feature_prob_) * np.abs(x - 1)
).sum(axis=1) + self.class_log_prior_ for x in X]

def predict(self, X):
return np.argmax(self.predict_log_proba(X), axis=1)


### Binarize

So far, we have been manually binarizing X. It would be nice if it's done automatically. BernoulliNB in scikit-learn has binarize parameter for that:

class BernoulliNB(object):
def __init__(self, alpha=1.0, binarize=0.0):
self.alpha = alpha
self.binarize = binarize


It specifies the threshold for binarizing. You can set it to None if your data does not need binarizing.

I added 1 line of code to the beginning of every method that uses X. It calls _binarize_X, which does the actual binarizing.

def fit(self, X, y):
X = self._binarize_X(X)
...

def predict_log_proba(self, X):
X = self._binarize_X(X)
...

def predict(self, X):
X = self._binarize_X(X)
...

def _binarize_X(self, X):
return np.where(X > self.binarize, 1, 0) if self.binarize != None else X


Now you can just pass X:

nb = BernoulliNB(alpha=1).fit(X, y)


### GaussianNB

If your data has a continuous variable, then Multinomial and Bernoulli are not suitable. You can discretize it, but a more common approach is to use Gaussian distribution: Not every data follows a Gaussian distribution, but it's a pretty good assumption.

#### Exercise 8

GaussianNB takes no parameter. separated is the same as the other models. self.model is going to be the mean and the standard deviation of each attribute for each class.

class GaussianNB(object):
def __init__(self):
pass

def fit(self, X, y):
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
self.model = # Your code here
return self


We are going to use the iris flower dataset:

from sklearn import datasets
nb = GaussianNB().fit(iris.data, iris.target)
print(nb.model)


The iris dataset has 4 features and 3 classes, so the output looks like this:

[[[ 5.006     0.348947]
[ 3.418     0.377195]
[ 1.464     0.171767]
[ 0.244     0.106132]]

[[ 5.936     0.510983]
[ 2.77      0.310644]
[ 4.26      0.465188]
[ 1.326     0.195765]]

[[ 6.588     0.629489]
[ 2.974     0.319255]
[ 5.552     0.546348]
[ 2.026     0.27189 ]]]


For example, 5.006 on the top left is the mean of the first attribute in class 0, and 0.27189 on the bottom right is the standard deviation of the fourth attribute in class 1.

#### Solution

Here is my solution:

self.model = np.array([np.c_[np.mean(i, axis=0), np.std(i, axis=0)]
for i in separated])


numpy.c_ is like zip:

>>> np.c_[[1,2],[3,4]]
array([[1, 3],
[2, 4]])


#### Exercise 9

Let's write gaussian distribution. The formula is as follows:

$$f(x | \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$$
def _prob(self, x, mean, std):
"""Gaussian distribution in log"""

nb = GaussianNB()
print(nb._prob(0, 0, 1))


Output:

-0.918938533205


which is $$log(.4)$$.

#### Solution

def _prob(self, x, mean, std):
"""Gaussian distribution in log"""
exponent = np.exp(- ((x - mean)**2 / (2 * std**2)))
return np.log(exponent / (np.sqrt(2 * np.pi) * std))


#### Exercise 10

This is the last exercise in this tutorial. predict_log_proba is as simple as applying the gaussian distribution, though the code might not necessarily be simple:

def predict_log_proba(self, X):

nb = GaussianNB().fit(iris.data, iris.target)
print(nb.predict_log_proba(iris.data[:2])) # just two samples


Output:

[[2.1414955664687394, -38.979365927930132, -55.744042534101993],
[1.5412386825371427, -37.209274991620475, -55.191594591414542]]


#### Solution

There are three loops in the code. The first one(for x in X) loops through each sample. The second one loops through each class. And the last one loops through each attribute and applies the gaussian distribution. *s unpacks an array of mean and standard deviation.

def predict_log_proba(self, X):
return [[sum(self._prob(i, *s) for s, i in zip(summaries, x))
for summaries in self.model] for x in X]


predict is the same as the other models. Here is the complete code:

class GaussianNB(object):
def __init__(self):
pass

def fit(self, X, y):
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
self.model = np.array([np.c_[np.mean(i, axis=0), np.std(i, axis=0)]
for i in separated])
return self

def _prob(self, x, mean, std):
exponent = np.exp(- ((x - mean)**2 / (2 * std**2)))
return np.log(exponent / (np.sqrt(2 * np.pi) * std))

def predict_log_proba(self, X):
return [[sum(self._prob(i, *s) for s, i in zip(summaries, x))
for summaries in self.model] for x in X]

def predict(self, X):
return np.argmax(self.predict_log_proba(X), axis=1)

def score(self, X, y):
return sum(self.predict(X) == y) / len(y)


I also added score. Every classification class in scikit-learn has the same score method. I talked about it in K-Nearest Neighbor from Scratch in Python.

Let's test it:

from sklearn.cross_validation import train_test_split


Using train_test_split function from cross_validation module, it splits the data into the training set and the test set. Both the training accuracy and the test accuracy is close to 1 in this case.