Naive bayes is a basic bayesian classifier. It's simple, fast, and widely used. You will see the beauty and power of bayesian inference. Naive bayes comes in 3 flavors in scikit-learn: MultinomialNB, BernoulliNB, and GaussianNB. In this post, we are going to implement all of them. Does it sound like a lot of work? It is. So let's get started.

### Before We Get Started

For this tutorial, I assume you know the followings:

- Python(list comprehension, basic OOP)
- Numpy(broadcasting)
- Basic Linear Algebra
- Probability(gaussian distribution)

My code follows the scikit-learn style. If you are unfamiliar with scikit-learn, I recommend you check out the website. I also briefly mention it in my post, K-Nearest Neighbor from Scratch in Python.

I'm using python3. If you want to use python2, add this line at the beginning of your file and everything should work fine.

```
from __future__ import division
```

### MultinomialNB

I will use the example from chapter 13 on An Introduction to Information Retrieval. I think it's the best introduction to multinomial naive bayes. PDF is also available for free. I will not explain how naive bayes works, so I expect you to know what is in the chapter.

Here is the initial code. `alpha`

is a smoothing parameter which will be used later.

```
import numpy as np
np.set_printoptions(precision=6)
class MultinomialNB(object):
def __init__(self, alpha=1.0):
self.alpha = alpha
def fit(self, X, y):
# group by class
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
print(separated)
```

First we need to group the training data by class. There are a lot of ways to accomplish the task. The obvious way is to use something like defaultdict. Instead, I'm using `np.unique`

. It's much cleaner. But notice that we have to loop through `X`

, the entire training data, for each class. If you have 20 target classes to classify, it would be inefficient. But most of the time you only have 2 classes, so it will be fine.

Let's try our code. I converted the data from table 13.1 to a matrix. We have 4 documents with 6 words(Chinese, Beijing, Shanghai, Macao, Tokyo, Japan) for the training set. In the book, the target class is yes or no, but I changed them to 0 and 1, respectively:

```
X = np.array([
[2,1,0,0,0,0],
[2,0,1,0,0,0],
[1,0,0,1,0,0],
[1,0,0,0,1,1]
])
y = np.array([0,0,0,1])
nb = MultinomialNB().fit(X, y)
```

This is the format that scikit-learn expects. Automatically constructing a word count matrix from raw text is not part of naive bayes. There is feature_extraction.text submodule for that. I might write another post about it in the future.

The output is a list of lists, each of which contains the samples:

```
[[array([2, 1, 0, 0, 0, 0]), array([2, 0, 1, 0, 0, 0]), array([1, 0, 0, 1, 0, 0])],
[array([1, 0, 0, 0, 1, 1])]]
```

We have three samples that belong to class 0, and one for class 1. Since we are using 0 and 1 for class, we can use index as a key. This simplifies the overall code dramatically. If we used dictionary instead, we would not be able to take advantage of numpy. Note that scikit-learn can take any value as a class.

#### Exercise 1

Your first exercise is to calculate \(log\hat{p}(c)\), the prior log probability for each class. Log probability is used to avoid floating point underflow. You can use `count_sample`

, which is the number of training samples.

```
def fit(self, X, y):
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
count_sample = X.shape[0]
self.class_log_prior_ = # Your code here
return self
nb = MultinomialNB().fit(X, y)
print(nb.class_log_prior_)
```

Output:

```
[-0.2876820724517809, -1.3862943611198906]
```

These numbers might not make much sense to you, but they are log of .75 and .25.

#### Solution

```
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
```

#### Exercise 2

Next we want to count each word for each class and add `self.alpha`

as smoothing:

```
def fit(self, X, y):
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
count_sample = X.shape[0]
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
count = # Your code here
print(count)
return self
nb = MultinomialNB(alpha=1).fit(X, y)
```

Output:

```
[[6 2 2 2 1 1]
[2 1 1 1 2 2]]
```

*Chinese* appears 5 times and *Beijing*, *Shanghai*, *Macao* once in class 0. In class 1, *Chinese*, *Tokyo*, and *Japan* appear once. Since `alpha=1`

, 1 is added to each count.

#### Solution

We can use numpy's `sum`

to broadcast sum operation. `+ self.alpha`

is also broadcasted.

```
count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha
```

#### Exercise 3

Finally we can calculate the log probability of each word, \(log\hat{p}(t|c)\). You divide count by the length of text:

```
def fit(self, X, y):
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
count_sample = X.shape[0]
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
count = np.array([np.array(i).sum(axis=0) for i in separated])
# log probability of each word
self.feature_log_prob_ = # Your code here
return self
nb = MultinomialNB().fit(X, y)
print(nb.feature_log_prob_)
```

Output:

```
[[-0.847298 -1.94591 -1.94591 -1.94591 -2.639057 -2.639057]
[-1.504077 -2.197225 -2.197225 -2.197225 -1.504077 -1.504077]]
```

#### Solution

There is a lot going on in 1 line. In our example, `count.sum(axis=1)`

is `[14, 9]`

. To broadcast it, we need to transpose it. But for some reason, transposing 1d array is a bit inconvenient in numpy. Just adding `.T`

does nothing. So you have to use `[np.newaxis]`

:

```
self.feature_log_prob_ = np.log(count / count.sum(axis=1)[np.newaxis].T)
```

### Predict

#### Exercise 4

Now we have the code to train the data, we can predict. `predict_log_proba`

outputs the log probability of each class. Use `self.class_log_prior_`

and `self.feature_log_prob_`

for the calculation:

```
def predict_log_proba(self, X):
# Your code here
```

The first test data is the same as the book. The second one I made up:

```
nb = MultinomialNB().fit(X, y)
X_test = np.array([[3,0,0,0,1,1],[0,1,1,0,1,1]])
print(nb.predict_log_proba(X_test))
```

Output:

```
[array([-8.10769 , -8.906681]), array([-9.457617, -8.788898])]
```

#### Solution

```
def predict_log_proba(self, X):
return [(self.feature_log_prob_ * x).sum(axis=1) + self.class_log_prior_
for x in X]
```

#### Exercise 5

`predict_log_proba`

does most of the work, so `predict`

just calls it and picks the maximum value.

```
def predict(self, X):
# Your code here
nb = MultinomialNB().fit(X, y)
X_test = np.array([[3,0,0,0,1,1],[0,1,1,0,1,1]])
print(nb.predict(X_test))
```

Output:

```
[0 1]
```

#### Solution

You can use `argmax`

to return the corresponding index:

```
def predict(self, X):
return np.argmax(self.predict_log_proba(X), axis=1)
```

Here is the complete code:

```
class MultinomialNB(object):
def __init__(self, alpha=1.0):
self.alpha = alpha
def fit(self, X, y):
count_sample = X.shape[0]
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha
self.feature_log_prob_ = np.log(count / count.sum(axis=1)[np.newaxis].T)
return self
def predict_log_proba(self, X):
return [(self.feature_log_prob_ * x).sum(axis=1) + self.class_log_prior_
for x in X]
def predict(self, X):
return np.argmax(self.predict_log_proba(X), axis=1)
```

### BernoulliNB

Bernoulli naive bayes is similar to multinomial naive bayes, but it only takes binary values. In our example, each value will be whether or not a word appears in a document. That is a very simplified model. Nevertheless, when word frequency is less important, bernoulli naive bayes may yield a better result.

We will continue using the same example. The corresponding part in the book is here.

Most of `fit`

is the same as MultinomialNB. I added two lines. `n_doc`

is the number of documents in each class + smoothing where smoothing is `2 * self.alpha`

.

```
class BernoulliNB(object):
def __init__(self, alpha=1.0):
self.alpha = alpha
def fit(self, X, y):
count_sample = X.shape[0]
# group by class
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
# class prior
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
# count of each word
count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha
smoothing = 2 * self.alpha
# number of documents in each class + smoothing
n_doc = np.array([len(i) + smoothing for i in separated])
print(n_doc)
```

We have to change `X`

to have binary values. We can use `numpy.where`

to change every value that is greater than 0 to 1:

```
nb = BernoulliNB(alpha=1).fit(np.where(X > 0, 1, 0), y)
```

Output:

```
[5 3]
```

#### Exercise 6

The last part of `fit`

is `self.feature_prob_`

. We don't take log at this point. You will see why in a minute:

```
def fit(self, X, y):
count_sample = X.shape[0]
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha
smoothing = 2 * self.alpha
n_doc = np.array([len(i) + smoothing for i in separated])
# probability of each word
self.feature_prob_ = # Your code here
return self
nb = BernoulliNB(alpha=1).fit(np.where(X > 0, 1, 0), y)
print(nb.feature_prob_)
```

Output:

```
[[ 0.8 0.4 0.4 0.4 0.2 0.2 ]
[ 0.666667 0.333333 0.333333 0.333333 0.666667 0.666667]]
```

#### Solution

It's similar to MultinomialNB:

```
self.feature_prob_ = count / n_doc[np.newaxis].T
```

#### Exercise 7

This is probably the toughest exercise in this tutorial. Just like last time, your code should output the log probability of each class. In multinomial model, nonoccuring words do not affect the posterior probability. But in bernoulli model, \(1 - \hat{p}(t|c)\) is used instead. That's why I didn't take log in `fit`

. *Hint:* You can use if statement, but my solution does not have one:

```
def predict_log_proba(self, X):
# Your code here
X_test = np.array([[1,0,0,0,1,1],[1,1,1,0,0,1]])
nb = BernoulliNB(alpha=1).fit(np.where(X > 0, 1, 0), y)
print(nb.predict_log_proba(X_test))
```

Output:

```
[array([-8.927341, -6.82724 ]), array([-7.459403, -5.898527])]
```

#### Solution

My solution is a really long 1 line. Sometimes you can express a control flow using math. `np.abs(x - 1)`

flips 1 and 0. It does look cryptic at first, but I found it much cleaner than using if-else.

```
def predict_log_proba(self, X):
return [(np.log(self.feature_prob_) * x + \
np.log(1 - self.feature_prob_) * np.abs(x - 1)
).sum(axis=1) + self.class_log_prior_ for x in X]
```

`predict`

is the same as MultinomialNB, so we are done!

```
class BernoulliNB(object):
def __init__(self, alpha=1.0):
self.alpha = alpha
def fit(self, X, y):
count_sample = X.shape[0]
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
self.class_log_prior_ = [np.log(len(i) / count_sample) for i in separated]
count = np.array([np.array(i).sum(axis=0) for i in separated]) + self.alpha
smoothing = 2 * self.alpha
n_doc = np.array([len(i) + smoothing for i in separated])
self.feature_prob_ = count / n_doc[np.newaxis].T
return self
def predict_log_proba(self, X):
return [(np.log(self.feature_prob_) * x + \
np.log(1 - self.feature_prob_) * np.abs(x - 1)
).sum(axis=1) + self.class_log_prior_ for x in X]
def predict(self, X):
return np.argmax(self.predict_log_proba(X), axis=1)
```

### Binarize

So far, we have been manually binarizing X. It would be nice if it's done automatically. BernoulliNB in scikit-learn has binarize parameter for that:

```
class BernoulliNB(object):
def __init__(self, alpha=1.0, binarize=0.0):
self.alpha = alpha
self.binarize = binarize
```

It specifies the threshold for binarizing. You can set it to `None`

if your data does not need binarizing.

I added 1 line of code to the beginning of every method that uses X. It calls `_binarize_X`

, which does the actual binarizing.

```
def fit(self, X, y):
X = self._binarize_X(X)
...
def predict_log_proba(self, X):
X = self._binarize_X(X)
...
def predict(self, X):
X = self._binarize_X(X)
...
def _binarize_X(self, X):
return np.where(X > self.binarize, 1, 0) if self.binarize != None else X
```

Now you can just pass X:

```
nb = BernoulliNB(alpha=1).fit(X, y)
```

### GaussianNB

If your data has a continuous variable, then Multinomial and Bernoulli are not suitable. You can discretize it, but a more common approach is to use Gaussian distribution:

Not every data follows a Gaussian distribution, but it's a pretty good assumption.

#### Exercise 8

`GaussianNB`

takes no parameter. `separated`

is the same as the other models. `self.model`

is going to be the mean and the standard deviation of each attribute for each class.

```
class GaussianNB(object):
def __init__(self):
pass
def fit(self, X, y):
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
self.model = # Your code here
return self
```

We are going to use the iris flower dataset:

```
from sklearn import datasets
iris = datasets.load_iris()
nb = GaussianNB().fit(iris.data, iris.target)
print(nb.model)
```

The iris dataset has 4 features and 3 classes, so the output looks like this:

```
[[[ 5.006 0.348947]
[ 3.418 0.377195]
[ 1.464 0.171767]
[ 0.244 0.106132]]
[[ 5.936 0.510983]
[ 2.77 0.310644]
[ 4.26 0.465188]
[ 1.326 0.195765]]
[[ 6.588 0.629489]
[ 2.974 0.319255]
[ 5.552 0.546348]
[ 2.026 0.27189 ]]]
```

For example, 5.006 on the top left is the mean of the first attribute in class 0, and 0.27189 on the bottom right is the standard deviation of the fourth attribute in class 1.

#### Solution

Here is my solution:

```
self.model = np.array([np.c_[np.mean(i, axis=0), np.std(i, axis=0)]
for i in separated])
```

`numpy.c_`

is like `zip`

:

```
>>> np.c_[[1,2],[3,4]]
array([[1, 3],
[2, 4]])
```

#### Exercise 9

Let's write gaussian distribution. The formula is as follows:

```
def _prob(self, x, mean, std):
"""Gaussian distribution in log"""
# Your code here
nb = GaussianNB()
print(nb._prob(0, 0, 1))
```

Output:

```
-0.918938533205
```

which is \(log(.4)\).

#### Solution

```
def _prob(self, x, mean, std):
"""Gaussian distribution in log"""
exponent = np.exp(- ((x - mean)**2 / (2 * std**2)))
return np.log(exponent / (np.sqrt(2 * np.pi) * std))
```

#### Exercise 10

This is the last exercise in this tutorial. `predict_log_proba`

is as simple as applying the gaussian distribution, though the code might not necessarily be simple:

```
def predict_log_proba(self, X):
# Your code here
nb = GaussianNB().fit(iris.data, iris.target)
print(nb.predict_log_proba(iris.data[:2])) # just two samples
```

Output:

```
[[2.1414955664687394, -38.979365927930132, -55.744042534101993],
[1.5412386825371427, -37.209274991620475, -55.191594591414542]]
```

#### Solution

There are three loops in the code. The first one(`for x in X`

) loops through each sample. The second one loops through each class. And the last one loops through each attribute and applies the gaussian distribution. `*s`

unpacks an array of mean and standard deviation.

```
def predict_log_proba(self, X):
return [[sum(self._prob(i, *s) for s, i in zip(summaries, x))
for summaries in self.model] for x in X]
```

`predict`

is the same as the other models. Here is the complete code:

```
class GaussianNB(object):
def __init__(self):
pass
def fit(self, X, y):
separated = [[x for x, t in zip(X, y) if t == c] for c in np.unique(y)]
self.model = np.array([np.c_[np.mean(i, axis=0), np.std(i, axis=0)]
for i in separated])
return self
def _prob(self, x, mean, std):
exponent = np.exp(- ((x - mean)**2 / (2 * std**2)))
return np.log(exponent / (np.sqrt(2 * np.pi) * std))
def predict_log_proba(self, X):
return [[sum(self._prob(i, *s) for s, i in zip(summaries, x))
for summaries in self.model] for x in X]
def predict(self, X):
return np.argmax(self.predict_log_proba(X), axis=1)
def score(self, X, y):
return sum(self.predict(X) == y) / len(y)
```

I also added `score`

. Every classification class in scikit-learn has the same score method. I talked about it in K-Nearest Neighbor from Scratch in Python.

Let's test it:

```
from sklearn.cross_validation import train_test_split
iris = datasets.load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)
nb = GaussianNB().fit(X_train, y_train)
print(nb.score(X_train, y_train))
print(nb.score(X_test, y_test))
```

Using `train_test_split`

function from cross_validation module, it splits the data into the training set and the test set. Both the training accuracy and the test accuracy is close to 1 in this case.

### Conclusion

If you attempted all the 10 exercises, thank you! I hope you learned something. If you have questions or comments, tweet **@kenzotakahashi** and I'll be happy to help.