Kenzo's Blog

Logistic Regerssion is a linear classifier. Despite the name, it is a classification algorithm. It's very similar to linear regression, so if you are not familiar with it, I recommend you check out my last post, Linear Regression from Scratch in Python. We are going to write both binary classification and multiclass classification.

Before We Get Started

For this tutorial, I assume you know the followings:

Python(list comprehension, basic OOP)
Numpy
Basic Linear Algebra
Multivariate Calculus(partial derivative)
Basic machine learning concepts
Linear Regression

My code follows the scikit-learn style. If you are unfamiliar with scikit-learn, I recommend you check out the website. I also briefly mention it in my post, K-Nearest Neighbor from Scratch in Python.

I'm using python3. If you want to use python2, add this line at the beginning of your file and everything should work fine.

from __future__ import division

The Initial Code

The code looks identical to our linear regression so far:

import numpy as np

class LogisticRegression(object):
    def __init__(self, eta=0.1, n_iter=50):
        self.eta = eta
        self.n_iter = n_iter

    def fit(self, X, y):
        X = np.insert(X, 0, 1, axis=1)
        self.w = np.ones(X.shape[1])
        m = X.shape[0]
        return self

Sigmoid Function

Before working on fit, let's see what predict looks like. We will change it later, but for now it is exactly the same as our linear regression:

def predict(self, X):
    return np.insert(X, 0, 1, axis=1).dot(self.w)

X = np.array([[-2, 2],[-3, 0],[2, -1],[1, -4]])
y = np.array([1,1,0,0])
logi = LogisticRegression()fit(X, y)
print(logi.predict(X))

Output:

[ 1. -2.  2. -2.]

It's not clear what this means. The output should be either 0 or 1 because this is a classification. Instead we got some real numbers. To solve this problem, we can use sigmoid function:

$$g(z) = \frac{1}{1 + e^{-z}}$$

Here is what sigmoid function looks like:

Sigmoid Function

We can interpret this as the probability that the target class is 1. It still isn't 0 or 1, but intuitively, the output should be 1 when $g(z) >= 0.5$.

Exercise 1

Before we get any further, let's write sigmoid function:

def _sigmoid(self, x):
    # Your code here

logi = LogisticRegression()
print(logi._sigmoid(3))
print(logi._sigmoid(-1))
print(logi._sigmoid(0))

Output:

0.952574126822
0.26894142137
0.5

Solution

return 1 / (1 + np.exp(-x))

Exercise 2

Now we can write the correct version of predict. We already have the output. So your code should apply the sigmoid function to the output and return 1 if it's greater than or equal to 0.5, 0 otherwise:

def predict(self, X):
    output = np.insert(X, 0, 1, axis=1).dot(self.w)
    # Your code here

X = np.array([[-2, 2],[-3, 0],[2, -1],[1, -4]])
y = np.array([1,1,0,0])
logi = LogisticRegression().fit(X, y)
print(logi.predict(X))

Output:

[1 0 1 0]

Solution

You could use list comprehension:

return [1 if i >= 0.5 else 0 for i in self._sigmoid(output)]

But a more elegant approach is to use where:

return np.where(self._sigmoid(output) >= .5, 1, 0)

Update Rule

The update rule looks exactly the same as linear regression:

$$\theta := \theta + \frac{\alpha}{m} (y - h(x))X$$

But in linear regression, $h(x) = X\theta$. In logistic regression, however, $h(x) = g(X\theta)$ where $g$ is sigmoid function. So fit looks like this:

def fit(self, X, y):
    X = np.insert(X, 0, 1, axis=1)
    self.w = np.ones(X.shape[1])
    m = X.shape[0]

    for _ in range(self.n_iter):
        output = X.dot(self.w)
        errors = y - self._sigmoid(output)
        self.w += self.eta / m * errors.dot(X)
    return self

Every classification class in scikit-learn has the same score method. I talked about it in K-Nearest Neighbor from Scratch in Python, so I will just show the code here:

def score(self, X, y):
    return sum(self.predict(X) == y) / len(y)

Here is the complete code:

class LogisticRegression(object):
    def __init__(self, eta=0.1, n_iter=50):
        self.eta = eta
        self.n_iter = n_iter

    def fit(self, X, y):
        X = np.insert(X, 0, 1, axis=1)
        self.w = np.ones(X.shape[1])
        m = X.shape[0]

        for _ in range(self.n_iter):
            output = X.dot(self.w)
            errors = y - self._sigmoid(output)
            self.w += self.eta / m * errors.dot(X)
        return self

    def predict(self, X):
        output = np.insert(X, 0, 1, axis=1).dot(self.w)
        return (np.floor(self._sigmoid(output) + .5)).astype(int)

    def score(self, X, y):
        return sum(self.predict(X) == y) / len(y)

    def _sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

We can test the code:

X = np.array([[-2, 2],[-3, 0],[2, -1],[1, -4]])
y = np.array([1,1,0,0])
logi = LogisticRegression().fit(X, y)
print(logi.predict(X))

Output:

[1 1 0 0]

The training accuracy is 100%, meaning the data is linearly separable.

Multiclass Classification

Our logistic regression can only be used for binary classification. When doing multiclass classification, you can use One vs Rest(OvR) method.

Here is how OvR works. Suppose you have 3 target classes, A,B, and C. First you treat B and C as one class, D and run a logistic regression. This classifier separates A and D. Next you treat A and C as D, and so on. You will have 3 independent logistic regressions.

When you predict, you will run the 3 classififers. Each one gives the probability of the class associated with it. Whichever has the highest probability is the most probable class.

Let's see how it's done in code:

class LogisticRegressionOVR(object):
    def __init__(self, eta=0.1, n_iter=50):
        self.eta = eta
        self.n_iter = n_iter

    def fit(self, X, y):
        X = np.insert(X, 0, 1, axis=1)
        self.w = []
        m = X.shape[0]

        for i in np.unique(y):
            y_copy = np.where(y == i, 1, 0)
            w = np.ones(X.shape[1])

            for _ in range(self.n_iter):
                output = X.dot(w)
                errors = y_copy - self._sigmoid(output)
                w += self.eta / m * errors.dot(X)
            self.w.append((w, i))
        return self

np.unique(y) creates an array of class labels. For each label, you make y_copy, which changes the labels of y. Now that y_copy has only 1's and 0's, we can run logistic regression. After the inner for loop, self.w.append((w, i)) appends the weights and the correspoing class label to self.w.

Let's test the code on the iris flower dataset. np.set_printoptions sets the precision of float numbers when printing. Notice that I set n_iter to 1000.

from sklearn import datasets
np.set_printoptions(precision=3)
iris = datasets.load_iris()
X = iris.data
y = iris.target
logi = LogisticRegressionOVR(n_iter=1000).fit(X, y)
print(logi.w)

Since the the iris dataset has 4 features and 3 classes, self.w looks like this:

[(array([ 1.11 ,  0.174,  2.1  , -2.901, -0.689]), 0),
 (array([ 1.509,  0.481, -1.805,  0.49 , -1.268]), 1),
 (array([-0.472, -2.298, -1.804,  3.015,  3.19 ]), 2)]

We also make some changes to predict. We are going to predict one instance at a time so the new predict calls _predict_one for each instance and puts them in a list.

def _predict_one(self, x):
    # predicts one instance

def predict(self, X):
    return [self._predict_one(i) for i in np.insert(X, 0, 1, axis=1)]

Exercise 3

Let's write _predict_one. It should compute $Xθ$ on each set of weights, take the maximum value, and return the corresponding class label.

def _predict_one(self, x):
    # Your code here

Solution

It can be done in 1 line:

def _predict_one(self, x):
    return max((x.dot(w), c) for w, c in self.w)[1]

Notice I am not applying sigmoid function to $Xθ$. Since we just need to take the maximum value, it's not necessary.

Let's test our new logistic regression:

from sklearn.cross_validation import train_test_split

iris = datasets.load_iris()
X_train, X_temp, y_train, y_temp = \
    train_test_split(iris.data, iris.target, test_size=.4)
X_validation, X_test, y_validation, y_test = \
    train_test_split(X_temp, y_temp, test_size=.5)

logi = LogisticRegressionOVR(n_iter=1000).fit(X_train, y_train)

print(logi.score(X_train, y_train))
print(logi.score(X_validation, y_validation))

Using train_test_split function from cross_validation module, it first splits the data in the ratio 60:40, then splits the latter in half. Now we have 60% for the training set, 20% for the validation and test set.

I ran it multiple times with different n_iter. It looks like we need at least 300 epochs to get a good result. Once you find the good parameters, you can get the true score on the test set.

print(neighbor.score(X_test, y_test))

I encourage you to play with the code and see how changing each parameter affects the accuracy.

Conclusion

If you have questions or comments, tweet @kenzotakahashi and I'll be happy to help.