Kenzo's Blog

Scikit-learn offers a lot of tools that make our life easier. Many of them are really simple, so let's write them from scratch.

Before We Get Started

For this tutorial, I assume you know the followings:

Python(list comprehension, basic OOP)
Numpy
Basic Linear Algebra and Statistics
Basic machine learning concepts

I'm using python3. If you want to use python2, add this line at the beginning of your file and everything will work fine.

from __future__ import division

Warning: Although scikit-learn accepts any array-like objects, my code only works with numpy arrays.

Credit: Many examples used in this post are taken from scikit-learn official documentation.

train_test_split

Splitting data into a training set and a test set is a basic and essential task. Scikit-learn has train_test_split from cross_validation module. Here is my simplified version:

import numpy as np

def train_test_split_(*arrays, test_size=None, train_size=None, random_state=None):
    length = len(arrays[0])
    if random_state:
        np.random.seed(random_state)
    p = np.random.permutation(length)

    if type(test_size) == int:
        index = length - test_size
    elif type(test_size) == float:
        index = length - np.ceil(length * test_size)
    else:
        if type(train_size) == int:
            index = train_size
        elif type(train_size) == float:
            index = int(length * train_size)
        else:
            index = length - np.ceil(length * 0.25)

    return [b for a in arrays for b in (a[p][:index], a[p][index:])]

The first parameter is *arrays. It takes an arbitrary number of arrays to split. test_size and train_size can be either float, int, or None. You can also pass random_state. If test_size or train_size is int, then it represents the absolute number of samples. If they are float, they represent the proportion of the dataset. So you only have to specify either test_size or train_size. If neither of them is specified, it splits the data into 75% training set and 25% test set.

That's the basic logic. The last line splits the data according to the index and flattens the nested list.

Typically, you pass both X and y. Here are some examples:

X, y = np.arange(20).reshape((10, 2)), np.arange(10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
print(len(X_train), len(X_test))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=3)
print(len(X_train), len(X_test))
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.75)
print(len(X_train), len(X_test))
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=6)
print(len(X_train), len(X_test))
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(len(X_train), len(X_test))

Output:

Standardization

Most machine learning estimators require standardization of data. Imagine you want to train k-nearest neighbor and your data consists of 2 features: one feature that ranges from 0 to 1 and another one that ranges from 100 to 10000. When you calculate Euclidean distance, the distance on the second feature is so big that it makes the distance on the first feature irrelevant. Standardization solves this problem by scaling each feature so that all the features will be in a similar range.

Standardization tools belong to preprocessing module in scikit-learn. We are going to write two of them.

StandardScaler

The basic standardization transforms the data to have 0 mean and unit variance on each feature:

import numpy as np

def scale(X):
    new = X - np.mean(X, axis=0)
    return new / np.std(new, axis=0)

Here is an example:

X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])
print(scale(X))

Output:

[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]

In practice, you want to use the mean and standard deviation on a training set to a test set. Scikit-learn has StandardScaler class for that:

import numpy as np

class StandardScaler(object):
    def __init__(self):
        pass

    def fit(self, X):
        self.mean_ = np.mean(X, axis=0)
        self.scale_ = np.std(X - self.mean_, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean_) / self.scale_

    def fit_transform(self, X):
        return self.fit(X).transform(X)

fit method saves the mean and standard deviation for later use. When transform is applied on the same data used to fit, it's essentially doing the same thing as the scale function above. fit_transform is just a shortcut method.

Here is the typical use case. You fit a training data and transform both the training set and a test set:

scaler = preprocessing.StandardScaler().fit(X)
print(scaler.transform(X))
print(scaler.transform([[-1.,  1., 0.]]))

Output:

[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
[[-2.44948974  1.22474487 -0.26726124]]

MinMaxScaler

Another standardization method is to scale the data to lie within certain ranges. This is called MinMaxScaler:

import numpy as np

class MinMaxScaler(object):
    def __init__(self, feature_range=(0, 1)):
        self.low_, self.high_ = feature_range

    def fit(self, X):
        self.min_ = X.min(axis=0)
        self.max_ = X.max(axis=0)
        return self

    def transform(self, X):
        X_std = (X - self.min_) / (self.max_ - self.min_)
        return X_std * (self.high_ - self.low_) + self.low_

    def fit_transform(self, X):
        return self.fit(X).transform(X)

The first line in transform is the most important. It scales the data to 0:1 no matter feature_range. X likely consists of multiple features, but think about just one feature. If a value is the minimum of the feature, the numerator becomes 0, so the whole thing is 0. If a value is the maximum, then the numerator becomes the same as the denominator, so X_std is 1. The next line rescales X_std according to the feature range.

Here is an example:

X = np.array([[ 1., -1.,  2.],
              [ 2.,  0.,  0.],
              [ 0.,  1., -1.]])
scaler = MinMaxScaler(feature_range=(0,1)).fit(X)
print(scaler.transform(X))
print(scaler.transform([ -3., -1.,  4.]))

Output:

[[ 0.5         0.          1.        ]
 [ 1.          0.5         0.33333333]
 [ 0.          1.          0.        ]]
[-1.5         0.          1.66666667]

accuracy_score

Every classification class has its own score method. But scikit-learn's metrics module also provides a stand-alone function, accuracy_score. It is actually a very simple function:

def accuracy_score(y_true, y_pred, normalize=True):
    correct = sum(y_true == y_pred)
    return correct / len(y_true) if normalize else correct

The first line computes the number of correctly classified samples. The second line divides the value by the total number of samples if normalize is set to True.

y_pred = np.array([0, 2, 1, 3])
y_true = np.array([0, 1, 2, 3])
print(accuracy_score(y_true, y_pred))
print(accuracy_score(y_true, y_pred, normalize=False))

Output:

0.5
2

confusion_matrix

Another useful function from metrics module is confusion_matrix. It's also easy to implement:

def confusion_matrix(y_true, y_pred, labels=None):
    labels = labels if labels else sorted(set(y_true) | set(y_pred))        
    indexes = {v:i for i, v in enumerate(labels)}
    matrix = np.zeros((len(indexes),len(indexes))).astype(int)
    for t, p in zip(y_true, y_pred):
        matrix[indexes[t], indexes[p]] += 1
    return matrix

First we need to know what classes are available. If labels are given, we can just use it. Otherwise, it converts each array into set, take the union, and sort it. Next, it maps each class with its index. Then it initializes a \(n \times n\) matrix where \(n\) is the number of classes. Finally, for each pair of y_true and y_pred, it increments the value of the corresponding cell of the matrix:

y_true = np.array([2, 0, 2, 2, 0, 1])
y_pred = np.array([0, 0, 2, 2, 0, 2])
print(confusion_matrix(y_true, y_pred))
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
print(confusion_matrix(y_true, y_pred, labels=["bird", "cat", "ant"]))

Output:

[[2 0 0]
 [0 0 1]
 [1 0 2]]
[[2 0 0]
 [0 0 1]
 [1 0 2]]

Conclusion

You are most likely to use scikit-learn instead of your code in practice. Nevertheless, it is a good practice to code from scratch. You will know more about python and numpy, and you will also notice small details of scikit-learn.

If you have questions or comments, tweet @kenzotakahashi and I'll be happy to help.