Scikit-learn offers a lot of tools that make our life easier. Many of them are really simple, so let's write them from scratch.
Before We Get Started
For this tutorial, I assume you know the followings:
- Python(list comprehension, basic OOP)
- Numpy
- Basic Linear Algebra and Statistics
- Basic machine learning concepts
I'm using python3. If you want to use python2, add this line at the beginning of your file and everything will work fine.
from __future__ import division
Warning: Although scikit-learn accepts any array-like objects, my code only works with numpy arrays.
Credit: Many examples used in this post are taken from scikit-learn official documentation.
train_test_split
Splitting data into a training set and a test set is a basic and essential task. Scikit-learn has train_test_split from cross_validation module. Here is my simplified version:
import numpy as np
def train_test_split_(*arrays, test_size=None, train_size=None, random_state=None):
length = len(arrays[0])
if random_state:
np.random.seed(random_state)
p = np.random.permutation(length)
if type(test_size) == int:
index = length - test_size
elif type(test_size) == float:
index = length - np.ceil(length * test_size)
else:
if type(train_size) == int:
index = train_size
elif type(train_size) == float:
index = int(length * train_size)
else:
index = length - np.ceil(length * 0.25)
return [b for a in arrays for b in (a[p][:index], a[p][index:])]
The first parameter is *arrays
. It takes an arbitrary number of arrays to split. test_size
and train_size
can be either float, int, or None. You can also pass random_state
. If test_size
or train_size
is int, then it represents the absolute number of samples. If they are float, they represent the proportion of the dataset. So you only have to specify either test_size
or train_size
. If neither of them is specified, it splits the data into 75% training set and 25% test set.
That's the basic logic. The last line splits the data according to the index
and flattens the nested list.
Typically, you pass both X and y. Here are some examples:
X, y = np.arange(20).reshape((10, 2)), np.arange(10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
print(len(X_train), len(X_test))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=3)
print(len(X_train), len(X_test))
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.75)
print(len(X_train), len(X_test))
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=6)
print(len(X_train), len(X_test))
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(len(X_train), len(X_test))
Output:
6 4
7 3
7 3
6 4
7 3
Standardization
Most machine learning estimators require standardization of data. Imagine you want to train k-nearest neighbor and your data consists of 2 features: one feature that ranges from 0 to 1 and another one that ranges from 100 to 10000. When you calculate Euclidean distance, the distance on the second feature is so big that it makes the distance on the first feature irrelevant. Standardization solves this problem by scaling each feature so that all the features will be in a similar range.
Standardization tools belong to preprocessing module in scikit-learn. We are going to write two of them.
StandardScaler
The basic standardization transforms the data to have 0 mean and unit variance on each feature:
import numpy as np
def scale(X):
new = X - np.mean(X, axis=0)
return new / np.std(new, axis=0)
Here is an example:
X = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
print(scale(X))
Output:
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124]
[-1.22474487 1.22474487 -1.06904497]]
In practice, you want to use the mean and standard deviation on a training set to a test set. Scikit-learn has StandardScaler class for that:
import numpy as np
class StandardScaler(object):
def __init__(self):
pass
def fit(self, X):
self.mean_ = np.mean(X, axis=0)
self.scale_ = np.std(X - self.mean_, axis=0)
return self
def transform(self, X):
return (X - self.mean_) / self.scale_
def fit_transform(self, X):
return self.fit(X).transform(X)
fit
method saves the mean and standard deviation for later use. When transform
is applied on the same data used to fit, it's essentially doing the same thing as the scale function above. fit_transform
is just a shortcut method.
Here is the typical use case. You fit a training data and transform both the training set and a test set:
scaler = preprocessing.StandardScaler().fit(X)
print(scaler.transform(X))
print(scaler.transform([[-1., 1., 0.]]))
Output:
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124]
[-1.22474487 1.22474487 -1.06904497]]
[[-2.44948974 1.22474487 -0.26726124]]
MinMaxScaler
Another standardization method is to scale the data to lie within certain ranges. This is called MinMaxScaler:
import numpy as np
class MinMaxScaler(object):
def __init__(self, feature_range=(0, 1)):
self.low_, self.high_ = feature_range
def fit(self, X):
self.min_ = X.min(axis=0)
self.max_ = X.max(axis=0)
return self
def transform(self, X):
X_std = (X - self.min_) / (self.max_ - self.min_)
return X_std * (self.high_ - self.low_) + self.low_
def fit_transform(self, X):
return self.fit(X).transform(X)
The first line in transform
is the most important. It scales the data to 0:1 no matter feature_range
. X likely consists of multiple features, but think about just one feature. If a value is the minimum of the feature, the numerator becomes 0, so the whole thing is 0. If a value is the maximum, then the numerator becomes the same as the denominator, so X_std
is 1. The next line rescales X_std
according to the feature range.
Here is an example:
X = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
scaler = MinMaxScaler(feature_range=(0,1)).fit(X)
print(scaler.transform(X))
print(scaler.transform([ -3., -1., 4.]))
Output:
[[ 0.5 0. 1. ]
[ 1. 0.5 0.33333333]
[ 0. 1. 0. ]]
[-1.5 0. 1.66666667]
accuracy_score
Every classification class has its own score method. But scikit-learn's metrics module also provides a stand-alone function, accuracy_score. It is actually a very simple function:
def accuracy_score(y_true, y_pred, normalize=True):
correct = sum(y_true == y_pred)
return correct / len(y_true) if normalize else correct
The first line computes the number of correctly classified samples. The second line divides the value by the total number of samples if normalize
is set to True.
y_pred = np.array([0, 2, 1, 3])
y_true = np.array([0, 1, 2, 3])
print(accuracy_score(y_true, y_pred))
print(accuracy_score(y_true, y_pred, normalize=False))
Output:
0.5
2
confusion_matrix
Another useful function from metrics module is confusion_matrix. It's also easy to implement:
def confusion_matrix(y_true, y_pred, labels=None):
labels = labels if labels else sorted(set(y_true) | set(y_pred))
indexes = {v:i for i, v in enumerate(labels)}
matrix = np.zeros((len(indexes),len(indexes))).astype(int)
for t, p in zip(y_true, y_pred):
matrix[indexes[t], indexes[p]] += 1
return matrix
First we need to know what classes are available. If labels
are given, we can just use it. Otherwise, it converts each array into set, take the union, and sort it. Next, it maps each class with its index. Then it initializes a \(n \times n\) matrix where \(n\) is the number of classes. Finally, for each pair of y_true
and y_pred
, it increments the value of the corresponding cell of the matrix:
y_true = np.array([2, 0, 2, 2, 0, 1])
y_pred = np.array([0, 0, 2, 2, 0, 2])
print(confusion_matrix(y_true, y_pred))
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
print(confusion_matrix(y_true, y_pred, labels=["bird", "cat", "ant"]))
Output:
[[2 0 0]
[0 0 1]
[1 0 2]]
[[2 0 0]
[0 0 1]
[1 0 2]]
Conclusion
You are most likely to use scikit-learn instead of your code in practice. Nevertheless, it is a good practice to code from scratch. You will know more about python and numpy, and you will also notice small details of scikit-learn.
If you have questions or comments, tweet @kenzotakahashi and I'll be happy to help.