In the last section, we went over how to use a linear neural network to perform classification. We covered using both the perceptron algorithm and gradient descent with a sigmoid activation function to learn the placement of the decision boundary in our feature space. However, we only covered binary classification. What if we instead want to classify a point belonging to one of $K$ classes?
Theory
Multiclass classification using a linear neural network is a fairly simple extension of the binary classification setup. You may think that instead of outputting 0/1 from our second layer node, we could output 0, 1, ..., $K-1$. However, this is not the case . Our labels are not necessarily linear, and halfway between $\hat{y}=0$ and $\hat{y}=2$ is not necessarily $\hat{y}=1$. They are in fact categorical, and we use $k \in \{0, 1, \ldots, K\}$ out of computational convenience.
Consider instead representing a label using a binary vector of length $K$. Having a 1 in position $k$ corresponds to a label of $k$. Then, we can extend our linear network (with a sigmoid activation at the output) to learn how to output this vector. It would look something like the following:
Our training routine is exactly the same as in Part 2, except that the gradient of the multinomial logistic regression objective is slightly different: $$ \nabla_{\mathbf{w}_k}\ell(\mathbf{w}) = \frac{1}{N} \sum_i^N\left(\mathbf{x_i}(1-P(Y = y_i\ |\ X = \mathbf{x}_i, \mathbf{w})\right) $$ See the Appendix for a full derivation.
Implementation
The implementation of multiclass linear classification doesn't change much from the binary case, except for the gradient and how we label our data points. For this toy example, we'll be generating 3 clusters of two-dimensional data. To make things more interesting, we won't restrict them to be linearly separable.
# Generate three random clusters of 2D data N_c = 200 A = 0.6*np.random.randn(N_c, 2)+[1, 1] B = 0.6*np.random.randn(N_c, 2)+[3, 3] C = 0.6*np.random.randn(N_c, 2)+[3, 0] X = np.hstack((np.ones(3*N_c).reshape(3*N_c, 1), np.vstack((A, B, C)))) Y = np.vstack(((np.zeros(N_c)).reshape(N_c, 1), np.ones(N_c).reshape(N_c, 1), 2*np.ones(N_c).reshape(N_c, 1))) K = 3 N = K*N_c
Next we run gradient descent using the multinomial logistic regression gradient:
# Run gradient descent eta = 1E-2 max_iter = 1000 w = np.zeros((3, 3)) grad_thresh = 5 for t in range(0, max_iter): grad_t = np.zeros((3, 3)) for i in range(0, N): x_i = X[i, :] y_i = Y[i] exp_vals = np.exp(w.dot(x_i)) lik = exp_vals[int(y_i)]/np.sum(exp_vals) grad_t[int(y_i), :] += x_i*(1-lik) w = w + 1/float(N)*eta*grad_t grad_norm = np.linalg.norm(grad_t) if grad_norm < grad_thresh: print "Converged in ",t+1,"steps." break if t == max_iter-1: print "Warning, did not converge."
There are a couple of things to note here. First, our weight vector $\mathbf{w}$ is now actually a 3x3 matrix. We have 3 classes, so we need 3 sets of weights. Recall that each set of weights has a bias weight and a weight for each feature, giving them a length of $M+1$. Therefore, each row of $\mathbf{w}$ corresponds to the weight for each $k \in \{0, 1, 2\}$. You can see this initialization in line 20.
In line 27, we calculate the unnormalized likelihood using numpy's dot function. dot computes the dot product between the input $\mathbf{x_i}$ and each of the weights in a row-wise fashion. In line 28, we generate the normalized likelihood that our datapoint $\mathbf{x}_i$ belongs to class 0, 1, or 2. Finally, in line 29, we use these likelihoods to calculate the gradient. This code follows the math exactly, so there shouldn't be any surprises here.
Running around 1,000 epochs with the given descent parameters will generate classification regions in the following manner:
Code Download
2D multi-label classification Python code.
Appendix
Deriving the gradient for multinomial logistic regression