So far we've covered using neural networks to perform linear regression. What if we want to perform classification using a single-layer network? In this post, I will cover two methods: the perceptron algorithm and using a sigmoid activation function to generate a likelihood. I will not cover the delta rule because it is a special case of the more general backpropagation algorithm, which will be covered in detail in Part 4.
Theory
Single-Layer Perceptron
Perhaps the simplest neural network we can define for binary classification is the single-layer perceptron. Given an input, the output neuron fires (produces an output of 1) only if the data point belongs to the target class. Otherwise, it does not fire (it produces an output of -1). The network looks something like this:
In this, we are computing the dot product of an example with our weight vector. Points with positive projections will be given a label of 1 and points with negative projections will be given a label of -1. Consequently, our decision boundary will be perpendicular to our weight vector. Why? Consider a 2-dimensional decision problem. The decision boundary is the line where it is equally probable that a point on that line belongs to either class, i.e. $ h(\mathbf{x}_i, \mathbf{w}) = \mbox{sign}(\mathbf{w}^T\mathbf{x}_i) = 0 $, or $ \mathbf{w}^T\mathbf{x}_i = 0 $. Then we have:
In this case, $ x_i^{(1)} $ is our bias value and is always equal to 1, $ x_i^{(2)} $ is "$ x $" in the cartesian plane and $ x_i^{(3)} $ is "$ y $." The slope of our weight vector in the cartesian plane is $ \frac{w_3}{w_2} $ (they "$ y $" component of $ \mathbf{w} $ is $ w_3 $, and the "$ x $" component is $ w_2 $), while the slope of the decision boundary is $ -\frac{w_2}{w_3} $ (thus making them perpendicular). Graphically, this looks something like this:
The problem we now face is that the step function is not continuously differentiable, and we cannot use standard gradient descent to learn the weights. Therefore, we will use the appropriately-named perceptron algorithm . This algorithm is an online method used to successively update the weights defining a linear boundary only if that boundary does not classify a training point correctly. The algorithm is as follows:
- Initialize the weight vector $ \mathbf{w} $ to all zeros.
- Repeat the following:
- For each training example $ \mathbf{x}_i $:
- If $ h(\mathbf{x}_i, \mathbf{w}) \neq y_i $, then update the weights with $ \mathbf{w}' = \mathbf{w}+\eta y_i\mathbf{x}_i $ Here, $ \eta $ is the step size.
- If the stopping condition $ \frac{1}{N} \sum_{j=0}^M |w_j' - w_j| < \delta $ is reached, then accept $ \mathbf{w} $ as the final weight vector ($ M $ in this case is the number of features in the dataset).
- For each training example $ \mathbf{x}_i $:
If our problem is linearly separable, the perceptron algorithm is guaranteed to converge. Therefore, at the algorithm's termination, we will end up with a linear decision boundary defined by $ \mathbf{w} $. However, this decision boundary is not guaranteed to be a maximum margin hyperplane as in the case of SVMs.
Finally, if we want to predict the label $ \hat{y}_i $ of a test point $ \mathbf{x}_i $, we use $ \hat{y}_i = \mbox{sign}(\mathbf{w}^T\mathbf{x}_i) $.
While it is not strictly necessary to define a neural network to use the perceptron algorithm, this is a good first step towards single-layer classification.
Classification with a sigmoid (softmax) activation function
Instead of an all-or-nothing classifier (like the sign function), it is helpful to come up with some way to measure the probability of assignment, that is $ P( Y = y_i\ |\ X = \mathbf{x}_i, \mathbf{w} ) $. If we can calculate this likelihood, we can use as a confidence measure of our predictions.
Instead of using a sign activation function, we can instead use a sigmoid (usually called softmax in the neural net literature) to output a probability:
But how do we assign a class label when given only a probability? We can simply "clamp" the probability using a sign function, so that any $ P(Y = 1\ |\ \mathbf{x}_i, \mathbf{w}) \geq 0.5$ is assigned a class label of 1, and any probability less than 0.5 is given a class label of -1. Our simple network now looks something like the following:
Luckily for us, this network function is identical to the likelihood used by logistic regression. Because the sigmoid is differentiable, we can use standard gradient acent to train the weights instead of the perceptron algorithm. For a derivation of the gradient for logistic regression, see the Appendix.
Implementation
# Generate two random clusters of 2D data N_c = 100 A = 0.3*np.random.randn(N_c, 2)+[1, 1] B = 0.3*np.random.randn(N_c, 2)+[3, 3] X = np.hstack((np.ones(2*N_c).reshape(2*N_c, 1), np.vstack((A, B)))) Y = np.vstack(((-1*np.ones(N_c)).reshape(N_c, 1), np.ones(N_c).reshape(N_c, 1))) N = 2*N_c
Next, we run the perceptron algorithm to learn the weights:
# Run perceptron delta = 1E-7 eta = 1E-2 max_iter = 500 w = np.array([0, 0, 0]) w_old = np.array([0, 0, 0]) for t in range(0, max_iter): for i in range(0, N): x_i = X[i, :] y_i = Y[i] h = np.sign(np.dot(w, x_i)) if h != y_i: w = w+eta*y_i*x_i if 1/(float(N))*np.abs(np.sum(w_old-w)) < delta: print "Converged in", t, "steps." break w_old = w if t==max_iter-1: print "Warning, did not converge." print "Weights found:",w
In line 21, we test to see if $ h(\mathbf{x}_i) = y_i $, and if it doesn't we update the weights accordingly. Finally, we test the stopping condition in line 25.
In one particular test run, you can see that the algorithm converged in 6 steps:
Here we introduce an alternative stopping condition. The heuristic is this: if the norm of the gradient is small, we must be nearing the minimum of the function. Thus, if we set a threshold for the norm of the gradient, and it falls below this threshold at a step, we stop. The interesting part of the gradient code is as follows:
# Run gradient descent delta = 1E-7 eta = 1E-3 max_iter = 1000 w = np.array([0, 0, 0]) grad_thresh = 5 for t in range(0, max_iter): grad_t = np.array([0., 0., 0.]) for i in range(0, N): x_i = X[i, :] y_i = Y[i] grad_t += y_i*x_i*(np.exp(-y_i*np.dot(w, x_i)))/(1+np.exp(-y_i*np.dot(w, x_i))) w = w + 1/float(N)*eta*grad_t grad_norm = np.linalg.norm(grad_t) print grad_norm if grad_norm < grad_thresh: print "Converged in ",t+1,"steps." break print "Weights found:",w
In line 24, we compute the gradient according to the derivation in the Appendix. In lines 28-29, we check to see if we've reached the stopping condition. This is a simple heuristic that may not work in all cases, but works well enough for a simple problem like this. Determining how to compute the right parameters for gradient descent is an active area of research.
Code Download
Python file that learns weights using the perceptron algorithm.
Python file that learns weights using gradient ascent.
Appendix
Derivation of the gradient for logistic regression