For Part I of this riveting series, click here.
In Part I, we went through each calculation by hand of a forward and backward pass through a simple single-layer neural network.
To start Part II, we’re going to do the same for the second pass through. My hope is after doing this a second time trends will emerge and we will be able to understand how the network’s weights end up where they do by 100,000th pass.
Since the first pass was called iteration #0, we begin with iteration #1:
---------ITERATION #1------------- inputs: [[0 0 1] [0 1 1] [1 0 1] [1 1 1]] weights: [[ 0.67423821] [-0.33473064] [-0.40469539]] dot product results: [[-0.40469539] [-0.73942603] [ 0.26954282] [-0.06518782]] l1 probability predictions (sigmoid): [[ 0.40018475] [ 0.32312967] [ 0.56698066] [ 0.48370881]]
Compared to the first pass, the first weight is larger and the second two weights got smaller. We’ll see if these updated weights cause less error in our predictions (Spoiler: They will).
Although you should be able to do dot products in your sleep at this point since you followed along so closely with Part I of the series, I’ll walk us through the dot product again:
(0 * .674) + (0 * -.335) + (1 * -.404) = -.4047 (0 * .674) + (1 * -.335) + (1 * -.404) = -.7394 (1 * .674) + (0 * -.335) + (1 * -.404) = .2695 (1 * .674) + (1 * -.335) + (1 * -.404) = -.0652
Great. Now we run the results through the sigmoid function to generate probability predictions (shown as “l1 probability predictions (sigmoid) above).
For nostalgia’s sake, here were our predictions from the previous pass:
OLD l1 probability predictions (sigmoid): [[ 0.36672394] [ 0.27408027] [ 0.46173529] [ 0.35868411]]
If you compare the old predictions with the new ones, you’ll notice that they simply all went up, meaning the model thinks they are more likely to be ones than before.
In terms of error, it hasn’t improved much from the last run.
OLD l1_error: [[-0.36672394] [-0.27408027] [ 0.53826471] [ 0.64131589]]
NEW l1_error: [[-0.40018475] [-0.32312967] [ 0.43301934] [ 0.51629119]]
Calculating the sum of the absolute value of the four errors, it did decrease from 1.82 to 1.67. So there was improvement!
Unlike in Part I, I’m not going to dive into the details of how taking the derivative of the sigmoid at the spot of the probability prediction, multiplying the result by the errors, and then taking the dot product of the result with the inputs leads to updating the weights in a way that will reduce prediction error… but instead just skip to the updated weights:
pre-update weights: [[ 0.67423821] [-0.33473064] [-0.40469539]] post-update weights: [[ 0.90948611] [-0.27646878] [-0.33618051]]
As we should come to expect, the weight on the first input got larger and the other two got smaller.
Let’s take a look at how the sum of the errors decreases over the first 100 iterations:
Now the first 1000 iterations:
Seems like we hit an “elbow point” around the 100th iteration. Let’s see how this same graph looks over 10,000 iterations:
Even more dramatic. So much of the effort (computational resources for those who don’t like to personify their processors) goes towards decreasing the final error by tiny, tiny amounts.
Last graph, lets see where we end up after 100,000 iterations:
The value of the error after 10,000 iterations is 0.03182. After 100,000 it is 0.00995, so the error is certainly still decreasing. Though from the graph above, we can see it is easy to make the argument that the additional training loops are not worth it since we get most of the way there from just a few hundred iterations.
Where did the weights end up? Great question! Let’s have a peek:
weights (after 100,000 iterations): [[ 12.0087] [-0.2044] [-5.8002]]
Not surprisingly, the size of the first weight has grown to be the largest. What does, in fact, surprise me is the relatively large weight on the third input (large weights, even if negative, still have an impact on the predictions.)
One thing to note is that the inputs corresponding to the third weight are all ones, making it effectively like adding a bias unit to the model. Viewed in that way, it is less surprising to see the large-ish third weight.
One more time, let’s run through the predictions produced from these weights. We start with the dot product of the weights and the input:
dot product results: (0 * 12.00) + (0 * -.20) + (1 * -5.8) = -5.8 (0 * 12.00) + (1 * -.20) + (1 * -5.8) = -6.0 (1 * 12.00) + (0 * -.20) + (1 * -5.8) = 6.2 (1 * 12.00) + (1 * -.20) + (1 * -5.8) = 6.0
Those results make an overwhelming amount of sense. Let’s apply the sigmoid function:
l1 probability prediction (sigmoid): 1/(1+e^-(-5.8)) = 0.003 1/(1+e^-(-6.0)) = 0.002 1/(1+e^-(6.2)) = 0.998 1/(1+e^-(6.0)) = 0.997
Hopefully that makes it a little more obvious why the error is so low. Only took 100,000 tries🙂
Jupyter notebook for this article on GitHub.
Stay tuned next time when we add another layer and dive into the details of a more legit backprop example!