Machine Learning

Intro

Machine learning is all the rage, and for good reason. Some problems, such as recognizing faces or handwriting, are difficult to create algorithms for because we don’t really understand how we do it ourselves. If we don’t understand how we do it, how can we tell a computer to do it? That’s where machine learning comes in.

Humans learn by taking an input, coming to some conclusion about that input, then seeing if our conclusion is correct. For example, a child might see a dog, and think it’s a cat or a dog. The parent then may correct the child if they think the creature is a cat, or praise the child for correctly identifying the dog. The parent doesn’t have to explain a procedure for identifying the dog, nor do they have to manipulate the neurons in the child’s brain; the child can figure that out by themselves with proper guidance.

Machine learning is the same way. Rather than telling a computer exactly what to do and how to do it, the programmer creates a highly malleable framework that takes an input and produces an output. Then, the programmer creates an environment for that framework, where inputs are provided, and then the outputs are assessed to see if the framework got the answer right. Using a previous example, we might provide a picture of a dog to the framework, and see if the framework outputs “cat” or “dog”. If the output is incorrect, we modify the framework and try again. Over time, the framework will “figure out” (AKA learn) how to identify cats and dogs.

What separates machine learning from conventional programming is that conventional programs are understandable by humans. A maze solving algorithm can be seen as depth-first or breadth-first, for example, and we can think of situations where one is better than the other. For machine learning, however, the highly malleable framework is basically a black box. It takes an input, does some computation that makes sense only to the computer, and produces an output. This can make debugging and improving the system quite challenging. Compounding this is that machine learning, from what I’ve seen, is heavily dependent on heuristics and best-practices. A lot of what I read says “we don’t really understand why this makes the system work better, but it worked for someone else, and it worked for us.”

If we don’t really understand how machine learning works, then how do we fix or improve systems? It’s like raising a child; you can’t go into their brain and adjust individual neurons, but you can provide a better learning environment for them and hope they relearn properly. Likewise, while the framework is a mystery to us, we can provide better guidance so the black box performs better.

This blog post will discuss how I view and think of machine learning and its many aspects. The focus is not on implementation or mathematics, but rather on concepts and ways of thinking.

With all that said, let’s jump in!

References

For those who want a more rigorous discussion of the topic, here are my primary references:

3Blue1Brown on YouTube has a fantastic series that explains machine learning for those with no prior experience, with really helpful animations.
Neural Networks and Deep Learning, which is a free and easy to understand book that covers a lot of the mathematics and the justification behind it

Neurons & Neural Network

What I’ve been calling the “highly malleable framework” is what others call a neural network. It is a model developed to loosely mimic the behavior of neurons in our brains. A single neuron takes in several inputs, performs computation with them, and produces a single output. When many neurons are connected together, the behavior and output of the entire network can become quite complex.

Single neuron
Image from Neural Networks and Deep Learning

Example of a neural network
Image from Neural Networks and Deep Learning

The output of the neuron is a normalized weighted average of the input with a bias. Let’s look at an example of a neuron with three inputs to see what that means:

z = w1*x1 + w2*x2 + w3*x3 + b

a = sigmoid(z), where sigmoid(z) = 1/(1-exp(-z))

z is called the weighted input. It is a sum of a bias (b) and the inputs (x1, x2, x3) after they have been scaled by weights (w1, w2, w3).

a is called the activation, and is the output of the neuron. Since there’s no limit to what the weights and biases can be, z could output a huge positive or negative number. This could be a problem, so a function is used to normalize z, reigning in its maximum and minimum value. The sigmoid function smoothly transitions from 0 to 1, so the output of a single neuron is limited to that range.

A neural network is organized into layers. The first layer of neurons is the input layer, where the input to the system is provided. Note that it’s not really a layer of neurons; the “neurons” in that layer just pass along the input to the next layer. Think of it as the network’s port to accept inputs. The last layer of neurons is the output layer, where the system will output its final answer. The layers between the input and output layers are the hidden layers, called such since their existence isn’t obvious to an outside observer.

Neurons within a single layer do not connect to each other. However, a neuron in a layer takes in as input the output of all layers in the previous layer, and produces an output that goes to all inputs of a subsequent layer.

Let’s do a quick recap. A neural network takes in an input, and feeds it to a layer of neurons. The neurons then performs computations, then passes its output to the next layer. The next layer does the same thing, until at the output layer, an answer is provided.

This raises some questions:

What does the input look like? If the input is a picture, then you could provide the RGB information for each pixel. For example, for a 10 x 10 RGB image, the input might be 3 x 10 x 10 = 300 inputs, each input representing the intensity of a color of a pixel. If the input is a 10 x 10 grey-scale image, then you might only have 100 inputs.
What does the output look like? The output can be coded in different ways. If the network is supposed to distinguish between a cat and a dog, you could have two output neurons; one neuron has a high value (outputs close to 1) if the input is a cat, and the other neuron has a high value if the input is a dog. Or you could have a single neuron that outputs close to 1 if the input is a cat, or close to 0 if the input is a dog. Generally, the former method is chosen; when the input is an image and the network is supposed to classify the image between 10 categories, then the network will typically have 10 outputs, and the highest value output is the systems answer.
When a machine “learns”, what exactly is happening? The equation for weighted input (for three inputs) is z=w1*x1+w2*x2+w3*x3+b. x1, x2 and x3 are determined by the previous layer, but what about w1, w2, w3 and b? When the neural network is first created, these numbers are initialized with a random value. When a neural network “learns”, it tweaks its weights and biases so that the output of the system is more and more accurate over time.

In essence, a neural network takes an input, mashes a bunch of numbers together, and produces an output. The network will then see how good or bad the output is, then adjust how it mashes the numbers together to try and get a better result.

Now we’ve covered what a neuron and a neural network is; now let’s see how it learns.

Learning

Say you step into a room, and the air is too cold. You could just turn the thermostat to a random point, and see if the new temperature suits your preferences. If it’s good, you leave the thermostat alone; if not, you try again, setting the thermostat to a random temperature.

This approach to finding the right temperature leaves a lot to be desired. Clearly the best way to get the temperature you want is to see if the room is too hot or cold, then adjust the thermostat in the opposite direction eg. if the room is too cold, turn the thermostat so the room gets hotter.

Machine learning has the same approach. When the neural network is first created, the weights and bias for each neuron is randomly initialized, so they’re probably not very good, just like how if you set the thermostat to a random temperature, the room probably won’t be at the temperature you want. And just like how you can discern in what direction and by how much to move the thermostat to get to the temperature you want, there must be a way to know in what direction the weights and biases should be moved to, and by how much.

Before we can know how to get the result we want, we need to quantify how “good” or “bad” the performance of the neural network is; after all, if you can’t say the system is doing poorly, how will you know it needs to be changed? This is where the cost function comes in. This value shows the difference between the desired output of the neural network, and its actual output. If the actual output is perfectly accurate, the cost is close to zero; if the output is way off, the cost function is large. Now, we have a more methodical approach: change the weights and biases of the network so the cost function goes down.

Example of cost function changing as a function of a single weight in the neural network

Say, for a given neural network, you randomly selected a single weight within a single neuron. Now, for a given input, you changed the weight over a range, and recorded the cost as a function of weight, and got the graph above. Since we have the whole graph, it’s obvious what the weight should be: the weight corresponding to the deepest point. But let’s say you didn’t have a whole graph; in fact, you had a single point. If you could somehow determine the slope at that point, then you could determine what direction you should move to reduce cost. If the randomly selected point has a positive derivative (increasing), then you want to decrease the weight. If the point has a negative derivative (decreasing), then you want to increase the weight. Note that you’re moving in the opposite direction of the derivative. This is the fundamental idea of gradient descent.

The graph above is for a 1-dimensional case; you only have one weight you’re dealing with. In an actual neural network, you have many neurons, each of which has many weights and a bias. Instead of dealing with the 1-dimensional case like in the previous paragraph, you would be dealing with hundreds or thousands of dimensions. The idea is still the same; determine the gradient at your current location (gradient is the multidimensional version of a derivative), and move in the opposite direction.

Now we have a battle plan: provide an input to the network, then use its output to determine the cost. Then, determine the gradient at your current point, and adjust the weights and biases accordingly. Only problem is… how do you determine the gradient at your current point?

Backpropagaion

The simplest way to determine the derivative is to find the value at a point, then the value of a point right next to it, then use the slope equation to approximate the derivative. You could do that in the multidimensional case, but that is very computationally inefficient since you would need to calculate the two points for each dimension, which could be thousands or millions of computations. So what do you do?

There is a very cool algorithm for determining the gradient called backpropagation, backprop for short. The equations are shown below:

Equations for backpropagaion
Image from Neural Networks and Deep Learning

As I mentioned in the intro, I’m not going to go into the proof or meaning of these equations. I’ll instead say how I think of this group of equations. For those curious, chapter two of the Neural Networks and Deep Learning book does a great job explaining the mathematics and intuitions of these equations.

First, a quick note about notation; the L superscript denotes the value of something in the final (output) layer, while the l superscript denotes the value of something in any other layer. Also, δ is the symbol for error, which for our discussion here is just a very useful quantity.

The first equation shows how to calculate δ^L, which is the error for the output layer. The second equation shows how to calculate the error for any layer, provided you know the error of the next layer (notice how δ^l depends on δ^(l+1)). Since the first equation tells us how to calculate error for the final layer, we can use the second equation repeatedly to walk, or propagate, backwards through the network. If you repeatedly use the second equation, you’ll eventually have the error for every layer in the network. Why do you care about the error? Because error can be used to calculate the gradient! The third and fourth equation lets you convert the error you calculated into partial derivatives of biases and weights, and the gradient is just a matrix of all these partial derivatives.

Backpropagation, in other words, provides a computationally efficient method for computing a very useful value, error. Error, in turn, provides an easy way to compute the gradient. Therefore, thanks to backpropagation, we can efficiently compute the gradient, which means we can quickly determine in what direction the weights and biases should be adjusted, and by how much!

Stochastic Gradient Descent

In my previous metaphor, you were alone in the room, and trying to get the temperature right. Now, say you were in a room full of people. You talk to one guy, and he says it’s cold, so please raise the temperature, so you make the room warmer. But there are multiple people in that room. The first guy you talked to is happy, but other people are not; they think it’s too hot, and want you to make it colder. What do you do? Well, you could ask every single person if they’re happy with the temperature, and then adjust the thermostat to make as many people as happy as possible. But say, somehow, you have millions of people in the room. Asking every single person how they feel every time the temperature changes a little bit takes too much time and effort. Instead, you take a sample; based on the sample, you’ll adjust the thermostat, and hope that makes everyone happy. The sample isn’t 100% accurate, but it’s pretty close to what the entire population wants (assuming the sample is large enough).

Apologies for the tortured metaphor. What does this have to do with machine learning? Well there’s two components to the metaphor: the multiple people wanting different things, and the sampling of the population. Let’s look at the first component.

So far, we’ve only looked at how to adjust the network when provided with only one example input, but during training you have to provide hundreds, thousands or even millions of examples. Say your neural network is supposed to distinguish between a cat and a dog. You provide a picture of a cat, and then use backpropagation to determine how the weights and biases should be adjusted. Then, you update the network. All good, right? Well you’ve made the neural network better at identifying that one specific picture of a cat, but that’s not what we really want. We want the neural network to identify cats and dogs, so you have to provide multiple pictures of cats, and multiple pictures of dogs. In other words, rather than updating the weights and biases based on a single example, you should update the weights and biases using many, many examples. Here’s the procedure:

Take the first example, use backpropagation to determine the gradient, and determine how you want the weights and biases adjusted for that example. But don’t update the network yet.
Take the second example, use backprop, then determine how you want the weights and biases adjusted. Again, don’t change the network yet.
Take the third example… etc.

Each example is going to tell you how you should adjust the weights and biases, just like how each person will tell you whether you should raise or lower the temperature. If you average all the weights and biases adjustments from each example, then you’ll have a single voice telling you how to adjust the weights and biases so that all the examples are (somewhat) happy. You’re essentially finding a happy medium that works for every example.

Each example has its own opinion on how you should adjust the weights and biases. If you listen to only one, you’ll move in the wrong direction. By listening to all of them, you’ll move in the direction that is most beneficial to everyone.

The second part of the metaphor is sampling. In the previous paragraph, I said use backprop to get the gradient for every single example, then average all the gradients, then use that to update the weights and biases. That’s one “step”. To take another step towards the ideal weights and biases, do the whole thing over again: run through all the examples, calculating gradients, averaging them, then updating the weights and biases again.

Works well in theory, but poorly in practice. The amount of examples can be HUGE, going up to millions. Running through millions of examples for every single step is very, very, very time consuming, just like asking millions of people if they think the room is too hot or cold. So you sample to make it faster. Rather than running through all the examples, you run through a random sample, calculating the gradient for each one. Then, you average the gradients and update the weights and biases. In other words, if you were to run through all the examples, then you’d know the exact direction to move to please everyone; by running through a random sample, you’ll know the approximate direction to move to please everyone.

So you use a random sample to take one step. Then, you take another random sample (excluding the examples you’ve already used), and take another step. Then you take a third random sample, etc. Each random sample, called a mini-batch, should be the same size. Since you’re excluding previously used examples for each new mini-batch, you’ll eventually go through your entire example set. Each time you run through your entire example set, that’s called completing one epoch. If you use mini-batches, and say one epoch consists of 100 mini-batches, then completing one epoch will mean you’ve taken 100 steps (adjusted weights and biases 100 times). If you don’t use mini-batches, then you’ll only take 1 step after completing each epoch. I’m sure you can see why mini-batches greatly speed up how quickly neural networks learn.

This entire process is called Stochastic Gradient Descent, often called SGD. Let’s recap:

Take your entire example set, and break it up into mini-batches (using random sampling)
For each mini-batch:
- For each example within the mini-batch, calculate the gradient using backprop
- Once you’ve run through all the examples in the current mini-batch, average all the gradients and use that to update the neural network’s weights and biases
Once you’re done with all the mini-batches, you’ve completed one epoch. Repeat steps 1~3 for as many epochs as desired. The number of epochs is usually in the tens to hundreds, depending on complexity and size of the neural network.

Great! So we started with a neural network with random weights and biases, and we learned how to use backprop to find gradients, which will tell us how to update weights and biases for a specific example. Then, we learned about stochastic gradient descent to quickly and efficiently update those weights and biases for all examples in a very large example set. By now, we’ve learned how to train a neural network! Since there’s no upper limit to the number of epochs to run (ignoring time constraints), if we train a network for hundreds or thousands of epochs, then the network will be trained to perfection… right?

Overfitting

Here’s one more metaphor since I love them so much. Say two students are supposed to learn how to multiply. To help them, you provide a page with dozens of examples of multiplication: 1×5 = 5, 9 x 12 = 108, etc. One student learns the rules of multiplication (x times y means you add x to itself y times) in the traditional sense. Sure this student makes arithmetic errors occasionally, so their performance isn’t perfect, but they learned how to multiply two numbers together. The second student, for some reason, decided the best way to learn the material was to memorize every single example you provided. They have 100% accuracy since they know the answer to every example. Which of these two students would you say truly learned the material?

If you ask a human, the answer is the first student. But if you ask a computer, they’ll say the second student: they got every single thing right, so that’s what peak performance looks like. Unfortunately, this is a problem that haunts machine learning. Here’s another, completely different way to look at it:

Which is the better best-fit line?
Image from Neural Networks and Deep Learning

Of the two graphs above, which would you say is the “better” best-fit line? A human would probably say the left one; the data points look like they’re linear, and the best-fit line is simple and easy to understand, and accounting for noise, the best-fit line looks perfectly suitable. The computer, meanwhile, would probably say the right one. The computer would argue that the 9th degree polynomial best-fit line has ZERO error! It’s perfect! It couldn’t get any better than that. What more could you want? How can you argue with a perfect result?

The point I’m trying to make here is that what humans consider ideal is different from what the computer considers ideal. If you let a machine learning algorithm learn for an excessive number of epochs, then it overfits the training data. The end result is rather than learning to differentiate between cats and dogs by recognizing and extrapolating from patterns in the image, the machine devolves into route memorization (“I’ve seen this exact image before, and I was told it was a cat, so it must be a cat”). The implications are clear and troubling: training a machine for too long can actually hurt overall performance. While the machine will perform better and better on the example set you provide to train it, it’ll perform terribly if you provide it with an image it’s never seen this before (“This isn’t one of the ones I memorized, so I have no idea what that is”).

There are many ways to to combat overfitting; I’ll touch on three of them: regularization, validation data, and WAY more training data.

Regularization is done by modifying the cost function. There are several different ways you can modify the cost function (L1 and L2 regularization are two examples), but they all strive to do the same thing. Overfitting occurs because the network is too heavily optimized to reduce the cost function; if you add a bit of a “twist” to the cost function, then stochastic gradient descent won’t properly optimize it. I think of it as intentionally obscuring the cost to impede SGD: in the large strokes, the cost function is the same, but when overfitting is about to occur, SGD gets confused by the weird cost function, preventing memorization. This is a very qualitative, wishy-washy explanation, and it’s because regularization, as far as I can tell, isn’t very well understood. It’s a mostly “I tried this and it worked” type solution.

Another approach is to have a separate data set. The example set you use to train the neural network is called the training data. To detect overfitting, you have another example set called validation data. The trick with validation data is that you don’t use it to teach the network; in other words, the neural net doesn’t have a chance to memorize the validation data. To prevent overfitting, you do the typical stochastic gradient descent: using mini-batches to constantly update the weights and biases. But, at the end of every epoch, you see how well your neural network does with the validation data. Over several epochs, if performance improves with the validation data, then the machine is improving, so keep going. If, however, performance stagnates or decreases, then overfitting may be occuring, so stop training your model. After training completes, use a third data set, called test data, to determine the final performance of your neural network.

Lastly, overfitting occurs more quickly the smaller your training data set. So, if you make your training data really really big, then overfitting will happen later and later. Simple, but an effective strategy. There are two ways to do this: do the leg work and get more training data (get more picture of cats and dogs), or manipulate the existing training data. For example, a picture of a dog is still a picture of a dog if you flip the picture, or rotate it, or scale it up or down a tiny bit. By performing one or a combination of these manipulations to each example, you could easily increase your training data set size ten fold. Since you have more training data, it’s harder to memorize the answers, so overfitting is delayed.

There are many more techniques to prevent overfitting, such as dropout, but the point of this section is to alert the reader that there is such a thing as “too much learning,” and that more epochs may mean more problems.

Deep Learning

If we can’t guarantee performance improvement by increasing the number of epochs we train our neural networks, then perhaps we can improve it by making the neural network more complex? Instead of having 1 or 2 hidden layers, why not have a dozen? Surely more neurons and computations means that the network is more powerful, and therefore will perform better? Unfortunately, it appears that’s not really the case. Mathematically, the gradient of earlier layers (near the input layer) becomes vanishingly small the more layers you have, so even if you add a dozen hidden layers, most of them don’t really learn, so you haven’t really helped the situation. As it turns out, teaching a neural network with many hidden layers is a totally different animal, so it gets its own name: deep learning.

Deep learning is also important for more than just improving performance on a simple task; object classification, object detection, image segmentation and natural language processing are extremely complex tasks that require much more than a small, shallow network. Additionally, while a shallow network could theoretically perform almost any task, a deeper architecture can perform the same task with fewer neurons (though with more layers) as long as the deep neural network is properly trained.

Since teaching deep neural networks is so challenging, one way to get around this is using convolutional neural networks. I’ll elaborate on this in a future post, when I talk about the Jetson Nano and transfer learning.

Conclusion

I hope you enjoyed my overview of what machine learning is, how it works, what its pitfalls are, and some extra topics. This is by no means comprehensive, so I highly suggest you check out my references!

Machine Learning

Intro

References

Neurons & Neural Network

Learning

Backpropagaion

Stochastic Gradient Descent

Overfitting

Deep Learning

Conclusion

Leave a comment

Cancel reply

Intro

References

Neurons & Neural Network

Learning

Backpropagaion

Stochastic Gradient Descent

Overfitting

Deep Learning

Conclusion

Share this:

Related

Leave a comment

Cancel reply