Differential equations are an important class of equations for describing physical systems and phenomenon. Let’s take a look at what they mean, some examples of their use, and how to solve them.
What is a Differential Equation?
Let’s say a ball is dropped from the top of a tall building, ignoring air friction or wind. What is the equation that describes the ball’s position, as a function of time? Unfortunately, when describing physical phenomenon or systems, it’s usually easier to describe the change of a system, rather than its current state. In other words, the derivative of a value is easier to describe then the value itself. In our case, the acceleration of the ball, caused by gravity, is far more apparent than the position of the ball. Therefore, in order to calculate the position, we start with derivative of position, then try to calculate the position equation.
Simply put, differential equations are equations that relate a function to its own derivatives. For example, the statement “the sum of a function and its derivative equals zero,” or “the third derivative and the first derivative combined create a sine wave.” Given a differential equation, the end goal is to solve for the underlying function.
Let’s take a look at the ball drop again. We know the acceleration due to gravity (9.81 m/s^2), and acceleration is the second derivative of position. This gives us the following differential equation, and its subsequent integrals:
Ball drop differential equation, and its solution
The top line gives an easy and obvious equation, the acceleration due to gravity. However, this is the second derivative of position, which isn’t what we’re after. To get the equation for position, we have to integrate twice, thus solving the differential equation.
But wait; what are c1 and c2? This brings up an important point: initial conditions are necessary to solve differential equations. In our case, c1 describes the ball’s initial velocity (whether it’s merely let go, or thrown downward), and c2 describes the initial position (from what height was the ball dropped). Indeed, it’s impossible to say with any confidence what the ball’s position is if we don’t know c1 and c2, which means that differential equations are not truly solved until initial conditions are accounted for.
The ball drop example is a very simple differential equation, as the second derivative only depends on a constant. This allowed us to directly integrate the equation to get rid of the derivatives. However, more complex differential equations do now allow this; see some examples below:
x’ denotes first derivative of x x” denotes second derivative of x, etc.
As you can see, we cannot directly integrate both sides of the equation to solve for x. So what do we do?
Solving Differential Equations
There are two ways to solve differential equations: analytically and numerically. Analytically involves using mathematics to find the underlying function, giving a closed form for the answer. Numerically involves using pure computing power to brute force a solution. Analytically provides a cleaner, simpler answer, but for many differential equations an analytical solution is impossible. Let’s look at analytical first.
Analytical Solutions
Analytical solutions are often done in two ways: memorization and the Laplace transform.
Memorization
Memorization is what it sounds like. Many forms of differential equations have already been solved, so simply memorizing those forms and their associated answers is sufficient. Let’s look at homogeneous second order differential equations with constant coefficients. Homogeneous signifies that the equation equates to 0, and second order means the highest derivative is second order:
So we have to solve for x(t).
Let’s take a step back. What function, x(t), has clean and simple relationships to its own derivative(s)? Exponentials and sinusoids come to mind. The derivative of an exponential is a scaled version of that exponential, and sinusoids are usually scaled inversions of their own second derivative, as follows:
c is some constant
This is relevant because exponentials and sinusoids come up a lot for solutions to differential equations. Keeping that in mind, let’s go back to solving for x(t). Let’s take an educated guess and assume x(t) is an exponential:
If e^rt*(ar^2+br+c)=0, then (ar^2+br+c)=0. Since we’re equating a polynomial to zero, this boils down to finding the roots of that polynomial. In our case, we’re looking at a quadratic equation, so let’s use the quadratic formula!
Depending on a, b, and c, there are three possible solutions: two real, distinct roots; two complex, distinct roots; and repeated roots. Let’s look at each case:
If b^2 – 4ac > 0, then there are two distinct, real roots. In this case, the solution is the sum of two exponentials:
If b^2 – 4ac < 0, then there are two distinct, complex roots. In this case, the solution is exponentially decaying sinusoids:
If b^2 – 4ac = 0, then there are repeated roots. In this case, the answer is a sum of an exponential and an exponential multiplied by t:
There are many similar equations for other types of differential equations. The key takeaway for memorization analytical solving is:
Recognize the differential equation
Remember the associated solution
Solve for constants in solution
Laplace Transform
The Laplace transform is a different approach to solving differential equations. Differential equations are difficult to work with due to derivatives, but the Laplace transform changes the problem into an algebra problem. The Laplace transform is as follows:
The Laplace transform has the convenient property that if f(t) becomes F(s), then the derivative of f(t) becomes sF(s), if initial conditions are 0:
Assuming all initial conditions are 0, this allows us to use the Laplace transform to easily rearrange the equation:
u(t) is the Heaviside function, which is 0 for t<0 and 1 for t>0 The Laplace transform of u(t) is 1/s
We start with a differential equation, then take the Laplace transform of both sides. On the left side, Y can be factored out, isolating it. We can then use algebra to solve for Y. Using partial fraction expansion:
Great! Using the Laplace transform, we’ve solved for Y(s). But what we’re looking for is y(t), so how do we get that? Well we get Y(s) from y(t) using the Laplace transform; we can get y(t) from Y(s) using the inverse Laplace transform! Turns out that’s a pain in the butt; rather than doing the computations ourselves, we use a table of transforms that’s already done the math for us! Conveniently, we can do the inverse Laplace transform one term at a time, and then add them all up at the end.
The last term is 1/(s+1)^2, which is n!/(s-a)^(n+1) when n=1 and a=-1. Therefore, the inverse Laplace transform of the final term is t*e^-t. All combined:
The Laplace transform, therefore, allows us to use tables and algebra to solve differential equations, which is often preferable to solving the differential equation directly. The procedure is to use the Laplace transform on the differential equation, algebraically solve for the function of interest, then use the inverse Laplace transform table to undo the transformation.
Numerical Solutions
Analytical solutions are great for finding closed form solutions to differential equations. However, this isn’t always possible. Fortunately, we have another approach: the numerical solution. Using Euler’s method, it is theoretically possible to compute the solution to any differential equation.
The top equation is the definition of the derivative. Below, the equation is rearranged to calculate f(x+h).
For some small step h, we can compute the new value f(x+h) using the current value f(x), and the derivative at that position, f‘(x). This is a linear approximation of the function which becomes exact as h becomes infinitely small.
How does this help us? Well let’s say we have some first order differential equation:
The differential equation gives us a way to find the slope at a given point. Let’s find the slope at the starting condition, x(0)=1. Plugging that into the differential equation, we get x’+1=0, or x’=-1. Let’s take a small step of 0.1. To calculate the new value, we get the small step times the derivative plus the current value, or 0.1*-1+1=0.9.
Now, we have x(0.1)=0.9. The process repeats all over again: use the current value to get the new derivative, then calculate the new value, ad infinatum. Below shows a couple of steps of this process:
Let’s plot x(t):
So we’ve found x(t). Let’s try smaller and larger steps, and see how that changes the solution:
Color indicates step size: red = 0.01, black = 0.1, blue = 1
The red plot has the smallest step, so it is closest to the true solution. The black plot is very close, but has far fewer computations. The blue solution is very rough, and probably shouldn’t be used.
This example shows that the smaller the step you take, the better (more accurate) your solution. If the step size is too large, though, the solution is completely wrong. This results in a balancing act: if your steps are too small, then you need tons of computations; if the step size is too large, then the solution is inaccurate.
So we can solve first order differential equations. But how do we solve second order differential equations?
Fortunately, we can decompose second order (and higher) differential equations into multiple first order equations.
We have transformed a single 2nd order equation into two single order equations. The process is the same as before:
Use the current values of x and v to calculate x’ and v’
Use x’ and v’, and a small step, to calculate the new values of x and v
Use the new values of x and v to calculate x’ and v’
Repeat process as many times as desired
Let’s look at an example problem:
Using h=0.05:
At t=0, x=0 and v=5. We calculate v’=-20. Using time step of 0.05, we calculate:
x = 0 + 0.05 * 5 = 0.25
v = 5 + 0.05 * -20 = 4
Above shows the calculation for t = 0.05; this process is repeated as much as necessary.
Simulation
Differential equations are great for modeling physical systems; solving the differential equations allows us to run simulations. Fortunately, Euler’s method works great for this purpose!
Let’s look at a damped spring mass system:
Damped spring mass system
The mass m is connected to the wall by a spring with stiffness k, and a dashpot with drag c. x denotes displacement of the mass from equilibrium; when x is 0, the spring is at its natural length.
Free body diagram, force equation and differential equations
The spring applies force proportional to displacement, and the dashpot applies force proportional to velocity. This gives the force equation, using Newton’s F = ma. Acceleration is the second derivative of position, so we can decompose the force equation into two first order differential equations. Let’s run some simulations, with various c!
m = 1, k = 16 c = 2 (red), 8 (black), 32(blue) Initial condition: x(0) = 1, x'(0) = 0
We see three cases above. The red graph shows the spring mass oscillating for a long time before returning to equilibrium; this is the underdamped scenario. This means the dashpot is dissipating very small amounts of energy, so the system remains active for a long time. The black graph shows the spring mass returning to equilibrium very quickly without any oscillation; this is the critically damped case. This is the quickest the mass can return to equilibrium without any overshoots. The blue graph shows no oscillation, but the mass returns to equilibrium very slowly; this is the overdamped case. The dashpot is dissipating energy so quickly the mass takes a long time to reach its final position.
Now let’s try varying k:
m = 1, c=2 k = 8 (red), 16 (black), 32(blue) Initial condition: x(0) = 0, x'(0) = 5
We see that as the spring constant k increases, the oscillation frequency, as well as number of oscillations, increases.
As you can see, solving differential equations numerically allows simulating physical systems, which then allows us to vary parameters and see how it changes the system’s behavior.
Conclusion
Differential equations are powerful tools for describing and modeling physical systems. Unfortunately, they’re not very easy to work with, as they give the relationship between functions and their derivatives, rather than the equation for the underlying function. To address this issue, many common forms of differential equations have been already been solved by mathematicians past. The Laplace transform also provides a way to solve differential equations, using algebra and transform tables rather than dealing with derivatives. Yet another approach is to use computing power and Euler’s method to calculate the solution to differential equations. The last approach is very common as it allows simulating reality, such as a damped spring mass system.
Linear algebra has a deceptively simple premise: the study of linear systems. However, this gives rise to many interesting, unexpected, and extremely useful results. Let’s take a look!
The goal of this post is to touch upon introductory concepts, rather than memorizing or deriving equations. Please see my references at the end for more advanced topics, as well as formulas for computations.
Vectors & Matrices
According to physics, a vector is something with magnitude and direction. In linear algebra, it is commonly represented in two ways. The first is a column of numbers. The second is as an arrow, with its tail at the origin and its head at the coordinate described by the column of numbers. For example:
Vector [2;3], represented in two different ways
Vectors typically have 1 column, and have the same number of rows as dimensions. The example above is a 2×1 vector, which means it has 2 rows and 1 column, so it shows a vector in two dimensions. A matrix is similar, but typically has more columns, such as 2×2.
For this blog post, we’ll focus on two dimensional vectors and matrices since they are easiest to plot and understand. Also note that vectors in the text of this post will be written in Matlab notation.
Matrix Multiplication
Let’s take a look at the example above again. The vector [2;3] has its tail at the origin, and its head at coordinates (2,3). How did we get the coordinates for the head?
A unit vector is a vector with length 1. The +x unit vector points in the positive x-axis direction by 1 unit, or [1;0]. Likewise, the +y unit vector points in the positive y-axis direction by 1 unit, or [0;1]. These unit vectors are the building blocks for all other vectors. In our case, the 2 in [2;3] means go two times the +x unit vector, and the 3 means go three times the +y unit vector. Numerically, we get 2*[1;0], and 3*[0;1]. Combining the two gives 2*[1;0]+3*[0;1]=[2;3]. The result is shown below, graphically:
[2;3] vector, drawn with +x and +y unit vectors
Let’s recap. For [2;3], we take 2 times a vector, [1;0], and take 3 times another vector, [0;1], and add them together.
But why use the +x and +y unit vectors? Why use [1;0] and [0;1], instead of something else? Matrix multiplication allows us to do just that:
Matrix multiplication in 2 dimensions
Above is the equation for multiplying a vector by a matrix in 2 dimensions. The vector has components v1 and v2, and the matrix has entries a, b, c, and d. The equation demonstrates what we said previously; v1 scales a vector, as does v2. These two scaled vectors are then added together, resulting in the final vector. Let’s plug [2;3], +x unit vector, and +y unit vector into our equation:
Multiplication equation applied to previous example
Now we return to my previous question. What if we don’t want to use [1;0] and [0;1]? In that case, we change the entries in the matrix! Let’s try an arbitrary example:
Here, instead of 2 times +x unit vector, and 3 times +y unit vector, we get 2 * [2;2] and 3 * [1;3]. Adding them together, we get the final vector [7;13].
[7;13], composed of [2;2] and [1;3]
[7;13] vector is shown above in black. There are two [2;2] vectors shown in red, and three [1;3] vectors shown in blue. The occurrence of each vector corresponds to the entries in the initial vectors, [2;3].
Eigenvalue & Eigenvector
Matrix multiplication often causes vectors to point in different directions. In the previous example, [2;3] became [7;13], which is a decrease in slope, which means the vector has changed orientation. But there are some vectors that do not change orientation upon multiplication. This is mathematically described as: Av = λv, where A is a matrix, v is a vector, and λ is a scalar. This equation means that for a given matrix A, there may be a vector v that, upon multiplication, is the same as the original vector, just multiplied by a scalar. v is called an eigenvector of A, and λ is its eigenvalue.
Let’s look at the matrix from the previous example, and see what happens when we multiply it by its eigvenvectors:
In the first example, the final vector is the same as the initial vector: [1;-1]. This means the eigenvalue for that eigenvector is 1. In the second example, the final vector is 4 times the initial vector, so the eigenvalue is 4.
Let’s look at another matrix:
Matrix to examine
This matrix has elements 5/6 and 1/6, but the 1/6 has been factored out.
Let’s take a look at how this matrix transforms a grid:
Grid, before and after transformation
On the left is a grid, which shows that the points are evenly spaced out. The origin is a circle, the +x axis is blue, and the +y axis is red.
The grid is transformed by multiplying each coordinate by the matrix. The new coordinate is then used for the new grid, shown on the right.
The new grid has some interesting properties. The +x axis has been rotated up, and the +y axis has been rotated to the right. The end result is that the first quadrant looks squeezed. The fourth quadrant, on the other hand, looks like it is stretched thin: the -y axis and the +x axis are further apart now.
The eigenvectors for this matrix are [1;1], with an eigenvalue of 1, and [1;-1], with an eigenvalue of 2/3. Let’s see how they transform (note that eigenvectors have been scaled up for easier viewing):
Grid and eigenvectors, before and after transformation
It looks like the +x axis and the +y axis are converging on the eigenvector in the first quadrant, [1;1]. Conversely, the +x axis and -y axis are diverging from the eigenvector in the third quadrant, [1;-1]. Both of these changes compress space into a single line. This phenomenon becomes apparent if we multiply the grid by the matrix multiple times:
Transforming grid repeatedly Grid converges to [1;1] line
Eigenvectors and eigenvalues are powerful tools to understand a matrix because they tell us how the matrix affects space. In this example, we can see that things on the [1;1] space are unchanged, and things on [1;-1] shrink; the end result is that space is compressed onto a single line.
The above understanding, where the space converges to [1;1], is very useful for understanding Markov chains:
Simple Markov chain
Imagine you have two nodes, v1 and v2. Each node starts off with a random amount of (say) chips, for a total of 100 chips. Every time increment, each node gives away 1/6th of its total chips, and keeps the rest. This is shown in the diagram above. After an infinite amount of time, how many chips will each node have?
For A, element in row i, column j means proportion of chips going to node i from node j. For example, elements in first row are proportion of chips going to v1 from v1 (first column) and to v1 from v2 (second column)
As it turns out, this is an eigenvector/eigenvalue problem. The system at a given time can be described by a vector v = [v1;v2], where v1 and v2 are how many chips those nodes have. The proportion of chips given away or kept can be described by matrix A, which is the same matrix we used in the previous example. Then, to calculate how many chips each node has after one time increment, we multiply A and v. Since we want a large amount of time to pass, we multiply A and v together repeatedly.
We already saw that multiplying by A repeatedly causes v to converge to [1;1], regardless of what v actually is. In this case, since v=[v1; v2]=[1;1], that means over a long enough time, v1 = v2. In other words, each node will end up with 50 chips, regardless of starting condition!
Distribution of chips with 3 different starting conditions
Solving differential equations involving matrices and vectors can also be an eigenvector problem:
Eigenvectors and eigenvalues are supremely important for describing how a matrix behaves, and how it transforms vectors. To understand linear algebra, understanding eigenvectors and eigenvalues is mandatory.
Change of Basis
When describing a vector or coordinate, we almost always use the x axis and y axis, using [1;0] and [0;1] as a basis for making more complicated vectors. However, it is sometimes necessary to describe a vector from another perspective; this is called a change of basis.
Vector. In standard basis, it is [6;12]
Examine the vector above. Using the standard basis, it consists of 6 times [1;0] and 12 times [0;1], so we would describe the vector as [6;12].
Instead of using [1;0] and [0;1], let’s try [1;1] and [1;-1]. By inspection, we see that to construct [6;12], we need 9 times [1;1] and -3 times [1;-1]. Using the new basis of [1;1] and [1;-1], the vector is [9;-3]:
Vector in new basis. Note [1;-1] is drawn backards due to negative scalar (-3)
If the vector or basis is more complicated, then inspection won’t work, so let’s take a look at the procedure for calculating the new coordinates:
C converts from standard basis to the new basis. It’s inverse, C^-1, converts the new basis back to standard basis. Let’s apply the equation above:
One instance where change of basis may be useful is changing perspective. Say a chef has a fridge with mangos and peach. The chef also makes peach smoothie (2 mangos and 3 peach) and mango juice (3 mangos and 1 peach). In order to calculate how many smoothies and juice you can make with given amount of each fruit, or how many fruits it takes to make a given amount of smoothies and juice, a change of basis is necessary.
Another example of change in perspective would be someone on the ground vs someone on the plane. The person on the ground facing North might say “I see something 30 degrees to my right, 3 km away”. The plane might be going West; in order to calculate where the spotted object is, they must convert the person on the ground’s basis to the plane’s basis.
Another instance where change of basis is useful is for simplifying computations, which we’ll see in the next section.
Diagonalization
Let’s return to our favorite matrix:
Try to compute A^5. That is, multiply A by itself 5 times. It’s very time consuming if you try to do it out by hand. Assuming a computer isn’t available, is there a better way to do this?
It’s too bad A isn’t simpler. For instance, if A merely doubled the x coordinate of a vector, and halved the y coordinate, then computing A^5 would be trivial…
But wait a second! Eigenvectors have the property of only scaling when transformed! Could we use that to our advantage? Let’s say we have a vector, v, and we can decompose it into scaled eigenvectors v1 and v2, which have eigenvalues λ1 and λ2. What would the transformation of v look like?
That means if we know A‘s eigenvectors and eigenvalues, then we can easily compute A^5, and what it does to a vector.
The next question is how do we compute the scaling values for the eigenvectors; that is, how do we determine a and b? This is actually a change of basis problem! Instead of describing v using a standard basis, you can describe it using a new basis, in this case the eigenvectors of A!
Let’s formalize our process:
We have matrix C, which is used to change basis from standard to eigenvectors, and matrix D, which is used to scale the eigenvectors.
The third to last line, and the last line, are the main takeaways. Av = (CDC^-1)v means that transforming v by A is the same as decomposing v to eigenvectors, scaling them by eigenvalues, then converting it back to standard basis. Let’s look at an example below, graphing each step:
Equations for our example
Vector, [6;12]
Vector with change of basis using eigenvectors. Now it is [9;-3]
Eigenvectors scaled by eigenvalues; original vectors shown as dotted 9*1=9 -3*2/3=-2
Transformed vector, restored to standard basis
Now let’s examine A^n = C(D^n)C^-1. Computing A^5 is a lot of work, since it takes a lot of computations. However, D^5 is very simple. D is a diagonal matrix, which means multiplying D by itself results in each eigenvalue multiplying itself as well:
The equation for D^n explains why everything converges to the [1;1] line when a vector is multiplied by A repeatedly. In our case, λ1 is equal to 1, so λ1^n = 1 as n increases. However, since λ2 = 2/3, λ2^n gets smaller as n increases. In other words, when vectors are decomposed into eigenvectors [1;1] and [1;-1], the [1;1] component remains the same (multiplied by 1), while [1;-1] component shrinks (multiplied by value less than 1) during transformation.
Conclusion
We looked at how matrices transform space, and how we can use eigenvectors and their eigenvalues to understand the transformation. We also learned change of basis and diagonalization, useful tools to simplify problems, as well as further examine the nature of matrices.
The Jetson Nano is a deep learning development kit from Nvidia. Let’s take a quick look at what it can do!
Terminology: Jetson Nano is the module that the heatsink is mounted on. This module is then connected to a PCB, which has a barrel jack, USB ports, Ethernet, etc. This full assembly is the Jetson Nano Developer Kit.
Setup – Parts
Mandatory
Jetson Nano Developer Kit: I got the B01 variant of the 4 GB version. The A02 is an older version with different connectors and layout.
Micro SD Card: I got a 256 GB one, but the minimum is 32 GB.
Keyboard and mouse: I’m using a Logitech MK360, which only needs a single USB dongle for both.
WiFi & Bluetooth module: Helpful if you want to browse the web or download code on the developer kit
Fan: The case comes with a fan, but it’s always on. The one from Noctua only turns on when Nano starts to get hot.
Setup – Procedure
Programming the Jetson
The instructions can be found here. It’s very similar to programming a Raspberry Pi; burn the operating system onto an SD card, then put the card into the Jetson.
The Jetson can be powered by a USB micro cable, connected to J28. I don’t recommend this, since the Nano consumes a lot of power. Instead, first make sure there is a jumper on J48. Then, connect the power supply to the barrel jack. Now, the Jetson is powered by the power supply instead of over USB
The Jetson, by default, can only be turned on and off by connecting/disconnecting the power supply (or going through the operating system). This is very inconvenient. A way around this is to jumper pins 5 and 6 on J50. Then, short pins 11 and 12 on J50 to turn the system on or off.
The WiFi/Bluetooth module is placed under the Jetson Nano module, onto J18 as shown in the figure above. Remove the two screws holding the module to the development kit, then push the two metal clips attached to the ends of J2 outwards. This will allow you to remove the Nano. Before placing the module into J18, remove the screw near J18, and attach the antennas to the WiFi/Bluetooth module. After inserting the module into J18, hold and screw it down. Reassemble the development kit. Here’s a helpful video.
The connector for the fan should connect to J15
Nano with case, antenna, jumper and power wires
Jetson AI Fundamentals Course
Nvidia has a free course for learning the basics of deep and transfer learning; you just need to create an account. The tutorial does an excellent job walking you through the process, so I’ll just point out some things I found interesting or important.
The Jetson Nano operates in “headless” mode, which means you don’t need a keyboard, mouse or monitor attached to it. It connects to a host PC over a USB micro cable, and you communicate with the Jetson through your computer. The Jetson creates a Jupyter Notebook which you can access through your browser.
Connection Problems
Before diving in, I wanted to address a problem I ran into. For me, my host PC kept losing connection to Jupyter Notebook. Reading through the documentation, I found the following:
README-usb-dev-mode.txt
On my Windows computer, I went into the network settings and made the following changes:
Windows network settings
After that, connectivity was nice and reliable.
Jupyter Notebook
You connect to the headless Nano over SSH. I used putty to do this. The Nano’s IP address, when connected to the host PC through USB, is 192.168.55.1
Putty settings
Then, when connected to the Nano, you have to run a shell script to start the Jupyter Notebook:
Name greyed out for privacy reasons
Now, we can go to the provided URL and use the provided password to access the notebook!
Accessing Jupyter Notebook through web browser
Classification
Let’s look at one of the Jupyter Notebooks. On the left hand side of the window you’ll directories; go to classification > classification_interactive.ipynb
Go through the file from beginning to end, executing each cell (except the last one; that’ll shut down the camera and neural network).
The second to last cell will create a GUI for training the neural network:
Training
Here, you can see what the camera sees. In category, select thumbs_up or thumbs_down, then do the corresponding gesture in the frame of the camera. Click the add button to take a photo. Do this numerous times for both gestures.
After you’ve taken many photos, you can start training the neural network. Select on the amount of epochs to train the network for, then click train. The number of epochs will go down one by one as the training runs; when epochs reaches zero, training has completed.
Testing Looks good!
The image above shows that the neural network can recognize my thumbs up and down gestures pretty well! Here’s a couple of things to note:
Take a lot of photos with your hands at different positions on the screen, and at various distances. The more diverse your photos are, the better the end result
Try to have various backgrounds and lighting conditions. Changing such ambient information will teach the neural network to ignore it, and focus on your hand
Classification II
Let’s look under the hood. When you run the classification Jupyter notebook, you are creating an instance of a fully trained neural network. In this case, we’re using ResNet-18, a variant of the ResNet neural network that has 18 layers. Then, we’re using transfer learning to retrain the network with the photos we took. Transfer learning is implemented using PyTorch, a library dedicated to machine learning (in Python). The advantage of PyTorch is that it takes advantage of the Nano’s GPU hardware.
Top: Creating ResNet-18 neural network Bottom: algorithm for re-training network
You may have heard that GPUs are very important for machine learning applications. But why?
CPUs are designed for general purpose applications, and they’re capable of training and running neural networks, but they’re very slow at it. This of course is a problem if you want your training to take hours instead of days, or want to run classification on a live video feed.
GPUs, on the other hand, are designed for a very specific purpose: rendering graphics. To accomplish this, GPUs are designed for parallel processing, which just so happens to be useful for machine learning. Examine the neural network below. Remember that neurons within a single layer have no connection to each other. This means that the outputs of neurons on that layer can be computed independently of each other, so they can all be computed in parallel.
Note that the outputs of a single layer can all be computed simultaneously, since there is no interdependence within that layer Image from Neural Networks and Deep Learning
While CPUs aren’t very good at parallel processing, GPUs are great at it. By taking advantage of parallel processing, training and application of neural networks can be sped up by a hundred or thousand fold.
In our case, the Nano has a CPU with 4 cores, and a GPU with 128 cores! PyTorch takes advantage of the hardware by using the GPU to perform computations, rather than the CPU. If the CPU were to run the neural network, it would calculate the output of a single neuron, then move on to the next one, then to the next one, computing one at a time. The GPU, on the other hand, can compute over a hundred neurons at once. Also note that the CPU has other responsibilities besides the neural network, such as running the operating system, so that’s another advantage to using the GPU.
Hello AI World
Hello AI World is a repo that contains files and instructions on how to run image classification, image detection and image segmentation. The files include network models, such as ResNet and GoogleNet, but also Python and C++ files to run the networks, and update them if so desired. The repo contains detailed instructions, so I’ll only show a couple of things I though were cool here.
Note: the neural networks are executed using TensorRT, which optimizes the networks for performance. Transfer learning to update these networks, while not explored here, is done using PyTorch.
Running Examples
The repo provides instructions for first time setup. I ran my examples out of docker, so each time you want to run the docker, you have to call the run shell script:
./docker/run.sh Parts redacted for privacy
Inside the docker, go to build/aarch64/bin. Here, you’ll see several executables you can run. Let’s take a look at some of them.
Camera Test
./video-viewer /dev/video0
For these examples, you’ll have to provide an input to the executable, such as a file to examine or a video stream. In most cases, the argument is ./dev/video0, which is my webcam. For video-viewer, this simply shows what the camera sees.
Image Classification
To run image classification, run the following command: ./imagenet ./images/black_bear.jpg The command will feed the following image to GoogleNet, a pretrained image classification network:
Image of a black bear
Output from GoogleNet, when provided with black_bear.jpg 99.0% certain image is of American black bear
As you can see, the neural network correctly identified the image as a black bear. Let’s try feeding the network a video feed:
Output from GoogleNet, when provided with camera feed 31.05% certain image is of apron (incorrect)
The most certain output of the neural network is shown in the camera feed, 31.05% certain that the image is an apron. This is, of course, incorrect; I’d say the correct answer is a t-shirt. This does raise some interesting questions though; there are many things in the frame, so how can we be sure what the neural network is identifying?
To be generous, let’s look at the top answers of the neural network, which is shown on the left. The correct answer is highlighted: t-shirt, with 9.7% certainty. So the good news is that the correct answer is in the top 15 answers (out of, I believe, 1000). The other answers provide some interesting, and possibly amusing insights: I do indeed use this t-shirt for my pajamas, (2.9% certainty), and the design is Japanese, so kimono is kind of relevant (5.6% certainty). But let’s not get too generous…
Image Detection
While image classification will label an image as one thing, image detection scans the image and tries to identify as many things as possible, while also determining coordinates for each of those items. Let’s take a look at an example:
./detectnet-camera /dev/video0
Here, you can see the image has me, my tie and my TV all identified and highlighted. Pretty cool, right!?
If you look at the terminal on the left, you can see the objects identified in the frame, as well as bounding boxes, which shows the position and size of the objects in the frame.
Image Segmentation
As the name suggests, segmentation divides a provided image in a meaningful way, simplifying it for a computer to work with. For example, for self driving purposes, a picture of the road (seen from the driver’s perspective) can be broken down into lanes, traffic signals, people, obstacles, etc.
./segnet –network=fcn-resnet18-mhp /dev/video0
Above is an example of image segmentation on a human body. The neural network used is Multi-Human, which is designed to break a person down into pieces (head, torso, arm, leg, etc.). Here’s another example from the repo:
The neural network used here is DeepScene, which is designed to work in nature. The image shows how a robot might use image segmentation to stay on the trail.
Conclusion
In this post, we saw what the Jetson Nano is capable of. We saw that a deep neural network can be used to classify, detect and segment images, and how neural networks can be updated for specific applications. Through the Jetson AI Fundamentals course, and Hello AI World, we can learn and apply the basics of machine learning to real world situations!
Last time, we looked at machine learning. Today, let’s take a look at deep learning for image recognition (convolutional network) and transfer learning.
Convolutional Network
In a typical, fully connected neural network, a neuron on the first hidden layer is connected to every single input neuron. This means that for x input neurons, each neuron on the first hidden layer has x weights and 1 bias; for y neurons on the first hidden layer, you get y * (x+1) parameters. That’s a lot of parameters, and the problem becomes much worse as the number of hidden layers increases. To combat this, convolutional networks simplify the network with the following techniques:
Input and hidden layer. Color shows which neurons on the input layer connect to which neurons on hidden layer. Note how red and blue overlap on the input layer
Local receptive field: instead of a hidden neuron seeing the entire input layer, it only sees a small part of it. Say the input is a 15 x 15 pixel input image, and the size of the local receptive field is 5 x 5 pixels; in that case, each neuron on the hidden layer will see 5 x 5 pixels. The top left hidden neuron will see the top left 5 x 5 pixels of the input, and the neuron next to it will see another 5 x 5 pixels, shifted over by a small amount. This small amount is called the stride length. If the stride length is 1 (5 x 5 pixel is shifted over by 1 pixel), then the hidden layer will be 11 x 11 neurons large.
The first advantage to this approach is reducing the number of parameters. Instead of a neuron on the hidden layer having 15 x 15 weights, now each one only has 5 x 5 weights. That’s a 9 fold decrease!
The second advantage is information about the location of data is preserved. In a fully connected layer, a neuron on the hidden layer lighting up means something happened somewhere in the input image. For a convolutional network, if a neuron on a hidden layer lights up, you know which 5 x 5 pixels caused that neuron to activate.
Note the hidden layer here is called a convolutional layer, due to its similarity with the mathematical operation of convolution
Shared weights & biases: Unlike a fully connected neural network, each neuron on the hidden layer in a convolutional network all have the same weights and biases. For example, in the image above, the red, blue, yellow and green neuron on the hidden layer all have the same 5 x 5 = 25 weights, and 1 bias.
The key advantage to this approach is further reducing the number of parameters. In a fully connected network, you would have 25 weights and 1 bias for each of the 11 x 11 neurons on the hidden network. Here, since all the neurons have the same weights and biases, you only have 25 weights and 1 bias for the entire hidden layer. That’s a 121 fold decrease!
The sharing of weights and biases means that all the neurons on the hidden layer look for a specific feature in its own local receptive field. Say the weights and biases are tuned to look for a vertical line. Then, in the image above, for the hidden layer, the red neuron looks for a vertical line at the top left of the input image, and the green neuron looks for a vertical line at the bottom right of the input image.
The hidden layer will look for a single feature in the entire image, forming a feature map. For effective image recognition, the neural network must recognize multiple features, such as vertical, horizontal, diagonal and curved lines. To accomplish this, the network must have multiple feature maps. For example, if you have 5 feature maps, then the network will recognize 5 distinct features. The top left neuron of each feature map will see the same 5 x 5 image, but since each one has different weights and biases (since they’re on different feature maps), each neuron will look for a different feature:
Input layer to multiple feature maps. Each feature map scans the entire image for a single feature.
Pooling layer reduces the number of neurons per layer
Pooling layer: Multiple neurons on the hidden layer are combined together, simplifying the network. This can be done in multiple ways. One common approach is max-pooling: the value of each neuron on the pooling layer is equal to the maximum value of 2 x 2 neurons on the hidden layer. Note that there is no overlap this time. Often, the hidden layer and pooling layer are considered a single layer.
By pooling, some of the location information is lost. On the convolutional layer, you know which 5 x 5 pixel of the input a neuron is looking at. On the pooling layer, each neuron is the amalgamation of 4 neurons on the convolutional layer, so you don’t know which 5 x 5 pixels of the input image activated the neuron. This is okay; generally, knowing the approximate location of a feature is all you need. Knowing its exact location doesn’t really help.
Let’s recap. An input image is provided to a layer of input neurons, arranged as a grid. The input neurons are then fed to feature maps, which scan the image for different features. The feature map, upon finding a feature, will activate, telling the network that a feature exists at a certain location. The pooling layer then compresses the information, making computation easier.
We still need a fully connected network to turn feature recognition into image recognition. The pooling layer will say “I found these features in these locations”, and the fully connected network will use that to categorize the input image. An example of a full network is shown below. The sizes of the layers are chosen arbitrarily:
Input layer: 28 x 28 Convolutional layer: 5 x 5 receptive field, stride length 1, so 24 x 24. 10 feature maps. Pooling layer: 2 x 2 max-pooling, so 12 x 12 Fully connected layer: 28 neurons Output layer: 14 neurons
Note that each neuron on the fully connected layer connects to every neuron on the pooling layer, and every neuron on the output layer.
For very complex neural networks, there are often multiple convolutional-pooling layers; that is, the output of one convolutional-pooling layer is fed into another convolutional-pooling layer. These networks also often have more than one fully connected layer. All in all, deep neural networks can get very complex, but thanks to sharing weights and pooling, learning is still possible.
Transfer Learning
Humans are remarkably good at pattern recognition. If you show someone one or two pictures of a raccoon, they will be able to correctly identify raccoons in real life, even if they’ve never seen that animal before. This amazing ability isn’t limited to just raccoons; elephants, cars, characters in a TV show, you name it.
How is this possible? A computer needs hundreds, thousands or millions of examples to learn how to recognize things, but a human needs only a handful. The key difference is that machines are learning from scratch, while humans are not. Take for example the person in the previous paragraph. This person has been learning how to recognize things since they were born. They can recognize people, sounds, textures, smells, etc. This person has developed the skills needed to recognize patterns and classify things already. When they see a couple of pictures of raccoons, they use those preexisting skills to pick up on key features, which they use to identify raccoons out in the field.
My point is that humans have a powerful and robust ability to recognize things already. When they need to recognize something new, they don’t start from scratch, but rather build upon existing abilities to rapidly gain new skills. Fortunately, computers can emulate this ability through transfer learning.
Say you have a neural network trained to recognize thousands of things, such as hammers, bears, dogs, etc. This neural network is very robust, and can handle a wide range of images with different backgrounds, lighting conditions, and weird angles. It is possible to take this fully trained neural network, and adapt it for something specific. For instance, you could use the neural network to differentiate between a thumbs up and thumbs down, or a smile and a frown. Though the original network wasn’t trained for this application, we can use transfer learning to re-train the network. This allows us to create powerful neural networks in a very short amount of time! My next post will be on the Jetson Nano, a development kit for playing with deep learning and transfer learning. Stay tuned to find out more!
Machine learning is all the rage, and for good reason. Some problems, such as recognizing faces or handwriting, are difficult to create algorithms for because we don’t really understand how we do it ourselves. If we don’t understand how we do it, how can we tell a computer to do it? That’s where machine learning comes in.
Humans learn by taking an input, coming to some conclusion about that input, then seeing if our conclusion is correct. For example, a child might see a dog, and think it’s a cat or a dog. The parent then may correct the child if they think the creature is a cat, or praise the child for correctly identifying the dog. The parent doesn’t have to explain a procedure for identifying the dog, nor do they have to manipulate the neurons in the child’s brain; the child can figure that out by themselves with proper guidance.
Machine learning is the same way. Rather than telling a computer exactly what to do and how to do it, the programmer creates a highly malleable framework that takes an input and produces an output. Then, the programmer creates an environment for that framework, where inputs are provided, and then the outputs are assessed to see if the framework got the answer right. Using a previous example, we might provide a picture of a dog to the framework, and see if the framework outputs “cat” or “dog”. If the output is incorrect, we modify the framework and try again. Over time, the framework will “figure out” (AKA learn) how to identify cats and dogs.
What separates machine learning from conventional programming is that conventional programs are understandable by humans. A maze solving algorithm can be seen as depth-first or breadth-first, for example, and we can think of situations where one is better than the other. For machine learning, however, the highly malleable framework is basically a black box. It takes an input, does some computation that makes sense only to the computer, and produces an output. This can make debugging and improving the system quite challenging. Compounding this is that machine learning, from what I’ve seen, is heavily dependent on heuristics and best-practices. A lot of what I read says “we don’t really understand why this makes the system work better, but it worked for someone else, and it worked for us.”
If we don’t really understand how machine learning works, then how do we fix or improve systems? It’s like raising a child; you can’t go into their brain and adjust individual neurons, but you can provide a better learning environment for them and hope they relearn properly. Likewise, while the framework is a mystery to us, we can provide better guidance so the black box performs better.
This blog post will discuss how I view and think of machine learning and its many aspects. The focus is not on implementation or mathematics, but rather on concepts and ways of thinking.
With all that said, let’s jump in!
References
For those who want a more rigorous discussion of the topic, here are my primary references:
3Blue1Brown on YouTube has a fantastic series that explains machine learning for those with no prior experience, with really helpful animations.
Neural Networks and Deep Learning, which is a free and easy to understand book that covers a lot of the mathematics and the justification behind it
Neurons & Neural Network
What I’ve been calling the “highly malleable framework” is what others call a neural network. It is a model developed to loosely mimic the behavior of neurons in our brains. A single neuron takes in several inputs, performs computation with them, and produces a single output. When many neurons are connected together, the behavior and output of the entire network can become quite complex.
The output of the neuron is a normalized weighted average of the input with a bias. Let’s look at an example of a neuron with three inputs to see what that means:
z = w1*x1 + w2*x2 + w3*x3 + b
a = sigmoid(z), where sigmoid(z) = 1/(1-exp(-z))
z is called the weighted input. It is a sum of a bias (b) and the inputs (x1, x2, x3) after they have been scaled by weights (w1, w2, w3).
a is called the activation, and is the output of the neuron. Since there’s no limit to what the weights and biases can be, z could output a huge positive or negative number. This could be a problem, so a function is used to normalize z, reigning in its maximum and minimum value. The sigmoid function smoothly transitions from 0 to 1, so the output of a single neuron is limited to that range.
A neural network is organized into layers. The first layer of neurons is the input layer, where the input to the system is provided. Note that it’s not really a layer of neurons; the “neurons” in that layer just pass along the input to the next layer. Think of it as the network’s port to accept inputs. The last layer of neurons is the output layer, where the system will output its final answer. The layers between the input and output layers are the hidden layers, called such since their existence isn’t obvious to an outside observer.
Neurons within a single layer do not connect to each other. However, a neuron in a layer takes in as input the output of all layers in the previous layer, and produces an output that goes to all inputs of a subsequent layer.
Let’s do a quick recap. A neural network takes in an input, and feeds it to a layer of neurons. The neurons then performs computations, then passes its output to the next layer. The next layer does the same thing, until at the output layer, an answer is provided.
This raises some questions:
What does the input look like? If the input is a picture, then you could provide the RGB information for each pixel. For example, for a 10 x 10 RGB image, the input might be 3 x 10 x 10 = 300 inputs, each input representing the intensity of a color of a pixel. If the input is a 10 x 10 grey-scale image, then you might only have 100 inputs.
What does the output look like? The output can be coded in different ways. If the network is supposed to distinguish between a cat and a dog, you could have two output neurons; one neuron has a high value (outputs close to 1) if the input is a cat, and the other neuron has a high value if the input is a dog. Or you could have a single neuron that outputs close to 1 if the input is a cat, or close to 0 if the input is a dog. Generally, the former method is chosen; when the input is an image and the network is supposed to classify the image between 10 categories, then the network will typically have 10 outputs, and the highest value output is the systems answer.
When a machine “learns”, what exactly is happening? The equation for weighted input (for three inputs) is z=w1*x1+w2*x2+w3*x3+b. x1, x2 and x3 are determined by the previous layer, but what about w1, w2, w3 and b? When the neural network is first created, these numbers are initialized with a random value. When a neural network “learns”, it tweaks its weights and biases so that the output of the system is more and more accurate over time.
In essence, a neural network takes an input, mashes a bunch of numbers together, and produces an output. The network will then see how good or bad the output is, then adjust how it mashes the numbers together to try and get a better result.
Now we’ve covered what a neuron and a neural network is; now let’s see how it learns.
Learning
Say you step into a room, and the air is too cold. You could just turn the thermostat to a random point, and see if the new temperature suits your preferences. If it’s good, you leave the thermostat alone; if not, you try again, setting the thermostat to a random temperature.
This approach to finding the right temperature leaves a lot to be desired. Clearly the best way to get the temperature you want is to see if the room is too hot or cold, then adjust the thermostat in the opposite direction eg. if the room is too cold, turn the thermostat so the room gets hotter.
Machine learning has the same approach. When the neural network is first created, the weights and bias for each neuron is randomly initialized, so they’re probably not very good, just like how if you set the thermostat to a random temperature, the room probably won’t be at the temperature you want. And just like how you can discern in what direction and by how much to move the thermostat to get to the temperature you want, there must be a way to know in what direction the weights and biases should be moved to, and by how much.
Before we can know how to get the result we want, we need to quantify how “good” or “bad” the performance of the neural network is; after all, if you can’t say the system is doing poorly, how will you know it needs to be changed? This is where the cost function comes in. This value shows the difference between the desired output of the neural network, and its actual output. If the actual output is perfectly accurate, the cost is close to zero; if the output is way off, the cost function is large. Now, we have a more methodical approach: change the weights and biases of the network so the cost function goes down.
Example of cost function changing as a function of a single weight in the neural network
Say, for a given neural network, you randomly selected a single weight within a single neuron. Now, for a given input, you changed the weight over a range, and recorded the cost as a function of weight, and got the graph above. Since we have the whole graph, it’s obvious what the weight should be: the weight corresponding to the deepest point. But let’s say you didn’t have a whole graph; in fact, you had a single point. If you could somehow determine the slope at that point, then you could determine what direction you should move to reduce cost. If the randomly selected point has a positive derivative (increasing), then you want to decrease the weight. If the point has a negative derivative (decreasing), then you want to increase the weight. Note that you’re moving in the opposite direction of the derivative. This is the fundamental idea of gradient descent.
The graph above is for a 1-dimensional case; you only have one weight you’re dealing with. In an actual neural network, you have many neurons, each of which has many weights and a bias. Instead of dealing with the 1-dimensional case like in the previous paragraph, you would be dealing with hundreds or thousands of dimensions. The idea is still the same; determine the gradient at your current location (gradient is the multidimensional version of a derivative), and move in the opposite direction.
Now we have a battle plan: provide an input to the network, then use its output to determine the cost. Then, determine the gradient at your current point, and adjust the weights and biases accordingly. Only problem is… how do you determine the gradient at your current point?
Backpropagaion
The simplest way to determine the derivative is to find the value at a point, then the value of a point right next to it, then use the slope equation to approximate the derivative. You could do that in the multidimensional case, but that is very computationally inefficient since you would need to calculate the two points for each dimension, which could be thousands or millions of computations. So what do you do?
There is a very cool algorithm for determining the gradient called backpropagation, backprop for short. The equations are shown below:
As I mentioned in the intro, I’m not going to go into the proof or meaning of these equations. I’ll instead say how I think of this group of equations. For those curious, chapter two of the Neural Networks and Deep Learning book does a great job explaining the mathematics and intuitions of these equations.
First, a quick note about notation; the L superscript denotes the value of something in the final (output) layer, while the l superscript denotes the value of something in any other layer. Also, δ is the symbol for error, which for our discussion here is just a very useful quantity.
The first equation shows how to calculate δ^L, which is the error for the output layer. The second equation shows how to calculate the error for any layer, provided you know the error of the next layer (notice how δ^l depends on δ^(l+1)). Since the first equation tells us how to calculate error for the final layer, we can use the second equation repeatedly to walk, or propagate, backwards through the network. If you repeatedly use the second equation, you’ll eventually have the error for every layer in the network. Why do you care about the error? Because error can be used to calculate the gradient! The third and fourth equation lets you convert the error you calculated into partial derivatives of biases and weights, and the gradient is just a matrix of all these partial derivatives.
Backpropagation, in other words, provides a computationally efficient method for computing a very useful value, error. Error, in turn, provides an easy way to compute the gradient. Therefore, thanks to backpropagation, we can efficiently compute the gradient, which means we can quickly determine in what direction the weights and biases should be adjusted, and by how much!
Stochastic Gradient Descent
In my previous metaphor, you were alone in the room, and trying to get the temperature right. Now, say you were in a room full of people. You talk to one guy, and he says it’s cold, so please raise the temperature, so you make the room warmer. But there are multiple people in that room. The first guy you talked to is happy, but other people are not; they think it’s too hot, and want you to make it colder. What do you do? Well, you could ask every single person if they’re happy with the temperature, and then adjust the thermostat to make as many people as happy as possible. But say, somehow, you have millions of people in the room. Asking every single person how they feel every time the temperature changes a little bit takes too much time and effort. Instead, you take a sample; based on the sample, you’ll adjust the thermostat, and hope that makes everyone happy. The sample isn’t 100% accurate, but it’s pretty close to what the entire population wants (assuming the sample is large enough).
Apologies for the tortured metaphor. What does this have to do with machine learning? Well there’s two components to the metaphor: the multiple people wanting different things, and the sampling of the population. Let’s look at the first component.
So far, we’ve only looked at how to adjust the network when provided with only one example input, but during training you have to provide hundreds, thousands or even millions of examples. Say your neural network is supposed to distinguish between a cat and a dog. You provide a picture of a cat, and then use backpropagation to determine how the weights and biases should be adjusted. Then, you update the network. All good, right? Well you’ve made the neural network better at identifying that one specific picture of a cat, but that’s not what we really want. We want the neural network to identify cats and dogs, so you have to provide multiple pictures of cats, and multiple pictures of dogs. In other words, rather than updating the weights and biases based on a single example, you should update the weights and biases using many, many examples. Here’s the procedure:
Take the first example, use backpropagation to determine the gradient, and determine how you want the weights and biases adjusted for that example. But don’t update the network yet.
Take the second example, use backprop, then determine how you want the weights and biases adjusted. Again, don’t change the network yet.
Take the third example… etc.
Each example is going to tell you how you should adjust the weights and biases, just like how each person will tell you whether you should raise or lower the temperature. If you average all the weights and biases adjustments from each example, then you’ll have a single voice telling you how to adjust the weights and biases so that all the examples are (somewhat) happy. You’re essentially finding a happy medium that works for every example.
Each example has its own opinion on how you should adjust the weights and biases. If you listen to only one, you’ll move in the wrong direction. By listening to all of them, you’ll move in the direction that is most beneficial to everyone.
The second part of the metaphor is sampling. In the previous paragraph, I said use backprop to get the gradient for every single example, then average all the gradients, then use that to update the weights and biases. That’s one “step”. To take another step towards the ideal weights and biases, do the whole thing over again: run through all the examples, calculating gradients, averaging them, then updating the weights and biases again.
Works well in theory, but poorly in practice. The amount of examples can be HUGE, going up to millions. Running through millions of examples for every single step is very, very, very time consuming, just like asking millions of people if they think the room is too hot or cold. So you sample to make it faster. Rather than running through all the examples, you run through a random sample, calculating the gradient for each one. Then, you average the gradients and update the weights and biases. In other words, if you were to run through all the examples, then you’d know the exact direction to move to please everyone; by running through a random sample, you’ll know the approximate direction to move to please everyone.
So you use a random sample to take one step. Then, you take another random sample (excluding the examples you’ve already used), and take another step. Then you take a third random sample, etc. Each random sample, called a mini-batch, should be the same size. Since you’re excluding previously used examples for each new mini-batch, you’ll eventually go through your entire example set. Each time you run through your entire example set, that’s called completing one epoch. If you use mini-batches, and say one epoch consists of 100 mini-batches, then completing one epoch will mean you’ve taken 100 steps (adjusted weights and biases 100 times). If you don’t use mini-batches, then you’ll only take 1 step after completing each epoch. I’m sure you can see why mini-batches greatly speed up how quickly neural networks learn.
This entire process is called Stochastic Gradient Descent, often called SGD. Let’s recap:
Take your entire example set, and break it up into mini-batches (using random sampling)
For each mini-batch:
For each example within the mini-batch, calculate the gradient using backprop
Once you’ve run through all the examples in the current mini-batch, average all the gradients and use that to update the neural network’s weights and biases
Once you’re done with all the mini-batches, you’ve completed one epoch. Repeat steps 1~3 for as many epochs as desired. The number of epochs is usually in the tens to hundreds, depending on complexity and size of the neural network.
Great! So we started with a neural network with random weights and biases, and we learned how to use backprop to find gradients, which will tell us how to update weights and biases for a specific example. Then, we learned about stochastic gradient descent to quickly and efficiently update those weights and biases for all examples in a very large example set. By now, we’ve learned how to train a neural network! Since there’s no upper limit to the number of epochs to run (ignoring time constraints), if we train a network for hundreds or thousands of epochs, then the network will be trained to perfection… right?
Overfitting
Here’s one more metaphor since I love them so much. Say two students are supposed to learn how to multiply. To help them, you provide a page with dozens of examples of multiplication: 1×5 = 5, 9 x 12 = 108, etc. One student learns the rules of multiplication (x times y means you add x to itself y times) in the traditional sense. Sure this student makes arithmetic errors occasionally, so their performance isn’t perfect, but they learned how to multiply two numbers together. The second student, for some reason, decided the best way to learn the material was to memorize every single example you provided. They have 100% accuracy since they know the answer to every example. Which of these two students would you say truly learned the material?
If you ask a human, the answer is the first student. But if you ask a computer, they’ll say the second student: they got every single thing right, so that’s what peak performance looks like. Unfortunately, this is a problem that haunts machine learning. Here’s another, completely different way to look at it:
Of the two graphs above, which would you say is the “better” best-fit line? A human would probably say the left one; the data points look like they’re linear, and the best-fit line is simple and easy to understand, and accounting for noise, the best-fit line looks perfectly suitable. The computer, meanwhile, would probably say the right one. The computer would argue that the 9th degree polynomial best-fit line has ZERO error! It’s perfect! It couldn’t get any better than that. What more could you want? How can you argue with a perfect result?
The point I’m trying to make here is that what humans consider ideal is different from what the computer considers ideal. If you let a machine learning algorithm learn for an excessive number of epochs, then it overfits the training data. The end result is rather than learning to differentiate between cats and dogs by recognizing and extrapolating from patterns in the image, the machine devolves into route memorization (“I’ve seen this exact image before, and I was told it was a cat, so it must be a cat”). The implications are clear and troubling: training a machine for too long can actually hurt overall performance. While the machine will perform better and better on the example set you provide to train it, it’ll perform terribly if you provide it with an image it’s never seen this before (“This isn’t one of the ones I memorized, so I have no idea what that is”).
There are many ways to to combat overfitting; I’ll touch on three of them: regularization, validation data, and WAY more training data.
Regularization is done by modifying the cost function. There are several different ways you can modify the cost function (L1 and L2 regularization are two examples), but they all strive to do the same thing. Overfitting occurs because the network is too heavily optimized to reduce the cost function; if you add a bit of a “twist” to the cost function, then stochastic gradient descent won’t properly optimize it. I think of it as intentionally obscuring the cost to impede SGD: in the large strokes, the cost function is the same, but when overfitting is about to occur, SGD gets confused by the weird cost function, preventing memorization. This is a very qualitative, wishy-washy explanation, and it’s because regularization, as far as I can tell, isn’t very well understood. It’s a mostly “I tried this and it worked” type solution.
Another approach is to have a separate data set. The example set you use to train the neural network is called the training data. To detect overfitting, you have another example set called validation data. The trick with validation data is that you don’t use it to teach the network; in other words, the neural net doesn’t have a chance to memorize the validation data. To prevent overfitting, you do the typical stochastic gradient descent: using mini-batches to constantly update the weights and biases. But, at the end of every epoch, you see how well your neural network does with the validation data. Over several epochs, if performance improves with the validation data, then the machine is improving, so keep going. If, however, performance stagnates or decreases, then overfitting may be occuring, so stop training your model. After training completes, use a third data set, called test data, to determine the final performance of your neural network.
Lastly, overfitting occurs more quickly the smaller your training data set. So, if you make your training data really really big, then overfitting will happen later and later. Simple, but an effective strategy. There are two ways to do this: do the leg work and get more training data (get more picture of cats and dogs), or manipulate the existing training data. For example, a picture of a dog is still a picture of a dog if you flip the picture, or rotate it, or scale it up or down a tiny bit. By performing one or a combination of these manipulations to each example, you could easily increase your training data set size ten fold. Since you have more training data, it’s harder to memorize the answers, so overfitting is delayed.
There are many more techniques to prevent overfitting, such as dropout, but the point of this section is to alert the reader that there is such a thing as “too much learning,” and that more epochs may mean more problems.
Deep Learning
If we can’t guarantee performance improvement by increasing the number of epochs we train our neural networks, then perhaps we can improve it by making the neural network more complex? Instead of having 1 or 2 hidden layers, why not have a dozen? Surely more neurons and computations means that the network is more powerful, and therefore will perform better? Unfortunately, it appears that’s not really the case. Mathematically, the gradient of earlier layers (near the input layer) becomes vanishingly small the more layers you have, so even if you add a dozen hidden layers, most of them don’t really learn, so you haven’t really helped the situation. As it turns out, teaching a neural network with many hidden layers is a totally different animal, so it gets its own name: deep learning.
Deep learning is also important for more than just improving performance on a simple task; object classification, object detection, image segmentation and natural language processing are extremely complex tasks that require much more than a small, shallow network. Additionally, while a shallow network could theoretically perform almost any task, a deeper architecture can perform the same task with fewer neurons (though with more layers) as long as the deep neural network is properly trained.
Since teaching deep neural networks is so challenging, one way to get around this is using convolutional neural networks. I’ll elaborate on this in a future post, when I talk about the Jetson Nano and transfer learning.
Conclusion
I hope you enjoyed my overview of what machine learning is, how it works, what its pitfalls are, and some extra topics. This is by no means comprehensive, so I highly suggest you check out my references!
I wanted to give an update on this and future projects. The past couple of months have been very hectic, primarily due to moving across the country and applying to schools. Most of my tools and equipment are in storage, and once school starts it’ll be difficult to continue working on projects, so for now this and future projects are on hiatus. I plan to continue with projects after I graduate, but that’ll be several years from now.
In the mean time, rather than focusing on projects, I’ll shift my focus to concepts. Currently, I’m studying to prepare for school, so I’ll share my progress on that. My focus will primarily be software and mathematics, as I feel that’s where I’m weakest. Once school starts, I’ll also share what I’m learning in class. I may also post information about any student projects I’m working on.
The overview of the PCB is shown above. On the left is the USB connector, and on the right is the RF interface through an SMA connector. On the top and bottom of the board are P2 and P3, which break out the STM32 MCU’s pins, allowing this board to be used as a development board, or for other projects that need USB, RF or both.
There is a vertical white line running through the board, a little off center. This is the cut line. The board is meant to be used as a whole, as a USB dongle, but if the board is cut along that line, then the right side of the board can be used as an RF module. The pins on P2 and P3 expose the SPI interface and power pins of the module, allowing an external MCU, like an Arduino, to control the RF interface.
The PCB is a four layer board. The top is where all the components are, and most of the routing. Layer 2 is a ground plane, layer 3 is the power plane, and the bottom layer is used for some routing, and as a ground plane. I used JLCPCB as the PCB fabricator, and they require a minimum of 4 layers for controlled impedance, so I went with that. 4 layers also allows the circuit board to be very compact, which is useful since a dongle needs to comfortably fit in USB ports.
Top layer
Layer 2, GND
Layer 3, VCC
Bottom layer
Let’s look at the key PCB design areas.
USB
USB Traces, highlighted
While most of the signals on the board are straight forward, USB needs special attention for two reasons: it is differential and needs controlled impedance.
The differential nature of USB requires that the D- and D+ trace lengths be close to each other. In the picture above, you can see that one trace travels in a large arc (D-), while the other travels in a small arc (D+). This causes one of the differential signals to travel a shorter path, causing D+ and D- to be out of sync. To combat this, the shorter signal has a zig-zag path built in to it, called an accordion. This forces the shorter path to become longer, making both differential signals travel the same length.
USB specifications state that the differential impedance of USB is 90 Ω. I used JLCPCB’s stack up information and Saturn PCB design software to determine what trace widths and spacing is required to achieve this. On JLCPCB’s website, they say the prepreg between the top layer and my ground plane is 0.2 mm thick, and has a dielectric constant of 4.6. Plugging this into my design software gives 10 mil wide differential traces, 7 mils apart:
Frankly, controlled impedance and length matching probably aren’t necessary for this project. The traces are very short, so any issues due to out of spec impedance and different lengths of differential signals will be negligible. However, it’s good practice for larger, more complex boards, and it never hurts to be careful.
Couple more things. There are two resistors in series with the D- and D+ signals. Normally, these should be as close to the MCU pads as possible. However, I’m fairly certain the MCU has these resistors internally, and the resistors on the PCB are redundant, so their position isn’t very critical. I put them there just in case I need them. Secondly, there is ESD protection, right near the USB connector. This component should be as close to the USB connector as possible; the rational is that if an ESD event occurs, the protection that’s right near the connector will handle it, preventing the ESD from getting far into the board. If the ESD protection was near the center of the board, for example, then the ESD pulse would go through half the board and potentially damage things; by having the protection right at the connector, ESD is handled immediately.
Clock
The RF transceiver requires a 16 MHz crystal. The crystal requires two load capacitors and a biasing resistor.
The top layer shows a guard ring around the crystal and the passive components. The guard ring prevents leakage current, that is small amounts of current flowing from copper to copper through the FR4, from getting in or out of the sensitive clock circuitry.
Additionally, the ground planes for the clock circuitry, on the layer 2 and on the bottom, are separated from the rest of the PCB. This ensures that any return currents and nasty transients that occur during oscillation do not impact the rest of the board. Conversely, the separation ensures that any return currents and nasty transients that occur on the rest of the board do not affect the crystal oscillator. Despite the separation, the ground planes must connect to the rest of the circuit’s ground at some point, so there is a small trace connecting the two ground pours on layer 2.
A really helpful document about crystal layouts is ST’s AN2867.
RF
RF Traces, highlighted
The RF traces are in four segments: unbalanced (left), balanced (middle-left), balanced + amplified (middle-right), and filtered (right). For all four segments, the traces should be kept as short as possible.
The unbalanced segment is composed of two traces. Like USB, the two signals should be of equal length. Since we’re operating at 2.4 GHz, making sure they’re the same length is very, VERY important. The other segments are fairly straight forward; just connect pads together with as short a trace as possible.
Two things to note. Firstly, I typically use 45 degree bends in my traces, but for RF I decided to use smooth corners. I read that RF signals have problems at 45 or 90 degree bends, and a smooth corner is better. This may be an old wive’s tale, but I think it looks cool (very scientific) and helps distinguish RF traces from regular traces. Secondly, like USB, RF traces require controlled impedance. After the balun, the RF traces need 50 Ω impedance. Rather than using the Saturn PCB design software again, I consulted JLCPCB’s online tool:
As you can see, to get a 50 Ω impedance trace, the trace needs to be 11.55 mil thick. I didn’t do this for the USB differential traces because I didn’t notice this tool could do differential traces too, but fortunately the results are nearly identical. If you have a choice between a PCB vendor tool, and a generic PCB tool, you should go with the vendor’s tool.
RF shielding
One last thing to note about RF. When a signal is present on a trace, we generally think of the signal as being confined to the copper. For example, when you drive a logic high on a piece of copper, the logic high doesn’t permeate beyond the boundaries of the copper. However, this becomes less and less true as the frequency of the signal increases. Even SPI clock signals, which are in the megahertz range, can start affecting signal traces that come too close to them. For RF, this is a much more prominent issue. RF signals can impact electronics that aren’t even on the same PCB through EMI emission. In this case, emission is good, since we’re trying to transmit an RF signal. But what we don’t want is uncontrolled, or unintentional emission. We want the RF to be emitted by the antenna only, and nowhere else.
To prevent the RF traces from emitting RF signals (besides through the antenna), we need to button up the edge of the PCB where the RF signals are active. You can think of it like a wall or a dam: the RF traces will emit EMI, and they’ll try to leave the PCB, but the shielding will keep them in. In order for the shielding to be effective, the vias should be placed at most one twentieth of the wavelength we’re trying to keep in apart. In this case, we’re using 2.4 GHz signal, which has a wavelength of 3E8/2.4E9 = 0.125 m, which is about 5 inches. One twentieth of 5 inches is 250 mils, so the vias must be placed closer together than that.
Conclusion
I’ve placed an order for the PCB and the BOM, and they’ll be here in a couple of weeks. When they arrive, let’s take a look at programming the MCU!
Above is the application circuit for the RF transceiver, NRF24L01P. The chip transmits and receives RF on pins ANT1 and ANT2. This chip has a balanced output, and is designed to use a balanced antenna:
I find it helpful to think of it like a differential signal in electronics. If a signal is single ended, then it is referenced to ground; if it is differential, it has positive and negative halves, which are equal and opposite. Similarly, the NRF24L01P outputs equally and oppositely to the two legs of the dipole antenna. However, I don’t want to use a dipole antenna; I’d rather use the rubber ducky antenna found on walkie-talkies, since they’re more compact and work equally well in all directions. This presents a problem: the NRF24L01P wants a balanced antenna, but the antenna I want to use is unbalanced. I’ll need some sort of interface or translator.
Fortunately for us, the application circuit shown above does exactly that. L1, L2, L3, C5 and C6 form a matching network, turning the balanced ANT1 and ANT2 signals into a ground referenced signal, which is compatible with a rubber ducky antenna. This was the approach I showed in my original schematic, but I have changed to a more compact and higher performance solution.
A balun, a combination of BALunced and UNbalanced, transforms a balanced signal into an unbalanced one. It can also be very, very compact; the one I’m using (2450BM14A0002T) is just 0.8mm x 1.6 mm! With this, the schematic becomes much simpler:
New, simpler matching network
The balun, on top of changing the signal, provides filtering:
From balun datasheet
The parameter we’re interested in here is IL_dB, the blue trace. IL stands for insertion loss, which indicates how much of the signal is transmitted or reflected. Think of it like shining light onto a piece of glass; some of it goes through, some of it is lost as heat, and some of it is reflected.
In the graph above, we’re interested in the frequencies around 2.5 GHz, which is the frequency that the NRF24L01P operates at. Let’s look at the blue trace:
2~3 GHz: insertion loss is almost 0 dB. This means that most of the signal goes straight through the balun with very little attenuation.
> 3 GHz: insertion loss falls rapidly. This means that most of the signal doesn’t make it through the balun
In other words, the balun acts as a low pass filter, which is very helpful. Now, instead of accidentally transmitting high frequency signals, we’ll only transmit the signals we want. The balun is also bidirectional, which means signals going through the balun to get to the NRF24L01P will also be cleaned up, which will hopefully improve communication.
I made a very similar change on the low pass filter that is attached to the PA/LNA. Previously, I used two pi filters to clean up the RF signals, coming and going to the antenna. Now, I’m using a single component to perform that task. This is not a balun; the signal comes and goes as unbalanced. But, like the balun, it only lets certain frequencies through:
From LPF datasheet
S11 and S22 shows how much reflection occurs, while S21 shows how much of the signal goes through the LPF. S21, the blue trace, is at 0 dB for frequencies below 3 GHz, which means signals go through the LPF mostly unchanged. However, after 3 GHz, S21 rapidly drops off, which means that the signal no longer makes it through the filter. S11 and S22, meanwhile, increase rapidly. This means that any signal coming into the filter will see a brick wall and just bounce back, like shining a light on a mirror. The LPF will, therefore, only allow our RF dongle to receive and transmit signals in the frequencies we’re interested in!
New, simpler LPF
USB Shield
New USB circuit
Apparently, how GND and the USB shield are connected is a controversial topic. Some recommend connecting the two directly, some suggest no connection, and some say connect the two through a capacitor or resistor. Yet another school of though says to connect the two through a capacitor and resistor in parallel. I decided to go with the one that makes most sense to me, which is the last one.
In theory, the shield and GND are already connected together on the host side (if I plug my dongle into a computer or laptop, then the computer or laptop is the host). This isn’t always the case, though, due to non USB standard compliant designs. To make sure that GND and shield don’t develop a DC difference in voltage, R16 was added to connect the two. No significant amount of current can flow through such a large resistor, but the resistor will gradually bring the GND and shield to the same voltage if a difference exists.
The capacitor is for EMI purposes. Think of a dipole antenna and how it functions. The dipole antenna transmits a signal by creating a time varying voltage across its two legs, which allows it to propagate electromagnetic signals. It’s possible to make a dipole antenna accidentally; all you need is two large pieces of metal or copper. GND tends to be very large, almost always the largest piece of copper on the PCB, and shielding or chassis can be very large as well. If a time varying voltage appears across GND and shield, then the two pieces of metal essentially form a dipole antenna, emitting electromagnetic radiation, which can be disruptive to other electronics. C27 prevents this by having very low impedance at high frequencies; to an RF signal, GND and shield are shorted through C27.
I’m almost done with the layout, so I’ll present that next time.
This really isn’t necessary for this project, but I made the schematic hierarchical. This means that at the top is a schematic that contains blocks, then each block contains a schematic. Hierarchical schematics are really useful for complex boards, and completely overkill for this project, but I wanted to try it to get experience.
All the way to the left is the USB block, which has the USB connector. This connects to the MCU block, providing it with the USB data lines. The MCU block then connects to the RF block, which has the RF transceiver and PA/LNA chip; the interface between them is SPI and some GPIO signals. At the top is the power block, which turns USB’s 5 V into 3.3 V for the various ICs. This block also generates analog and digital signals for the MCU to check the state of the system power.
USB Block
The USB block shows the USB A plug (P1). The bus voltage goes through a ferrite bead and fuse, then to the power block. The data lines go through ESD protection, then to the MCU block; this is a common best-practice. C28 provides decoupling for the bus voltage, while C27 is for EMI purposes; the value of C27 may need to be tweaked.
MCU Block
U1 is the microcontroller I’ll be using, the STM32L072KBU6. This chip has plenty of Flash and RAM, has USB interface built in to it, and doesn’t need a crystal! Handy!
The only two interfaces I’ll be using on the chip are USB, to communicate with my PC, and SPI, to communicate with the RF transceiver. Additionally, I need some GPIO and ADC inputs for digital logic and checking rail voltages. There are a bunch of unused interfaces, and they seem like such a waste to do nothing with them, so I’m going to be bringing out most of them to a header; that way, this board can be used as a USB-to-SPI or USB-to-I2C or any combination of interfaces.
In order to program the board, I plan to use the factory programmed bootloader (DFU) on the chip. When the chip is reset, it reads BOOT0. If it is high during this time, then the MCU prepares to reprogram itself. If it is low, then the MCU runs the regular program it already has. This means that I can reprogram the MCU by holding down SW2, then hitting SW1.
RF Block
U4, the RF transceiver, receives commands and provides information to the MCU through the SPI interface. CE is used to enter and exit receive and transmit mode, and IRQ is used to alert the MCU that something has changed (eg. transmission complete, received a new message, etc.). The state machine inside the chip is shown below:
The transceiver needs a crystal; mine has a load capacitance of 9 pF, so 15 pF seems good for this purpose. In general, the value of each capacitor should be a couple picofarads less than twice the load capacitance; in this case, the capacitors should be less than 18 pF. The two capacitors are seen as in series from the crystal’s point of view, and then there’s a couple picofarads of stray capacitance on top of that. In this circuit, the crystal sees (15/2) pF + stray capacitance as a load. If the stray capacitance is 2 pF, then the crystal sees 9.5 pF of load capacitance, which is probably close enough.
U3 is the PA/LNA, which amplifies incoming and outgoing RF signals. Besides its input and output RF pins (TXRX and ANT), it has TXEN and RXEN for control:
In the schematic, RXEN is tied to CE, the chip enable for the RF transceiver. CE is high when the transceiver is receiving or transmitting, and low otherwise, which is in accordance with the table shown above. TXEN was tied to VDD_PA in an example circuit I saw, which I didn’t understand, so I did some testing and found the following:
Two Arduinos communicating with each other using NRF24L01P modules
VDD_PA on the transmitter, when two NRF24L01P are communicating
VDD_PA on the receiver, when two NRF24L01P are communicating
The setup shows two Arduinos, each with an NRF24L01P module. One of the modules is transmitting, and the other is receiving. The images above show what VDD_PA looks like on the transmitting and receiving module, when communication is occurring.
Let’s look at the transmitter. The first large pulse, when VDD_PA goes up to around 1.8 V, is when the transmitter is sending a message; 4 bytes in this test. It takes about 160 us. About 100 us after that is a smaller pulse, which comes up to around 800 mV, and lasts for about 120 us. This is the acknowledge from the receiver, which tells the transmitter that the message was transmitted successfully. The second bump is unusually large because the two NRF24L01P modules are very close together; in the setup photo you can see the antennas are almost touching. When the distance between the two modules is increased, or the antennas are out of alignment (no longer parallel to each other), the second pulse becomes much, much smaller. VDD_PA on the receiver is basically the same as on the transmitter, but backwards; the receiver receives the signal (small pulse on the left), and then sends an acknowledgement (larger pulse on the right).
This test shows the following:
VDD_PA is low when not receiving or transmitting
VDD_PA is 1.8 V when transmitting
VDD_PA varies in voltage when receiving. When the antennas are right next to each other, VDD_PA goes up to 800 mV. When the two antennas are 15 cm apart, pulse goes down to 370 mV; at 30 cm apart, pulse goes down to 120 mV
According to the RFX2401C datasheet, a logic high is 1.2 V and higher. This means that it’s unlikely that receiving a signal will cause VDD_PA to become large enough that TXEN will be read as a logic high. Therefore, tying VDD_PA to TXEN should be fine since VDD_PA will only be a logic high when NRF24L01P is transmitting, which is exactly what we want.
Let’s summarize RXEN and TXEN. When nothing is happening, RXEN and TXEN are low since VDD_PA and CE are low. When receiving, RXEN will be high and TXEN will be low, putting U3 in receive mode, since VDD_PA is low, while CE is high. When transmitting, RXEN and TXEN will be high, putting U3 in transmit mode, since VDD_PA is high, and CE is high. This means that the MCU doesn’t need any control signals dedicated solely to the PA/LNA!
Power Block
Finally, the power block. VBUS is the +5V from USB, and VIN is power that can be supplied externally. U2 is a LDO that outputs 3.3 V.
I have voltage dividers to measure or detect VBUS, VIN and VCC. According to USB specification, a pull-up must be attached to one of the USB data signals if the USB device is self-powered (not powered by USB). If the device is powered by USB, then VBUS_DET will be 3 V, or a logic high; if the device is powered by VIN, then VBUS will be 0 V, a logic low. However, it is possible that VBUS has power on it, but VIN is also powered at a higher voltage, so the system is self-powered despite having USB power. To address that case, the MCU must also measure VIN; if VIN is higher than VBUS, the system is self-powered. If VIN is less than VBUS, the system is bus powered. VCC is also measured, but this is for monitoring / self-test purposes.
That should do it for the schematic! Next, let’s look at the layout.
I know I just started the RC Car project, but upon further reflection, I need to do another project first. For the RC car, I wanted to add an RF transceiver among many other things to the board, but I decided to do a separate, smaller project dedicated to trying out RF routing first, something I have little experience in. By doing this, I will reduce risk by getting experience, and it’ll yield a very useful product I’ve wanted to make for a long time. This project should be relatively straightforward: an MCU, an RF transceiver, and a power amplifier. Let’s take a look at the block diagram:
RF Dongle, FBD
First, power: the USB provides +5V to an LDO through a diode. The LDO generates 3.3 V, which powers everything on the board. Alternatively, VIN will power the LDO, the diode preventing VIN from pumping current into the USB interface. This means the board can be bus powered or self-powered. This is important for the USB interface, as the microcontroller must know how it is powered, and communicate that to the USB interface. I’ll elaborate in a future post.
The signal chain: the microcontroller will convert USB into a simpler interface like SPI, the RF transceiver will convert that to something that can be transmitted and received using an antenna, and the power amplifier will boost the range and strength of the signal. This way, two of these dongles will allow two computers to communicate, or allow my computer to receive the wireless telemetry from the RC car. I’m sure this will be very useful for future projects. Let’s look at the parts:
Microcontroller: I’ve been quite taken with the STM32 microcontrollers lately; they offer a wide range of products, making the series suitable for almost any application, and the STM32 Cube IDE is very user-friendly. For this reason, I was looking into STM32’s for this project (and the RC car). Since this project is relatively straight forward, I was looking at the cheapest, weakest microcontrollers that supported USB, and I originally wanted to use the STM32F0 series. Unfortunately, at this time, they’re very difficult to get; Mouser, Digikey, etc. didn’t have them. So I switched to the STM32L0 series; STM32L072KB to be specific. While I praise STM32’s IDE for its ease of use, we must be wary of the size of the code it outputs; the vendor provided Hardware Abstraction Layer (HAL) consumes a lot of memory, and the USB middleware is huge. The microcontroller I chose has 128 KB of Flash and 20 KB of RAM, and I’ll need about a third of each for my application.
RF transceiver: I’m going with the trusty NRF24L01P for this application. It’s very popular and and well supported by the Maker/DIY community, so it’s a safe bet for my first dip into RF design. I read the datasheet, and it can do what I can just fine, as well as surprisingly simple to design for, schematically speaking.
Power amplifier: I honestly don’t know much about power amplifiers, but a lot of RF modules that use the NRF24L01P use the RFX2401C as the amplifier. It has a Low Noise Amplifier (LNA) to amplify the received RF, and a Power Amplifier (PA) for the outgoing RF, so it cam amplify signals in both directions. It’s also got a shockingly simple electrical interface, so that’s a plus. The power amplifier probably isn’t necessary for indoor communication, but if I take the RC car outside, then the amplifier will be very useful in a parking lot or something like that. Using one will also give me a chance to gain more experience with it.
I spent most of my time on this project looking for a microcontroller; most of that time was discovering my chosen chips were either unobtainable, or didn’t have enough Flash or RAM. Now that that’s behind us, I’ll start working on the schematic.
On a side note, I plan to break out a lot of the pins on the microcontroller to headers, so if desired this project can be used for other things (eg. read ADCs using USB, or USB to I2C, etc.)