For the neural network to learn we define cost functions. This indicates to the system what is a good answer and what isn't. For the example in particular if the function correctly classifies the image we should get the activation of one of the neurons very high and all the others very low. In the case of getting high values for different numbers, we have a high cost and don't like the answer.
The cost is defined as the sum of the squared differences between what the system gave me and what I expected. By averaging the costs we can get an idea of how well is the network performing.
We can think of the cost function as a function that receives only one parameter and we want to minimize it. Starting from a random input and getting into a local minimum can be feasible but getting to the global meaning is extremely hard.
If we think of the cost as a multivariable function we could think that adding a small step of the negative of the gradient to the actual gradient of the cost will slowly provide us with the direction to get to a local minimum.
The algorithm for computing the gradient efficiently is called backpropagation.
Learning → Minimizing a cost function.
The learning for this type of neural network doesn't detect patterns to recognize images. It sees all of the different pixels and analyzes them to better fit the cost function.