by Emil Wallner

# The History of Deep Learning — Explored Through 6 Code Snippets

In this article, we’ll explore six snippets of code that made deep learning what it is today. We’ll cover the inventors and the background to their breakthroughs. Each story includes simple code samples on FloydHub and GitHub to play around with.

If this is your first encounter with deep learning, I’d suggest reading my Deep Learning 101 for Developers.

To run the code examples on FloydHub, install the floydcommand line tool. Then clone the code examples I’ve provided to your local machine.

Note: If you are new to FloydHub, you might want to first read the getting started with FloydHub section in my earlier post.

Initiate the CLI in the example project folder on your local machine. Now you can spin up the project on FloydHub with the following command:

``floyd run --data emilwallner/datasets/mnist/1:mnist --tensorboard --mode jupyter``

### The Method of Least Squares

Deep learning started with a snippet of math.

I’ve translated it into Python:

``# y = mx + b# m is slope, b is y-interceptdef compute_error_for_line_given_points(b, m, coordinates):    totalError = 0    for i in range(0, len(coordinates)):        x = coordinates[i][0]        y = coordinates[i][1]        totalError += (y - (m * x + b)) ** 2    return totalError / float(len(coordinates))# example compute_error_for_line_given_points(1, 2, [[3,6],[6,9],[12,18]])``

This was first published by Adrien-Marie Legendre in 1805. He was a Parisian mathematician who was also known for measuring the meter.

He had a particular obsession with predicting the future location of comets. He had the locations of a couple of past comets. He was relentless as he used them in his search for a method to calculate their trajectory.

It really was one of those spaghetti-on-the-wall moments. He tried several methods, then one version finally stuck with him.

Legendre’s process started by guessing the future location of a comet. Then he squared the errors he made, and finally remade his guess to reduce the sum of the squared errors. This was the seed for linear regression.

Play with the above code in the Jupyter notebook I’ve provided to get a feel for it. `m` is the coefficient and `b` in the constant for your prediction, and the `coordinates` are the locations of the comet. The goal is to find a combination of `m` and `b` where the error is as small as possible.

This is the core of deep learning:

• Take an input and a desired output
• Then search for the correlation between the two

Legendre’s method of manually trying to reduce the error rate was time-consuming. Peter Debye was a Nobel prize winner from The Netherlands. He formalized a solution for this process a century later in 1909.

Let’s imagine that Legendre had one parameter to worry about — we’ll call it `X`. The `Y` axis represents the error value for each value of `X`. Legendre was searching for where `X` results in the lowest error.

In this graphical representation, we can see that the value of `X` that minimizes the error `Y` is when `X = 1.1`.

Peter Debye noticed that the slope to the left of the minimum is negative, while it’s positive on the other side. Thus, if you know the value of the slope at any given `X` value, you can guide `Y` towards its minimum.

This led to the method of gradient descent. The principle is used in almost every deep learning model.

To play with this, let’s assume that the error function is `Error = x⁵ -2x³-2`. To know the slope of any given `X` value we take its derivative, which is `5x⁴ - 6x²​​`:

Watch Khan Academy’s video if you need to brush up your knowledge on derivatives.

Debye’s math translated into Python:

``current_x = 0.5 # the algorithm starts at x=0.5learning_rate = 0.01 # step size multipliernum_iterations = 60 # the number of times to train the function``
``#the derivative of the error function (x**4 = the power of 4 or x^4) def slope_at_given_x_value(x):    return 5 * x**4 - 6 * x**2``
``# Move X to the right or left depending on the slope of the error functionfor i in range(num_iterations):   previous_x = current_x   current_x += -learning_rate * slope_at_given_x_value(previous_x)   print(previous_x)``
``print("The local minimum occurs at %f" % current_x)``

The trick here is the `learning_rate`. By going in the opposite direction of the slope it approaches the minimum. Additionally, the closer it gets to the minimum, the smaller the slope gets. This reduces each step as the slope approaches zero.

`num_iterations `is your estimated time of iterations before you reach the minimum. Play with the parameters it to get an intuition for gradient descent.

### Linear Regression

Combining the method of least square and gradient descent you get linear regression. In the 1950s and 1960s, a group of experimental economists implemented versions of these ideas on early computers. The logic was implemented on physical punch cards — truly handmade software programs. It took several days to prepare these punch cards and up to 24 hours to run one regression analysis through the computer.

Here’s a linear regression example translated into Python so that you don’t have to do it in punch cards:

``#Price of wheat/kg and the average price of breadwheat_and_bread = [[0.5,5],[0.6,5.5],[0.8,6],[1.1,6.8],[1.4,7]]``
``def step_gradient(b_current, m_current, points, learningRate):    b_gradient = 0    m_gradient = 0    N = float(len(points))    for i in range(0, len(points)):        x = points[i][0]        y = points[i][1]        b_gradient += -(2/N) * (y - ((m_current * x) + b_current))        m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))    new_b = b_current - (learningRate * b_gradient)    new_m = m_current - (learningRate * m_gradient)    return [new_b, new_m]``
``def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):    b = starting_b    m = starting_m    for i in range(num_iterations):        b, m = step_gradient(b, m, points, learning_rate)    return [b, m]``
``gradient_descent_runner(wheat_and_bread, 1, 1, 0.01, 100)``

This should not introduce anything new. However, it can be a bit of a mind boggle to merge the error function with gradient descent. Run the code and play around with this linear regression simulator.

### The Perceptron

Enter Frank Rosenblatt — the guy who dissected rat brains during the day and searched for signs of extraterrestrial life at night. In 1958, he hit the front page of New York Times: “New Navy Device Learns By Doing” with a machine that mimics a neuron.

If you showed Rosenblatt’s machine 50 sets of two images, one with a mark to the left and the other on the right, it could make the distinction without being pre-programmed. The public got carried away with the possibilities of a true learning machine.

For every training cycle, you start with input data to the left. Initial random weights are added to all the input data. They are then summed up. If the sum is negative, it’s translated into `0`, otherwise, it’s mapped into a `1`.

If the prediction is correct, then nothing happens to the weights in that cycle. If it’s wrong, you multiply the error with a learning rate. This adjusts the weights accordingly.

Let’s run the perceptron with the classic OR logic.

The perceptron machine translated into Python:

``from random import choice from numpy import array, dot, random 1_or_0 = lambda x: 0 if x < 0 else 1 training_data = [ (array([0,0,1]), 0),                     (array([0,1,1]), 1),                     (array([1,0,1]), 1),                     (array([1,1,1]), 1), ] weights = random.rand(3) errors = [] learning_rate = 0.2 num_iterations = 100 ``
``for i in range(num_iterations):     input, truth = choice(training_data)     result = dot(weights, input)     error = truth - 1_or_0(result)     errors.append(error)     weights += learning_rate * error * input     for x, _ in training_data:     result = dot(x, w)     print("{}: {} -> {}".format(input[:2], result, 1_or_0(result)))``

In 1969, Marvin Minsky and Seymour Papert destroyed the idea. At the time, Minsky and Papert ran the AI lab at MIT. They wrote a book proving that the perceptron could only solve linear problems. They also debunked claims about the multi-layer perceptron. Sadly, Frank Rosenblatt died in a boat accident two years later.

In 1970 a Finnish master student, discovered the theory to solve non-linear problems with multi-layered perceptrons. Because of the mainstream criticism of the perceptron, the funding of AI dried up for more than a decade. This was known as the first AI winter.

The power of Minsky and Papert’s critique was the XOR problem. The logic is the same as the OR logic with one exception — when you have two true statements (1 & 1), you return False (0).

In the OR logic, it’s possible to divide the true combination from the false ones. But as you can see, you can’t divide the XOR logic with one linear function.

### Artificial Neural Networks

By 1986, several experiments proved that neural networks could solve complex nonlinear problems. At the time, computers were 10,000 times faster compared to when the theory was developed. This is how Rumelhart introduced the legendary paper:

We describe a new learning procedure, back-propagation, for networks of neuron-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure” — Nature 323, 533–536 (09 October 1986)

To understand the core of this paper, we’ll code the implementation by DeepMind’s Andrew Trask. This is not a random snippet of code. It’s been used in Andrew Karpathy’s deep learning course at Stanford and Siraj Raval’s Udacity course. It solves the XOR problem, thawing the first AI winter.

Before we dig into the code, play with this simulator for one to two hours to grasp the core logic. Then read Trask’s blog post.

Note that the added parameter `[1]` in the `X_XOR` data are bias neurons.

They have the same behavior as a constant in a linear function:

``import numpy as np``
``X_XOR = np.array([[0,0,1], [0,1,1], [1,0,1],[1,1,1]]) y_truth = np.array([[0],[1],[1],[0]])``
``np.random.seed(1)syn_0 = 2*np.random.random((3,4)) - 1syn_1 = 2*np.random.random((4,1)) - 1``
``def sigmoid(x):    output = 1/(1+np.exp(-x))    return outputdef sigmoid_output_to_derivative(output):    return output*(1-output) ``
``for j in range(60000):    layer_1 = sigmoid(np.dot(X_XOR, syn_0))    layer_2 = sigmoid(np.dot(layer_1, syn_1))    error = layer_2 - y_truth    layer_2_delta = error * sigmoid_output_to_derivative(layer_2)    layer_1_error = layer_2_delta.dot(syn_1.T)    layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)    syn_1 -= layer_1.T.dot(layer_2_delta)    syn_0 -= X_XOR.T.dot(layer_1_delta)    print("Output After Training: \n", layer_2)``

Backpropagation, matrix multiplication, and gradient descent combined can be hard to wrap your mind around. The visualizations of this process is often a simplification of what’s going on behind the hood. Focus on understanding the logic behind it, but don’t worry too much of having a mental picture of it.

Also, look at Andrew Karpathy’s lecture on backpropagation, play with these visualizations, and read Michael Nielsen’s chapter on it.

### Deep Neural Networks

Deep neural networks are neural networks with more than one layer between the input and output layer. The notion was introduced by Rina Dechter in 1986. But it didn’t gain mainstream attention until 2012. This was soon after IBM Watson’s Jeopardy victory and Google’s cat recognizer.

The core structure of deep neural network have stayed the same. But they are now applied to several different problems. There have also been a lot of improvement in regularization.

In 1963, it was a set of math functions to simplify noisy earth data. They are now used in neural networks to improve their ability to generalize.

A large share of the innovation is due to computing power. This improved researcher’s innovation cycles — what took a supercomputer one year to calculate in the mid-eighties takes half a second with today’s GPU technology.

The reduced cost in computing and the development of deep learning libraries have now made it accessible to the general public. Let’s look at an example of a common deep learning stack, starting from the bottom layer:

• GPU &gt; Nvidia Tesla K80. The hardware commonly used for graphics processing. Compared to CPUs, they are on average 50–200 times faster for deep learning.
• CUDA > low level programming language for the GPUs
• CuDNN > Nvidia’s library to optimize CUDA
• Tensorflow > Google’s deep learning framework on top of CuDNN
• TFlearn > A front-end framework for Tensorflow

Let’s have a look at the MNIST image classification of digits, the “Hello World” of deep learning.

Implemented in TFlearn:

``from __future__ import division, print_function, absolute_importimport tflearnfrom tflearn.layers.core import dropout, fully_connectedfrom tensorflow.examples.tutorials.mnist import input_datafrom tflearn.layers.conv import conv_2d, max_pool_2dfrom tflearn.layers.normalization import local_response_normalizationfrom tflearn.layers.estimator import regression``
``# Data loading and preprocessingmnist = input_data.read_data_sets("/data/", one_hot=True)X, Y, testX, testY = mnist.train.images, mnist.train.labels, mnist.test.images, mnist.test.labelsX = X.reshape([-1, 28, 28, 1])testX = testX.reshape([-1, 28, 28, 1])``
``# Building convolutional networknetwork = tflearn.input_data(shape=[None, 28, 28, 1], name='input')network = conv_2d(network, 32, 3, activation='relu', regularizer="L2")network = max_pool_2d(network, 2)network = local_response_normalization(network)network = conv_2d(network, 64, 3, activation='relu', regularizer="L2")network = max_pool_2d(network, 2)network = local_response_normalization(network)network = fully_connected(network, 128, activation='tanh')network = dropout(network, 0.8)network = fully_connected(network, 256, activation='tanh')network = dropout(network, 0.8)network = fully_connected(network, 10, activation='softmax')network = regression(network, optimizer='adam', learning_rate=0.01,                        loss='categorical_crossentropy', name='target')``
``# Trainingmodel = tflearn.DNN(network, tensorboard_verbose=0)model.fit({'input': X}, {'target': Y}, n_epoch=20,            validation_set=({'input': testX}, {'target': testY}),            snapshot_step=100, show_metric=True, run_id='convnet_mnist')``

There are plenty of great articles explaining the MNIST problem: here and here.

### Let’s sum it up

As you see in the TFlearn example, the main logic of deep learning is still similar to Rosenblatt’s perceptron. Instead of using a binary Heaviside step function, today’s networks mostly use Relu (Rectifier linear unit) activations.

In the last layer of the convolutional neural network, loss equals `categorical_crossentropy`. This is an evolution of Legendre’s least square, a logistical regression for multiple categories. The optimizer `adam`originates from the work of Debye’ gradient descent.

Tikhonov’s regularization notion is widely implemented in the form of dropout layers and regularization functions, `L1/L2`.

If you want a better intuition for neural networks and how to implement them, read my previous post: Deep Learning 101 for Coders.

Thanks to Ignacio Tonoli de Maussion, Brian Young, Paal Rgd, Tomas Moška, and Charlie Harrington for reading drafts of this. Code sources are included in the Jupyter notebooks.