Coding Backpropagation and Gradient Descent From Scratch without using any libraries

Backpropagation is considered one of the core algorithms in Machine Learning. It is mainly used in training the neural network. And backpropagation is basically gradient descent. What if we tell you that understanding and implementing it is not that hard? Anyone who knows basic Mathematics and has knowledge of the basics of Python Language can learn this in 2 hours. Let’s get started.

Though there are many high-level overviews of the backpropagation and gradient descent algorithms what I found is that unless one implements these from scratch, one is not able to understand many ideas behind neural networks.

Getting Started:

Clone this repository in your home directory
git clone https://github.com/cloudxlab/iitr-deep-learning-spl-tf2
Open ‘backprop_from_scratch.ipynb’ notebook in the folder

Writing Functions – Python

A function is basically a sequence of operations. It may take inputs via arguments. And it may give the result by returning a value.

Let us define a function square which basically gives a square of whatever we pass as an argument.

You can visualize a function like this. There is input and there is output.

Definition

In [2]:

def sq(x):
    return x*x

In [3]:

sq(4)

Out[3]:

In [4]:

sq(10)

Out[4]:

A function could also take multiple arguments

In [5]:

# function could also take multiple arguments
# Here a and b are the two arguments and function is returning a single value.
def add(a, b):
    return a+b

add(10, 20)

Out[5]:

In [8]:

add(10.5, 120)

Out[8]:

130.5

Variable arguments to a function

In [9]:

# We can define functions with variable arguments
def sum(*args):
    s = 0
    for i in args:
        s += i
    return s

In [10]:

sum(1)

Out[10]:

In [11]:

sum(2, 3)

Out[11]:

In [12]:

sum(2, 3, 4)

Out[12]:

Calling a function with variable arguments

In [14]:

# If we have a usual function such as the following:

def add(a, b, c):
    return a+b+c

# we can call it using variable argument by putting an asterix
arr = range(3)
add(*arr)

Out[14]:

Passing a function as argument

In python, you can pass a function in another function just like another variable.In [15]:

def double_it(x):
    return 2*x

def apply_transformation(f, arr):
    result = []
    for a in arr:
        v = f(a)
        result.append(v)
    return result
apply_transformation(double_it, [1,2,3])

Out[15]:

[2, 4, 6]

In [14]:

# Similarly we can us another function - the one that we defined above.
apply_transformation(sq, [1,2,3])

Out[14]:

[1, 4, 9]

Differentiation

Differentiation is basically about finding the rate of change.

Rate of change of distance is velocity. Rate of change of velocity is acceleration.

If you plot rate of change of y with respect to x for a straight line y = mx +c , you will get a constant horizontal line.

If you plot, rate of change of y with respect to x for a parabola, it gives a straight line.

Numeric Differentiation

In [16]:

# Slightly increase x and compute the result. 
# Then compute the ratio of change in result with change in x

def diff(fn, x):
    delta = 0.000000000001
    y = fn(x)
    x1 = x + delta
    y1 = fn(x1)
    return (y1 - y)/(x1 - x)

In [17]:

diff(sq, 2)

Out[17]:

4.0

In [17]:

x = 2
2*x

Out[17]:

In [18]:

def cb(x):
    return x*x*x
cb(10)

Out[18]:

In [19]:

diff(cb, 2)

Out[19]:

12.0

In [20]:

# If you know symbolic / algebric differentiation, 
# you would be knowing that the differentiation of an algebric function x^3 is 3*x^2.
# The result that we computed above is same as the algebric differentiation
x = 2
3*x**2

Out[20]:

Partial Derivative

If there is function which is taking multiple inputs, the rate of change of output could be computed with respect of any of the inputs.In [21]:

# Here is a very simple function
def fn(x1, x2):
    return x1*5 + x2*x2

# Lets calculate the rate of change of output of this function wrt x1
x1 = 10
x2 = 20
delta_x = 0.000000000001

delta_y = fn(x1+delta_x, x2) - fn(x1, x2)
df_dx1 = delta_y / delta_x

print(df_dx1)

5.002220859751105

In [22]:

# Now, lets calculate the rate of change of output of this function wrt x2

delta_y = fn(x1, x2+delta_x) - fn(x1, x2)
df_dx2 = delta_y / delta_x
print(df_dx2)

39.960923459148034

In [23]:

### Now, let us generalize it a bit

import numpy as np

# The first argument is the function, 
# second argument is the idx of the input with respect to which we are differentiating.
# Third arguments is a variable arguments - there can be many values. The function is called with these values.

def diffp(fn, idx, *args):
    delta = 0.000000000001
    
    # Call the underlying function with args as such
    y = fn(*args)
    
    # Now, increase only one input - at idx.
    args = list(args)
    args[idx] += delta
    
    # Now, calulate the output after changing one of the inputs a little bit.
    y1 = fn(*args)
    return (y1 - y)/delta

In [24]:

# Let us test it using a very simple function.
# This function can be differentiated with respect to m, c or x.
def line_y(m, c, x):
    return m*x + c

# Notice how we are passing the function as the first argument.
# The second argument is the index of the value with respect to which we are differentiating
# Remaining values are the values of the arguments to be passed while calling line_y
diffp(line_y, 0, 1, 2, 3) # differentiate wrt m

Out[24]:

3.000266701747023

In [25]:

diffp(line_y, 1, 1, 2, 3) # differentiate wrt c

Out[25]:

1.000088900582341

In [26]:

diffp(line_y, 2, 1, 2, 3) # differentiate wrt x

Out[26]:

1.000088900582341

Chain Rule

In [76]:

# If we have a sequence of functions as mentioned above, we can represent it using another function:
def sq(x):
    return x*x

def cb(t):
    return t*t*t

def sqcb(x):
    t = sq(x)
    y = cb(t)
    return y

In [77]:

# Differentiating it will be really straightforward
diffp(sqcb, 0, 2)

Out[77]:

192.01706891180947

Third method:

In [28]:

x = 2
dt_dx = diffp(sq, 0, x)

In [29]:

t = sq(2)
dy_dt = diffp(cb, 0, t)

In [30]:

dt_dx * dy_dt

Out[30]:

192.03413934105515

Gradient descent

When we have a function and we want to find out at what point this function would attain a minima or maxima. See more here: https://youtu.be/4Cu9fizAtAoIn [38]:

# Let us first generate some data

import numpy as np

# Generate 100 random numbers
X = 2*np.random.random((100, 1))

In [39]:

# Now, let us generate the values of y corresponding to a line with slope 3 and intercept 4
# plus some noise
y = 4 + 3 * X + .3*np.random.randn(100, 1)

In [40]:

# Let us take a look at first three records
y[0:3]

Out[40]:

array([[7.46275157],
       [5.42272672],
       [5.26000136]])

In [41]:

X[0:3]

Out[41]:

array([[1.13629739],
       [0.52740146],
       [0.61857458]])

In [42]:

4+3*X[0:3]

Out[42]:

array([[7.40889218],
       [5.58220439],
       [5.85572374]])

In [43]:

# Let us plot it using matplotlib

from matplotlib import pyplot as plt

plt.figure(figsize=(11,4))
plt.scatter(X, y)
plt.show()

# horizontal axis is x and vertical axis is y

In [45]:

def error(m, c, X, y):
    ypred = m*X + c
    err = ypred-y
    return np.sum(err*err)

In [46]:

error(3, 4, np.array([1,2,3]), np.array([1,2,3]))

Out[46]:

In [47]:

error(3, 4, X, y)

Out[47]:

11.370852782527962

In [52]:

error(3.04, 4, X, y)

Out[52]:

11.164979946174462

In [83]:

def grad_fit(X, y):
    m = np.random.random()
    c = np.random.random()
    learning_rate = 0.001
    for i in range(100):
        print("i: ", i, " Error: ", error(m, c, X, y))
        dE_dm = diffp(error, 0, m, c, X, y)
        dE_dc = diffp(error, 1, m, c, X, y)
        
        m = m - learning_rate * dE_dm
        c = c - learning_rate * dE_dc
    return (m, c)

In [84]:

m, c = grad_fit(X, y)

i:  0  Error:  4083.7285975679897
i:  1  Error:  1357.6252100867644
i:  2  Error:  459.3592139100733
i:  3  Error:  163.29051705436652
i:  4  Error:  65.5610011647214
i:  5  Error:  33.1328146842778
i:  6  Error:  22.218172770940345
i:  7  Error:  18.399259306669478
i:  8  Error:  16.927083201555945
i:  9  Error:  16.240253000396848
i:  10  Error:  15.823423526524639
i:  11  Error:  15.506227542324433
i:  12  Error:  15.232001423349436
i:  13  Error:  14.981432087611957
i:  14  Error:  14.747680617493739
i:  15  Error:  14.52798983150515
i:  16  Error:  14.321092393591112
i:  17  Error:  14.125943647997243
i:  18  Error:  13.941884312505449
i:  19  Error:  13.76823354607412
i:  20  Error:  13.604392639484571
i:  21  Error:  13.449850857643208
i:  22  Error:  13.304041069116783
i:  23  Error:  13.1664926916126
i:  24  Error:  13.036677643076112
i:  25  Error:  12.914182340103585
i:  26  Error:  12.798551326229774
i:  27  Error:  12.689583643386808
i:  28  Error:  12.58679533656726
i:  29  Error:  12.48975407746471
i:  30  Error:  12.398176189973645
i:  31  Error:  12.311842550845657
i:  32  Error:  12.230368725423476
i:  33  Error:  12.15349017582421
i:  34  Error:  12.080929504768589
i:  35  Error:  12.01246968129011
i:  36  Error:  11.947881746213836
i:  37  Error:  11.886969741939057
i:  38  Error:  11.829508805337317
i:  39  Error:  11.775232938950976
i:  40  Error:  11.7240189978844
i:  41  Error:  11.675756159108447
i:  42  Error:  11.630169982103672
i:  43  Error:  11.58715098844346
i:  44  Error:  11.546590599235017
i:  45  Error:  11.508352340820826
i:  46  Error:  11.472259383091766
i:  47  Error:  11.438172320591587
i:  48  Error:  11.406064361894988
i:  49  Error:  11.375765097590975
i:  50  Error:  11.347175072319152
i:  51  Error:  11.320198663860097
i:  52  Error:  11.294768656072876
i:  53  Error:  11.27073288512202
i:  54  Error:  11.248069676645077
i:  55  Error:  11.22667985210804
i:  56  Error:  11.206523779032805
i:  57  Error:  11.187457348111439
i:  58  Error:  11.169498175945936
i:  59  Error:  11.15254221747177
i:  60  Error:  11.13654887644347
i:  61  Error:  11.12145424762985
i:  62  Error:  11.10723362998187
i:  63  Error:  11.093814960235605
i:  64  Error:  11.081145535104787
i:  65  Error:  11.069214843723582
i:  66  Error:  11.057944186806685
i:  67  Error:  11.04730077050891
i:  68  Error:  11.037282428082056
i:  69  Error:  11.027789502107026
i:  70  Error:  11.018836113798875
i:  71  Error:  11.010404557679433
i:  72  Error:  11.00246237284482
i:  73  Error:  10.994951898575751
i:  74  Error:  10.987863959791374
i:  75  Error:  10.981173260576831
i:  76  Error:  10.97487583850806
i:  77  Error:  10.968924851820436
i:  78  Error:  10.963317694512957
i:  79  Error:  10.95803070674304
i:  80  Error:  10.953029845889434
i:  81  Error:  10.948319335119882
i:  82  Error:  10.943855597092096
i:  83  Error:  10.939657430608891
i:  84  Error:  10.935700404741592
i:  85  Error:  10.93196932790682
i:  86  Error:  10.928446914760144
i:  87  Error:  10.925122307189463
i:  88  Error:  10.921987557948412
i:  89  Error:  10.919030199853355
i:  90  Error:  10.916238322489436
i:  91  Error:  10.913601802488085
i:  92  Error:  10.911115528431818
i:  93  Error:  10.908765814951405
i:  94  Error:  10.906556316793806
i:  95  Error:  10.904466955489564
i:  96  Error:  10.902504471647127
i:  97  Error:  10.900643630382485
i:  98  Error:  10.898883927123046
i:  99  Error:  10.897232145447704

In [85]:

m, c

Out[85]:

(3.153937346115943, 3.856832667230562)

In [86]:

from matplotlib import pyplot as plt

x1 = 0
x2 = 2

y1 = m*x1 + c
y2 = m*x2 + c

ycap = m*X + c
plt.figure(figsize=(11,4))
plt.subplot()

# plt.scatter(X, ycap, color='r')
plt.plot([x1, x2], [y1, y2], color = 'r')
plt.subplot()
plt.scatter(X, y, color='b')
plt.show()

/usr/local/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:15: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  from ipykernel import kernelapp as app

Backpropagation

Learn more about neural networks and backpropagation here: https://youtu.be/4Cu9fizAtAo?t=1652

In [88]:

## Generate Data
import numpy as np

X = np.random.random((100, 1))
y = 4 + 3 * X + .3*np.random.randn(100, 1)

In [89]:

def neuron(w11, w12, x):
    # We will talk about Activation function later.
    r = w11*x + w12
    return r

def forward_nn(w11, w12, w21, w22, x):
    h1 = neuron(w11, w12, x)
    o1 = neuron(w21, w22, h1)
    return o1

def sq_error(o1, yy):
    e = o1 - yy
    return e*e

def error_nn(w11, w12, w21, w22, x, yy):
    o1 = forward_nn(w11, w12, w21, w22, x)
    e = o1 - yy
    return e*e

In [90]:

w11 = 0.5
w12 = 0.5
w21 = 0.5
w22 = 0.5

learning_rate = 0.001

for epoch in range(100):
    for i in range(len(X)):
        x = X[i]
        yy = y[i]
        
        dE_dw11 = diffp(error_nn, 0, w11, w12, w21, w22, x, yy)
        dE_dw12 = diffp(error_nn, 1, w11, w12, w21, w22, x, yy)
        dE_dw21 = diffp(error_nn, 2, w11, w12, w21, w22, x, yy)
        dE_dw22 = diffp(error_nn, 3, w11, w12, w21, w22, x, yy)
        
        w11 = w11 - learning_rate * dE_dw11
        w12 = w12 - learning_rate * dE_dw12
        
        w21 = w21 - learning_rate * dE_dw21
        w22 = w22 - learning_rate * dE_dw22

In [91]:

ycap = forward_nn(w11, w12, w21, w22, X)

In [ ]:

In [92]:

from matplotlib import pyplot as plt

plt.figure(figsize=(11,4))
plt.subplot()
plt.plot(X, ycap, color='r')
plt.subplot()
plt.scatter(X, y, color='b')
plt.show()

/usr/local/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:6: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.

Further Improvements – Backpropagation

In [93]:

# Initialize
w11 = 0.5
w12 = 0.5
w21 = 0.5
w22 = 0.5

def neuron(w11, w12, x):
    #TODO: activation function
    return w11*x + w12

def sq_error(o1, yy):
    e = o1 - yy
    return e*e

# Learning Rate
eta = 0.01

for epoch in range(50):
    for i in range(len(X)):
        x = X[i]
        yy = y[i]
        
        #Forward Pass
        h1 = neuron(w11, w12, x)
        o1 = neuron(w21, w22, h1)

        dE_do1 = diffp(sq_error, 0, o1, yy)
        do1_dw21 = diffp(neuron, 0, w21, w22, h1)
        dE_w21 = dE_do1 * do1_dw21
        
        do1_dw22 = diffp(neuron, 1, w21, w22, h1)
        dE_w22 = dE_do1 * do1_dw22
        
        do1_dh = diffp(neuron, 2, w21, w22, h1)
        dE_dh = dE_do1 * do1_dh
        dh_dw11 = diffp(neuron, 0, w11, w12, x)
        dE_dw11 = dE_dh * dh_dw11

        dh_dw12 = diffp(neuron, 1, w11, w12, x)
        dE_dw12 = dE_do1 * do1_dh * dh_dw12
        
        w11 = w11 - eta * dE_dw11
        w12 = w12 - eta * dE_dw12

        w21 = w21 - eta * dE_w21
        w22 = w22 - eta * dE_w22

In [94]:

# ycap = forward_nn(w11, w12, w21, w22, X)

h1 = neuron(w11, w12, X)
ycap1 = neuron(w21, w22, h1)

from matplotlib import pyplot as plt

plt.figure(figsize=(11,4))
# plt.subplot()
# plt.scatter(X, ycap, color='r')
plt.subplot()
plt.plot(X, ycap1, color='g')
plt.subplot()
plt.scatter(X, y, color='b')
plt.show()

/usr/local/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:13: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  del sys.path[0]

Extensions

Neurons are many. Weight are computed in a single shot using numpy.
The layers are many and the code needs to be flexible to stitch many layers flexible
Neuron definition is complex – activation function

Let us know in the comments below if you find it helpful.

In order to claim the certificate from E&ICT Academy, IIT Roorkee, visit https://www.cloudxlab.com

Website ► https://www.cloudxlab.com

Facebook ► https://www.facebook.com/cloudxlab

Instagram ► h ttps://www.instagram.com/cloudxlab

Twitter ► h ttp://www.twitter.com/cloudxlab

Sandeep Giri

View all posts