MicroGrad: Implementing a Simple Automatic Differentiation Framework

Introduction to Micrograd#

Micrograd is a learning project created by Andrej Karpathy to understand the basics of automatic differentiation, backpropagation, and more.

Video link: https://www.bilibili.com/video/BV1De4y1p7Z8

Code link: https://github.com/karpathy/micrograd

Backpropagation is one of the most commonly used algorithms in machine learning and deep learning, capable of calculating the gradient of the loss function with respect to the weight parameters in a neural network, guiding the model weight updates during backpropagation by minimizing the loss.

Derivative#

Mathematical Definition of Derivative#

The mathematical definition of a derivative: the derivative of a function at a certain point refers to the rate of change of the function near that point, which is the slope of the tangent line at that point. It is expressed as:

$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$

At point x, take a small value h, such as 0.0001, and use the above formula to approximate the derivative of the function $f(x)$ at $x$ .

For example:

import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Define a function
def f(x):
    return 3*x**2 - 4*x + 5
# Use matplotlib to plot the function
xs = np.arange(-5, 5, 0.25)
ys = f(xs)
plt.plot(xs, ys)

h = 0.00000001
a = -3.0
(f(a + h) - f(a))/h

Output: -22.00000039920269 gives the approximate value of the derivative of function f at -3.

In fact, the derivative of $f$ can be directly calculated using the derivative rule as $6x - 4$ , and the derivative at -3 is -22.
However, in modern neural network libraries, such expressions are not derived for every function, as solving such derivative expressions for large and complex functions is difficult and inefficient.

Further Understanding of Derivative#

The previous section solved the derivative of a function with respect to an unknown variable, but in actual neural networks, we need to solve the derivative of a function with respect to a scalar. It is better to understand this from the definition of the derivative.

h = 0.0001
a = 2.0
b = -3.0
c = 10.0

d1 = a*b+c
c += h
d2 = a*b+c

print('slope', (d2 - d1)/h)

According to the definition, here we are differentiating with respect to c (c has undergone a small shift, solving for the rate of change of function d after the shift in c), resulting in:
slope 0.9999999999976694

This means that with a small change of 0.0001 in c, the change in d is approximately equal to the change in c (positive for increase, negative for decrease).

The relationship between the change in d and the change in c can also be derived from the definition of the derivative:
$\frac {d_2-d_1} {h} = 1$

Similarly, we can approximate the derivatives of the function with respect to a and b.

Principles of Neural Networks#

Taking a simple multilayer perceptron (MLP) as an example, a neural network is essentially a collection of mathematical formulas. Data is input, and through hidden layers, activation values are produced in neurons via mathematical operations (usually a weighted sum of input data plus a bias, followed by a nonlinear activation function) that enter the next layer, ultimately outputting predictions and other data. The main operations involve a series of additions and multiplications (reflected as matrix operations in linear algebra) and nonlinear activations.

Forward propagation is responsible for calculating the network's output and obtaining the loss value; after forward propagation is completed, backpropagation is executed, whose core task is to compute the gradients of the loss function with respect to each layer's parameters (such as weights and biases) using the chain rule based on the loss value and the intermediate results stored during forward propagation. The gradients obtained from these calculations guide the weight parameters of neurons in the hidden layers to update, reducing training error (in fact, it is not only necessary to reduce training error, but more importantly, to reduce testing error, as improving the model's generalization is the ultimate goal).

In micrograd, simple scalar operations are used to demonstrate and learn the principles of automatic differentiation, but in actual neural network implementations, a large number of matrix operations are performed for efficiency.

Implementing an Automatic Differentiation Framework#

There are many parameters that need to be maintained in a neural network, so a data structure needs to be designed to assist us in data maintenance.

Custom Class#

class Value:
    def __init__(self, data):
        self.data = data

    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        out = Value(self.data + other.data)
        return out
    
    def __mul__(self, other):
        out = Value(self.data * other.data)
        return out

repr is used to generate a developer-friendly string representation.
str is used to generate a user-friendly string representation, usually called by print() and str().
If str does not exist, print() will fall back to using repr.
If repr does not exist, repr() and direct display in interactive environments will use the default <...> representation.

The above code simply defines a Value class with a data attribute and methods for addition and multiplication.

a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)
d = a * b + c
d

Value(data=4.0)

Next, we add some other attributes to Value to retain more operation details. Here, the set _children represents the child nodes of the current data, op records the operation operator, and label is the alias for the current data.

class Value:
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self._prev = set(_children)
        self._op = _op
        self.label = label

    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        return out
    
    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')
        return out

Using Visualization API to Display Computation Graph#

To visually display the computation process more intuitively, we will perform visualization operations through the following code (non-essential, can be skipped).

from graphviz import Digraph

def trace(root):
    nodes, edges = set(), set()
    def build(v):
        if v not in nodes:
            nodes.add(v)
            for child in v._prev:
                edges.add((child, v))
                build(child)
    build(root)
    return nodes, edges

def draw_dot(root):
    dot = Digraph(format='svg', graph_attr={'rankdir':'LR'})

    nodes, edges = trace(root)
    for n in nodes:
        uid = str(id(n))

        dot.node(name=uid, label='{%s | data %.4f | grad %.4f}' % (n.label, n.data, n.grad), shape='record')

        if n._op:
            dot.node(name=uid+n._op, label=n._op)
            dot.edge(uid + n._op, uid)
    
    for n1, n2 in edges:
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)
    return dot

Define a deep expression and visualize the computation graph (forward propagation).

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a * b; e.label = 'e'
d = e + c; d.label = 'd'
f = Value(-2.0, label='f')
L = d * f; L.label = 'L'
L

draw_dot(L)

output

Backpropagation to Compute Gradients#

Before proceeding with backpropagation, let's review the chain rule of differentiation.

Chain Rule of Differentiation#

During backpropagation, we need to compute the derivative of the final output with respect to some intermediate input, but the final output function is not directly a function of the intermediate variable.

For example, if y is a function of u $(y=f(u))$ , and u is a function of x $(u=g(x))$ , then the derivative of $y$ with respect to $x$ is:

$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$

If $h(x)$ is a composite function $f(g(x))$ , then the derivative $h'(x)$ of $h(x)$ with respect to $x$ equals the derivative of the outer function $f$ evaluated at the inner function $g(x)$ , multiplied by the derivative of the inner function $g(x)$ . That is:

$h'(x) = f'(g(x)) \cdot g'(x)$

When differentiating, first differentiate the outer function (keeping the inner function unchanged), then multiply by the derivative of the inner function.

Manual Backpropagation#

Modify the Value class to implement manual backpropagation.

class Value:
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self._prev = set(_children)
        self._op = _op
        self.label = label
        self.grad = 0.0
        self._backward = lambda: None

    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad = 1.0 * out.grad
            other.grad = 1.0 * out.grad
        out._backward = _backward

        return out
    
    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad = other.data * out.grad
            other.grad = self.data * out.grad
        out._backward = _backward
        return out

To differentiate L with respect to L itself, the derivative is 1, so we manually set L's gradient to 1.

L.grad = 1.0

Perform backpropagation on L. From the computation graph, we can see that L=f*d, so backpropagation on L can yield the gradients of f and d. Manually backpropagate L:

L._backward()

output
Similarly, backpropagating on d yields the gradients of c and e, and backpropagating on c yields the gradients of a and b.

d._backward()
e._backward()

output

In neural networks, neurons not only perform simple linear operations after receiving data but also apply nonlinear processing to the results of linear operations. We will continue to expand the computation graph and use the activation function tanh.

Define the tanh function:

def tanh(self):
        n = self.data
        t = (math.exp(2*n) - 1) / (math.exp(2*n) + 1)
        out = Value(t, (self, ), 'tanh')
        def _backward():
            self.grad += (1 - t ** 2) * out.grad
        out._backward = _backward
        return out

tanh(x) = (e^2x - 1) / (e^2x + 1)
Its derivative is tanh(x)' = 1 - tanh(x)²

Define a complex neuron.

x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')

w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')

b = Value(6.8813735870195432, label='b')

x1w1 = x1 * w1; x1w1.label='x1w1'
x2w2 = x2 * w2; x2w2.label='x2w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label='x1w1x2w2'

n = x1w1x2w2 + b; n.label='n'
o = n.tanh(); o.label='o'

draw_dot(o)

output

Manually execute backpropagation as above.

output

If the computation process becomes more complex, manual backpropagation becomes overly complicated. Next, we will implement automatic backpropagation.

Automatic Backpropagation#

Specifically, after forward propagation, we perform backpropagation by calling the ._backward() method on the nodes in reverse order according to topological sorting.

Define the automatic backpropagation function.

def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

Execute

o.backward()

to compute the gradients of the entire computation graph at once. However, our current design has certain flaws, and the following example clearly demonstrates our shortcomings:

a = Value(3.0, label='a')
b = a + a; b.label='b'
b.backward()
draw_dot(b)

The derivative of b with respect to a should be 2, not 1, because in defining backpropagation, we simply performed assignment operations. For operations that use a variable multiple times, the gradient should be accumulated.

Fix this issue, and the complete Value class is:

class Value:
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self._prev = set(_children)
        self._op = _op
        self.label = label
        self.grad = 0.0
        self._backward = lambda: None

    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
      # Type check: allows operations between Value class and numbers, enhancing the robustness and strength of the Value class
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        out._backward = _backward

        return out
    # __radd__ is a reverse magic method that makes handling expressions like 2 * a more natural
    def __radd__(self, other):
        return self.__add__(other)
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out
    
    def tanh(self):
        n = self.data
        t = (math.exp(2*n) - 1) / (math.exp(2*n) + 1)
        out = Value(t, (self, ), 'tanh')
        def _backward():
            self.grad += (1 - t ** 2) * out.grad
        out._backward = _backward
        return out 
    
    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

For other related operations in the Value class, you can define them yourself, such as div operations, pow, etc.

In the video, the author also explains some other content, such as using Pytorch to implement a simple MLP. If you are interested, you can continue learning.

Conclusion#

By writing this blog, I hope that through this journey with Micrograd, everyone can no longer feel mystified by automatic differentiation and even be able to implement a mini version themselves. If this blog helps you gain a better understanding of the "heroes" behind deep learning frameworks (such as automatic differentiation engines), then I have achieved my goal! You are also welcome to continue exploring, trying to add more features to our Value class, or checking out Karpathy's original video and code, as I believe there will be more gains.