MicroGrad: 動手實現一個簡單的自動微分框架

Micrograd 簡介 (Introduction)#

Micrograd 是Andrej Karpathy打造的一個學習性項目，用於理解自動微分、反向傳播等基本知識。

視頻地址：https://www.bilibili.com/video/BV1De4y1p7Z8

代碼地址：https://github.com/karpathy/micrograd

反向傳播 (Backpropagation) 是機器學習和深度學習中最常用的一種算法，能夠計算神經網絡中損失函數相對權重參數的梯度，通過最小化損失在反向傳播過程中指導模型權重更新。

導數 (Derivative)#

導數的數學定義#

數學上對導數的定義：函數在某一點上的導數是指函數在這個點附近的變化率，即函數在該點上的切線斜率。公式表示為：

$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$

在 x 點處，取一個較小的值 h，如 0.0001，利用上述公式近似計算函數 $f(x)$ 在 $x$ 處的導數。

如：

import math
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 定義一個函數
def f(x):
    return 3*x**2 - 4*x + 5
# 使用matplotlib畫出函數圖像
xs =np.arange(-5, 5, 0.25)
ys = f(xs)
plt.plot(xs, ys)

h = 0.00000001
a = -3.0
(f(a + h) - f(a))/h

輸出：-22.00000039920269 就得到了函數 f 在 -3 處的導數近似值

實際上可以通過求導法則直接求出 $f$ 的導數為 $6x - 4$ ，在 - 3 處的導數為 -22
但是在實際的現代神經網絡庫中不會對每一個函數求出這樣的表達式，遇到龐大複雜的函數時求解出這樣的導數表達式是困難且低效的。

進一步理解導數#

上文中求解了函數對未知數的導數，但實際的神經網絡中，我們需要求解函數對某個標量的導數，這時從導數的定義下手更好理解

h = 0.0001
a = 2.0
b = -3.0
c = 10.0

d1 = a*b+c
c += h
d2 = a*b+c

print('slope', (d2 - d1)/h)

按照定義，這裡對 c 進行求導（c 進行了少量偏移，求解函數 d 在 c 偏移後的變化率），結果為：
slope 0.9999999999976694

這意味著 c 在增加 0.0001 的微小變化下，d 的變化量約為 c 的變化量的 1 倍（正為增加，副為減少）。

也可以通過導數的定義求解 d 的變化量相對於 c 的變化量的關係：
$\frac {d_2-d_1} {h} = 1$

同理也能求函數相對於 a 和 b 的導數近似值。

神經網絡的原理#

以簡單的多層感知機 MLP 為例，神經網絡實際上是數學公式的集合。數據作為輸入，經過隱藏層在神經元中通過數學運算（通常是輸入數據的加權求和加上偏置，然後應用非線性激活函數）產生進入到下一層的激活值，最終輸出預測結果等其他數據。其主要的運算是一系列的加乘（在線性代數中體現為矩陣運算）以及非線性激活。

前向傳播負責計算網絡的輸出並得到損失值；前向傳播完成後執行反向傳播，其核心任務是根據損失值和前向傳播過程中存儲的中間結果，通過鏈式法則計算損失函數關於各層參數（如權重和偏置）的梯度。通過這些計算得到的梯度，指導隱藏層中神經元的權重參數進行更新，以減小訓練誤差（其實不僅僅是需要減小訓練誤差，更重要的是減少測試誤差，讓模型的泛化性更好才是最終目的）。

在 micrograd 中使用的是簡單的標量運算用於演示和學習自動微分的原理，但在實際的神經網絡實現中，為了效率會進行大量的矩陣運算。

實現自動微分框架#

神經網絡中有很多需要維護的參數，因此需要設計一個數據結構協助我們進行數據維護。

自定義類#

class Value:
    def __init__(self, data):
        self.data = data

    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        out = Value(self.data + other.data)
        return out
    
    def __mul__(self, other):
        out = Value(self.data * other.data)
        return out

__repr__用於生成開發者友好的字符串表示
__str__用於生成用戶友好的字符串表示通常由 print () 和 str () 調用
如果 str 不存在，print () 會退而使用 repr。
如果 repr 不存在，repr () 和交互式環境的直接顯示會使用默認的 <...> 表示。

上面的代碼中簡單定義了一個Value類，包含屬性 data，定義了 Value 類的方法 add 和 mul。

a = Value(2.0)
b = Value(-3.0)
c = Value(10.0)
d = a * b + c
d

Value(data=4.0)

接下來為 Value 增加一些其他的屬性，使之能夠保留更多的操作細節。這裡使用集合_children表示當前 data 的子節點，op記錄操作運算符，label 為當前數據的別稱

class Value:
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self._prev = set(_children)
        self._op = _op
        self.label = label

    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        return out
    
    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')
        return out

使用可視化 API 顯示計算圖#

為了更加直觀的顯示出計算過程，我們將通過下面的代碼進行可視化操作（非重點，可跳過）

from graphviz import Digraph

def trace(root):
    nodes, edges = set(), set()
    def build(v):
        if v not in nodes:
            nodes.add(v)
            for child in v._prev:
                edges.add((child, v))
                build(child)
    build(root)
    return nodes, edges

def draw_dot(root):
    dot = Digraph(format='svg', graph_attr={'rankdir':'LR'})

    nodes, edges = trace(root)
    for n in nodes:
        uid = str(id(n))

        dot.node(name=uid, label='{%s | data %.4f | grad %.4f}' % (n.label, n.data, n.grad), shape='record')

        if n._op:
            dot.node(name=uid+n._op, label=n._op)
            dot.edge(uid + n._op, uid)
    
    for n1, n2 in edges:
        dot.edge(str(id(n1)), str(id(n2)) + n2._op)
    return dot

定義一個深層的表達式並可視化計算圖（前向傳播）

a = Value(2.0, label='a')
b = Value(-3.0, label='b')
c = Value(10.0, label='c')
e = a * b; e.label = 'e'
d = e + c; d.label = 'd'
f = Value(-2.0, label='f')
L = d * f; L.label = 'L'
L

draw_dot(L)

output

反向傳播計算梯度#

在進反向傳播之前先回顧一下鏈式求導法則

鏈式求導法則#

在反向傳播過程中需要求解最終輸出對中途某個輸入的導數，但最終輸出的函數並不是直接關於中途變量的函數。

eg. 如果 y 是關於 u 的函數 $(y=f(u))$ ，而 u 又是關於 x 的函數 $(u=g(x))$ ，則 $y$ 關於 $x$ 的導數為：

$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$

如果 $h(x)$ 是複合函數 $f(g(x))$ ，那麼 $h(x)$ 關於 $x$ 的導數 $h'(x)$ 等於外層函數 $f$ 的導數在內層函數 $g(x)$ 處的值，乘以內層函數 $g(x)$ 的導數 $g'(x)$ 。即：

$h'(x) = f'(g(x)) \cdot g'(x)$

求導時，先對外層函數求導（保持內層函數不變），然後乘以內層函數的導數。

手動反向求導#

修改 Value 類，實現手動的反向求導

class Value:
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self._prev = set(_children)
        self._op = _op
        self.label = label
        self.grad = 0.0
        self._backward = lambda: None

    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad = 1.0 * out.grad
            other.grad = 1.0 * out.grad
        out._backward = _backward

        return out
    
    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad = other.data * out.grad
            other.grad = self.data * out.grad
        out._backward = _backward
        return out

L 對 L 本身求導，導數為 1，所以先手動的將 L 的梯度設為 1

L.grad = 1.0

對 L 進行梯度反向傳播，由計算圖我們可以知道 L=f*d，那麼 L 的反向傳播就能分別求出 f 和 d 的梯度。像這樣，手動的對 L 進行反向傳播：

L._backward()

output
同理，對 d 進行反向傳播得到 c、e 的梯度，對 c 反向傳播得到 a 和 b 的梯度。

d._backward()
e._backward()

output

在神經網絡中神經元接收數據後不僅進行簡單的線性運算，還會對線性運算的結果進行非線性處理。我們接著拓展計算圖，並使用激活函數 tanh。

定義 tanh 函數：

def tanh(self):
        n = self.data
        t = (math.exp(2*n) - 1) / (math.exp(2*n) + 1)
        out = Value(t, (self, ), 'tanh')
        def _backward():
            self.grad += (1 - t ** 2) * out.grad
        out._backward = _backward
        return out

tanh(x) = (e^2x - 1) / (e^2x + 1)
其導數為 tanh (x)' = 1 - tanh (x)²

定義一個複雜的神經元

x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')

w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')

b = Value(6.8813735870195432, label='b')

x1w1 = x1 * w1; x1w1.label='x1w1'
x2w2 = x2 * w2; x2w2.label='x2w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label='x1w1x2w2'

n = x1w1x2w2 + b; n.label='n'
o = n.tanh(); o.label='o'

draw_dot(o)

output

像上面一樣手動執行反向傳播

output

如果計算過程再複雜一下，手動反向傳播就過於繁雜了。接下來實現一個自動的反向傳播。

自動反向傳播#

具體來說，我們在前向傳播後執行反向傳播操作，按照拓撲排序的對節點的逆序調用 ._backward() 方法。

定義自動的反向傳播函數

def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

執行

o.backward()

即可一次性求出整個計算圖的梯度。但是當前我們的設計有一定的缺陷，下面的例子能夠清晰地顯示出我們的缺陷：

a = Value(3.0, label='a')
b = a + a; b.label='b'
b.backward()
draw_dot(b)

b 對 a 的導數應該是 2，而不是 1，這是因為我們在定義反向傳播時，只是簡單的進行了賦值操作，對於多次使用某個變量的操作，梯度應該是累加的。

修復這個問題，完整的 Value 類為：

class Value:
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self._prev = set(_children)
        self._op = _op
        self.label = label
        self.grad = 0.0
        self._backward = lambda: None

    def __repr__(self):
        return f"Value(data={self.data})"
    
    def __add__(self, other):
      # 類型檢驗：允許Value類和數值進行運算，增強了Value類的魯棒性和強健性
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        out._backward = _backward

        return out
    # __radd__是一個反向的魔術方法，使得處理2 * a這樣的表達更自然
    def __radd__(self, other):
        return self.__add__(other)
    
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward
        return out
    
    def tanh(self):
        n = self.data
        t = (math.exp(2*n) - 1) / (math.exp(2*n) + 1)
        out = Value(t, (self, ), 'tanh')
        def _backward():
            self.grad += (1 - t ** 2) * out.grad
        out._backward = _backward
        return out 
    
    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

對於 Value 類中其他的相關操作，你可以自己再做定義，如 div 操作，pow 等。

視頻中作者還講解了一些其他的內容如使用 Pytorch 來實現一個簡單的 MLP，感興趣的話可以繼續學習。

總結#

寫下這篇博客，希望通過這趟 Micrograd 之旅，能讓大家對自動微分不再感到神秘，甚至能自己動手實現一個 mini 的版本。如果這篇博客能幫助你對深度學習框架背後那些默默工作的 “英雄” 們（比如自動微分引擎）多一分理解，那我的目的就達到了！也歡迎大家繼續探索，嘗試給我們的 Value 類添加更多功能，或者去看看 Karpathy 原版的視頻和代碼，相信會有更多收穫。