Autograd Internals

How automatic differentiation works in Aicraft.

Overview

Aicraft implements reverse-mode automatic differentiation (backpropagation) using a dynamic computational graph. Every operation that produces a tensor can optionally record its computation, allowing gradients to flow backward.

Forward:  x → [Linear] → [ReLU] → [Linear] → [Softmax] → loss
Backward: ∂loss/∂x ← ∂/∂z₃ ← ∂/∂z₂ ← ∂/∂z₁ ← ∂/∂y ← 1.0

Computational Graph

Nodes

Each tensor with requires_grad = true has an associated AcGradNode:

typedef struct AcGradNode {
    AcTensor *tensor;           // The tensor this node belongs to
    AcTensor *grad;             // Accumulated gradient
    
    AcTensor *inputs[AC_MAX_INPUTS];  // Input tensors
    int num_inputs;
    
    void (*backward)(struct AcGradNode *node, AcTensor *grad_output);
    void *context;              // Operation-specific data
    
    bool visited;               // For topological sort
} AcGradNode;

Building the Graph

The graph is built dynamically during the forward pass:

// When you compute: c = a + b
AcTensor *ac_add(AcTensor *a, AcTensor *b) {
    AcTensor *c = ac_tensor_new(a->shape, a->ndim);
    
    // Compute forward
    for (int i = 0; i < c->size; i++) {
        c->data[i] = a->data[i] + b->data[i];
    }
    
    // Build graph if needed
    if (a->requires_grad || b->requires_grad) {
        c->requires_grad = true;
        c->grad_node = ac_grad_node_new(c);
        c->grad_node->inputs[0] = a;
        c->grad_node->inputs[1] = b;
        c->grad_node->num_inputs = 2;
        c->grad_node->backward = add_backward;
    }
    
    return c;
}

Backward Pass

Topological Sort

Before computing gradients, we topologically sort the graph (output to input):

void ac_backward(AcTensor *loss) {
    // Start with gradient of 1.0 w.r.t. loss
    loss->grad_node->grad = ac_tensor_ones(loss->shape, loss->ndim);
    
    // Collect all nodes in reverse order
    AcGradNode **sorted = topological_sort(loss->grad_node);
    
    // Backward through each node
    for (int i = 0; i < num_nodes; i++) {
        AcGradNode *node = sorted[i];
        if (node->backward && node->grad) {
            node->backward(node, node->grad);
        }
    }
}

Topological Sort Algorithm

static void topo_visit(AcGradNode *node, AcGradNode **stack, int *idx) {
    if (node->visited) return;
    node->visited = true;
    
    for (int i = 0; i < node->num_inputs; i++) {
        AcTensor *input = node->inputs[i];
        if (input && input->grad_node) {
            topo_visit(input->grad_node, stack, idx);
        }
    }
    
    stack[(*idx)++] = node;
}

Backward Functions

Each operation defines how gradients flow through it.

Addition

// c = a + b
// ∂L/∂a = ∂L/∂c, ∂L/∂b = ∂L/∂c

static void add_backward(AcGradNode *node, AcTensor *grad_output) {
    AcTensor *a = node->inputs[0];
    AcTensor *b = node->inputs[1];
    
    if (a->grad_node) {
        ac_grad_accumulate(a->grad_node, grad_output);
    }
    if (b->grad_node) {
        ac_grad_accumulate(b->grad_node, grad_output);
    }
}

Multiplication (Element-wise)

// c = a * b
// ∂L/∂a = ∂L/∂c * b, ∂L/∂b = ∂L/∂c * a

static void mul_backward(AcGradNode *node, AcTensor *grad_output) {
    AcTensor *a = node->inputs[0];
    AcTensor *b = node->inputs[1];
    
    if (a->grad_node) {
        AcTensor *grad_a = ac_mul(grad_output, b);
        ac_grad_accumulate(a->grad_node, grad_a);
    }
    if (b->grad_node) {
        AcTensor *grad_b = ac_mul(grad_output, a);
        ac_grad_accumulate(b->grad_node, grad_b);
    }
}

Matrix Multiplication

// C = A @ B (A: [M, K], B: [K, N], C: [M, N])
// ∂L/∂A = ∂L/∂C @ Bᵀ
// ∂L/∂B = Aᵀ @ ∂L/∂C

static void matmul_backward(AcGradNode *node, AcTensor *grad_output) {
    AcTensor *A = node->inputs[0];
    AcTensor *B = node->inputs[1];
    
    if (A->grad_node) {
        AcTensor *B_T = ac_transpose(B);
        AcTensor *grad_A = ac_matmul(grad_output, B_T);
        ac_grad_accumulate(A->grad_node, grad_A);
    }
    if (B->grad_node) {
        AcTensor *A_T = ac_transpose(A);
        AcTensor *grad_B = ac_matmul(A_T, grad_output);
        ac_grad_accumulate(B->grad_node, grad_B);
    }
}

ReLU

// y = max(0, x)
// ∂L/∂x = ∂L/∂y * 1(x > 0)

static void relu_backward(AcGradNode *node, AcTensor *grad_output) {
    AcTensor *x = node->inputs[0];
    
    if (x->grad_node) {
        AcTensor *grad_x = ac_tensor_new(x->shape, x->ndim);
        for (int i = 0; i < x->size; i++) {
            grad_x->data[i] = (x->data[i] > 0) ? grad_output->data[i] : 0;
        }
        ac_grad_accumulate(x->grad_node, grad_x);
    }
}

Softmax + Cross-Entropy

Combined for numerical stability:

// L = -Σ yᵢ log(softmax(xᵢ))
// ∂L/∂x = softmax(x) - y

static void softmax_ce_backward(AcGradNode *node, AcTensor *grad_output) {
    AcTensor *logits = node->inputs[0];
    AcTensor *targets = node->inputs[1];  // One-hot
    AcTensor *softmax_out = (AcTensor *)node->context;
    
    if (logits->grad_node) {
        AcTensor *grad = ac_tensor_new(logits->shape, logits->ndim);
        for (int i = 0; i < logits->size; i++) {
            grad->data[i] = softmax_out->data[i] - targets->data[i];
        }
        // Scale by incoming gradient (usually 1.0 for scalar loss)
        ac_tensor_scale_inplace(grad, ac_scalar(grad_output));
        ac_grad_accumulate(logits->grad_node, grad);
    }
}

Gradient Accumulation

Gradients are accumulated (summed), not replaced. This handles:

Multiple uses of the same tensor
Batch dimension gradients

void ac_grad_accumulate(AcGradNode *node, AcTensor *grad) {
    if (!node->grad) {
        node->grad = ac_tensor_clone(grad);
    } else {
        for (int i = 0; i < node->grad->size; i++) {
            node->grad->data[i] += grad->data[i];
        }
    }
}

Broadcasting in Backward

When forward involves broadcasting, backward must reduce:

// Forward: C[M, N] = A[M, N] + B[1, N] (B is broadcast)
// Backward: ∂L/∂B = sum over M dimension of ∂L/∂C

static void add_broadcast_backward(AcGradNode *node, AcTensor *grad_output) {
    AcTensor *b = node->inputs[1];  // The broadcast operand
    
    if (b->grad_node) {
        // Sum along broadcast dimensions
        AcTensor *grad_b = ac_sum(grad_output, /*axis=*/0);
        ac_grad_accumulate(b->grad_node, grad_b);
    }
}

Memory Management

Gradient Cleanup

After ac_optimizer_step, gradients are zeroed:

void ac_optimizer_zero_grad(AcOptimizer *opt) {
    for (int i = 0; i < opt->num_layers; i++) {
        AcTensor *w = opt->layers[i]->weights;
        if (w && w->grad_node && w->grad_node->grad) {
            ac_tensor_zero_inplace(w->grad_node->grad);
        }
        // Same for bias
    }
}

Graph Cleanup

The computational graph is freed with ac_mem_restore():

ac_mem_checkpoint();
// Forward pass builds graph
AcTensor *loss = ...;
ac_backward(loss);
ac_optimizer_step(opt);
ac_mem_restore();  // Frees all tensors and grad nodes

No-Grad Mode

Disable graph building for inference:

// Option 1: Explicit flag
ac_set_grad_enabled(false);
AcTensor *y = ac_forward_seq(net, n, x);  // No graph built
ac_set_grad_enabled(true);

// Option 2: Use input without requires_grad
AcTensor *x = ac_tensor_from_data(data, shape, ndim);
x->requires_grad = false;  // Default

Cycle Detection

Aicraft prevents infinite loops with O(1) cycle detection:

// During backward, each node is visited exactly once
if (node->visited) {
    return;  // Already processed
}
node->visited = true;

Supported Operations

The autograd engine supports 22 differentiable operations:

Category	Operations
Arithmetic	add, sub, mul, div, neg
Matrix	matmul, transpose
Reductions	sum, mean, max
Activations	relu, sigmoid, tanh, softmax, gelu
Loss	mse, cross_entropy, bce
Shape	reshape, squeeze, unsqueeze
Other	log, exp, pow

Adding Custom Ops

See Custom Layers Guide for implementing your own differentiable operations.

Debugging

// Print the computational graph
ac_print_graph(loss);

// Check specific gradients
ac_tensor_print(layer->weights->grad_node->grad);

// Numerical gradient check
ac_gradient_check(loss_fn, input, 1e-5, 1e-4);

Performance Considerations

In-place operations don't support autograd — they break the graph
Reuse checkpoints — avoid graph explosion in training loops
Detach when needed — ac_tensor_detach(t) creates a copy without grad connection

Next Steps

Memory Internals — Arena allocator details
Custom Layers — Implement new layers
Debugging Guide — Troubleshoot autograd issues

Overview​

Computational Graph​

Nodes​

Building the Graph​

Backward Pass​

Topological Sort​

Topological Sort Algorithm​

Backward Functions​

Addition​

Multiplication (Element-wise)​

Matrix Multiplication​

ReLU​

Softmax + Cross-Entropy​

Gradient Accumulation​

Broadcasting in Backward​

Memory Management​

Gradient Cleanup​

Graph Cleanup​

No-Grad Mode​

Cycle Detection​

Supported Operations​

Adding Custom Ops​

Debugging​

Performance Considerations​

Next Steps​