Aicraft
Skip to main content

Autograd Internals

How automatic differentiation works in Aicraft.

Overview

Aicraft implements reverse-mode automatic differentiation (backpropagation) using a dynamic computational graph. Every operation that produces a tensor can optionally record its computation, allowing gradients to flow backward.

Forward:  x → [Linear] → [ReLU] → [Linear] → [Softmax] → loss
Backward: ∂loss/∂x ← ∂/∂z₃ ← ∂/∂z₂ ← ∂/∂z₁ ← ∂/∂y ← 1.0

Computational Graph

Nodes

Each tensor with requires_grad = true has an associated AcGradNode:

typedef struct AcGradNode {
AcTensor *tensor; // The tensor this node belongs to
AcTensor *grad; // Accumulated gradient

AcTensor *inputs[AC_MAX_INPUTS]; // Input tensors
int num_inputs;

void (*backward)(struct AcGradNode *node, AcTensor *grad_output);
void *context; // Operation-specific data

bool visited; // For topological sort
} AcGradNode;

Building the Graph

The graph is built dynamically during the forward pass:

// When you compute: c = a + b
AcTensor *ac_add(AcTensor *a, AcTensor *b) {
AcTensor *c = ac_tensor_new(a->shape, a->ndim);

// Compute forward
for (int i = 0; i < c->size; i++) {
c->data[i] = a->data[i] + b->data[i];
}

// Build graph if needed
if (a->requires_grad || b->requires_grad) {
c->requires_grad = true;
c->grad_node = ac_grad_node_new(c);
c->grad_node->inputs[0] = a;
c->grad_node->inputs[1] = b;
c->grad_node->num_inputs = 2;
c->grad_node->backward = add_backward;
}

return c;
}

Backward Pass

Topological Sort

Before computing gradients, we topologically sort the graph (output to input):

void ac_backward(AcTensor *loss) {
// Start with gradient of 1.0 w.r.t. loss
loss->grad_node->grad = ac_tensor_ones(loss->shape, loss->ndim);

// Collect all nodes in reverse order
AcGradNode **sorted = topological_sort(loss->grad_node);

// Backward through each node
for (int i = 0; i < num_nodes; i++) {
AcGradNode *node = sorted[i];
if (node->backward && node->grad) {
node->backward(node, node->grad);
}
}
}

Topological Sort Algorithm

static void topo_visit(AcGradNode *node, AcGradNode **stack, int *idx) {
if (node->visited) return;
node->visited = true;

for (int i = 0; i < node->num_inputs; i++) {
AcTensor *input = node->inputs[i];
if (input && input->grad_node) {
topo_visit(input->grad_node, stack, idx);
}
}

stack[(*idx)++] = node;
}

Backward Functions

Each operation defines how gradients flow through it.

Addition

// c = a + b
// ∂L/∂a = ∂L/∂c, ∂L/∂b = ∂L/∂c

static void add_backward(AcGradNode *node, AcTensor *grad_output) {
AcTensor *a = node->inputs[0];
AcTensor *b = node->inputs[1];

if (a->grad_node) {
ac_grad_accumulate(a->grad_node, grad_output);
}
if (b->grad_node) {
ac_grad_accumulate(b->grad_node, grad_output);
}
}

Multiplication (Element-wise)

// c = a * b
// ∂L/∂a = ∂L/∂c * b, ∂L/∂b = ∂L/∂c * a

static void mul_backward(AcGradNode *node, AcTensor *grad_output) {
AcTensor *a = node->inputs[0];
AcTensor *b = node->inputs[1];

if (a->grad_node) {
AcTensor *grad_a = ac_mul(grad_output, b);
ac_grad_accumulate(a->grad_node, grad_a);
}
if (b->grad_node) {
AcTensor *grad_b = ac_mul(grad_output, a);
ac_grad_accumulate(b->grad_node, grad_b);
}
}

Matrix Multiplication

// C = A @ B (A: [M, K], B: [K, N], C: [M, N])
// ∂L/∂A = ∂L/∂C @ Bᵀ
// ∂L/∂B = Aᵀ @ ∂L/∂C

static void matmul_backward(AcGradNode *node, AcTensor *grad_output) {
AcTensor *A = node->inputs[0];
AcTensor *B = node->inputs[1];

if (A->grad_node) {
AcTensor *B_T = ac_transpose(B);
AcTensor *grad_A = ac_matmul(grad_output, B_T);
ac_grad_accumulate(A->grad_node, grad_A);
}
if (B->grad_node) {
AcTensor *A_T = ac_transpose(A);
AcTensor *grad_B = ac_matmul(A_T, grad_output);
ac_grad_accumulate(B->grad_node, grad_B);
}
}

ReLU

// y = max(0, x)
// ∂L/∂x = ∂L/∂y * 1(x > 0)

static void relu_backward(AcGradNode *node, AcTensor *grad_output) {
AcTensor *x = node->inputs[0];

if (x->grad_node) {
AcTensor *grad_x = ac_tensor_new(x->shape, x->ndim);
for (int i = 0; i < x->size; i++) {
grad_x->data[i] = (x->data[i] > 0) ? grad_output->data[i] : 0;
}
ac_grad_accumulate(x->grad_node, grad_x);
}
}

Softmax + Cross-Entropy

Combined for numerical stability:

// L = -Σ yᵢ log(softmax(xᵢ))
// ∂L/∂x = softmax(x) - y

static void softmax_ce_backward(AcGradNode *node, AcTensor *grad_output) {
AcTensor *logits = node->inputs[0];
AcTensor *targets = node->inputs[1]; // One-hot
AcTensor *softmax_out = (AcTensor *)node->context;

if (logits->grad_node) {
AcTensor *grad = ac_tensor_new(logits->shape, logits->ndim);
for (int i = 0; i < logits->size; i++) {
grad->data[i] = softmax_out->data[i] - targets->data[i];
}
// Scale by incoming gradient (usually 1.0 for scalar loss)
ac_tensor_scale_inplace(grad, ac_scalar(grad_output));
ac_grad_accumulate(logits->grad_node, grad);
}
}

Gradient Accumulation

Gradients are accumulated (summed), not replaced. This handles:

  1. Multiple uses of the same tensor
  2. Batch dimension gradients
void ac_grad_accumulate(AcGradNode *node, AcTensor *grad) {
if (!node->grad) {
node->grad = ac_tensor_clone(grad);
} else {
for (int i = 0; i < node->grad->size; i++) {
node->grad->data[i] += grad->data[i];
}
}
}

Broadcasting in Backward

When forward involves broadcasting, backward must reduce:

// Forward: C[M, N] = A[M, N] + B[1, N] (B is broadcast)
// Backward: ∂L/∂B = sum over M dimension of ∂L/∂C

static void add_broadcast_backward(AcGradNode *node, AcTensor *grad_output) {
AcTensor *b = node->inputs[1]; // The broadcast operand

if (b->grad_node) {
// Sum along broadcast dimensions
AcTensor *grad_b = ac_sum(grad_output, /*axis=*/0);
ac_grad_accumulate(b->grad_node, grad_b);
}
}

Memory Management

Gradient Cleanup

After ac_optimizer_step, gradients are zeroed:

void ac_optimizer_zero_grad(AcOptimizer *opt) {
for (int i = 0; i < opt->num_layers; i++) {
AcTensor *w = opt->layers[i]->weights;
if (w && w->grad_node && w->grad_node->grad) {
ac_tensor_zero_inplace(w->grad_node->grad);
}
// Same for bias
}
}

Graph Cleanup

The computational graph is freed with ac_mem_restore():

ac_mem_checkpoint();
// Forward pass builds graph
AcTensor *loss = ...;
ac_backward(loss);
ac_optimizer_step(opt);
ac_mem_restore(); // Frees all tensors and grad nodes

No-Grad Mode

Disable graph building for inference:

// Option 1: Explicit flag
ac_set_grad_enabled(false);
AcTensor *y = ac_forward_seq(net, n, x); // No graph built
ac_set_grad_enabled(true);

// Option 2: Use input without requires_grad
AcTensor *x = ac_tensor_from_data(data, shape, ndim);
x->requires_grad = false; // Default

Cycle Detection

Aicraft prevents infinite loops with O(1) cycle detection:

// During backward, each node is visited exactly once
if (node->visited) {
return; // Already processed
}
node->visited = true;

Supported Operations

The autograd engine supports 22 differentiable operations:

CategoryOperations
Arithmeticadd, sub, mul, div, neg
Matrixmatmul, transpose
Reductionssum, mean, max
Activationsrelu, sigmoid, tanh, softmax, gelu
Lossmse, cross_entropy, bce
Shapereshape, squeeze, unsqueeze
Otherlog, exp, pow

Adding Custom Ops

See Custom Layers Guide for implementing your own differentiable operations.


Debugging

// Print the computational graph
ac_print_graph(loss);

// Check specific gradients
ac_tensor_print(layer->weights->grad_node->grad);

// Numerical gradient check
ac_gradient_check(loss_fn, input, 1e-5, 1e-4);

Performance Considerations

  1. In-place operations don't support autograd — they break the graph
  2. Reuse checkpoints — avoid graph explosion in training loops
  3. Detach when neededac_tensor_detach(t) creates a copy without grad connection

Next Steps