Target Propagation: A Biologically Plausible Neural Network Training Algorithm

An Alternative to Backpropagation Founded on Targets, Not Gradients

Jul 13, 2025

LeetArxiv is Leetcode for implementing Arxiv papers.

**Frontmatter for the 2014 paper, “How Auto-Encoder Could Provide Credit Assignment in Deep Networks via Target Propagation”** by Yoshua Bengio

The code to this article is available on GitHub.

1.0 Introduction

Gradient-based learning algorithms like backpropagation and conjugate gradient descent are biologically implausible. Biologically-inspired alternatives1 to gradient-based learning include the forward-forward algorithm (Hinton, 2022)2, NEAT or Neuro-Evolution of Augmenting Topologies (Stanley & Miikkulainen, 2002)3, equilibrium propagation (Bengio & Scellier, 2016)4, direct feedback alignment (Nøkland, 2016)5 and NoPropagation (Li et al. , 2025)6.

This article focuses on Target Propagation, a biologically plausible alternative to backpropagation introduced in (Bengio, 2014)7 and improved upon in (Bengio et al, 2015)8.

**Bengio introduces his thesis on page 2**

Target propagation is built upon the thesis: autoencoders are great at reconstruction and therefore can be used to learn the backpropagation computation.

2.0 Forward Pass

**Architecture and Forward Pass**. Taken from page 11 (Bengio, 2014)

Target propagation permits training deep networks with long-term dependencies or strong non-linearities such as a composition of tanh layers (Bengio et al, 2015). They are constrained neither by depth nor gradients that vanish or explode.

Each layer of our neural network resembles parts of a denoising auto-encoder(DAE). DAEs are trained to find meaningful representations of data by learning to remove noise (Vincent et al., 2008)9.

3.0 Backward Pass

**Targets rather than gradients are preferred.** Taken from page 1 of (Bengio et al, 2015)

The backward pass takes targets instead of gradients. Learning occurs in target propagation by finding the approximate inverse of the layer’s output.

**The nearby value, h_hat is the target**. Taken from (Bengio et al, 2015)

As mentioned in (Bengio et al, 2015), this approximate inverse is chosen to be a nearby value close to the desired output which hopefully leads to a lower global loss.

In Python, an inverse layer resembles a forward layer. This is the structure of a single layer:


class LinearWithInverse(nn.Module):
    """Linear layer with a learned approximate inverse"""
    def __init__(self, in_features, out_features):
        super(LinearWithInverse, self).__init__()
        self.forward_layer = nn.Linear(in_features, out_features)
        self.inverse_layer = nn.Linear(out_features, in_features)
        
    def forward(self, x):
        return self.forward_layer(x)
    
    def inverse(self, y):
        return self.inverse_layer(y)

The improved Difference Target Propagation network architecture has this structure:

class DTPNetwork(nn.Module):
    """Network for Difference Target Propagation"""
    def __init__(self, layer_sizes):
        super(DTPNetwork, self).__init__()
        self.layers = nn.ModuleList()
        
        # Create layers with forward and inverse mappings
        for i in range(len(layer_sizes)-1):
            self.layers.append(LinearWithInverse(layer_sizes[i], layer_sizes[i+1]))
        
    def forward(self, x):
        activations = [x]
        for layer in self.layers:
            x = F.relu(layer(x))
            activations.append(x)
        return activations
    
    def compute_targets(self, activations, labels, learning_rate=0.1):
        # Start with the top layer target (difference to true label)
        targets = [None] * len(activations)
        top_layer = len(activations) - 1
        targets[top_layer] = labels - activations[top_layer]
        
        # Propagate targets downward
        for i in range(top_layer-1, 0, -1):
            # Difference target propagation formula
            targets[i] = activations[i] + self.layers[i].inverse(
                activations[i+1] + learning_rate * targets[i+1]) - self.layers[i].inverse(activations[i+1])
        
        return targets

4.0 Training the Network

Training via target propagation is computationally expensive compared to backpropagation. For some layers, a local gradient is found but this is not shared across layers. Here is the training function:

def train_dtp(model, train_loader, optimizer, epochs=10, learning_rate=0.1):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch_idx, (data, target) in enumerate(train_loader):
            # Flatten the image
            data = data.view(data.size(0), -1)
            
            # Forward pass to get activations
            activations = model(data)
            
            # Convert target to one-hot encoding
            target_onehot = F.one_hot(target, num_classes=10).float()
            
            # Compute targets using DTP
            targets = model.compute_targets(activations, target_onehot, learning_rate)
            
            # Update each layer
            optimizer.zero_grad()
            
            # Compute loss for each layer and update
            for i in range(1, len(activations)):
                # Compute layer-specific loss
                layer_loss = F.mse_loss(activations[i], targets[i])
                
                # Backward pass for this layer only
                layer_loss.backward(retain_graph=True)
                
                # Update only the current layer's parameters
                for param in model.layers[i-1].parameters():
                    if param.grad is not None:
                        param.data -= learning_rate * param.grad
                        param.grad.zero_()
                
                total_loss += layer_loss.item()
            
            if batch_idx % 100 == 0:
                print(f'Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)}'
                      f' ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {layer_loss.item():.6f}')
        
        print(f'Epoch: {epoch}, Average Loss: {total_loss / len(train_loader.dataset):.6f}')

5.0 Results

The algorithm takes 5 minutes to achieve 39% accuracy on MNIST on CPU. Target propagation is extremely slow. This is why it failed to go mainstream post-2015.

Footnotes

Penkovsky. (2019). Are There Alternatives to Backpropagation?. StackOverflow.

Hinton, G,. (2022). The Forward-Forward Algorithm: Some Preliminary Investigations. ArXiv. https://doi.org/10.48550/arXiv.2212.13345

Stanley, K,. & Miikkulainen, R,. (2002). Evolving Neural Networks Through Augmenting Topologies. Evolutionary Computation. 10 (2): 99–127. CiteSeerX 10.1.1.638.3910. doi:10.1162/106365602320169811. PMID 12180173. S2CID 498161

Bengio, Y,. & Scellier, B,. (2016). Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation. ArXiv. https://doi.org/10.48550/arXiv.1602.05179

Nøkland, A,. (2016). Direct Feedback Alignment Provides Learning in Deep Neural Networks. ArXiv. https://doi.org/10.48550/arXiv.1609.01596

Li, Q., Teh, Y,. Pascanu, R,. (2025). NoProp: Training Neural Networks without Back-propagation or Forward-propagation. ArXiv. https://doi.org/10.48550/arXiv.2503.24322

Bengio, Y,. (2014). How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation. ArXiv. https://doi.org/10.48550/arXiv.1407.7906

Bengio, Y,. Lee, D.-H., Zhang, S., & Fischer, A,. (2015). Difference Target Propagation. ArXiv. https://doi.org/10.48550/arXiv.1412.7525

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P,. (2008). Extracting and Composing Robust Features with Denoising Autoencoders. Proceedings of the 25th International Conference on Machine Learning. PDF.

LeetArxiv