# Learning in neural networks

PhD Course: Solving engineering problems with neuro-inspired computation

JÃ¶rg Conradt & Jens Egholm Pedersen

1. Learning theory
2. Learning in neuromorphic systems
3. Encoding and decoding spikes

## 1. Learning theory

Consider measurable spaces $\mathcal{X}, \mathcal{Y}, \mathcal{Z}$ and measurable functions $\mathcal{M}: \mathcal{X} \to \mathcal{Y}$.

**Goal**: Learn a **mapping** $$\mathcal{F} \subset \mathcal{M(X,Y)}$$ that, given some data $\mathcal{Z}$ and **loss function** 
$$\mathcal{L}: \mathcal{M(X,Y)} \times \mathcal{Z} \to \mathbb{R}$$ via **algorithm** 
$$\mathcal{A}: \bigcup_{m \in \mathbb{N}} \mathcal{Z}^m \to \mathcal{F}$$

## 1.1 Empirical risk minimization

For some training data $s = (z^i)^m_{i=1} \in \mathcal{Z}^m$ and $f \in \mathcal{M(X,Y)}$, empirical risk (ER) is defined as

$$\hat{\mathcal{R}}_s(f) \coloneqq \frac{1}{m}\sum^m_{i=1}\mathcal{L}(f, z^i)$$

This leads to the **empirical risk minimization** algorithm $\mathcal{A}^{\text{ERM}}$:

$$\mathcal{A}^{\text{ERM}} \in \underset{f \in \mathcal{F}}{\text{arg min}}\ \hat{\mathcal{R}}_s(f)$$

## 1.2 Risk

Given some training data $S = (\mathcal{Z}^i)^m_{i=1}$ the risk is defined as

$$\mathcal{R} \coloneqq \mathbb{E}\left[\mathcal{L}(f, Z)\right] = \int_{\mathcal{Z}}\mathcal{L}(f, z) d\S(z)$$

And for classification tasks, the risk becomes the **probability of misclassification**
$$\mathcal{R} = \mathbb{E}\left[ \mathbb{1}_{(-\infty,0)}(Y\ f(X))\right] = S(f(X) \neq Y)$$

For $X \in \mathcal{X}$ and $Y \in \mathcal{Y}$.

A function achieving the *smallest risk* is called a **Bayes-optimal function**:

$$\mathcal{R}^* \coloneqq \underset{f\in\mathcal{M(X, Y)}}{\text{inf}}\mathcal{R}(f)$$

## 1.3 No free lunch theorem

![image.png](attachment:dce71670-6677-4707-80dc-9a1d100c9585.png)

> Shows the non-existence of a universal learning algorithm for every data distribution $\mathbb{P}_Z$ and shows that useful bounds must necessarily be accompanied by a priori regularity conditions on the underlying distribution $\mathcal{P}_Z$.

![image.png](attachment:48b1a2c7-f921-4fb6-893c-3fe87077f8a6.png)

## 1.4 Current state is a mess...

**Who do large neural network not overfit?**

**Why do neural networks perform well in high-dimensional environments?**

**Which aspects of a neural network architecture aï¬ect the performance of deep learning?**

# 2. Learning in neuromorphic systems

What is a neuromorphic system? 

### Here: **Mixed-signal circuit**

![image.png](attachment:93e800d3-a716-4390-87df-d1dc5bad0802.png)

### More practically: recurrent neural network with spikes

![image.png](attachment:b1b9847b-a6cd-4ff7-8cef-16db6c98e723.png)

## 2.1 Training recurrent neural networks

> RNNs suffer from the vanishing and exploding gradient problem, which makes it difficult for these models to learn about long-range dependencies -- Orvieto et al. 2023 (DeepMind)

### Training methods

1. Manual tuning
2. ANN to SNN conversion
3. Optimization on spatial **and** temporal components
    * Backward mode (backpropagation)
    * Forward mode (gradient approximation)

## 2.2 Backpropagation-through-time (BPTT)

![image.png](attachment:d09a8c64-af1d-41d0-a582-e648be9d653d.png)

(Eshragian et al. 2023)

### 2.2.1 Surrogate gradients (SuperSpike)

Consider a normalized convolutional kernel $\alpha$, a target spiketrain $\hat{S}$ and actual spiketrain $S$ we get the gradient of the loss with respect to some weights:

![image.png](attachment:0619069a-c6d7-4743-9a21-bb6341612c5c.png)

We then insert an auxillary function
![image.png](attachment:0f676dbc-4767-480c-ac28-786d38209140.png)

![image.png](attachment:aab56064-4cf0-433d-a666-919aeb6a2a65.png)

(Zenke and Ganguli, 2018)

### 2.2.2 Forward-mode approximation

Actually, an old idea (see [Schmidhuber 1987](https://people.idsia.ch/~juergen/diploma1987ocr.pdf) or [Xiauhui & Seung 2004](https://link.aps.org/doi/10.1103/PhysRevE.69.041909)).

![image.png](attachment:271d32ea-83b9-4970-a16a-f938241d8127.png)

(Bellec et al. 2020)

# 3. Encoding and decoding spikes

Given an MNIST image, how do we turn that into spikes?

* Rate coding
* Latency coding
* Population coding

## 3.1 Rate coding

![image.png](attachment:e86c274f-f747-4bd9-8f30-962a175c0cd0.png)

(Eshragian et al. 2023)

## 3.2 Latency coding

![image.png](attachment:2116437d-ba24-492d-8b51-868323775ff0.png)

(Eshragian et al. 2023)

## 3.3 Population coding

![image.png](attachment:c9205f92-fdd2-4be4-8a81-0a41d8d662a4.png)

(Wikipedia)

## Literature

* [Modern mathematics of Deep Learning](http://arxiv.org/abs/2105.04026)
* [Training Spiking Neural Networks Using Lessons From Deep Learning](https://ieeexplore.ieee.org/document/10242251) 
* [SuperSpike: Supervised Learning in Multilayer Spiking Neural Networks](https://direct.mit.edu/neco/article/30/6/1514-1541/8378)
* [Neuromorphic Engineering: In Memory of Misha Mahowald](https://doi.org/10.1162/neco_a_01553)
* [Resurrecting Recurrent Neural Networks for Long Sequences](http://arxiv.org/abs/2303.06349)
* [A solution to the learning dilemma for recurrent networks of spiking neurons](https://www.biorxiv.org/content/10.1101/738385v4)