Learning in neural networks#
PhD Course: Solving engineering problems with neuro-inspired computation
Jörg Conradt & Jens Egholm Pedersen
Learning theory
Learning in neuromorphic systems
Encoding and decoding spikes
1. Learning theory#
Consider measurable spaces \(\mathcal{X}, \mathcal{Y}, \mathcal{Z}\) and measurable functions \(\mathcal{M}: \mathcal{X} \to \mathcal{Y}\).
Goal: Learn a mapping $\(\mathcal{F} \subset \mathcal{M(X,Y)}\)\( that, given some data \)\mathcal{Z}\( and **loss function** \)\(\mathcal{L}: \mathcal{M(X,Y)} \times \mathcal{Z} \to \mathbb{R}\)\( via **algorithm** \)\(\mathcal{A}: \bigcup_{m \in \mathbb{N}} \mathcal{Z}^m \to \mathcal{F}\)$
1.1 Empirical risk minimization#
For some training data \(s = (z^i)^m_{i=1} \in \mathcal{Z}^m\) and \(f \in \mathcal{M(X,Y)}\), empirical risk (ER) is defined as
This leads to the empirical risk minimization algorithm \(\mathcal{A}^{\text{ERM}}\):
1.2 Risk#
Given some training data \(S = (\mathcal{Z}^i)^m_{i=1}\) the risk is defined as
And for classification tasks, the risk becomes the probability of misclassification $\(\mathcal{R} = \mathbb{E}\left[ \mathbb{1}_{(-\infty,0)}(Y\ f(X))\right] = S(f(X) \neq Y)\)$
For \(X \in \mathcal{X}\) and \(Y \in \mathcal{Y}\).
A function achieving the smallest risk is called a Bayes-optimal function:
1.3 No free lunch theorem#
Shows the non-existence of a universal learning algorithm for every data distribution \(\mathbb{P}_Z\) and shows that useful bounds must necessarily be accompanied by a priori regularity conditions on the underlying distribution \(\mathcal{P}_Z\).
1.4 Current state is a mess…#
Who do large neural network not overfit?
Why do neural networks perform well in high-dimensional environments?
Which aspects of a neural network architecture affect the performance of deep learning?
2. Learning in neuromorphic systems#
What is a neuromorphic system?
Here: Mixed-signal circuit#
More practically: recurrent neural network with spikes#
2.1 Training recurrent neural networks#
RNNs suffer from the vanishing and exploding gradient problem, which makes it difficult for these models to learn about long-range dependencies – Orvieto et al. 2023 (DeepMind)
Training methods#
Manual tuning
ANN to SNN conversion
Optimization on spatial and temporal components
Backward mode (backpropagation)
Forward mode (gradient approximation)
2.2 Backpropagation-through-time (BPTT)#
(Eshragian et al. 2023)
2.2.1 Surrogate gradients (SuperSpike)#
Consider a normalized convolutional kernel \(\alpha\), a target spiketrain \(\hat{S}\) and actual spiketrain \(S\) we get the gradient of the loss with respect to some weights:
We then insert an auxillary function
(Zenke and Ganguli, 2018)
2.2.2 Forward-mode approximation#
Actually, an old idea (see Schmidhuber 1987 or Xiauhui & Seung 2004).
(Bellec et al. 2020)
3. Encoding and decoding spikes#
Given an MNIST image, how do we turn that into spikes?
Rate coding
Latency coding
Population coding
3.1 Rate coding#
(Eshragian et al. 2023)
3.2 Latency coding#
(Eshragian et al. 2023)
3.3 Population coding#
(Wikipedia)