Incremental Risk Minimization Algorithm

Example - How it Works - Incremental Risk Functional - Application to LIP Regression - Limitations - What's next? - Further Reading

Incremental Regression with Polynomials ↑

Incremental (or on-line) learning regression is the process of adapting a model one example at a time without accumulating a batch of data. It has the advantages of allowing continuous adaptation to non-stationary environments, easily handling big data through stream processing, and a fixed low computation and memory demand.

The easiest solution is to perform a gradient descent on a squared error metric with each new training example. But this solution does not work well for complex model structures. Especially, the influence of a non-linear transformation of the inputs through a fixed model structure has long been an open problem. During my PhD I worked on an approach which is able to deal with a broad class of non-linear model structures. Its emphasis is on minimizing the effect of local training examples on changes of the global model. Thus, it yields a robust behavior by preventing overfitting on sparse data as well as fatal forgetting.

As is often the case, I finished my PhD and left my research behind moving on to new exciting things. But, recently I decided to revisit it again, especially as the underlying principle is applicable in a much larger set of circumstances than the specific one I investigated. I wanted to make the idea more accessible and thus built this interactive example.

Learning Algorithm Setup

Auto Sampling of Sine Function

This simple example allows you to interactively train a polynomial model on incremental data. You can click at any point in the graph and generate a training example. The model will adapt right away and continue to do so with every new example you provide. The latest example (i.e. the only one the algorithm has knowledge of at any point in time) is highlighted.

At the top you can set up the learning algorithm:

Fade On/Off: Fade out old training examples (just for visualization purposes - the incremental algorithm always only uses the latest example).
IRMA/GD: Selects between my IRMA algorithm and the commonly used Gradient Descent.
Learning Rate: Parameter impacting how much the model adapts to a new example with 0 meaning no adaptation at all and 1 meaning full reproduction of the example.
Polynomial Order: Chooses the complexity of the model from a simple constant (order = 0) to a polynomial up to $x^{20}$ with 21 parameters.
Reset Training: Removes all examples and resets the model to parameters of all zero.

At the bottom you can automatically sample data from a sine function:

Random On/Off: Choose the next training example randomly or linearly loop through $[-10, 10]$
Amplitude: Choose the amplitude of the sine function between $[0, 20]$.
Phase: Choose the phase shift of the sine function between $[0, 2\pi]$.
Noise: Choose the amplitude of uniformly distributed noise added to the training example.
Start/Stop Auto Sample: Start/stop automatically generating new examples form the since curve.

Changing the amplitude and/or phase are a convenient example of a non-stationary process, i.e. the learning algorithm needs to adapt to new relationships in the data over time. You can also switch between auto sampling and manually providing examples to see how the learning algorithm reacts.

How it Works ↑

In general, a learning system tries to minimize the loss or discrepancy between the response given by training data for an input and the response provided by the learning system. In the case of batch learning, the learning system has all data available at once and can find the best approximation taking all this data into account. In contrast, an incremental learning system only has its latest approximation and a single new example available to update the approximation. The approximation often is defined by a vector of parameters that the learning system can choose.

In the classical gradient descent algorithm, the update of the approximation is performed such that the new example is incorporated to reduce the loss while minimizing the change of the parameter vector (as measured by the Euclidean distance). But each parameter can have a very different impact on the output of the model. Take for example a linear model with two parameters - the intercept and the slope. Changing the intercept by 0.1 changes the output of the model by 0.1 at every input. However, changing the slope by 0.1 changes the output for example by 1 at the input 10. So a change of the same magnitude in the slope parameter affects the model much more than for the intercept parameter.

This leads to the fundamental idea of IRMA. We don't have any past data anymore. But that doesn't mean we don't know where past examples might have been. If we have learned well so far, they were very likely close to the input-output curve of the current approximation. Hence, we need to measure how each parameter affects the model across the input space to change this curve as little as possible.

Following is a more detailed mathematical description of how this is achieved.

Incremental Risk Functional ↑

The Incremental Risk Minimization Algorithm (IRMA) is inspired by the risk functional for batch learning $$R(h) = \int L(y,h(x)) dF(x,y)$$ that describes the risk of a loss $L$ for a chosen hypothesis $h \in H$ given the data distribution $F(x,y)$. For a set of examples $(x_i,y_i),i=1,...,n_d$ drawn independently form the distribution $F(x,y)$, the integral can be approximated by the empirical risk $$R_e(h) = \frac{1}{n_d} \sum_{i=1}^{n_d} L(y_i,h(x_i))$$ $$= \frac{1}{n_d} \sum_{i=1}^{n_d−1} L(y_i,h(x_i)) + \frac{1}{n_d} L(y_{n_d},h(x_{n_d})).$$ In the incremental case older examples for $i=1,...,n_d−1$ are not available. But they are embedded in the hypothesis $h_t$, especially its encoded input-output relation, as they were learned before. So the output resulting from the current hypothesis $h_t$ can be used to describe the risk of a loss regarding those examples for a newly chosen hypothesis $h$ in each step. Hence, to measure the distance between a new and the old hypothesis, IRMA uses the change of the model output $$D(h) = \frac{\sigma}{2} \cdot \int_X L(h_t(x),h(x))dx$$ on a bounded input space $X$. In contrast to Gradient Descent, the change of the hypothesis is measured by the inﬂuence on the change of the global input-output relation and not by the change of the parameter vector. Together with the example loss the incremental approximation of the risk functional is then given by $$R_{inc}(h) = \frac{\sigma}{2} \cdot \int_X L(h_t(x), h(x)) dx + \frac{1}{2} L(y_t,h(x_t))$$ with a weighing factor, called stiffness, $\sigma \gt 0$ to choose the new hypothesis according to $$h_{t+1} = \underset{h}{\operatorname{argmin}} R_{inc}(h).$$

Application to LIP Regression ↑

The above deﬁnition of IRMA is generally applicable for any learning system $(h, L)$. I investigated a specific class of systems in more detail for the task of regression. The class is that of models that are Linear In Parameters (LIP) and take the form $$y = \omega^T \cdot \phi(x) = \sum_{i=1}^n \omega_i \cdot \phi_i(x)$$ with the parameter vector $\omega$ and a vector of arbitrary but fixed transformations $\phi$. Typical examples of this would be local linear regression (as often seen in Fuzzy systems) or polynomials as used in the example above. Using the squared loss for regression and a LIP model as the hypothesis $h$, the risk functional takes the form $$R_{inc}(\omega) = \frac{\sigma}{2} \cdot \int_X ((\omega_t−\omega)^T\phi(x))^2 dx + \frac{1}{2} (y_t − \omega^T\phi(x_t))^2.$$ Minimizing this incremental risk functional yields the update of the parameter vector in each step. As $R_{inc}(\omega) \geq 0 \ \forall \ \sigma \gt 0$, if a unique solution of a zero partial derivative exists, it minimizes the risk functional. The resulting equation for $$\frac{\partial R_{inc}}{\partial \omega_i} = 0 \ \forall \ i \in [1; n]$$ then has the form of a linear system of equations which can be written as $$(A + \frac{1}{\sigma} B(x_t)) \omega_{t+1} = A \omega_t + \frac{1}{\sigma}\phi(x_t)y_t$$ with symmetric matrices $$(A)_{i,j} = \int_X \phi_i(x) \phi_j(x)dx$$ $$(B(x_t))_{i,j} = \phi_i(x_t) \phi_j(x_t).$$ It can be shown that the inverse exists and the minimization thus has the unique solution $$\omega_{t+1} = (A + \frac{1}{\sigma} B(x_t))^{-1} \cdot (A \omega_t + \frac{1}{\sigma}\phi(x_t)y_t).$$ For fast adaptation, a low stiﬀness is required. But the above update is not deﬁned in the case of $\sigma = 0$. Hence, for a vanishing stiﬀness the update is deﬁned by the limit, as the stiﬀness $\sigma$ approaches zero. It is given by $$\lim_{\sigma \to 0} \omega_{t+1} = \omega_t + \frac{A^{−1}}{\phi(x_t)^TA^{−1}\phi(x_t)} \phi(x_t) \cdot (y_t − \omega_t^T \phi(x_t)).$$ This results in a full reproduction of the current example after parameter adaptation, i. e. a zero loss, and allows for a quick adaptation to new data.

Limitations ↑

There are some limitations to keep in mind for IRMA:

The input space $X$ needs to be finite (or at least the integrals of all $\phi_i$ over $X$) and pre-defined in order to calculate the integrals.
In most practical applications this limitation is easily fulfilled and extrapolation into an arbitrary input space comes with its own set of problems.
The algorithm has a runtime complexity of $O(n^2)$ and requires $O(n^2)$ amount of memory rather than $O(n)$ for the simple Gradient Descent.
Here $n$ is the number of parameters in the LIP model and thus typically doesn't pose a problem on modern machines. It can be a limiting factor however for cheap embedded systems.
The initial values of the parameters preset the functional relation the model starts with and thus inform the regularization.
Similar to the case of Gradient Descent, this can have a significant impact on how quickly the algorithm is able to learn. However, in case of IRMA, this initialization can be easily viewed as prior knowledge that a developer can provide at design time. And, in the incremental learning scenario this usually is significantly safer than a random initialization anyway.

What's next? ↑

So far, I've only applied the IRMA approach to LIP models. But there are many other interesting non-linear model structures that could benefit from this approach, though a closed form solution might not be feasible. It would be interesting to apply the approach to neural networks and investigate if a solution can be found through some adapted form of back-propagation. Furthermore, evaluating non-fixed model structures like RBF networks that allow to either adapt parameters of existing RBFs or adding a new RBF would be interesting to see how IRMA could help guide between those two choices.

I originally published the source code as a Matlab library along with many other learning algorithms as the UOSLIB. This library is primarily geared at fundamental research and less useful for applications. To make IRMA more easily available to practitioners, it would be great to add it to widely used machine learning toolkits like scikit-learn.

If anyone is interested in supporting (one of) these efforts, feel free to reach out!

Further Reading ↑

IRMA - Incremental Risk Minimization Algorithm

Over the course of my PhD, I published (together with collaborators) several papers on IRMA and a second order version. Ultimately, this research culminated in my PhD thesis.

Buschermoehle, A.:
Reliable On-line Machine Learning for Regression Tasks in Presence of Uncertainties.
Doctoral Thesis, 2014,

Buschermoehle, A.; Brockmann, W.:
On-line Learning with Minimized Change of the Global Mapping – Optimized Local Learning by Incremental Risk Minimization.
In: Evolving Systems : Springer, 2014, pp.131–151,

Buschermoehle, A.; Brockmann, W.:
Reliable Localized On-line Learning in Non-stationary Environments.
In: International Conference on Evolving and Adaptive Intelligent Systems (EAIS), Linz : IEEE Press, 2014, pp.1-7,

Buschermoehle, A.; Brockmann, W.:
Stable On-line Learning with Optimized Local Learning, but Minimal Change of the Global Output.
In: International Conference on Machine Learning and Applications (ICMLA), Miami : IEEE Press, 2013, pp.21-27,

Buschermoehle, A.; Huelsmann, J.; Brockmann, W.:
UOSLIB – A Library for Analysis of Online-Learning Algorithms.
In: Hoffmann, E.; Hüllermeier, E. (Hrsg.): Proc. 23. Workshop Computational Intelligence, Dortmund: KIT Scientific Publishing, 2013, pp.355-369,

Buschermoehle, A.; Schoenke, J.; Rosemann, N.; Brockmann, W.:
The Incremental Risk Functional: Basics of a Novel Incremental Learning Approach.
In: Proc. 2013 IEEE Conference on Systems Man and Cybernetics - SMC2013, Manchester: IEEE Press, 2013, pp.1500-1505,

Gradient Descent and other Classic Approaches

There is a rich literature on incremental (or sometimes called on-line) learning. A lot of the methods also borrow from filter theory and/or control theory. This is just a short overview of some interesting publications in the field - ordered by year of publication.

Crammer K, Lee DD:
Learning via gaussian herding.
In: Advances in Neural Information Processing Systems 23, 2010, pp.451–459,

Crammer K, Kulesza A, Dredze M, et al:
Adaptive regularization of weight vectors.
In: Advances in Neural Information Processing Systems 22, 2009, pp.414–422,

Crammer K, Fern MD, Pereira O:
Exact convex conﬁdence-weighted learning.
In: Advances in Neural Information Processing Systems 21, 2008, pp.345–352,

Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y:
Online passive-aggressive algorithms.
In: Journal of Machine Learning Research 7, 2006, pp.551–585,

De Vito E, Rosasco L, Caponnetto A, De Giovannini U, Odone F:
Learning from examples as an inverse problem.
In: Journal of Machine Learning Research 6, 2006, pp.883–904,

Cesa-Bianchi N:
Prediction, learning, and games.
Cambridge University Press, Cambridge, UK, 2006,

Farrell J, Polycarpou M:
Adaptive approximation based control.
Wiley-Blackwell, Hoboken, NJ, USA, 2006,

Cesa-Bianchi N, Conconi A, Gentile C:
A second-order perceptron algorithm.
In: Journal on Computing 34(3), 2005, pp.640–668,

Gentile C:
The robustness of the p-norm algorithms.
In: Machine Learning 53(3), 2003, pp.265–299,

Haykin S:
Adaptive Filter Theory, 3rd edn.
Prentice-Hall, Englewood Cliﬀs, NJ, USA, 2002,

Chen MS, Yen JY:
Application of the least squares algorithm to the observer design for linear time-varying systems.
In: Trans. on Automatic Control 44(9), 1999, pp.1742–1745,

Battiti R:
First- and second-order methods for learning: Between steepest descent and Newton’s method.
In: Neural Computation 4(2), 1992, pp.141–166,

Hunt KJ:
A survey of recursive identiﬁcation algorithms.
In: Trans. of the Inst. of Measurement and Control 8(5), 1986, pp.273–278,

Goodwin G, Teoh E, Elliott H:
Deterministic convergence of a self-tuning regulator with covariance resetting.
In: IEE Proc. D on Control Theory and Applications, IET, vol 130, 1983, pp.6–8,

Bregman LM:
The relaxation method of ﬁnding the common point of convex sets and its application to the solution of problems in convex programming.
In: USSR Computational Mathematics and Mathematical Physics 7(3), 1967, pp.200–217,

Contact: andbusch@uni-osnabrueck.de