Selective Learning Algorithm for Radial Basis Neural Networks
As it has been previously seen, the time series behaviour can be captured by expressing the value x(k+1) as a function of the d previous values of the time series, x(k),...,x(k-d); that is, the time series can be modeled as a function F:
x(k+1) = F(x(k),...,x(k-d))
Thus, a way of predicting time series consists of approximating the function F using artificial neural networks. This can be done in two ways:
Constructing a global description of the target function using the whole available training examples.
Constructing a local description of the target funcion each time a new prediction must be done. This local description is built using only the most appropriate examples.
The method presented in this section follows the second approach, that is, it constructs local models for each prediction that must be done. Because the computational effort of the second approach is bigger, Radial Basis Neural Networks (RBNN) are used since their training phase is considerably faster than Multilayer Perceptrons's. Because making a new prediction of the time series is equivalent to estimate the function value (F) of a novel pattern, this method will be explained as an approximation function task.
The method consists of selecting, from the whole training data, an appropriate subset of training patterns in order to improve the answer of the network for a novel pattern. Afterwards, the RBNN is trained using this new subset of selected data. The general idea for the pattern selection is to include -once or more times- those patterns close -in terms of the Euclidean distance and some weighting measure- to the novel sample. Thus, the network is trained with the most useful information, discarding those patterns that not only do not provide any knowledge to the network, but might confuse the learning process.
A n-dimensional sphere is centered at the test pattern, in order to select only those patterns placed into this sphere. Its radius -named r- is a threshold distance, since all the training patterns whose distance to the novel sample is bigger than r will be discarded. Distances may have very different magnitudes depending on the problem domains, due to their different data values and number of attributes. It may happen that for some domains the maximum distance between patterns is many times the maximum distance between patterns for other domains. In order to make the method independent to this fact, both the sphere radius and the training patterns distances will be relative respect to the maximum distance to the test pattern. Thus, the relative threshold distance or relative radius, rr, will be used to select the training patterns situated into the sphere centered at the test pattern, being rr a parameter that must be established before the application of the learning algorithm.
Let us consider q an arbitrary novel pattern described by an n-dimensional vector q=(q1,..., qn), where qi represents the attributes of the instance q. Let X be the whole available training data set:
where xk are the input patterns and yk their respective target outputs. When a new sample q must be predicted, the RBNN is trained with a subset, which is named Xq, from the whole training data X. The steps to select the training set Xq are the following:
A real value, dk, is associated to each training pattern (xk, yk). That value is defined in terms of the standard Euclidean distance from the pattern q to each input training pattern. More precisely, it is defined as:
As it was previously mentioned, in order to make the method independent on the distances magnitude, relative distances must be used. Thus, a relative distance, drk is calculated for each training pattern. Let dmax be the maximum distance to the novel pattern, this is: dmax= max(d1, d2, ..., dN). Then, the relative distance is given by:
A weighting function or kernel function K() is used to calculate a weight for each training pattern from its distance to the test pattern. The maximum value of the kernel function must be given at zero distance and the function should decrease smoothly as distance increases. The kernel function used in the method is the inverse function K(d)=1/d. An example of this function can be seen in the next figure where the x-axis represents the patterns in a one-dimensional domain, being the value of the test pattern 2.25.
Thus, the result of evaluating the inverse function of the relative distance calculated at equation (1) is associated to each training pattern:
These values K(xk) are normalized in such a way that the sum of them equals the number of training patterns in X. The normalized values, named as Kn(xk), are obtained by:
In order to simplify the notation, Kn(xk) will be named fnk, normalized frequency.
Both the relative distance drk calculated in step 2 and the relative frequency fnk calculated in step 3 are used to decide whether the training pattern (xk, yk) is selected, and -if the pattern is selected- how many times is included in the training subset. Hence, they are used to generate a natural number, nk, following the next rule:
At this point, each training pattern in X has an associated natural number, nk, which indicates how many times the pattern (xk, yk) will be used to train the RBNN when the new instance q is reached. If the pattern is selected, nk > 0 otherwise nk=0.
A new training pattern subset associated to the novel pattern q, Xq, is built up. Given a pattern (xk, yk) from the original training set X, that pattern is included in the new subset if the value nk is higher than zero. In addition, the pattern (xk, yk) is placed nk times randomly in the training set Xq. Once the training patterns are selected, the RBNN is trained with the new subset of patterns, Xq. As usually, training a RBNN involves to determine the centers, the dilations or widths, and the weights. The centers are calculated in an unsupervised way using the K-means algorithm in order to classify the input space formed by all the training patterns included in the subset Xq. The K-means algorithm initialization has been modified in order to avoid the situation where many classes have no patterns at all. Thus, the initial values of the centers are set in the following way:
Mq, the centroid of the set Xq, is evaluated.
k centers (c1q , c2q, ....ckq) are randomly generated, such as ||cjq-Mq||< e, j=1,2,..., k , where e is a very small real number.
In this way, the k centers expand from their initial positions around the centroid Mq, and if the number of centers is smaller than the number of patterns, no classes will remain empty when k-means finishes. Once the neurones centers have been calculated, the neurones dilations or widths are evaluated as the geometric mean of the distances from each neuron center to its two nearest centers. Lets di the width and Ci the center of the ith neurone:
where Ct and Cs are the two nearest centers to center Ci.
Finally, the weights of the RBNN are estimated in a supervised way to minimize the mean square error E measured in the training subset Xq:
where R is the number of patterns in Xq and er is the error committed by the network for the pattern xr, given by: