Rubner-Tavan's network

In Chapter 5EM Algorithm and Applications, we said that any algorithm that decorrelates the input covariance matrix is performing a PCA without dimensionality reduction. Starting from this approach, Rubner, and Tavan (in the paper A Self-Organizing Network for Principal-Components Analysis, Rubner J., Tavan P., Europhysics. Letters, 10(7), 1989) proposed a neural model whose goal is decorrelating the output components to force the consequent decorrelation of the output covariance matrix (in lower-dimensional subspace). Assuming a zero-centered dataset and E[y] = 0, the output covariance matrix for m principal components is as follows:

Hence, it's possible to achieve an approximate decorrelation, forcing the terms yiyj with i ≠ j to become close to zero. The main difference with a standard approach (such as whitening or vanilla PCA) is that this procedure is local, while all the standard methods operate globally, directly with the covariance matrix. The neural model proposed by the authors is shown in the following diagram (the original model was proposed for binary units, but it works quite well also for linear ones):

 Rubner-Tavan network. The connections v jk are based on the anti-Hebbian rule

The network has m output units and the last m-1 neurons have a summing node that receives the weighted output of the previous units (hierarchical lateral connections). The dynamic is simple: the first output isn't modified. The second one is forced to become decorrelated with the first one. The third one is forced to become decorrelated with both the first and the second one and so on. This procedure must be iterated a number of times because the inputs are presented one by one and the cumulative term that appears in the correlation/covariance matrix (it's always easier to zero-center the dataset and work with the correlation matrix) must be implicitly split into its addends. It's not difficult to understand that the convergence to the only stable fixed point (which has been proven to exist by the authors) needs some iterations to correct the wrong output estimations.

The output of the network is made up of two contributions:

The notation y/x(i) indicates the ith element of y/x. The first term produces a partial output based only on the input, while the second one uses hierarchical lateral connections to correct the values and enforce the decorrelation. The internal weights wij are updated using the standard version of Oja's rule (this is mainly responsible for the convergence of each weight vector to the first principal component):

Instead, the external weights vjk are updated using an anti-Hebbian rule:

The previous formula can be split into two parts: the first term -ηyjyk acts in the opposite direction of a standard version of Hebb's rule (that's why it's called anti-Hebbian) and forces the decorrelation. The second one -ηyjykvjk acts as a regularizer and it's analogous to Oja's rule. The term -ηyjyk works as a feedback signal for the Oja's rule that readapts the updates according to the new magnitude of the actual output. In fact, after modifying the lateral connections, the outputs are also forced to change and this modification impacts on the update of wij. When all the outputs are decorrelated, the vectors wi are implicitly obliged to be orthogonal. It's possible to imagine an analogy with the Gram-Schmidt orthogonalization, even if in this case the relation between the extraction of different components and the decorrelation is more complex. Like for Sanger's network, this model extracts the first m principal components in descending order (the reason is the same that has been intuitively explained), but for a complete (non-trivial) mathematical proof, please refer to the aforementioned paper.

If input dimensionality is n and the number of components is equal to m, it's possible to use a lower-triangular matrix V (m × m) with all diagonal elements set to 0 and a standard matrix for W (n × m).

The structure of W is as follows:

Therefore, wi is a column-vector that must converge to the corresponding eigenvector. The structure of V is instead:

Using this notation, the output becomes as follows:

As the output is based on recurrent lateral connections, its value must be stabilized by iterating the previous formula for a fixed number times or until the norm between two consecutive values becomes smaller than a predefined threshold. In our example, we use a fixed number of iterations equal to five. The update rules cannot be written directly in matrix notation, but it's possible to use the vectors wi (columns) and vj (rows):

In this case, y(i) means the ith component of y. The two matrices must be populated with a loop.

The complete Rubner-Tavan's network algorithm is (the dimensionality of x is n, the number of components is denoted with m):

  1. Initialize  W(0) randomly. The shape is (n × m).
  2. Initialize  V(0) randomly. The shape is (m × m).

 

  1. Set V(0) = Tril(V(0)). Tril(•) transforms the input argument in a lower-triangular matrix.
  2. Set all diagonal components of V(0) equal to 0.
  3. Set the learning_rate η (for example, 0.001).
  4. Set a threshold Thr (for example, 0.0001).
  5. Set a cycle counter T=0.
  6. Set a maximum number of iterations max_iterations (for example, 1000).
  7. Set a number of stabilization_cycles (for example, 5):
    1. While ||W(t) - W(t-1)||F > Thr and T < max_iterations:
      1. Set T = T + 1.
      2. For each x in X:
        1. Set yprev to zero. The shape is (m, 1).
        2. For i=1 to stabilization_cycles:
          1. y = WTx + Vyprev.
          2. yprev = y.
        3. Compute the updates for W and V:
          1. Create two empty matrices ΔW (n × m) and ΔV (m × m)
          2. for t=1 to m:
            1. Δwt = ηy(t)(x - y(t)wt)
            2. Δvt = -ηy(t)(y + y(t)vt)
          3. Update W and V:
            1. W(t+1) = W(t)ΔW
            2. V(t+1) = V(t)ΔV
          4. Set V = Tril(V) and set all the diagonal elements to 0
          5. Set W(t+1) W(t+1) / ||W(t+1)||(columns) (The norm must be computed column-wise)

In this case, we have adopted both a threshold and a maximum number of iterations because this algorithms normally converges very quickly. Moreover, I suggest the reader always checks the shapes of vectors and matrices when performing dot products.

In this example, as well as in all the other ones, the NumPy random seed is set equal to 1000 ( np.random.seed(1000)). Using different values (or repeating more times the experiments without resetting the seed) can lead to slightly different results (which are always coherent).