On the Kernel Extreme Learning Machine speedup

1 On the Kernel Extreme Learning Machine speedup Alexandros Iosifidis and Moncef Gabbouj Abstract—In this paper, we describe an approximate method f...

Author: Cory Manning

0 downloads 2 Views 303KB Size

Report

Download PDF

Recommend Documents

Machine Learning Kernel Functions

Reproducing Kernel Hilbert Spaces in Machine Learning

Machine Learning!!!!!Srihari. Kernel Methods! Sargur Srihari!

KVM: The Kernel Virtual Machine

MTAT Machine Learning (Spring 2013) Exercise session XV: Kernel Methods

Kernel-Based Machine Learning with Multiple Sources of Information

On the Impact of Kernel Approximation on Learning Accuracy

LARGE-SCALE NONLINEAR FACIAL IMAGE CLASSIFICATION BASED ON APPROXIMATE KERNEL EXTREME LEARNING MACHIINE

Online Sequential Double Parallel Extreme Learning Machine for Classifications

On-line Sequential Extreme Learning Machine Based on Recursive Partial Least Squares

Localized Multiple Kernel Learning

Sentiment Classification for Chinese Reviews Using Machine Learning Methods Based on String Kernel

Variable Sparsity Kernel Learning

Multiple Kernel Learning Algorithms

Continuous Kernel Learning

Lanczos Approximations for the Speedup of Kernel Partial Least Squares Regression

MLlib: Scalable Machine Learning on Spark

Discharge estimation based on machine learning

Kernel Learning Using Neural Networks

Multiple Random Subset-Kernel Learning

Two-Stage Learning Kernel Algorithms

Kernel Logistic Regression and the Import Vector Machine

Preference Learning with Extreme Examples

1

On the Kernel Extreme Learning Machine speedup Alexandros Iosifidis and Moncef Gabbouj

Abstract—In this paper, we describe an approximate method for reducing the time and memory complexities of the kernel Extreme Learning Machine variants. We show that, by adopting a Nystr¨om-based kernel ELM matrix approximation, we can define an ELM space exploiting properties of the kernel ELM space that can be subsequently used to apply several optimization schemes proposed in the literature for ELM network training. The resulted ELM network can achieve good performance, which is comparable to that of its standard kernel ELM counterpart, while overcoming the time and memory restrictions on kernel ELM algorithms that render their application in large-scale learning problems prohibitive. Index Terms—Kernel Extreme Learning Machine, Nystr¨om Approximation.

I. I NTRODUCTION Extreme Learning Machine (ELM) ([1]) is a Single-hidden Layer Feedforward Neural (SLFN) network training algorithm that has been proposed as an alternative to gradient descent network training approaches, e.g. Backpropagation ([2]). The main idea of ELM is that the network hidden layer weights and bias values can be randomly assigned, leading to random nonlinear data mappings to a high-dimensional feature space (the so-called ELM space). Similar approaches have been also shown to be efficient in several neural network training methods ([3], [4], [5], [6], [7]), as well as in other learning processes ([8]). With a large enough number of hidden layer neurons, it is expected that the problem to be solved becomes easier in the ELM space and can be solved by using linear techniques, like ridge regression ([9]). It has been also proven that SLFN networks trained by the ELM algorithm have the properties of global approximators ([10], [11], [12], [13]). Due to its effectiveness and its fast learning process, ELM has been adopted in many problems and several extensions of the algorithm have been proposed, each highlighting different properties of ELM networks ([14], [10], [15], [9], [16], [17]). Among the various extensions of ELM, kernel formulations have been proposed ([9], [18]). It has been demonstrated that in small-scale and medium-scale learning problems, kernel ELMs outperform ELM formulations exploiting random hidden layer parameters. However, this superiority in performance comes with a higher computational cost that renders the application of kernel ELMs in large-scale learning problems prohibitive. This is due to the fact that kernel ELMs require the calculation of the so-called kernel ELM matrix K ∈ RN ×N (N is the cardinality of the training set) and its inverse. This process has a time complexity of the order of O(N 3 ) and a memory complexity equal to O(N 2 ). As N gets larger, both the time and memory complexities of kernel ELMs become A. Iosifidis and M. Gabbouj are with the Department of Signal Processing, Tampere University of Technology, P. O. Box 553, FIN-33720 Tampere, Finland. e-mail: {alexandros.iosifidis,moncef.gabbouj}@tut.fi

prohibitive. In order to reduce the time complexity of the standard kernel ELM algorithm, ([19]) uses randomly selected training data in order to form a submatrix of K which is used to derive an approximate solution. However, the method is restricted only to standard ELM. Recently, it has been shown that the time complexity of kernel ELM can be reduced by exploiting a randomized approximation approach ([20]). Specifically, after the calculation of the kernel matrix K, a (low-rank) approximate kernel matrix ˜ ' K is obtained by exploiting a random orthogonal matrix K Ω ∈ RN ×n , where n < N . By using such an approach, the kernel matrix inversion processing step can be highly accelerated, since one can calculate the inversion of the low-rank ˜ instead of that of the original approximate kernel matrix K kernel matrix K. However, this process adds an overhead to the overall memory complexity of kernel ELM. Since kernel ELMs require the use of the entire kernel ELM matrix, their application in large-scale problems is still prohibitive (even by using the approximate method in ([20]). In this paper, we show that by exploiting a Nystr¨om-based kernel ELM matrix approximation approach both the abovedescribed time and memory restrictions can be appropriately addressed. We will show that, by following ([21]), we do not need to calculate (and store) the entire kernel ELM matrix. Instead, we can calculate a small sub-matrix of the kernel ELM matrix and exploit it in order to proceed with the (approximate) kernel matrix inversion step. In this way, both the time and memory complexities of the kernel ELM are highly reduced, making its application in large-scale learning problems feasible. In addition, we show that we can exploit Nystr¨ombased approximation in order to determine a (relatively lowdimensional when compared to N ) ELM space that keeps most of the information of the kernel ELM space. Using the derived ELM space, we can directly derive approximate kernel formulations for several optimization schemes proposed in the literature for ELM network training and apply them in largescale problems. The reminder of the paper is structured as follows. We provide an overview of related work in Section II. Our method for training an approximate kernel ELM network is described in Section III. Experiments conducted in order to illustrate its efficiency and effectivity in publicly available classification problems are provided in Section IV. Finally, conclusions are drawn in Section V. II. R ELATED W ORK In this section, we briefly describe the ELM, regularized ELM (RELM), kernel ELM (kELM) and Graph Embedded ELM (GEELM) algorithms proposed in ([1], [9], [18]), respectively. Subsequently, we describe the randomized approximate

2

method proposed in ([20]) for kernel ELM speedup and describe a variation of the algorithm that can be used for GEELM approximation. Let us denote by xi ∈ RD , i = 1, . . . , N the training vectors and by ci ∈ {1, . . . , C} the corresponding class labels, which will be used in order to train a SLFN network. The network consists of D input (equal to the dimensionality of xi ), L hidden and C output (equal to the number of classes involved in the classification problem) neurons. The number of hidden layer neurons is usually selected to be much greater than the number of classes, i.e., L C ([9], [16]). In practice, in order to achieve good performance the number of hidden layer neurons is at the same order with the cardinality of the training set. The elements of the network target vectors ti = [ti1 , ..., tiC ]T , each corresponding to a training vector xi , are set to tik = 1 for vectors belonging to class k, i.e., when ci = k, and to tik = −1 when ci 6= k. In ELM-based approaches, the network input weights Win ∈ RD×L and the hidden layer bias values b ∈ RL are randomly assigned, while the network output weights Wout ∈ RL×C are analytically calculated. Let us denote by qj , wk , wkj the j-th column of Win , the k-th column of Wout and the j-th element of wk , respectively. Given an activation function Φ(·) for the network hidden layer and using a linear activation function for the network output layer, the response oi = [oi1 , . . . , oiC ]T of the network corresponding to xi is calculated by: oik =

L X

wkj Φ(qj , bj , xi ), k = 1, ..., C.

(1)

j=1

It has been shown that almost any nonlinear piecewise continuous activation functions Φ(·) can be used for the calculation of the network hidden layer outputs, e.g. the sigmoid, sine, Gaussian, hard-limiting, Radial Basis Function (RBF), RBFχ2 , Fourier series, etc ([10], [9], [22]). By storing the network hidden layer outputs φi ∈ RL corresponding to all the training vectors xi , i = 1, . . . , N in a matrix Φ = [φ1 , . . . , φN ], equation (1) can be expressed in a matrix form as O = T Φ, where O ∈ RC×N is a matrix containing the network Wout responses for all training data xi . A. Extreme Learning Machine ELM assumes zero training error ([1]), by assuming that oi = ti , i = 1, . . . , N , or in a matrix notation O = T, where T = [t1 , . . . , tN ] is a matrix containing the network target vectors. The network output weights out can be analytically −1 W calculated by Wout = ΦΦT ΦTT . After the calculation of the network output weights Wout , the network response T for a vector xl ∈ RD is given by ol = Wout φl , where φl is the network hidden layer output for xl . B. Regularized and kernel Extreme Learning Machine A regularized version of the ELM algorithm that allows small training errors and tries to minimize the norm of the network output weights Wout has been proposed in ([9]), where

the network output weights are calculated by minimizing for: N

JRELM s.t :

λX 1 kξ k2 = kWout k2F + 2 2 i=1 i 2

(2)

T Wout φi = ti − ξ i , i = 1, ..., N,

(3)

C

where ξ i ∈ R is the error vector corresponding to xi and λ > 0 is a parameter denoting the importance of the training error in the optimization problem. Based on the Karush-KuhnTucker (KKT) theorem ([23]), the network output weights Wout can be determined by solving the dual optimization problem: N N X λX 1 T kWout k2F + kξ i k22 − aTi Wout φi − ti + ξ i , 2 2 i=1 i=1 (4) where ai are the corresponding Lagrange multipliers. By calculating the derivatives of JD,RELM with respect to Wout , ξ i and ai and setting them equal to zero, the network output weights Wout are obtained by: −1 1 ΦTT , (5) Wout = ΦΦT + I λ or −1 1 Wout = Φ K + I TT = ΦA, (6) λ

JD,RELM =

where K ∈ RN ×N is the ELM kernel matrix, having elements equal to [K]i,j = φTi φj ([24]) and A ∈ RN ×C is a matrix expressing the network output weights as linear combination of the training data representation in the kernel ELM space. By using (6), the network response for a given vector xl ∈ RD is given by: T ol = Wout φl = AT ΦT φl = AT kl ,

(7)

where kl ∈ RN is a vector having its elements equal to kl,i = φTi φl , i = 1, . . . , N .

C. Graph Embedded Extreme Learning Machine A regularized version of ELM that exploits graph structures in the ELM space has been proposed in ([18]), where the network output weights are calculated by minimizing for: JGEELM s.t. :

=

N λX µ T 1 kWout k2F + kξi k22 + tr Wout SWout(8) 2 2 i=1 2

T Wout φi − ti = ξi , i = 1, ..., N,

S−1 p Si

(9)

where S = is a matrix expressing both intrinsic and penalty subspace learning criteria and µ > 0 is a trade off parameter between the two regularization terms of Wout . Graph Embedded ELM (GEELM) assumes that the training data representations in the ELM space are embedded in a graph expressing intra-class relationships and a second one expressing between-class relationships, described in matrices Si ∈ RL×L and Sp ∈ RL×L , respectively. By substituting the constraints (9) in JGEELM and setting the derivative of JGEELM with respect to Wout equal to zero, Wout is given

3

by: Wout =

! −1 1 T I + µS ΦTT . ΦΦ + λ

(10)

GEELM has also been extended by exploiting kernel formulations. In this case the reconstruction weights matrix A is given by: !−1 1 TT , A = K + (I + µL) (11) λ where L = (Lp )† Li ∈ RN ×N , Lp is the graph Laplacian matrix of the penalty graph, Li is the graph Laplacian matrix for the intrinsic graph and (·)† denotes the pseudo-inverse of a matrix. Here we should note that GEELM can be considered as a generalization of several ELM variants. For example, the standard ELM ([1]) is a special case for λ = 0, regularized ELM ([9]) is a special case for µ = 0, the Minimum Variance ELM ([16], [22]) is a special case for Sp = I and Si = Sw (or Si = ST ) and Discriminative Graph Regularized ELM ([25]) is a special case for Sp = I and Li set to be the Laplacian of the k-NN graph. D. Kernel ELM speedup As has been discussed above, the infeasibility of kernel ELMs in large scale-problems is related to that it requires the storage and inversion of an N ×N matrix. In order to alleviate the computational burden of the matrix inversion processing step, the randomized low-rank approximate approach of ([26]) has been proposed in ([20]) for kernel ELM speedup. The processing steps of the method in ([20]) are the following: N ×N • Calculation of the ELM kernel matrix K ∈ R . • Determination of a random (Gaussian) matrix Ω ∈ RN ×n , where n < N . • Random mapping of the ELM kernel matrix, i.e. Y = KΩ. N ×n • Decomposition of Y = QR, where Q ∈ R is an orthogonal matrix. T n×n • Calculation of the matrix B = Q KQ ∈ R . T ˆ ˆ • Eigen-decomposition of B = UDU . • Approximation of K using the low-rank approximate ˜ = GGT , where G = QUD ˆ 21 . matrix K • Calculation of A using the WoodBury formula:

improvements, due to that the calculation of L requires the inversion of Lp . In the following section, we will describe a low-rank approximate method that exploits Nystr¨om-based low-rank approximation in order to overcome these limitations. ¨ - BASED KERNEL ELM SPEEDUP III. N YSTR OM Let us assume that we would like to determine an n-rank approximation of K ∈ RN ×N . This can be done by calculating its eigen-decomposition K = UΛUT and keeping n of its largest eigenvalues (and the corresponding eigenvectors), i.e. ˜ = U(n) Λn UT . However, the calculation, storage K ' K (n) and the eigen-decomposition of K are prohibitive for large values of N . In order to determine an n-rank approximation of K without requiring its calculation and eigen-decomposition we can follow the method in ([21]). Let us denote by KN n ∈ RN ×n the sub-matrix of K that is formed by the columns corresponding to n samples. In addition, let us denote by Knn the sub-matrix of K that is formed by the intersection of the columns and rows corresponding to the n selected samples. By applying eigendecomposition on Knn we obtain: Knn = Un Λn UTn .

(13)

The matrix containing the n leading eigenvectors of K can be approximated by the Nystr¨om extension ([21]) by: U(n) ≈ KN n Un Λ−1 n .

(14)

An n-rank approximation of K is given by exploiting (13) and (14): T T ˜ = KN n K−1 K≈K (15) nn KN n = GG , −1

where G = KN n Knn2 . Thus, by exploiting (15) and the WoodBury formula, the matrix A can be calculated using (12). The above-described process requires the calculation of KN n having time and memory complexities equal to O(nN ) (note that Knn is a sub-matrix of KN n and we do not need to calculate and store it). The inversion of Knn has a time complexity equal to O(n3 ). Overall, the calculation of the matrix A has a time complexity equal to O(n3 + nN ) (which is smaller than the time complexity O(nN 2 + n3 ) of the randomized approximate method in ([20])). In addition, Nystr¨om-based approximation requires the storage of an N ×n matrix and, thus, its memory complexity is considerably lower " # −1 −1 from that of kernel ELM and the randomized approximate 1 ˜ + 1I A' K I + GT G TT = λ I−G GT TT . method in ([20]). λ

λ

(12)

Clearly, from the above steps we can see that the time complexity of the kernel ELM can be reduced from O(N 3 ) to O(nN 2 + n3 ). The memory complexity of the algorithm however increases, since we need to calculate the matrices Ω, Y, B (and its decomposition) and the kernel ELM matrix K. By observing (11) and (12) we can see that, by approximating the matrix K + µλ L instead of K, the above-described algorithm can be used for the calculation of an approximate solution of the kernel GEELM algorithm. However, in the general case where both intrinsic and penalty graphs are used, such an approach will not lead to computational or memory

A. Exploiting kernel ELM space properties in ELM variants In order to incorporate properties of the kernel ELM space in any extension of ELM algorithm, we follow an analysis similar to that of ([27]). As use case scenario we will use the GEELM algorithm described in II-B. However, the following analysis can be directly applied to any other ELM extension without modification. This is an advantage of the method, when compared to ([19]), since the latter is restricted only to standard ELM. Using the kernel ELM matrix definition, we can write K = ΦT Φ. The dimensionality of Φ is arbitrary (even infinite) and

4

TABLE I: Data sets used in our experiments (D, C refer to the data dimensions and number of classes. N , M are number of training and test data used in each experiment). Data set German (UCI) Segmentation (UCI) Madelon (UCI) OptDigits (UCI) Coil100 (Object) Isolet (UCI) USPS (Digit) Pubfig+LFW (Face) YouTube Faces (Face)

Samples 1000 2310 2600 5620 7200 7797 9298 47189 370319

D 24 19 500 64 1024 617 256 1536 1770

C 2 7 2 10 100 26 10 200 340

N 700 1617 1820 3935 5000 5458 6509 35469 259223

M 300 693 780 1685 2200 2339 2789 11720 111096

is determined by the selected kernel function. For example, the dimensionality of the kernel ELM space for the linear kernel function is D, while the dimensionality of the kernel space in the case of RBF kernel is infinite. By using Nystr¨om ˜ approximation, K is approximated by the n-rank matrix K. Rewriting (15), we can observe that: T ˜ = KN n K−1 ˜T ˜ K'K nn KN n = Φ Φ,

(16)

− 12

˜ = Knn KT ∈ Rn×N is a matrix that can be where Φ Nn considered to contain the images of φi , i = 1, . . . , N in the ˜ is an n-rank approximate kernel ELM space Rn . Since K matrix, the application of the kernel ELM algorithm using ˜ is equivalent to the application of the corresponding ELM K ˜ algorithm using Φ. Let us now consider the process that will be followed in the test phase, where a test sample xl is introduced to the (trained) kernel ELM classifier. The network’s output is given by ol = AT kl , where A is the weight matrix learnt in the ˜ The training phase using the approximate kernel matrix K. image of the test sample xl in the approximate kernel ELM −1 n T ˜ ˜ ˜ ˜ space R can be defined as φl = ΦΦ Φkl . By following the above-described analysis, the approximate kernel GEELM can be obtained by mapping the training data from the kernel ELM space to their images in the approximate ˜ ∈ Rn×N . Then the GEELM algorithm kernel ELM space Φ is applied where the network output weights are given by: ! −1 1 T ˜ T. ˜Φ ˜ + Wout = Φ ΦT (17) I + µS λ The matrix S = Sp−1 Si is defined as in the original GEELM, i.e. it is assumed that the training data representations in the Rn are embedded in a graph expressing intra-class relationships and a second one expressing between-class relationships, described in matrices Si ∈ Rn×n and Sp ∈ Rn×n , respectively. Thus, the calculation of S involves the inversion of a n×n matrix. The network’s output for a test vector xl is given T ˜ by ol = Wout φl . B. Selection of KN n and Knn In the above-described method, the matrix KN n is obtained by randomly selecting n training samples (corresponding to n columns of K). Such an approach has been found to provide

good matrix approximation results ([21], [28]). Recently, it has been shown that KN n can be calculated by exploiting a set of prototype vectors zj , j = 1, . . . , n, which are obtained by summarizing the training data, e.g. by applying KMeans clustering ([29]). For a number of widely used kernel functions, e.g. linear, RBF and polynomial, it has been shown that the approximation error (in terms of Frobenius norm) of the prototype-based approach is bounded by the encoding error of the prototypes used ([29], [30]). In this case, KN n is defined by [KN n ]ij = κ(xi , zj ), i = 1, . . . , N, j = 1, . . . , n and Knn is defined by [Knn ]ij = κ(zi , zj ), i = 1, . . . , n, j = 1, . . . , n. It has been shown that the exploitation of prototype vectors can lead to better matrix approximation results, while adding a small overhead on the computational and memory complexities of the method. Specifically, it requires the storage of an additional n × n matrix and the application of a clustering algorithm on the training data. In this paper, we have adopted the fast K-Means implementation proposed in ([31]), that has a time complexity equal to O(tnN ), where t is the number of the K-Means iterations. Following ([30]), we set the number of K-Means iterations to a small value (t = 5). While using such an iteration number K-Means usually does not converge, it provides good results without increasing the overall time complexity of the method. We have experimentally found that, while the random sampling approach can achieve similar results with that of the one exploiting prototype vectors, the latter one generally is more robust in performance. IV. E XPERIMENTS In this Section, we provide experiments conducted in order to illustrate the efficiency of our method exploiting a Nystr¨ombased approximate kernel ELM matrix. In these experiments, we compare the performance, training time and memory requirements of the kernel Graph Embedded ELM (kGEELM) ([18] and our approximate method (noted by nAkGEELM hereafter). We also include in our experiments the variant of the randomized approximate kernel ELM ([20]), which has been described in Section II-D for GEELM approximation (noted by rAkGEELM hereafter). All experiments have been conducted on an Intel Xeon 24-core CPU E5-2697 v2 2.7 GHz, 64-bit and 92 GB PC using Matlab implementation and floating point precision. We have employed five medium-scale UCI data sets ([32]), two medium-scale image (digit/object recognition) data sets ([33], [34]) and two large-scale facial image data sets ([35], [36]) to this end. Information regarding the data sets used is provided in Table I. On the Pubfig+LFW data set we apply the 5-fold cross-validation procedure using the standard data partitioning and report the mean classification rate and the standard deviation over the five folds for each algorithm. On the remaining data sets, we measure the performance of the each algorithm by using the mean classification rate and the standard deviation over five experiments. On each experiment, we randomly keep 70% of the data for training and the remaining 30% for testing. We used the RBF kernel function and we set the parameter σ equal to the mean Euclidean distance between the training data, which is the

5

TABLE II: Mean classification rate and standard deviation (%) on medium-scale data set. Data set German Segmentation Madelon OptDigits COIL100 Isolet USPS

n = 0.01N rAkGEELM nAkGEELM 70 ± 0.01 70 ± 0.01 70.48 ± 1.67 70.42 ± 1.75 60.79 ± 1.64 59.49 ± 2.42 89.19 ± 0.38 89.27 ± 0.39 45.7 ± 0.37 45.7 ± 0.37 82.78 ± 0.81 82.77 ± 0.76 43.51 ± 1.27 45.2 ± 1.63

kGEELM 77.27 ± 1.28 87.76 ± 1.1 61.9 ± 1.1 97.86 ± 0.16 85.93 ± 0.6 97.42 ± 0.43 98.48 ± 0.29

n = 0.05N rAkGEELM nAkGEELM 70 ± 0.01 74.33 ± 1.63 89.15 ± 1.32 88.23 ± 1.23 60.77 ± 1.44 59.67 ± 2.02 97.86 ± 0.16 97.71 ± 0.43 73.44 ± 1.32 73.78 ± 0.4 92.94 ± 0.61 92.98 ± 0.75 95.08 ± 0.89 95.22 ± 0.63

n = 0.1N rAkGEELM nAkGEELM 73.93 ± 2.64 74.93 ± 1.61 90.76 ± 0.74 90.33 ± 0.67 56.28 ± 5.86 60.51 ± 1.94 97.73 ± 0.34 98.68 ± 0.17 82.67 ± 0.53 83.13 ± 0.46 94.79 ± 0.53 94.87 ± 0.48 96.6 ± 0.52 96.63 ± 0.43

TABLE III: Mean training times (in seconds) in medium-scale data sets. Data set German Segmentation Madelon OptDigits COIL100 Isolet USPS

kGEELM 0.59 5.10 6.61 79.55 89.97 191.33 359.33

n = 0.01N rAkGEELM nAkGEELM 0.22 0.0048 1.35 0.0159 1.83 0.0509 15.72 0.017 46.43 0.2916 44.85 0.1599 95.89 0.1345

n = 0.05N rAkGEELM nAkGEELM 0.21 0.0083 1.36 0.0181 1.88 0.0913 17.19 0.0726 47.20 0.56 41.85 0.3681 99.82 0.3311

n = 0.1N rAkGEELM nAkGEELM 0.23 0.0117 1.39 0.0285 1.91 0.1033 17.68 0.1333 49.16 0.8292 42.80 0.6977 100.70 0.7951

TABLE IV: Memory requirements (in MB) for the medium-scale data sets. Data set German Segmentation Madelon OptDigits COIL100 Isolet USPS

kGEELM 9.356 49.842 63.179 295.338 476.837 681.833 969.407

n = 0.01N rAkGEELM nAkGEELM 9.362 0.0027 49.963 0.0124 63.248 0.0139 395.668 0.0452 479.05 0.0765 683.416 0.1669 970.5 0.199

n = 0.05N rAkGEELM nAkGEELM 9.558 0.0562 50.982 0.286 64.523 0.36 301.544 1.6356 489.756 2.6455 700.591 6.1824 994.948 8.7485

n = 0.1N rAkGEELM nAkGEELM 9.7568 0.1159 52.09 0.6084 65.889 0.7679 308.095 3.571 500.6216 5.7488 718.8635 13.128 1021.06 18.6836

TABLE VI: Performance (%), mean training time (in sec) and memory requirements (in MB) for the YouTube Faces data set.

Performance Tr. time Memory

n = 250 79.39 ± 0.41 17.53 247.45

n = 500 90.15 ± 0.15 26.29 495.39

n = 1000 94.98 ± 0.17 40.1 992.69

natural scaling factor for each data set. In all the experiments we use the LDA graphs and a value of µ = 1 (i.e. we equally weight the contribution of the two regularization terms). The optimal λ value has been determined by randomly partitioning the training set to two equal (training/validation) sets and evaluating the performance of the algorithms on the validation set using the values λ = 10r , r = −6, . . . , 6. Regarding the value of the parameter n (i.e. the rank of the low-rank approximation of the kernel ELM matrix), we have used three values, which are determined proportionally to the cardinality of the training set (i.e. n = pN , where p = {0.01, 0.05, 0.1}), for all the data sets (except from the YouTube Faces data set where we manually test six values in order to demonstrate how the performance, training time and memory requirements of the methods scale with respect to the value of n. On the YouTube Faces data set we have tested values of n up to 2500

n = 1500 95.57 ± 0.21 67.62 1788.55

n = 2000 95.35 ± 0.17 89.09 2388.55

n = 2500 94.89 ± 0.15 143.15 4473.77

TABLE V: Performance (%), mean training time (in sec) and memory requirements (in MB) for the Pubfig+LFW data set. Performance Tr. time Memory

n = 0.01N 37.61 ± 0.39 2.5 2.58

n = 0.05N 63.83 ± 0.21 10.86 132.51

n = 0.1N 70.33 ± 0.18 25.74 289.18

due to memory constraints. In practice, the value of n can be selected by taking into account time/computational constraints related to the problem at hand. In our first set of experiments, we have applied the three algorithms on the medium-scale data sets. The performance of each algorithm for three values of n is illustrated in Table II. As can be observed, the two approximate methods

6

TABLE VII: Comparison between the proposed approach with other methods. Pubfig+LFW YouTube Faces

RFR 64.14 ± 0.38 95.92 ± 0.41

PVM 70.9 ± 0.39 95.35 ± 0.21

provide similar performance in most cases. The performance of the approximate methods becomes competitive to that of the original algorithm as the value of n gets higher. Moreover, it can be seen that the approximate methods provide satisfactory performance for smaller values of n. nAkGEELM has lower computational and memory costs when compared to both kGEELM and rAkGEELM. The mean training time and the memory requirements of the algorithms are illustrated in Tables III and IV, respectively. Both rAkGEELM and nAkGEELM are able to operate much faster, when compared to the kGEELM. However, this speedup is much lower for rAkGEELM, since it requires the calculation of the entire ELM kernel matrix and the inverse of Lp . On the other hand, nAkGEELM calculates the inverse of Sp in Rn , which is much faster. By comparing the training times of the two approximate methods, it can be seen that nAkGEELM operates much faster. This difference in training time is expected to be larger for very large data sets. Regarding the memory requirements of each algorithm, it can be seen that rAkGEELM slightly increases required memory, when compared to the kernel GEELM algorithm (as explained in subsection II-D). The memory requirements of nAkGEELM are much lower.

In our second set of experiments we have applied nAkGEELM on the two large-scale data sets. We did not apply kGEELM and rAkGEELM due to memory and time constraints. On the PubFig+LFW data set we used the 1536dimensional facial image representations suggested by ([36]). On the YouTube Faces data set we have employed the facial images depicting persons in at least 500 images, resulting to a dataset of 370319 images and 340 classes and we used the 1770-dimensional representations suggested by ([35]). The performance of nAkGEELM, along with the required training time and memory are illustrated in Tables V and VI for the PubFig+LFW and YouTube Faces data sets, respectively. In order to speedup the overall training process in the experiments on the YouTube Faces data set, the prototype vectors have been determined by clustering a set of 5 · 104 randomly sampled training vectors. In both data sets nAkGEELM provides satisfactory performance. In Table VII we compare the performance obtained by applying the proposed method with that of the Random Feature Regression (RFR) method of ([37]), (a supervised version of) the Prototype Vector Machine (PVM) of ([30]) and the GEELM algorithm exploiting random hidden layer weights in the two large-scale datasets. In order to achieve similar training times, we used the same number of mappings (i.e. n = 0.1N for the Pubfig+LFW and n = 2500 for YouTube Faces) for all methods. The proposed method provides competitive results with other, recently proposed, related methods.

GEELM 64.2 ± 0.21 90.62 ± 0.15

nAkGEELM 70.33 ± 0.18 94.89 ± 0.15

V. C ONCLUSIONS In this paper, we described a method for reducing the training time and memory cost of the kernel Extreme Learning Machine variants. We have showed that by exploiting a Nystr¨om-based kernel ELM matrix approximation we can train a kernel ELM network that overcomes the time and memory restrictions on kernel ELM rendering its application in largescale learning problems prohibitive. We have also shown that by using Nystr¨om-based kernel matrix approximation, we can define an ELM space that exploits properties of the kernel ELM space and can be used in several optimization schemes proposed in the literature for ELM network training. Experiments on both medium-scale and large-scale classification problems show that this approach can achieve performance which is comparable with that of the kernel ELM network, while alleviating the heavy time and memory requirements of the kernel ELM approaches. R EFERENCES [1] G. Huang, Q. Zhu, and C. Siew, “Extreme learning machine: a new learning scheme of feedforward neural networks,” IEEE International Joint Conference on Neural Networks, vol. 2, pp. 985–990, 2004. [2] D. Rumelhart, G. Hinton, and R. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986. [3] D. Broomhead and D. Lowe, “Multivariable functional interpolation and adaptive networks,” Complex Systems, vol. 2, pp. 321–355, 1988. [4] W. Schmidt, M. Kraaijveld, and R. Duin, “Feedforward neural networks with random weights,” International Conference on Pattern Recognition, 1992. [5] Y. Pao, G. Park, and D. Sobajic, “Learning and generalization characteristics of random vector functional-link net,” Neurocomputing, vol. 6, pp. 163–180, 1994. [6] C. Chen, “A rapid supervised learning neural network for function interpolation and approximation,” IEEE Transactions on Neural Networks, vol. 7, no. 5, pp. 1220–1230, 1996. [7] B. Widrow, A. Greenblatt, Y. Kim, and D. Park, “The no-prop algorithm: A new learning algorithm for multilayer neural networks,” Neural Networks, vol. 37, pp. 182–188, 2013. [8] A. Rahimi and B. Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” Advances in Neural Information Processing Systems, 2008. [9] G. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme Learning Machine for Regression and Multiclass Classification,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 42, no. 2, pp. 513–529, 2012. [10] G. Huang, L. Chen, and C. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879– 892, 2006. [11] R. Zhang, Y. Lan, G. Huang, and Z. Zu, “Universal approximation of extreme learning machine with adaptive growth of hidden nodes,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 2, pp. 365–371, 2012. [12] X. Liu, S. Lin, J. Fang, and Z. Xu, “Is extreme learning machine feasible? a theoretical assessment (Part I),” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 1, pp. 7–20, 2015. [13] ——, “Is extreme learning machine feasible? a theoretical assessment (Part II),” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 1, pp. 21–34, 2015.

7

[14] M. Li, G. Huang, P. Saratchandran, and N. Sundararajan, “Fully complex extreme learning machine,” Neurocomputing, vol. 68, no. 13, pp. 306– 314, 2005. [15] Y. Wang, F. Cao, and Y. Yuan, “A study on effectiveness of extreme learning machine,” Neurocomputing, vol. 74, no. 16, pp. 2483–2490, 2011. [16] A. Iosifidis, A. Tefas, and I. Pitas, “Minimum class variance extreme learning machine for human action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 11, pp. 1968– 1979, 2013. [17] ——, “Regularized extreme learning machine for multi-view semisupervised action recognition,” Neurocomputing, vol. 145, pp. 250–262, 2014. [18] ——, “Graph embedded extreme learning machine,” IEEE Transactions on Cybernetics, D.O.I. 10.1109/TCYB.2015.2401973, 2015. [19] A. Iosifidis, A. Tefas, and A. Pitas, “Large-scale nonlinear facial image classification based on approximate kernel extreme learning machiine,” IEEE International Conference on Image Processing, 2015. [20] C. Men and W. Wang, “A randomized elm speedup algorithm,” Neurocomputing, vol. 159, pp. 78–83, 2015. [21] C. Williams and M. Seeger, “Using the Nystr¨om method to speed up kernel machines,” Neural Information Processing Systems, 2001. [22] A. Iosifidis, A. Tefas, and I. Pitas, “Minimum variance extreme learning machine for human action recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing, 2014. [23] R. Fletcher, Practical Methods of Optimization: Volume 2 Constrained Optimization. Wiley, 1981. [24] B. Frenay and M. Verleysen, “Using svms with randomised feature spaces: An extreme learning approach,” European Symposium on Artificial Neural Networks, 2010. [25] Y. Peng, S. Wang, X. Long, and B. Lu, “Discriminative graph regularized extreme learning machine for face recognition,” Neurocomputing, vol. 149, pp. 340–353, 2015. [26] P. Halko, N. andMartinsson and J. Tropp, “Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions,” SIAM Review, vol. 53, no. 2, pp. 217–288, 2011. [27] A. Iosifidis, A. Tefas, and I. Pitas, “On the kernel extreme learning machine classifier,” Pattern Recognition Letters, vol. 54, pp. 11–17, 2015. [28] P. Drineas and M. Mahoney, “On the Nystrom Method for Approximating a Gram Matrix for Improved Kernel-based Learning,” Journal of Machine Learning Research, vol. 6, pp. 2153–2275, 2005. [29] K. Zhang and J. Kwok, “Clustered Nystr¨om method for large scale manifold learning and dimensionality reduction,” IEEE Transactions on Neural Networks, vol. 21, no. 10, pp. 1576–1587, 2010. [30] K. Zhang, L. Lan, J. Kwok, S. Vucetic, and B. Parvin, “Scaling up graph-based semisupervised learning via Prototype Vector Machines,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 3, pp. 444–457, 2015. [31] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seedin,” ACM-SIAM Symposium on Discrete Algorithms, 2007. [32] K. Bache and M. Lichman, “UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science,” 2013. [33] S. Nene, S. Nayar, and H. Murase, “Columbia object image library (coil100),” Technical Report CUCS-006-96, 1996. [34] D. Cai, X. He, J. Han, and T. Huang, “Graph regularized nonnegative matrix factorization for data representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1545– 1560, 2011. [35] L. Wolf, T. Hassner, and I. Maoz, “Face recognition in unconstrained videos with matched background similarity,” Computer Vision and Pattern Recognition, 2011. [36] E. Ortiza and B. Beckerb, “Face recognition for web-scale datasets,” Computer Vision and Image Understanding, vol. 118, pp. 153–170, 2014. [37] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” Neural Infomration Processing Systems, 2007.