Kernel regression on matrix patterns

. In this paper we propose a kernel-based regression model for matrix patterns (KRMP). The training algorithm is derived. The proposed model was empirically compared with traditionalmodels.


Introduction
In most supervised or unsupervised machine learning models the inputs are described by vectors. However, there are some important applications where the inputs are sets of vectors, or matrices (for example, images, graphs, multidimensional time series, and others). The standard approach is to decompose the input matrix into the vector, but such decomposition can delete important information about an inner structure of the input matrix. Cai et al. [1] experimentally demonstrated that even in vector cases it can be useful to reshape the input vector into a matrix. In recent years an interest in this problem has arisen ( [1,2,5,6,3]). Most publications on this topic analyze the linear methods (for example, see [1,5,6,3]). In this article we introduce a new nonlinear kernel regression model -KRMP (kernel regression on matrix patterns). In the kernel methods the initial data vectors x i are mapped to high dimensional features φ(x i ). By Mercer's theorem, which states that any continuous, symmetric, nonnegative definite function k(· , · ) can be expressed as an inner product (i.e., k( [4], computation of the inner products in the feature space is replaced by computation of the values of the kernel function k(· , · ). This idea is known as kernel trick.

Kernel least squares regression
In this section we briefly describe the traditional kernel least squares regression model (KR). Let y ∈ R N be a vector and X = [x 1 , x 2 , . . . , x N ] T (where x i ∈ R m ) be an observation matrix. In the linear regression we seek a vector α which minimizes the norm y − Xα 2 . The solution to this problem is defined by α = (X T X) −1 X T y.The linear regression can be extended to nonlinear by mapping the original data into a feature space. Let k(x i , x j ) = φ(x i ) T φ(x j ) be a Mercer kernel. In the kernel regression the original data are mapped into a feature space (i.e., each observation x i is mapped to We seek the solution a which minimizes and is defined in the basis of the columns of X (i.e., a = Xα). The least squares solution of (1) is defined by α = K −1 y, where K = X T · X is a kernel matrix. To avoid an overfitting, the regularization often is used. In that case, the norm of the solution α is penalized and J = y − X T a 2 +λ a 2 = y − Kα 2 +λα T Kα is minimized. The solution to this problem is defined by α = (K + λI) −1 y, where λ 0 is a regularization constant.

KRMP model
Denote the training set by where u ∈ R m and v ∈ R n . When the inputs are matrices (especially when they have large dimensions) Cai's model has an advantage over standard linear regression because it has fewer parameters and exploits an inner structure of the input matrix. In the following a nonlinear version of Cai's model is proposed.
We will analyze the following model: where u, v ∈ R n , A = N i=1 α i X i , and kernel matrix K(X, X i ) = X T · X i . By kernel trick one can calculate y(X) knowing only a kernel k(· , · ) and without knowing actual mapping φ(.). The (2) or (3) models can be applied on regression or classification problems.

Parameter estimation
In this section an algorithm for regularized sum squared error (RSSE) minimization is formulated. RSSE is defined by where λ 1 , λ 2 , λ 3 0 are regularization constants. Our aim is to estimate the parameters u, v and α, which minimizes (4). For the sake of convenience, define N×N matrix M= (m i,j ), m ij =u T K(X i , X j )v, Differentiate (4) with respect to u and v and set the derivatives to 0: For fixed u and v, optimal α can be found by the well-known least squares formula. From equations (5), (6) we see that the optimal parameters depend on each other, thus cannot be computed explicitly. For the parameter optimization the following algorithm can be applied: t 0 ∈ N, > 0 and set t = 1.
Since with respect to α, u, and v (4) is a convex function, an iterative sequence of its values, defined by the KRMP algorithm, converges because it monotonically non-increases and is bounded by 0.

Numerical simulations
In this section the (3) model is empirically compared with two supervised machine learning algorithms: the kernel regression, which is the analogue of (3) when the inputs are vectors, and support vector machines (SVM, [4]). The results of [1] suggest, that matrix-based models are efficient with small training samples. However, Cai et al. worked with linear models. In our experiments we will also use a small part of the data for the training of the models and check this assumption for nonlinear ones.
The benchmark data sets are three binary classification data sets from UCI machine learning repository 1 . In the experiments we used a Gaussian kernel k(x i , The measure of performance of the models was the correct classification probability over the testing set. In each experiment the training set was selected randomly, all experiments were performed 100 times, and the results were averaged. The meta-parameters (a bandwidth σ , regularization constants, etc.) were selected using cross validation.

Data sets
The Ionosphere data set consists of 351 observations, which have 34 features. The variance of the 2nd feature is zero, so this feature is removed. For training of the models 20 random examples are selected; the others are left for testing. For the KRMP model the input vectors are preprocessed into 3 × 11 matrices.
SPECTF data set consists of 80 observations, which have 44 features. For the KRMP model the input vectors are preprocessed into 4 × 11 matrices. In this case 10 observations are randomly selected for the training of the models.
In Australian credit approval data set the input data consists of 690 14-dimensional input vectors. When KRMP is used, the initial data vectors are preprocessed into the 2 × 7 matrices. For training of the models 10 training examples are selected, others are left for testing.

Empirical results
Sign ">" means that p-value in the signed rank test for zero median between differences of the performances of the models was < 0.01, "∼" means the opposite case. From the Table 1 we see that the KRMP model was more efficient than the traditional kernel regression (KR) model and performed similarly or better than SVM. In our opinion the KRMP was more effective than the KR because of the model structure.