Particle swarm optimization for linear support vector machines based classifier selection

Particle swarm optimization is a metaheuristic technique widely applied to solve various optimization problems as well as parameter selection problems for various classification techniques. This paper presents an approach for linear support vector machines classifier optimization combining its selection from a family of similar classifiers with parameter optimization. Experimental results indicate that proposed heuristics can help obtain competitive or even better results compared to similar techniques and approaches and can be used as a solver for various classification tasks.


Introduction
Novel machine learning techniques play a very important role in solving various analysis and forecasting problems in various domains.Computational finance is one of such fields, which involves many researchers working in subfields such as financial forecasting, development of advanced techniques for credit risk modelling and evaluation to evaluate ratings, classify debtors by their risk level, predict bankruptcies.Various financial institutions and authorities such as banking sector, investors, governing authorities pay a lot more attention to these techniques as they prove to overcome limitations of previously applied techniques or tend to show competitive results in terms of accuracy or precision.As Balthazar refers in [1], machine learning techniques such as support vector machines (SVM) based models are already successfully applied to solve real world problems in Standard & Poor's rating company.Support vector machines is widely adopted classification technique with performance comparable to neural network classifiers; yet, they help to avoid some of their problems such as overtraining, overfitting, local minimas.SVM is applied to solve various classification problems in different domains, including bioinformatics and computational biology [2,3], document classification [4,5] image c Vilnius University, 2014 recognition [6,7] etc. as well as bankruptcy prediction [8][9][10].Least squares SVM (LS-SVM), developed by Suykens and Vandevalle [11], reformulates SVM as quadratic programming problem, solved by a set of linear equations.Lai et al., Zhou and Lai used LS-SVM to develop approaches for credit risk evaluation [12,13].One of main challenges in SVM adoption for practical real world problems is parameter selection task -multiple SVMs with different parameters have to be computed in order to find SVM which results in best classification performance.This is also stated in book by Steinwart and Christmann [14].Various heuristic and evolutionary optimization techniques are used to solve this task.Grid search is used as an option by various researchers such as Chen et al. [8], Yun et al. [15]; it is also implemented by default in some SVM packages such as LibSVM [16].Papers which describe adoption of genetic algorithm report its benefits and increase in overall classification performance [17,18].Particle swarm optimization (abbr.as PSO) algorithm introduced by Kennedy [19] and based on behaviour of flock of birds has also been reported by various researchers as an effective tool for parameter selection [15,[20][21][22][23], feature selection [24] or combination of both [25][26][27].
Linear support vector machines are not widely applied to solve classification problems in credit risk domain, mainly because of their inflexibility in modelling.They are more applicable for large scale classification, whereas related surveys [28,29] indicated that research of insolvency prediction or ratings analysis mostly involved mostly less than 1000 instances.This is related mainly to limited availability of related data; however, increasing amount of various financial data available online offers possibilities for new insights, as well as advantages of larger scale research.The authors of linear SVM (particularly its implementations in LIBLINEAR software) showed that in some cases it is able to produce competitive results to nonlinear SVMs while resulting in less complexity and reduced computational time required to train classifier [30][31][32].Similar results were also obtained in our previous works [33,34] where we worked with comparatively large number of instances.Wu [35] demonstrated that linear SVM classifiers can be very sensitive regarding their cost parameters; therefore their selection is necessary, yet subtle task.To deal with this problem, we previously introduced a hybrid technique based on PSO and linear SVM called PSO-LinSVM with capability to select linear SVM classifier algorithm together with its parameters.In this work we concentrated mainly on its adoption on credit risk evaluation problem, applying real-value based PSO algorithm for hybrid search space with one discrete dimension [36,37].Although the results were promising, the nature of real-valued PSO indicated its possible improvements regarding particle movement in discrete dimension.Therefore, this paper extends this work by proposing enhancement of PSO-LinSVM algorithm for hybrid search space.

Description of techniques used in research
Fig. 1.Linear support vector machine illustration.(Source: adopted from [38], using comments by the authors.)capabilities; it is described in detail in [39].SVM performs data discrimination by mapping the input space to a high-dimensional feature space using kernel functions.Linear SVM is illustrated in Fig. 1.The main objective is to find a hyperplane which minimizes margin error and is described as a set of support vectors.Finding these vectors from training data is formulated as quadratic optimization problem [39] minimize w,b,ξ where C is a regularization (also referred as cost [16]) parameter that determines the tradeoff between the maximum margin and the minimum classification error.The decision function is defined as [39] sgn φ(x) • w .
If training vectors are not linearly separable, they can be represented in a larger (probably infinite) dimensional space by using kernel function K(x i , x j ) ≡ φ(x i ) T φ(x j ).SVM is solved using dual formulation [16,38] minimize where the number of training examples is denoted by l, training vectors x i ∈ R, i = 1 . . .l, and a vector y ∈ R l such that y ∈ {−1, 1}.α is a vector of l Lagrange multipliers, where each α i corresponds to a training example (x i , y i ).According to [38], parameters Table 1.Linear SVM classification algorithms and their formulations.

Algorithm
Minimization problem L2-regularized logistic regression minw 1 2 w T w + C l i=1 log(1 + e y l w T x l ) L2-regularized L2-loss SVC (dual) for optimal hyperplane w 0 and b 0 are obtained using where N SV is the number of support vectors.Support vectors are instances which have nonzero α 0i and support forming the decision function which becomes [16] sgn Linear SVM.A linear SVM classifier is defined as follows [40]: given training vectors x i ∈ R n , i = 1 . . .l, in two class, and a vector y ∈ R l such that y ∈ {−1, 1}, a linear classifier generates a weight vector x using a decision function LIBLINEAR includes a family of linear SVM and logistic regression classifiers for large-scale SVM classification.These classifiers have several advantages over nonlinear SVM implementations (such as LibSVM or SVM Light ) as absence of kernel functions results in reduced complexity and training time.In some cases, the discriminant function of the classifier includes a bias term b.LIBLINEAR handles this term by augmenting the vector w and each instance x i with an additional dimension w T ← w T , b , x T i ← x T i , b using constant B, which is specified by the user as bias term (further B will be referred as bias parameter) [40].According to [40], L1-SVM and L2-SVM are solved using coordinate descent method [40,41]; for logistic regression and L2-SVM, a trust region Newton method [40,42] is implemented.
For the research, LIBLINEAR classifiers listed in Table 1 were used; for more information refer to [30,40].
Least squares SVM.Least squares SVM (abbr.as LS-SVM) aims to solve a set of linear equations instead of convex quadratic programming performed by using standard SVM.LS-SVM is also referred as kernel Fisher discriminant analysis [11].The problem can be formulated as minimize w,b,e with l denoting the number of training instances.Lagrangian for LS-SVM is defined as [11] L(w, b, e, α) = 1 2 with α k as Lagrangian multipliers.

Heuristic optimization algorithms applied in researh
Simulated annealing (SA) method is described as "based on the metaphor of molecules cooling into a crystalline pattern after being heated.In a molten metal the molecules move chaotically, and as the metal cools they begin to find patterns of connectivity with neighbouring molecules, until they cool into a nice orderly pattern -an optimum" [19].
According to Kennedy et al., simulated annealing extends hill climbing with a stochastic decision and a cooling schedule [19].Application of this technique has been proved to show good performance and obtain good results in relatively small number of iterations.
The main idea of the algorithm is as follows: at each iteration, a set of possible solutions is generated and a random successor v is chosen.If f (v) < f (u), it is considered as optimal, otherwise cost function value increases, then the new set is accepted with a certain probability and a random step is taken to obtain the new variable set where r is selected according to here r is random value from uniform distribution, i.e., r ∼ U (0, 1).After a certain number of iterations, the new variable sets do not minimize the costs, thus the procedure is stopped after temperature T converges to 0 (i.e., T ).
Particle swarm optimization (PSO).Particle swarm optimization (abbr.as PSO) algorithm was introduced by Kennedy [19].This technique is based on behaviour of flock of birds which search for food randomly in some area, knowing only the distance from the food.In PSO, each possible solution is represented as this bird and is called a particle, and its location relative to the object which is searched (food in this example) is defined by the fitness value.Thus all the particles have one fitness value defined by a function which is optimized, and each particle has one velocity to determine its flying direction and distance.All the particles perform search in the solution space by following currently the most optimal particle.PSO is initialized to be a group of random particles and iteratively find the optimal solution.In each iteration each particle is updated itself by two extremes that are tracked.The first extreme is the optimal solution found by the particle itself (pbest), the other is the optimal solution found by the whole swarm (gbest).At each step of the algorithm, particles are displaced from their current position by applying a velocity vector to them.The magnitude and direction of their velocity at each step is influenced by their velocity in the previous iteration of the algorithm, simulating momentum, and the location of a particle relative to the location of its pbest and the gbest.At each step a particle is stochastically accelerated towards its previous best position and towards a neighbourhood (global) best position, thereby forcing particles to continually search in the most-promising regions found so far in the solution space.
Velocity equation includes such components [43]: • The previous velocity (also referred as inertia or momentum) v p (t), representing memory of the previous movement direction, i.e. movement in the immediate past.This prevents the particle from drastically changing direction, and to bias towards the current direction.• Global optima value ŷ.
• The cognitive component c 1 r 1 (y p − x p ), representing confidence in solutions of individual particle which is relative to past performances and encourages the particle to return to their own best positions; r 1 is a random value with a uniform distribution, i.e., r 1 ∼ U (0, 1).• The social component c 2 r 2 (ŷ − x p ) representing confidence on solutions by the neighbours of particle and representing common standard that individuals want to obtain; r 2 ∼ U (0, 1).The particle is able to head to the best position found by the particle's neighbourhood.• Positive acceleration constants (also referred as learning factors) c 1 and c 2 used to scale the contribution of the cognitive and social components.

Proposed classification approach
As Table 1 shows, linear SVM based classifiers, although having different formulations, share several common parameters.Therefore, selection problem can be approached as selection of classifier together with its parameters instead of parameter selection for each classifier from this set.This lead to a metaheuristic approach proposed in this paper.According to the techniques which are used for its development (particle swarm optimization and linear SVM) it is further referred as PSO-LinSVM.Definition 1.Each particle P = p 1 ; p 2 ; p 3 in PSO-LinSVM is represented as follows: • p 1 -non-negative integer value, which represents the algorithm used for classification; • p 2 -real value, cost parameter C; • p 3 -real value, which represents bias term B.
Note that p 1 value itself does not play an important role in obtaining position value, as p 1 is initialized randomly in whole search space, and optimization is done according to performance of SVM classifier represented by this particle.However, scattered Nonlinear Anal.Model.Control, 2014, Vol. 19, No. 1, 26-42 values may influence particle velocity; therefore, it is required that p 1 values are nonnegative successive integers (i.e., given cl min P i1 cl max , S(i) = i + h for each P i1 ).Although it is possible that it can be used with other h values, it is not reasonable computationally, as corresponding population initialization and velocity equations would require modifications by replacing round operations with operators which ensure that P i1 and velocity values stay valid and require additional operations.Thus h = 1 was used in the experiments.The results can depend the number of particles used in optimizationthe larger number of particles is used, the better coverage of search space is obtained, but the larger is the demand for computational resources.Another factor which can influence final results arises from initialization of particles which are initialized randomly in search space; thus cl sequential ordering as well as implementation of random number generator used in the implementation of this algorithm can also have impact on the results.
The main objective of this algorithm is to maximize fitness function defined as sum of TPR values for each class where N C is the number of classes, TPR i -TPR value for ith class.Alternatively, it can be defined as minimization problem where it is aimed to minimize the difference between "ideal" performance (i.e., when TPR value for all classes is equal to 1) and performance obtained by the classifier Accuracy or error ratio for fitness evaluation is chosen often [21][22][23]; however, in case of imbalanced learning, accuracy is not the best option (it is possible to obtain high classification accuracy, if the classifier correctly recognizes most of "majority" instances, but fails to identify most of "minority" instances), so sum of TP rate values is selected in our approach.These evaluations are obtained by performing k-fold cross-validation training; the number of folds can be selected according to the size of training dataset.As the formula shows, an ideal solution can be obtained only in case of perfect classification; as this happens very rarely, the main goal is to find satisfactory solution.Therefore, this technique ism also adjusted in order to terminate search after no further improvements in performance are observed.This technique also comprises such aspects as velocity clamping, where V max ,j represents maximum allowed velocity in dimension j.According to Engelbrecht [43], large values of V max ,j facilitate global exploration, while smaller values encourage local exploitation.It is often computed as a fraction σ of search space and selected empirically, according to the problem which is solved.Therefore, in proposed technique it is calculated as where R min,j is denoted as minimum of search space for jth dimension, R max ,j -as its maximum.
The proposed approach for linear SVM classifier selection based on these principles is presented as Algorithm 1.The algorithm is defined as a solver for f fitness minimization problem, as defined in Eq. ( 11), although it can be easily adapted to solve maximization problem in Eq. ( 10).
Such parameters are defined for the proposed algorithm: • n -size of swarm.
• c2 -PSO coefficient for social component.• terminate_iteration (optional) -parameter which defines the number of iterations after which PSO optimization should be terminated if no further improvement is observed.• max _iterations (optional) -maximum number of iterations for PSO optimization.It is also optional and if it is not given, the procedure loops until terminate_iteration criteria is satisfied.This can be considered if a fast convergence to optimal solution is known to occur.• ŷ(t) is the global best position obtained at iteration t.
Several modifications, compared to original PSO gbest algorithm, can be excluded here: • This algorithm performs search in mixed search space (one of dimensions is mapped to integer space, while other two are represented by subsets of real-valued space), thus both initialization procedure and velocity equation are modified to meet these requirements.
• The search space is constrained by several constraints (p 1 ∈ {i | cl min i cl max , cl min , i, cl max ∈ Z}; p 2 > 0) which cannot be violated, i.e., none of obtained parameters can be outside of these constraints.To deal with these constraints, particle "teleportation" principle is applied (if particle reaches upper boundary for first dimension it is moved back to the lower boundary), while infeasible value for p 2 is replaced by minimal value of cost parameter C.

Experimental results
An experiment to verify classification efficiency of PSO-LinSVM algorithm defined in Algorithm 1 was performed using Australian and German datasets; they both can be accessed in the UCI repository.These datasets were chosen because of their popularity and wide adoption for similar experiments as well as the credit risk domain context; therefore, the results can be used in benchmarking to compare with similar algorithms.Two variations of German credit dataset are provided: the original dataset which contains categorical/symbolic attributes and one for algorithms that need numerical attributes.The latter file has been edited and several indicator variables have been added to make it suitable for algorithms which cannot cope with categorical (nominal) variables.Thus the final German credit dataset consists of 1000 instances (700 instances labeled as "Class 1" and 300 instances labeled as "Class 2") with 24 numerical attributes.Main specification of numerical German credit dataset is given in Table 2.
Australian credit dataset concerns credit card applications.All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.Numerical version of this dataset has 690 instances with 14 attributes; of these 690 instances, 383 instances are labeled as "Class 1" and 307 instances are labeled as "Class 2".
The proposed technique was compared with LibSVM and LS-SVM implementations.The experiment was performed using MATLAB 2010b environment, LibSVM 3.12, Lib-LINEAR 1.8 and LS-SVMlab 1.8 toolboxes.Their parameter selection was implemented using simulated annealing algorithm in MATLAB's optimization toolbox, while PSO optimization was developed using PSO toolbox for MATLAB by Sam Chen 1 .Two approaches for fitness evaluation were applied, in order to obtain a classifier with best classification performance capabilities in both balanced and unbalanced classification conditions: • Accuracy, obtained using k-fold cross-validation (further referred as CV optimization).• The sum of TP ratios also obtained using k-fold cross-validation (this principle further referred as balanced CV optimization).This is an approach used in [36,37].
In the experiment, k = 5 was selected (although if dataset is large, k = 2 or k = 3 can be considered as a computationally more efficient choice).Fig. 2 and Fig. 3 present result visualizations obtained after iteratively performing classification tasks in bounded search spaces, for both accuracy and unbalanced classification optimization.In order enable comparison of various optimization approaches, a direct parameter search procedure was run, using seven linear SVM classifiers in LIBLINEAR 1.8 toolbox (this procedure is further referred as LIBLINEAR+DS).These figures show performance results in terms of accuracy, where each point is represented as (C; bias; max(acc i )), with and acc i as accuracy obtained by best performing classifier from the set of LIBLINEAR classifiers at particular search space point.C parameter change step was set to 5, whereas bias parameter was set to change by 1.Such representation can be used to visualize search surface and can be used to identify core parameters for optimization procedure.As an example, these figures help to identify that best performing classifier had relatively large C (somewhere between 30 and 70) and bias parameters for the case of German dataset, whereas regularization parameter is relatively small for Australian credit dataset.This information can be considered during initialization of particles, for e.g., to initialize larger part of swarm in particular regions in order to enable faster convergence or increase possibility to obtain best possible solution.In this research, default PSO initialization (with its modifications for discrete dimension) is considered.Notably, unbalanced classification resulted in total domination of single classifiers with different C and bias parameters.
According to Engelbrecht, larger social coefficient is a better option for search spaces with smooth surface while larger cognitive coefficient is preferred for problem spaces which have many global and local optimas and therefore result in rough search space [43].As visualizations of linear SVM search space show, SVM selection problem can be considered as the latter case, therefore, in further experiments c 1 > c 2 will be used.These figures also show that accuracy selected as evaluation metrics resulted in wide range of classifiers, i.e., none of the classifiers could be identified as the most effective solution whereas in case of balanced cross-validation (using sum of true positive rate values) based evaluation single classifiers (dual L2-regularized L1-loss support vector classification and dual L2-regularized logistic regression) dominated.Note that although such results might indicate optimal classifiers, they are not necessary among the visualized points, as direct search was performed in discrete and limited search subspace, thus this representation is very approximate.In order to compare proposed approach with similar classification techniques, similar SVM (particularly LibSVM C-SVC) and LS-SVM classifiers were also developed by performing heuristic parameter selection on their kernel functions using previously described simulated annealing and particle swarm optimization.C-SVC was run using with RBF (further referred as LibSVM RBF ) and sigmoid (further referred as LibSVM Sigmoid ) kernel functions (polynomial kernel function was not selected of relatively large parameter space and slow performance) whereas LS-SVM classifiers were based on polynomial (further referred as LS-SVM Poly ) and RBF (LS-SVM RBF ) kernels.Dataset split 7:3 (i.e., 70% of data was selected for classifier training and optimization procedure, the rest 30% were used for testing) widely used in such research was selected.
SA procedure was run using exponential temperature (temperatureexp) and simulannealbnd functions in 180 iterations whereas PSO implementation was applied with default parameters.LIBLINEAR and PSO based classifier with similar approach to PSO-LinSVM (used in [36]) was also tested; the main difference lies in its design as it is based on real-valued PSO implementation instead of hybrid which is proposed in this paper.
Similar approach is also applied for LIBLINEAR classifier selection using SA; default MATLAB real-valued SA implementation was used for its implementation.
Table 3 and Table 4 present classification results for different SVM based classifiers represented as error rate and true positive rates for each class; core parameters which were obtained during parameter selection such as selected classifier index (in cases of LIBLIN-EAR based classifiers, as well as PSO-LinSVM) and C parameter (for all classifiers) are also given.Table 4 presents experimental results obtained with Australian dataset.Again, linear SVM classifiers outperformed other SVM classifiers.LS-SVM RBF showed similar results, while it outperformed other classifiers in balanced CV based optimization.PSO-LinSVM with L2-regularized L2-loss support vector classification (dual) classifier again proved to be best choice; however, direct search resulted in highest performance.Note that in both experiments PSO-LinSVM obtained the same classifiers in both cases of fitness based on accuracy and balanced CV evaluation.The second case also proved to be a reasonable choice in general -both Table 3 and Table 4 show that application of such this approach often resulted in increased accuracy compared to accuracy-based optimization; this is especially seen with classifiers developed using simulated annealing approach.
This approach identified best parameter sets for classifiers better than or equal to accuracy-based evaluation approach in almost all cases for Australian dataset except LS-SVM Poly (notable that this classifier also did not perform well with PSO optimization technique for this dataset).

Conclusions and future works
Support vector machines techniques are powerful nonparametric techniques which can perform efficient classification and obtain results comparable to neural networks.This article presents new particle swarm optimization and linear SVM based approach which can be applied for both small scale and large scale classification tasks.This technique uses particle swarm optimization based heuristic to select best performing SVM classifier from a set of linear classifiers with the same parameters.Its comparison with iterative parameter search showed that it is able to obtain SVM configuration resulting in better classification performance in terms of accuracy and identification of each class.An approach for classifier evaluation based on sum of true positive ratios is proposed together with this algorithm; it is more suitable for imbalanced learning as it tries to maximize classification performance instead of often applied accuracy.Empirical results showed that it can produce similar results to accuracy-based parameter selection approach.It is also shown that this approach can be a competitively efficient solution for classification problems, compared to similar SVM based techniques.Future work will concern more detailed investigation of PSO-LinSVM parameters, improvements in PSO algorithm, such as topologies or application of particle clusters [19], possible effects which come from sequential ordering or initialization, together with other possible enhancements.

•
cl ←− {i | cl min i cl max , cl min ∈ Z, i ∈ Z, cl max ∈ Z} -a set of classifiers, represented by inner encodings.• rangeC = [C min ; C max ] -range of cost parameters which is considered (note that C 0). • rangeBias = [b min ; b max ] -range of B (bias term) parameters which is considered in optimization.

Table 2 .
Main characteristics of German dataset.