Geodesic distances in the intrinsic dimensionality estimation using packing numbers

Abstract. Dimensionality reduction is a very important tool in data mining. An intrinsic dimensionality of a data set is a key parameter in many dimensionality reduction algorithms. When the intrinsic dimensionality of a data set is known, it is possible to reduce the dimensionality of the data without losing much information. To this end, it is reasonable to find out the intrinsic dimensionality of the data. In this paper, one of the global estimators of intrinsic dimensionality, the packing numbers estimator (PNE), is explored experimentally. We propose the modification of the PNE method that uses geodesic distances in order to improve the estimates of the intrinsic dimensionality by the PNE method.


Introduction
In the real applications, we confront with data that are of a very high dimensionality.For example, in the image analysis each image is described by a large number of pixels of different colour.Another application, that produces data of a very high dimensionality, is analysis of DNA microarray data [1].The analysis of high-dimensional data is usually challenging.The dimensionality reduction or visualization methods are recent techniques to discover knowledge hidden in multidimensional data sets [2].Although data are considered in a high-dimensional space, in fact they are often either points of a nonlinear manifold of some lower dimensionality or the points close to that manifold.Thus, one of the major problems is to find the exact dimensionality of the manifold.Afterwards, it is reasonable to transfer the data points that lie on or near to this manifold into the space, the dimensionality of which is coincident with the manifold dimensionality.As a result, the dimensionality of the data set will be reduced to that of a manifold.Therefore, the problem is to disclose the manifold dimensionality, i.e. the intrinsic dimensionality (ID) of the analysed data.
The intrinsic dimensionality of a data set is usually defined as the minimal number of parameters or latent variables necessary to describe the data [3].Latent variables are c Vilnius University, 2014 still often called as degrees of freedom of a data set [3,4].Let the dimensionality of the analysed data be n.High-dimensional data sets can have meaningful low-dimensional structures hidden in the observation space, i.e. the data are of a low intrinsic dimensionality d (d n).
Recently, a lot of manifold learning methods have been proposed to solve the problem of nonlinear dimensionality reduction.Important manifold learning algorithms include isometric feature mapping (ISOMAP) [4], locally linear embedding (LLE) [5,6], Laplacian eigenmaps (LE) [7], Hessian LLE [8] etc.They all assume data to distribute on an intrinsically low-dimensional manifold and reduce the dimensionality of data by investigating the intrinsic structure of data.However, all manifold learning algorithms require the intrinsic dimensionality of the data as a key parameter for implementation.In recent years, ISOMAP and LLE have drawn great interests.They avoid nonlinear optimization and are simple to implement.However, both ISOMAP and LLE methods need the precise information of both the input parameters k for the neighbourhood identification and the intrinsic dimensionality d of the data set.The ways to select the value of the parameter k are proposed and investigated in [9][10][11][12].If the intrinsic dimensionality d is set larger than what it really is, much redundant information will also be preserved; if it is set smaller, useful information of the data could be lost during the dimensionality reduction [13].
Due to the increased interest in dimensionality reduction and manifold learning, a lot of techniques have been proposed in order to estimate the intrinsic dimensionality of a data set [14][15][16][17][18][19] etc. Techniques for intrinsic dimensionality estimation can be divided into two main groups [20]: (1) estimators based on the analysis of local properties of the data (the correlation dimension estimator [21], the nearest neighbour dimension estimator [22][23][24], and the maximum likelihood estimator (MLE) [19]), and (2) estimators based on the analysis of global properties of the data (the eigenvalue-based estimator [23,25], the packing numbers estimator (PNE) [18], and the geodesic minimum spanning tree estimator [17]).Local intrinsic dimensionality estimators are based on the idea that the number of data points, that are covered by a hypersphere of some radius r around a given data point, grows in proportion to rd , where d is the intrinsic dimensionality of the data manifold around that point.As a result, the intrinsic dimensionality d can be estimated by measuring the number of data points, covered by a hypersphere with a growing radius r.While the local estimators of the intrinsic dimensionality compute the average over the local estimates of intrinsic dimensionality, the global estimators consider the data as a whole when estimating the intrinsic dimensionality.
The packing number estimator (PNE) [18] has drawn great interests in the literature.Huge number of references may be found.However, to our knowledge, no further development of this estimator is done.Our research improves the quality of the estimates using packing numbers.
In this paper, we propose the modification of the PNE method that uses geodesic distances.We compare the application of both Euclidean and geodesic distances between data points in the PNE algorithm.The application of the geodesic distances in the intrinsic dimensionality estimation using packing numbers is grounded via the large experimental investigation.
Packing numbers estimator [18] is one of the global estimators of the intrinsic dimensionality.
At first, several definitions are given for clarity.
1. Given a metric space R n with the distance metrics d(•, •), a set X ⊂ R n is said to be r-separated, if d(X i , X j ) r for all distinct X i , X j ∈ X.In [18], r is called as a resolution.
2. The r-packing number M (r) of a set X ⊂ R n is defined as the maximum cardinality of an r-separated subset of X [18].
In [18], the intrinsic dimensionality of a set X has been suggested to be found by evaluating the limit For a finite set, the zero limit cannot be achieved.Therefore, Kégl [18] suggests to redefine the intrinsic dimensionality in a scale-dependent manner.Let the analysed data set X consists of m n-dimensional points Then the intrinsic dimensionality of the finite data set X is estimated by the formula In order to find the r-packing number M (r) for a finite data set X = {X 1 , . . ., X m }, Kégl [18] applies the following approximation algorithm.Given a data set X, the algorithm starts with an empty set of centers C, and in an iteration over X it adds to C the data points that are at a distance of at least r from all the centers in C (lines 4 to 8 in Algorithm 1).
Algorithm 1.The PNE algorithm for the estimate d of the intrinsic dimensionality of the data set X [18].
11: if l > 10 and 1.65 The estimate M (r) is the cardinality of C after each point in X has been visited (line 9).According to Kégl, a good estimate for d can be obtained by using M (r) instead of M (r).Because of the dependence of M (r) on the order of the data points at which they are visited, the variance of M (r) can distort the dimension estimate d.To eliminate this variance, the procedure is repeated several times with random permutations of the data (lines 1, 2), and the estimate d is computed by using the average µ(•) of logarithms of the packing numbers (line 10).The number of repetitions depends on r 1 , r 2 , and a preset parameter ε that determines the accuracy of the final estimate (set to 99% in all experiments).The complete algorithm is given formally in Algorithm 1.

Data sets
The following data sets were used in the experiments: • 1000 three-dimensional data points (m = 1000, n = 3) that lie on a nonlinear two-dimensional S-shaped manifold (Fig. 1a).
• 1801 three-dimensional data points (m = 1801, n = 3) that lie on a nonlinear one-dimensional manifold, i.e. a spiral (Fig. 1d).The components (x 1 , x 2 , x 3 ) of these data are calculated by the parametric equations below: x 2 = 100 sin(t), • A data set of uncoloured (greyscale) images of a person's face [4] (samples of images are shown in Fig. 1f).The data consist of many photos of the same person face observed in different poses (left-and-right pose, up-and-down pose) and lighting conditions, in no particular order.The number of photos (data points) is m = 698.The images have 64 × 64 greyscale pixels, therefore the dimensionality of points that characterize each photo in a multidimensional space is n = 4096.Source database: http://isomap.stanford.edu/datasets.html.
4 Experimental exploration of the packing numbers estimator (PNE) In this section, the PNE method is investigated experimentally with various artificial and real data sets.The data points lie on manifolds, the dimensionality of which is known in advance.Therefore, we will be able to establish precisely, whether the estimate of the intrinsic data dimensionality, obtained by the PNE, is true.The aim of these investigations is to find out, which distances (Euclidean or geodesic) are better to be used in the PNE algorithm, while estimating the similarity between data points.The geodesic distance is the length of the shortest path between two points along the surface of a manifold.Here the Euclidean distances are used when calculating the length of the shortest path.In order to compute the geodesic distances between the points X 1 , X 2 , . . ., X m , it is necessary to set some number of the nearest points (neighbours) of each point X i .The search for the neighbours of each point X i can be organized in two ways: (1) by the fixed number k geod of the nearest points from X i , (2) by all the points within some fixed radius of a hypershere, the center of which is the point X i .When the neighbours are defined, a weighted graph over the points is constructed: each point X i is connected with its neighbours; the weights of edges are Euclidean distances between the point X i and its neighbours.Using one of the algorithms for the shortest path distance in the graph, the shortest path lengths between the pairs of all points are computed.These lengths are estimates of the geodesic distances between the points.
In [18], the dimensionality estimate d is measured on consecutive pairs of a sequence r 1 , . . ., r s of resolutions, and the estimate is plotted halfway between the two parameters, i.e. d(r i , r i+1 ) is plotted at (r i + r i+1 )/2.In our paper, the values of r 1 and r 2 (r 1 < r 2 ) are chosen in another way: the minimum distance and the maximum distance between all the data points are found; then all the cases of r 1 and r 2 from the minimum distance to the maximum one with a step equal to (maximum distance-minimum distance)/10 are fixed.In this case, the total number of combinations of r 1 and r 2 (r 1 < r 2 ) is equal to 55 (11 × (11 − 1)/2 = 55).The PNE algorithm is explored by evaluating two types of distances: Euclidean and geodesic.In both cases, the PNE values d, defined by formula (2), are calculated with all the combinations of r 1 and r 2 .In such a way, dependences of the estimate of intrinsic dimensionality of the data on the parameters r 1 and r 2 are obtained.As the estimate of the intrinsic dimensionality, we choose such a PNE value d that remains the same in the largest area of the values of parameters r 1 and r 2 .Further this area will be called by a dominant area and the value of d in this area will be called by a dominant value.Let the numbers of combinations of r 1 and r 2 , that fall in the dominant area in the case of Euclidean and geodesic distances, be denoted as v e and v g , respectively.
The first investigation is performed with the points of the two-dimensional S-shaped manifold (m = 1000, n = 3), see Fig. 1a.We can see from Fig. 2 that the dominant value is equal to the true intrinsic dimensionality, i.e. d = 2, when both Euclidean and geodesic distances are used.However, the dominant area is a bit larger when using the geodesic distances in the PNE algorithm: v e = 27 and v g = 31.The next investigation is performed with the points of the two-dimensional 8-shaped manifold (m = 1000, n = 3), see Fig. 1b.At first sight, it is difficult to say which area is dominant when the Euclidean distances are used (Fig. 3a).However, v e = 24 as d = 2 and v e = 19 as  After investigating a helix (m = 1000, n = 3, Fig. 1c) that is the one-dimensional manifold, it became clear that it is reasonable to evaluate geodesic distances instead of Euclidean ones in the PNE algorithm in order to get the true estimates of the intrinsic dimensionality d of the data.Figure 4  Next, a spiral (Fig. 1d) which is the one-dimensional manifold (m = 1801, n = 3) was investigated.In this case, the difference between the use of the Euclidean and geodesic distances in the PNE algorithm is not considerable (Fig. 5).In both cases, the dominant value of the estimate of the intrinsic dimensionality is d = 1 which is coincident with the true intrinsic dimensionality of these data, i.e. d = d = 1.However, the dominant area is a bit larger when the geodesic distances are used, i.e. v g = 54 and v e = 49.
One of the practical fields, where high-dimensional data appear, is the analysis of images.Let us have a set of images of some moving object.There are many investigations of such sets [4,27,28] etc.Each image is described by the number of pixels of different colour.The dimensionality of such a data set is equal to the number of pixels in the  greyscale case or even it is three times larger than the number of pixels in the coloured case.So, the dimensionality of these data is very large.Since the intrinsic dimensionality of a data set is defined as the minimal number of latent variables or features, necessary to describe the data [3], we hypothesize that there are latent variables or features that characterize the motion of the object in the images and their number is the same as the number of degrees of freedom of a possible motion of the object.Therefore, the minimal possible intrinsic dimensionality of a data set of images should be equal to the number of degrees of freedom of a possible motion of the object.The high-dimensional data, obtained from the set of images (greyscale pictures of a rotated duckling and photos of the same person face observed in different poses), are investigated.Since a duckling was gradually rotated at a certain angle on the same plane, i.e. without turning the object itself, these data have only one degree of freedom, i.e. the minimal intrinsic dimensionality of these data may be equal even to 1.The person's face analysed in [4] has two directions of motion (two poses): left-and-right pose and up-anddown pose.Therefore, the high-dimensional data, corresponding to these pictures, have two degrees of freedom, i.e. the minimal possible intrinsic dimensionality of these data should be equal to 2.
The results of investigation with the high-dimensional data points, corresponding to real pictures of a rotated duckling (m = 72, n = 16384, Fig. 1e), are given in Fig. 6.Taking into consideration that the duckling in the pictures has only one degree of freedom, an assumption can be made that the intrinsic dimensionality of these data is 1.In this case,     The last investigation is performed with the high-dimensional data points (m = 698, n = 4096), corresponding to the photos of the face of the same person observed in different poses from different lighting direction (Fig. 1f).We can see the results in Fig. 7a.If the Euclidean distances are used, the dominant value is d = 4, v e = 27.However, d = 2 dominates (v g = 33) if the geodesic distances are used in the PNE method.Referring to the hypothesis that the minimal intrinsic dimensionality of these data is 2, we see the advantage of the geodesic distance again.
The intrinsic dimensionality of the face data set [4] is analysed in several papers.It is shown in [4] that the intrinsic dimensionality of this data set is 3. Levina and Bickel [19] state that the estimated dimensionality of about 4 is very reasonable.In [27], the estimated dimensionality is equal to 2, when geodesic distances are used in the MLE algorithm, and it is equal to 4 or 5, when Euclidean distances are used in MLE.The question arises which estimated dimensionality can be taken as the true intrinsic dimensionality?
In order to answer this question and verify whether our hypothesis on the minimal intrinsic dimensionality is true in this case, let us analyse the face database [4] in detail.At first, the 4096-dimensional data points are projected on the 5-dimensional space by the ISOMAP method [4].ISOMAP is used in the investigation, because recently it is one of the most popular manifold learning methods.So we get a matrix of size [698 x 5].The rows of this matrix correspond to the objects Y 1 , Y 2 , . . ., Y m , m = 698, and the columns correspond to the features y 1 , y 2 , . . ., y n * , n * = 5, that characterize the objects.Then the covariance matrix C of the features is obtained: It is obvious from this covariance matrix that all the 5 features y k and y l are not correlated, because their covariance coefficient is equal to zero: c kl = c lk = 0, k = l.The covariance coefficient c kk , k = 1, n * , is the variance of features y k .So, we see from (3) that the variances of the first three features are much larger than others.The variances of the fourth and fifth features are the least ones and differ very little from each other.It means that three features are the main ones.A question arises which features from y 1 , y 2 , y 3 correspond to the left-and-right pose, the up-and-down pose, and to the lighting direction.In order to answer this question, we visualized these features pairwise on the plane (see Figs.  Since the face database consists of images of an artificial face under three changing conditions: vertical and horizontal orientation and illumination (lighting direction), it is possible to assume that the intrinsic dimensionality of this data set should be 3.However, the person's face has two directions of motion (two poses): left-and-right pose and upand-down pose.So the minimal intrinsic dimensionality of these data can be assigned to 2, if we assume that the intrinsic dimensionality of the data set of images were equal to the number of degrees of freedom of a possible motion of the object in the image.
So, after such a discussion, we dare say, that in the last investigation, a false result is obtained by PNE, if the Euclidean distances are used.However, the minimal intrinsic dimensionality of these data is evaluated well, if the geodesic distances between the data points are calculated.

Conclusions
Real-life data are often hardly understandable because of their high dimensionality.While analysing these data, we often have to reduce their dimensionality so that to preserve as much information on the analysed data set as possible.To this end, it is reasonable to find out the intrinsic dimensionality of the data.Several methods for estimating the intrinsic dimensionality are proposed in the literature.
In this paper, we have analysed one of the global estimators for intrinsic dimensionality -the packing numbers estimator (PNE).We have shown that, in order to get true estimates by PNE, it is necessary to evaluate the geodesic distances between data points in this algorithm.If the Euclidean distances are used in PNE, we can get false estimates of the intrinsic dimensionality.In this paper, in the experiments, no false results are obtained in the PNE modification that uses the geodesic distances.
The efficiency of PNE is disclosed in the images analysis.The experiments with the sets of images of the moving object showed that there are latent variables or features that characterize the motion of the object in the images and their number is the same as the number of degrees of freedom of a possible motion of the object.It is shown in this paper that the minimal possible intrinsic dimensionality of a data set of images is equal to the number of degrees of freedom of a possible motion of the object.

Fig. 3 .d = 1 .
Fig. 3. Estimate d of the intrinsic dimensionality depending on distances: (a) Euclidean, (b) geodesic.Data set: the 8-shaped manifold.d = 1.It leads to the conclusion that the dominant value is d = 2.In the case of geodesic distances, it is obvious that the dominant value is d = 2 (v g = 33).The investigations performed with the two-dimensional S-shaped and 8-shaped manifolds show that the dominant values of the estimate of the intrinsic dimensionality are coincident with the true intrinsic dimensionality of these data in the case of both distances, i.e. d = d = 2.After investigating a helix (m = 1000, n = 3, Fig.1c) that is the one-dimensional manifold, it became clear that it is reasonable to evaluate geodesic distances instead of Euclidean ones in the PNE algorithm in order to get the true estimates of the intrinsic dimensionality d of the data.Figure4indicates that the dominant value is d = 2 = d in the case of Euclidean distances (v e = 25), and the dominant value is d = 1 = d, in the case of geodesic distances (v g = 49).Next, a spiral (Fig.1d) which is the one-dimensional manifold (m = 1801, n = 3) was investigated.In this case, the difference between the use of the Euclidean and geodesic distances in the PNE algorithm is not considerable (Fig.5).In both cases, the dominant value of the estimate of the intrinsic dimensionality is d = 1 which is coincident with the true intrinsic dimensionality of these data, i.e. d = d = 1.However, the dominant area is a bit larger when the geodesic distances are used, i.e. v g = 54 and v e = 49.One of the practical fields, where high-dimensional data appear, is the analysis of images.Let us have a set of images of some moving object.There are many investigations of such sets[4,27,28] etc.Each image is described by the number of pixels of different colour.The dimensionality of such a data set is equal to the number of pixels in the indicates that the dominant value is d = 2 = d in the case of Euclidean distances (v e = 25), and the dominant value is d = 1 = d, in the case of geodesic distances (v g = 49).

Fig. 7 .
Fig. 7. Estimate d of the intrinsic dimensionality depending on distances: (a) Euclidean, (b) geodesic.Data set: photos of a face.
[8][9][10].show that the feature y 1 corresponds to the left-and-right pose, the feature y 2 corresponds to the up-and-down pose, and the feature y 3 corresponds to the lighting direction.Summarizing everything, it is obvious that the Left−and−right pose Up−and−down pose

Fig. 8 .
Fig. 8. Projections of the high-dimensional data points corresponding to the photos of a face on a plane: leftand-right pose, up-and-down pose.