Simulation study of a tree similarity measure based on small subtree counts

Augsten, Böhlen and Gamper [1] suggested a measure of similarity between ordered and labeled trees based on subtree counts: two trees are declared close if they contain similar number of copies of ordered and labeled subtrees of a given form, called pq-gram. We report the results of a simulation study of statistical properties of distances based on pq-grams.


Introduction
Augsten, Böhlen and Gamper [1] suggested a measure of similarity between (ordered and labeled) trees based on subtree counts. Roughly speaking, two trees are declared close if they contain similar numbers of copies of (ordered and labeled) subtrees of a given form, called pq-gram. Augsten, Böhlen and Gamper [1] used distances based on pq-grams to define approximate matching of hierarchical data.
Here we are interested in statistical properties of distances based on pq-grams. For this purpose we consider several parametric families of Galton-Watson random trees. Computer simulation results show that pq-gram distance effectively discriminates between two Galton-Watson trees generated for different values of parameter.

2.1.
A rooted unlabeled tree on p + q vertices is called pq-gram if for every 0 j p − 1 there is only one vertex in the distance j from the root and the vertex in the distance p − 1 from the root has q leaves of degree 1 (which are in the distance p from the root), see Augsten, Böhlen and Gamper [1]. That is, in order to obtain the pq-gram we stick the (center of the) star K 1,q to one endpoint of the path on p vertices. Another endpoint of the path is called the root of the pq-gram. The pq-gram is denoted T p,q .
Using letters from an alphabet, say , of size k we obtain n = k p+q labeled ordered pq-grams T p,q 1 , . . . , T p,q n . Given an ordered labeled tree T , with labels from , we prescribe the vector N(T ) = (N 1 , . . . , N n ) that counts copies of T p,q i , 1 i n, contained in T . More precisely, N i denotes the number of exact matchings of T p,q i in T (order is important). One would expect that two ordered and labeled trees T 1 , T 2 are similar if the corresponding subtree count vectors N(T 1 ) and N(T 2 ) were close.
Augsten, Böhlen and Gamper [1] define the pq-gram distance as follows. Let T be an ordered labeled tree with labels from the alphabet . Introduce an extra letter * and extend the alphabet * = ∪ { * }. The pq-extended tree T * is constructed from T by adding p − 1 ancestors to the root node, inserting q − 1 children before the first and after the last child of each non-leaf node, and adding q children to each leaf of T . All newly inserted nodes become labels * . Let the subtree count vector N * (T * ) = (N * 1 (T * ), . . . , N * m (T * )) be defined as above but for the alphabet * and the extended tree T * . In particular, we have m = (k + 1) p+q . The pq-gram distance between two ordered labeled trees T 1 and T 2 .

2.2.
One would expect that graph similarity measure based on small subgraph count should discriminate between graphs generated using different probabilistic models. Galton-Watson tree is a convenient probabilistic model to test statistical properties of the similarity measure based on pq-gram counts. Below we refer results of a simulation study. Given tree T , we denote by T k the subtree induced by vertices that are in a distance of at most k from the root.
In Examples 1-3 we put = {a}. Therefore, we have * = {a, * }. Every node of T k is labeled with the letter a, while some nodes of its extended version receive also labels * .
In Examples 4-6 we put = {a, b}. Therefore, we have * = {a, b, * }. We generate Galton-Watson tree T with two types of offspring a and b. Given a vertex of type a (respectively b) let X aa and X ab (respectively X ba and X bb ) denote its offspring numbers of types a and b. Random variables X aa , X ab , X ba , X bb are independent and have Poisson distributions with mean values λ aa = 5p(a|a), λ ab = 5p(b|a), λ ba = 5p(a|b), λ aa = 5p(b|b). We denote T = T (p), where p = (p(a|a), p(b|a), p(a|b), p(b|b)). The root of T chooses its label (a or b) at random and with equal probabilities.
Example 7. We study the value distribution of 2,3 (T 7 (p), T 7 (p )) defined in Example 1. Fig. 1 shows the histogram of the value distribution in the case where p = 0.3 and p = 0.4. Fig. 2 shows the histogram of the value distribution in the case where p = p = 0.3. Each histogram is based on 10000 independently generated values of 2,3 (T 7 (p), T 7 (p )).

Conclusions
In each of the Tables 1-6 the minimum of every row is achieved at the diagonal element of the table. We conclude that 23-gram distance effectively discriminates between different values of the parameter. It is interesting to study possible asymptotic distributions of the random variables p,q (T k (p), T k (p )) as well as of the corresponding subtree count vectors N * . Empirical evidence based on a small simulation study (Example 7) suggests that p,q (T k (p), T k (p )) is asymptotically normal, for p = p , and it is distributed as the absolute value of a normal random variable, for p = p . This is not surprising as one would expect that the number N of subtrees of a bounded size would obey the central limit theorem.