# morphismbased_learning_for_structured_data__6d758bc0.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Morphism-Based Learning for Structured Data

Kilho Shin,1 David Lawrence Shepard2

1Gakushuin University, Japan, 2UCLA Scholarly Innovation Lab., USA yoshihiro.shin@gakushuin.ac.jp, shepard.david@gmail.com

In mathematics, morphism is a term that indicates structurepreserving mappings between mathematical structures of the same type. Linear transformations for linear spaces, homomorphisms for algebraic structures and continuous functions for topological spaces are examples. Many data researched in machine learning, on the other hand, can include mathematical structures in them. Strings are totally ordered sets, and trees can be understood not only as graphs but also as partially ordered sets with respect to an ancestor-to-descendent order and semigroups with respect to the binary operation to determine nearest common ancestor. In this paper, we propose a generic and theoretic framework to investigate similarity of structured data through structure-preserving one-to-one partial mappings, which we call morphisms. Through morphisms, useful and important methods studied in the literature can be abstracted into common concepts, although they have been studied separately. When we study new structures of data, we will be able to extend the legacy methods for the purpose of studying the new structure, if we can deﬁne morphisms properly. Also, this view reveals hidden relations between methods known in the literature and can let us understand them more clearly. For example, we see that the center star algorithm, which was originally developed to compute sequential multiple alignments, can be abstracted so that it not only applies to data structures other than strings but also can be used to solve problems of pattern extraction. The methods that we study in this paper include edit distance, multiple alignment, pattern extraction and kernel, but it is sure that there exist much more methods that can be abstracted within our framework.

1 Introduction Similarity of data is the most fundamental concept of machine learning. For example, clustering is to make groups of data so that members of each group are similar to one another, and classiﬁcation is to predict unknown classes of data based on their similarity to known data. Thus, quantitative evaluation of similarity of data by some means is an imperative step of machine learning algorithms. In fact, various methods have been proposed in the literature to quantify similarity. For example, a kernel is a similarity measure,

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

that is, the greater, the more similar (e.g., (Lodhi et al. 2001; Collins and Duffy 2001)), while an edit distance is a dissimilarity measure, that is, the smaller, the more similar (e.g., (Levenshtein 1966; Ta ı 1979)).

In this paper, we will propose a novel abstract approach to quantify the similarity of discrete and structured data. The key ﬁnding that drove us was the fact that plural methods of machine learning developed for data with different structures can be sometimes redeﬁned in a common manner using the concept of morphisms, which abstracts differences in structures. In mathematics, morphism is a term that indicates structure-preserving mappings between mathematical structures of the same type. Linear transformation for linear spaces is a good example. In our approach, we view a datum as a set of elements that is equipped with some structure (relation) among the elements, and deﬁne morphisms as one-toone partial set mappings that preserve the structure. On top of the concept of morphisms, we deﬁne methods to investigate similarity of data in an abstract manner. The methods in the literature that we study in this paper encompass edit distances (Levenshtein 1966; Ta ı 1979), sequential multiple alignments (Gusﬁeld 1993), pattern extraction (Kao et al. 2007)) and kernels (Haussler 1999; Collins and Duffy 2001; Lodhi et al. 2001; Kashima and Koyanagi 2002; Leslie et al. 2004), but the range where our framework is effective should not be limited to them in principle.

The advantages of our theory include: (1) it bridges gaps among important methodologies of machine learning, which exist due to differences in data structures to study, and as a consequence, a method that has proven effective for a type of structures can be converted into a new method for other types of structures; For example, we unify many different deﬁnitions of edit distances for strings, trees and graphs into a single common deﬁnition; (2) Abstraction by morphisms can make a methodology developed to solve a particular problem applicable to problems of different types; For example, we show that the center star algorithm, which was developed to study sequential multiple alignments, can be used to extract structural patterns of data with various structures; (3) It provides a generic framework to engineer novel methods in a uniform manner; If we can mathematically deﬁne structures of data in the form of morphisms, abstract

methods constructed on top of morphisms can automatically apply to them.

2 A morphism-based framework 2.1 Mathematical notations In this paper, we use the following notations.

S and T denote arbitrary sets. A partial mapping μ from S to T is a subset of S T such that, if (s, t) and (s, t ) are both in μ, then t = t . A one-to-one partial mapping μ satisﬁes that, if (s, t) and (s , t) are both in μ, then s = s . The domain of μ is Dom(μ) = {s | t T[(s, t) μ]}. The range of μ is Ran(μ) = {t | s S[(s, t) μ]}. For two partial mappings μ S U and ν U T, their composition is the partial mapping determined by ν μ = {(s, t) | u U[(s, u) μ, (u, t) ν]} S T. The cardinal number of μ is denoted by |μ|. Kronecker s delta function δx,y yields 1 if x = y and 0 otherwise. For a propagation (a1, . . . , an), [a1, aˆi, an] denotes the sub-propagation (a1, . . . , ai 1, ai+1, . . . , an).

2.2 Introductory illustration Our framework is based on three premises stated below: A datum is a collection of one or more labeled component elements; A similarity function to compare labels is given as prior knowledge; Structures among elements within data are represented by sets of one-to-one structure-preserving partial mappings between data. We exemplify the concept with strings of alphabetic letters:

We view a string as a collection of elements labeled with letters in an alphabet Σ through a function ℓ. If S = a1 . . . a|S| and T = b1, . . . , b|T | are two strings with ai Σ and bi Σ, we view S and T as collections of labeled elements {s1, . . . , s|S|} and {t1, . . . , t|T |} with ℓ(si) = ai and ℓ(ti) = bi. As a similarity function to compare labels, we deploy Koronecker s delta function. That is, the similarity of elements s and t is either 0 or 1 determined by δℓ(s),ℓ(t). We deal with the sequential orders of elements in strings through the entire set of one-to-one order-preserving partial mappings between S and T, denoted by MS,T :

MS,T = {(si1, tj1), . . . , (sin, tjn)} S T | n 1,

1 i1 < < in |S|, 1 j1 < < jn |T| .

Although MS,T does not completely determine the sequential structures of S and T, it reﬂects an important part of the structures. In particular, if MS,T and the order of S is given, the order of T is uniquely determined.

2.3 Component elements, data, and morphisms

We give a formal description of our framework and introduce the Maximum Similarity Measurement (MSM) problem.

A space of data A datum in our framework is always a set, and D denotes the space of data of our interest.

Labels and a label similarity measure. In our framework, labels associated with elements are used for the purpose of evaluating the similarity among the elements. The association of elements with labels is determined through a labeling function ℓX : X L , where L denotes a common ﬁnite alphabet of labels. Furthermore, a label similarity function ϕ : L L [0, ) is given as prior knowledge, and therefore, ϕ(ℓX(x), ℓY (y)) can be used as a similarity measurement between elements x X and y Y for X, Y D. The following are the axioms for ϕ to satisfy:

ϕ(ℓ1, ℓ2) 0 (Non-negativity) ϕ(ℓ1, ℓ1) > 0 (Self-positivity) ϕ(ℓ1, ℓ2) = ϕ(ℓ2, ℓ1) (Symmetry)

Furthermore, we have several desirable properties for ϕ to have:

ϕ(ℓ1, ℓ1) ϕ(ℓ1, ℓ2) (Maximality); (1) ϕ(ℓ1, ℓ2)ϕ(ℓ3, ℓ3) ϕ(ℓ1, ℓ3)ϕ(ℓ2, ℓ3) (Convexity); (2)

and for any n > 0, ℓ1, . . . , ℓn L and c1, . . . , cn R,

j=1 cicjϕ(ℓi, ℓj) 0 (Positive deﬁniteness). (3)

As explained later in this paper, maximality and convexity are sufﬁcient conditions to make morphism-based distances pseudo-metrics, while positive deﬁniteness (Berg, Christensen, and Ressel 1984) is a necessary condition to make morphism-based moment kernels positive deﬁnite. Positive deﬁniteness of kernels is known to be necessary to take advantage of the kernel method. When we restate problems known in the literature within our framework, the most common setting of ϕ is ϕ(ℓ, ℓ ) = β + (α β)δℓ,ℓ with α > β 0. This ϕ, indeed, satisﬁes all of the axioms and the properties stated above. In this paper, however, we assume that ϕ always has non-negativity, self-positivity and symmetry and will require the others only when they are necessary.

Morphisms. As morphisms between data X and Y , we use one-to-one partial mappings from X to Y as sets that also preserve the predetermined structures of the data. Our framework provides various methods to evaluate the similarity between data, but they are commonly realized through predetermined sets of morphisms. To be speciﬁc, each pair of data (X, Y ) D D is uniquely associated with a set of morphisms, denoted by MX,Y . By allowing morphisms to be partial, we will be able to incorporate local similarities of data into evaluation of the entire similarity.

As axioms of morphisms, we require:

id X MX,X (Identity);

μ MX,Y μ 1 MY,X (Inverse); μ MX,Y , ν MY,Z ν μ MX,Z (Transitivity).

When we restate problems in the literature within our framework, we have examples where morphisms are not necessarily transitive. For example, the morphisms associated with the less-constrained tree edit distance are not transitive, and as a result, the triangle inequality does not hold. However, such examples are exceptional, and therefore, we include transitivity in the axioms for morphisms. In the category theory, the transitivity is one of the axioms for a category, and D and morphisms described in the above becomes a category, but is not concrete, because morphisms include partial mappings.

2.4 Maximum Similarity Measurement Problem

We deﬁne a simple similarity function ΦM ϕ : D D [0, ) on top of the concepts introduced in the above. We start with deﬁning Φϕ :

X,Y D MX,Y R by

(x,y) μ ϕ(x, y). (4)

To deﬁne the simple similarity, we can deploy a deﬁnition based on a sum of primitive similarity measurements instead of their product. In fact, for ϕ+ : L L [ , ), we can deﬁne Φ+ ϕ+(μ) =

(x,y) μ ϕ+(x, y).

These deﬁnitions are, however, mutually equivalent, because we can convert Φ+ ϕ+ to Φϕ by ϕ(x, y) = eϕ+(x,y) and

Φϕ(μ) = eΦ+ϕ+μ. In this paper, we deﬁne a similarity measure of data as follows based on Eq. (4).

Maximum Similarity Measure ment (MSM)

We deﬁne the similarity between X, Y D by

ΦM ϕ (X, Y ) = max{Φϕ(μ) | μ MX,Y }. (5) On top of this fundamental similarity measure, we can construct variable measures to evaluate the similarity of data from different points of view. For example, the type of problems to ﬁnd morphisms μ that maximize MSM may be a target of interest of researchers, and it abstracts various concrete problems of pattern extraction. In fact, a formulation of the morphism-based pattern extraction (MPE) problem is introduced in Section 5. Also, to incorporate the inﬂuence of X \ Dom(μ) and Y \ Ran(μ) into evaluation of similarity in addition to MSM, we introduce the morphism-based distance (MD) in Section 3, which is an abstraction of the well-known concept of edit distances. On the other hand, the morphism-based moment kernel (MMK), which is introduced in Section 6, evaluates the distributions of Φϕ(μ) across morphisms μ MX,Y .

2.5 A 2-MAST problem is an MSM problem

To illustrate, we see that the well known 2-MAST (Maximum Agreement Subtree) problem (Kao et al. 2007) can be formulated as a problem to ﬁnd a morphism μ that maximizes ΦM ϕ (X, Y ) of an MSM problem.

Figure 1: 2-MAST.

For this purpose, it is convenient to deﬁne rooted trees as algebraic structures: If a semigroup (X, ) satisﬁes the additional conditions of (i) xy = yx, (ii) x2 = x and (iii) |{xy, yz, zx}| 2, it is a rooted tree by viewing X as a vertex set and xy as the nearest common ancestor of x and y in X. The root is computed by

x X x. Furthermore, an agreement subtree S of X is an arbitrary sub-semigroup of X. Under these notations, 2-MAST problem is formulated as a problem to ﬁnd two agreement subtrees S X and T Y maximum in size such that there is a semigroup isomorphism μ : S T that preserves vertex labels (Fig. 1). We say that S and T are mutually congruent. With this deﬁnition, we can view 2-MAST problem as an MSM problem. We let D be a set of labeled rooted trees, and determine MX,Y for X, Y D by

MX,Y = {μ | Dom(μ) and Ran(μ) are sub-semigroups, μ : Dom(μ) Ran(μ) is a semigroup isomorphism}.

For simplicity, we call such a morphism a partial semigroup isomorphism. For a label similarity function ϕ, on the other hand, we use ϕ(ℓ, ℓ ) = αδℓ,ℓ with α > 1. The corresponding MSM problem requires to maximize

(x,y) μ αδℓX(x),ℓY (y).

If ℓX|Dom(μ) = ℓY μ holds, we have Φϕ(μ) = α|Dom(μ)| = α|Ran(μ)|, and otherwise, Φϕ(μ) = 0. Therefore, if μ maximizes Φϕ(μ), the pair (Dom(μ), Ran(μ)) determines maximum agreement subtrees.

3 Abstraction of Edit Distances

The concept of edit distances has proven to be effective to measure similarity of strings (Levenshtein 1966), trees (Ta ı 1979), and graphs (Neuhaus and Bunke 2007). In this section, we introduce the morphism-based distance within our framework as an abstraction of edit distances.

3.1 The classical notion of edit distances Levenshtein distance for strings (Levenshtein 1966) and Ta ı distance for trees (Ta ı 1979) are well known examples of edit distances and are deﬁned within the same framework. An edit distance d(X, Y ) between two data X and Y , which can be strings and rooted trees, is determined as the minimum cost of edit paths. An edit path consists of one or more edit operations, each of which is one of (1) substituting a component y of Y for a component x of X, (2) deleting a component x of X and (3) inserting a component y of Y , and transforms X into Y . To each edit operation, a cost is assigned: we let ψ(x, y), ψ(x, ) and ψ( , y) denote the costs of substitution, deletion and insertion, respectively. The symbol denotes a gap. Then, the cost of an edit path π, denoted by Ψ(π), is determined by the sum of the costs of the edit operations that constitute π. On the other hand, the substitution operations speciﬁed in π determine a one-to-one partial mapping from X to Y , which is called the trace of π. We denote it by τ(π). When we let Xπ = X \ Dom(τ(π)) and Yπ = Y \ Ran(τ(π)), Ψ(π) can be restated by

x Xπ ψ(x, )+

y Yπ ψ( , y)+

(x,y) τ(π) ψ(x, y). (6)

When ΠX,Y denotes the entire set of edit paths from X to Y , d(X, Y ) = minπ ΠX,Y Ψ(π) determines the distance. We have a few important observations to note here.

A trace τ(π) preserves the order of letters for the string case and the generation order of vertices for the rooted tree case. With mathematical terminology, these traces are structure-preserving when we view a string is a totally ordered set and a rooted tree as a partially ordered set. Many other edit distances including the Hamming distance for strings, the constrained distance (Zhang 1996), the less-constrained distance (Lu, Su, and Tang 2001), the degree-two distances (Zhang, Wang, and Shasha 1996) for trees, which are all variations of the Ta ı distance, and the graph edit distance for graphs (Neuhaus and Bunke 2007) are deﬁned by posing certain constraints on edit paths that constitute ΠX,Y . Nevertheless, these edit distances can be commonly determined by d(X, Y ) = minπ ΠX,Y Ψ(π) using Eq. (6) with the restricted ΠX,Y . These constraints on edit paths can be translated into additional conditions for traces to satisfy. For example, to deﬁne the graph edit distance (Neuhaus and Bunke 2007), the constraint on edit paths is that, when deleting a vertex, all the edges connected to the vertex must be deleted beforehand, and when inserting an edge, the two ends of the edge must exist. Interestingly, this elaborated constraint turns out to be equivalent to the simple condition that traces are partial graph isomorphisms. Constraints on edit paths for many tree edit distances are translated to constraints on traces in (Kuboyama 2007).

3.2 Morphism-based distances (MD) In the observations above, we have learnt that the edit distances are deﬁned in a common way regardless of the dif-

ferences in ΠX,Y ; the differences in ΠX,Y reduce to differences in the associated traces; and the traces are structure preserving partial mappings. This understanding leads us to the notion of morphismbased distances within our framework. As a preliminary, we ﬁrst deﬁne cost functions ψ( , ) over L L , and ψ( , ) and ψ( , ) over L by

ψ(ℓ, ℓ ) = log ϕ(ℓ, ℓ) + log ϕ(ℓ , ℓ )

2 log ϕ(ℓ, ℓ );

ψ(x, ) = ψ( , ℓ) = log ϕ(ℓ, ℓ)

c is a positive constant to adjust the cost of deletion and insertion. Finally, the morphism-based distance is determined as follows:

Morphism-based Distances (MD)

For X, Y D, we determine Xμ = X \ Dom(μ) and Yμ = Y \ Ran(μ). Then, a morphism-based distance between X and Y is determined by:

x Xμ ψ(ℓX(x), ) +

y Yμ ψ( , ℓY (y))

(x,y) μ ψ(ℓX(x), ℓY (y));

d M ϕ,c(X, Y ) = min{Ψϕ,c(μ) | μ MX,Y }. We should note that determining ψ is equivalent to determining ϕ up to positive factor of c. In fact, cϕ(x, y) = eψ(x, )+ψ( ,y) ψ(x,y) holds.

3.3 Conditions to be a pseudo-metric

For morphism-based distance d M ϕ,c to be a pseudo-metric, the following four axioms must be satisﬁed.

d M ϕ,c(X, Y ) 0: To satisfy this axiom, we require that the values of ψ is non-negative, and therefore, we require

ϕ(ℓ1, ℓ2)2 ϕ(ℓ1, ℓ1)ϕ(ℓ2, ℓ2) and ϕ(ℓ, ℓ) 1

The ﬁrst requirement will be supported, if ϕ has maximality (Eq. (1)) or positive deﬁniteness (Eq. (3)): If ϕ satisﬁes maximality, ϕ(ℓ1, ℓ2) ϕ(ℓ1, ℓ1) and ϕ(ℓ1, ℓ2) ϕ(ℓ2, ℓ2) imply the requirement; If ϕ is positive deﬁnite,

the Gramian matrix G = ϕ(ℓ1, ℓ1) ϕ(ℓ1, ℓ2) ϕ(ℓ2, ℓ1) ϕ(ℓ2, ℓ2)

metrix, and therefore, Schur decomposition yields U TGU = λ1 0 0 λ2

for some orthogonal matrix U = u11 u12 u21 u2

The eigenvalues λ1 and λ2 are non-negative, since λk =

j uikujkϕ(ℓi, ℓj) 0, and in particular, we have det G = ϕ(ℓ1, ℓ1)ϕ(ℓ2, ℓ2) ϕ(ℓ1, ℓ2)2 0. The second is not an excessive demand as well, because we can determine c so that c 1 minℓ L ϕ(ℓ,ℓ): Since ϕ(ℓ, ℓ) > 0, 1 minℓ L ϕ(ℓ,ℓ) < holds.

d M ϕ,c(X, X) = 0: By deﬁnition, ψ(ℓ, ℓ) = 0 always holds, and hence, we have Ψϕ,c(id X) = 0.

d M ϕ,c(X, Y ) = d M ϕ,c(Y, X): Because ϕ is symmetric, we have Ψϕ,c(μ) = Ψϕ,c(μ 1). Since μ 1 MY,X by deﬁnition, d M ϕ,c(X, Y ) d M ϕ,c(Y, X) holds. For the same reason, d M ϕ,c(X, Y ) d M ϕ,c(Y, X) holds as well.

d M ϕ,c(X, Y ) + d M ϕ,c(Y, Z) d M ϕ,c(X, Z): If ϕ has maximality (Eq. (1)) and convexity (Eq. (2)), the triangle inequality holds for ψ, that is, ψ(ℓ1, ℓ2) + ψ(ℓ2, ℓ3) ψ(ℓ1, ℓ3) and ψ( , ℓ1) + ψ(ℓ1, ℓ2) ψ( , ℓ2) hold for arbitrary ℓ1, ℓ2, ℓ3 L . Finally, we have

Theorem 1 If ψ( , ), ψ( , ) and ψ( , ) are all nonnegative and satisfy the triangle inequality, d M ϕ,c is a pseudo-metric.

Corollary 1 If ϕ satisﬁes maximality and convexity, d M ϕ,c is a pseudo-metric for any c 1 minℓ L ϕ(ℓ,ℓ).

We cannot always expect identity of indiscernibles. In the extreme example where ψ(ℓ, ) = ψ( , ℓ) = 0 holds for any ℓ L , the resulting distances are always zero. We can, however, derive a metric space from a pseudo-metric space arbitrarily given: the quotient space of a pseudo-metric space with respect to the equivalence of x y d(x, y) = 0 always is a metric space.

3.4 Examples We redeﬁne several important instances of the edit distance known in the literature as morphism-based distances within our framework. In the following descriptions, we determine data structures and morphisms individually, while we commonly assume one of the following for ϕ:

ϕ(ℓ, ℓ ) = e2, ℓ= ℓ

e, ℓ = ℓ or ϕ(ℓ, ℓ ) = e2, ℓ= ℓ

With c = 1, the ﬁrst yields ψ(ℓ, ℓ ) = 1 δℓ,ℓ and ψ(x, ) = ψ( , y) = 1, which is the most common setting in the literature. The second, on the other hand, yields ψ(ℓ, ℓ ) = for ℓ = ℓ , and therefore, refrains substitution of elements. This type of distances is known as in-del distances and is reported to show high accuracy when used with distance-based classiﬁcation algorithms such as k-NN (Shin and Niiyama 2018).

Levenshtein distance (Levenshtein 1966) A datum is a totally ordered set of labeled elements, and a morphism is an arbitrary order-preserving partial mapping: We let (S, <) and (T, <) be ﬁnite totally ordered sets and let MS,T consist of partial mappings μ such that if s, s Dom(μ) and s < s , 0

Ta ı distance (Ta ı 1979) First, we assume that trees are rooted but unordered. A rooted tree is a partially ordered set, as deﬁned as follows.

For a ﬁnite set X, (X, >) is a partially ordered set (poset).

For any x X, the sub-poset (X>x, >) with X>x = {y X | y > x} is totally ordered.

There exists max X, which is referred to as the root.

The order > determines a generation order, and x > y means that x is a proper ancestor of y. In particular, if x is the unique parent of y, we denote x y. A rooted ordered tree, on the other hand, is equipped with a traversal order in addition to the generation order. Intuitively speaking, a traversal order determines an order of vertices from left to right. A rooted ordered tree can be formalized by the following axioms.

For (X, >, ), (X, >) and (X, ) are both posets.

For arbitrary x, y X, exactly one of x = y, x > y, x < y, x y or x y holds.

For any x X, the sub-poset (X>x, >) with X>x = {y X | y > x} is totally ordered.

There exists max X, which is referred to as the root.

For any x X, (X x, ) with X x = {y X | y x} is totally ordered.

Ta ı has proved that the set of traces of edit paths for Ta ı distance is identical to the set of order preserving one-toone partial mappings (Ta ı 1979). Hence, we determine morphisms to be one-to-one partial mappings μ X Y with x1 > x2 μ(x1) > μ(x2) and, if X and Y are ordered, x1 x2 μ(x1) μ(x2) as well.

Less-constrained tree distance (Lu, Su, and Tang 2001) The less-constrained distance constrains edit paths of Ta ı distance not to include any deletion operations prior to insertion. As a result, a morphism μ is an order preserving partial mapping such that, for {(x1, y1), (x2, y2), (x3, y3)} μ,

x1x2 < x1x3 y1y2 y1y3

holds: xi < xj means that xj is a proper ancestor of xi, and xixj is the nearest common ancestor of xi and xj: xixj = min{x X | x xi, x xj}. The resulting morphisms are not transitive, and in fact, the distance does not support triangle inequality.

Degree-two tree distance (Zhang, Wang, and Shasha 1996) The constraint of the degree-two distance on Ta ı edit paths is that insertion and deletion are allowed only for vertices of degree one or two (vertices with one or two edges). To redeﬁne the degree-two distance, we view rooted trees as semigroups, instead of partially ordered sets, so that xixj is the nearest ancestor of xi and xj. A morphism is an arbitrary semigroup isomorphisms, but its domain or range is not necessarily a sub-semigroup.

Graph edit distance (Neuhaus and Bunke 2007) The graph edit distance requires that, when deleting a vertex, all the edges of the vertex must be deleted beforehand, and, when inserting an edge, its two ends must be included in the graph. The resulting morphisms are arbitrary partial graph isomorphisms.

3.5 Duality between MD and MSM

Theorem 2 shows an important equivalence between the problems of determining morphism-based distances and the MSM problem. We start with describing a useful lemma.

Lemma 1 For μ MX,Y , we have

Ψϕ,c(μ) = log (Φcϕ(id X) Φcϕ(id Y ))1/2

Proof. For Xμ = X \ Dom(μ) and Yμ = Y \ Ran(μ),

log ϕ(x, x) + log c

log ϕ(y, y) + log c

2 log ϕ(x, x) + 1

2 log ϕ(y, y) log ϕ(x, y)

1 2 log cϕ(x, x) +

1 2 log cϕ(y, y) log Φcϕ(μ)

implies the assertion of the theorem. Theorem 2 is a direct corollary to Lemma 1.

Theorem 2 (Duality) The following equality holds.

d M ϕ,c(X, Y ) = log (Φcϕ(id X) Φcϕ(id Y ))1/2

ΦM cϕ(X, Y ) (8)

Since the numerator of the right-hand side of Eq (8) is constant depending only on X and Y , computing a morphism-based distance d M ϕ,c is equivalent to determining ΦM cϕ by solving the associated MSM problem.

4 Morphism-based multiple alignments

A multiple sequence alignment (MSA) problem is deﬁned as follows: given more than one strings S1, . . . , Sn, the problem requires an optimal multiple alignment that minimizes the value of a predetermined cost function. For example,

S1: -TGTAAG--- S1: --GC-AGGTC S1: ATGC-A--T-

is a multiple alignment of three strings. A popular cost function used in the literature is the sum-of-pairs cost deﬁned by n i=1 n j=i+1 Ψ(πij): πij is the edit path that the pairwise alignment between Si and Sj in a multiple alignment determines; Eq. (6) determines the cost Ψ(πij). In the example, we have Ψ(π12) = 6 and Ψ(π23) = Ψ(π31) = 5, and therefore, the cost of the multiple alignment is 16. We can also abstract the concept of the multiple alignment leveraging the framework of morphism-based distances.

Morphism-based Multiple Alignment (MMA)

Given {X1, . . . , Xn} D, a morphism-based multiple alignment problem requires to ﬁnd morphisms μij MXi,Xj for distinct {i, j} {1, . . . , n} that

j=i+1 Ψϕ,c(μij)

subject to (1) μij = μ 1 ji and (2) μij μkj μik for any distinct {i, j, k} {1, . . . , n}. When the number of given strings is greater than two, the MSA problem is known to be NP-hard, and hence, MMA problems for n 3 are not necessarily solvable in polynomial time. For MSA problems with n 3, Gusﬁeld proposed an efﬁcient and error-bounded approximation algorithm, namely the center star algorithm, (Gusﬁeld 1993). We can abstract the algorithm in a straightforward manner to solve MMA problems without loss of the original advantages of the algorithm.

Abstract Center Star Algorithm

Input: X1, . . . , Xn D. Output: μij MXi,Xj for distinct i, j {1, . . . , n} with μij = μ 1 ji and μij μkj μik. Procedures: 1. For each Xi, compute morphisms μij MXi,Xj for j [1,ˆi, n] with d M ϕ,c(Xi, Xj) = Ψϕ,c( μij) and let Si =

j [1,ˆi,n] d M ϕ,c(Xi, Xj).

2. Pick k arg min{Si | i = 1, . . . , n};

3. Determine μki = μki, μik = μ 1 ki and μij = μkj μ 1 ki for i = k and j = k. The computational complexity of the abstract center star algorithm is dominated by the product of n and the computational complexity of computing the morphism-based distances d M ϕ,c(Xi, Xj), and hence, if we have a polynomial time algorithm to compute the distances, the abstract center star algorithm has a polynomial time complexity. With respect to the approximation guarantee, we can prove that Gusﬁeld s theorem (Gusﬁeld 1993) also holds.

Theorem 3 We let {μij}i,j be a set of morphisms obtained by the abstract center star algorithm and let { μij}i,j be an optimal solution to the MMA problem. If d M ϕ,c is a pseudo metric, we have

j=i+1 Ψϕ,c(μij) 2 2

j=i+1 Ψϕ,c( μij).

5 Abstraction of Pattern Extraction 5.1 Morphism-based pattern extraction (MPE)

The 2-MAST problem is a typical example of pattern extraction problems and is studied in the literature in a general form as the n-MAST problem, which requires to ﬁnd n agreement subtrees of n rooted trees that are congruent

to one another. To abstract it within our framework, we formalize the morphism-based pattern extraction problem as follows.

Morphism-based Pattern Extraction (MPE)

Given {X1, . . . , Xn} D, a morphism-based pattern extraction problem requires to ﬁnd morphisms μij MXi,Xj for distinct {i, j} {1, . . . , n} that

j=i+1 Φϕ(μij)

subject to (1) μij = μ 1 ji and (2) μij = μkj μik for any distinct {i, j, k} {1, . . . , n}. The MMA and MPE problems are almost the same except that μij μkj μik is replaced by μij = μkj μik. Proposition 1 clariﬁes the meaning of this replacement.

Proposition 1 If morphisms μij for distinct {i, j} {1, . . . , n} satisfy μij = μ 1 ji and μij = μkj μik, Dom(μij) = Dom(μik) = Ran(μji) = Ran(μki) holds for any distinct {i, j, k} {1, . . . , n}.

Thus, when a solution {μij}i,j of a MPE problem is given, we can view Dom(μij) as the extracted pattern in Xi, which is only dependent on i by Proposition 1. We look at this idea more closely through an example with the n-MAST problem. An n-MAST problem can be viewed as an MPE problem by deﬁning data, morphisms and a label similarity function in the same way as when we showed a 2-MAST problem is an MSM problem: D is a set of rooted trees deﬁned as semigroups with respect to the nearest common ancestor operator; morphisms are partial semigroup isomorphisms; ϕ(ℓ, ℓ ) are αδℓ,ℓ with α > 1. For a solution μij, Proposition 1 implies that we can uniquely determine Ti = Dom(μij) = Ran(μji) for each i, which is a subsemigroup, that is, a sub-tree. μij is a congruent mapping between Ti and Tj, if it preserves labels. Furthermore, we see that n i=1 n j=i+1 Φϕ(μij) is identical to α(n 1)(n 2)|Ti|/2, if all μij preserve labels, and to zero, otherwise.

5.2 Center star algorithm for MPE problems

Like the multiple sequence alignment problem, n-MAST problems for n 3 are known to be NP-hard (Kao et al. 2007). Therefore, MPE problems are not necessarily solvable in polynomial time, and we need a polynomial-time approximation algorithm with a good error bound. The duality theorem 2 may inspire us to develop the algorithm based on the center star algorithm, and this idea is actually right.

Deﬁnition 1 For {X, X1, . . . , Xn} D, a pivot around X is (μ1, . . . , μn) MX,X1 MX,Xn that maximizes S = n i=1 Φϕ(μi) under the constraint that all of Dom(μi), i = 1, . . . , n, are identical. The maximum value of S is called a signature of X.

The following algorithm approximately solve MPE problems.

Abstract Center Star Algorithm for MPE

Input: X1, . . . , Xn D. Output: μij MXi,Xj for distinct i, j {1, . . . , n} with μij = μ 1 ji and μij = μkj μik. Procedures:

1. Compute a pivot ( μi1 . . . , μi,ˆi, . . . , μin) around each Xi and let Si be its signature.

2. Pick k arg max{Si | i = 1, . . . , n};

3. Determine μki = μki, μik = μ 1 ki and μij = μkj μ 1 ki for i = k and j = k. Theorem 4 give an error bound of the algorithm.

Theorem 4 For X1, . . . , Xn D, we let {μij}i,j be a set of morphisms obtained by the abstract center star algorithm, { μij}i,j be an optimal solution to the MPE problem. Without any loss of generality, we assume the optimal k of the abstract center star algorithm is 1 and let D be Dom(μ1i). If ϕ is convex, we have

j=i+1 Φϕ(μij) 2 n

j=i+1 Φϕ( μij)

x D ϕ(ℓX1(x), ℓX1(x))

Proof. By the hypothesis, the pivot ( μ12, . . . , μ1,n) around X1 has been used to compute μij.

i=2 Φϕ(μ1i)n 1 =

j=i+1 Φϕ(μ1i)Φϕ(μ1j)

i=2 Φϕ(μ1i)

x D ϕ(ℓX1(x), ℓX1(x))

j=i+1 Φϕ(μij).

On the other hand, we let { μij | j [1,ˆi, n]} be the pivot around Xi to compute the signature Si. Then,

j=2 Φϕ(μ1j)

j [1,ˆi,n] Φϕ( μij)

j [1,ˆi,n] Φϕ( μij)

holds, and Hence, we have

i=2 Φϕ(μ1i)n

j [1,ˆi,n] Φϕ( μij)

j=i+1 Φϕ( μij)

The assertion follows.

6 Morphism-based Moment Kernels

For given (X, Y ) D, we can consider the distribution of Φϕ(μ) across μ MX,Y , and the MSM problem determines the maximum. Morphism-based moment kernels, on the other hand, aim to evaluate the entire distribution.

6.1 Moments in statistics When a real-valued random variable X associated with a probability distribution P is given, the n-th moment of X is deﬁned by mn = xn P(x)dx. It is known that these moments describe the distribution. In fact, we have that m1 is the mean and m2 m2 1 is the variance, but the power of moments is much more. Under some reasonable mathematical assumption, a Fourier transform P(t) = eitx P(x)dx of P(x) is Taylor-expanded as

and the coefﬁcients of degree n is identical to mn. In other words, the series of moments uniquely determines the distribution P(x). Based on this understanding, we introduce morphism-based moment kernels as follows

Morphism-based Moment Kernel (MMK)

For X, Y D, we deﬁne an n-th moment kernel as

Kn(X, Y ) =

μ MX,Y Φϕ(μ)n.

K0(X, Y ) is |MX,Y |, while K1(X,Y ) K0(X,Y ) and K2(X,Y ) K0(X,Y ) K1(X,Y ) K0(X,Y ) 2 yield the mean and the variance of the distri-

bution of Φϕ(μ). For the unique determination, we have

Theorem 5 For X, Y D, if |Φϕ(μ) | μ MX,Y }| = n holds, K0(X, Y ), . . . , Kn 1(X, Y ) uniquely determines the distribution of Φϕ(μ).

Proof. We denote the distinct values of Φϕ(μ) by {x1, . . . , xn} and let Ni be the cardinal number of Ni = |{μ MX,Y | Φϕ(μ) = xi}|. Then, we have

K0(X, Y ) K1(X, Y ) ... Kn 1(X, Y )

1 . . . 1 x1 . . . xn ... ... ... xn 1 1 . . . xn 1 n

N1 N2 ... Nn

N1 N3 ... Nn

i>j(xi xj) = 0 implies the assertion. Use of kernels to analyze structured data has been intensively studied. The ﬁrst important contribution in the literature was the convolution kernel by (Haussler 1999): For two ﬁnite sets S and T, the convolution kernel is deﬁned by KC(S, T) = (x,y) S T k(x, y) and is positive deﬁnite, if k(x, y) is positive deﬁnite. The convolution kernel is generalized into mapping kernel (Shin and Kuboyama 2008), which is in the form of KM(S, T) =

(x,y) M k(x, y) for M S T, and have shown the necessary and sufﬁcient condition for KM to be positive deﬁnite. From Theorem 3 of (Shin and Kuboyama 2010), Theorem 6 is derived.

Theorem 6 If morphisms are transitive and ϕ is positive deﬁnite, Kn(X, Y ) is positive deﬁnite. Many kernels known in the literature can be restated as 0-th morphism-based moment kernels. The all sequences kernel (Shawe-Taylor and Cristianini 2004) for strings and the elastic kernel (Kashima and Koyanagi 2002) for trees are typical examples, but we have many more. To compute morphism-based moment kernels of higher degrees, the theory of partitionable kernels (Shin 2011) can be used.

6.2 Relation to MSM problems Theorem 7 not only indicates the relation with MSM problems but also implies that moment kernels of too high degrees can have only the same information as max Φϕ(μ). Theorem 7 If Φϕ(μ) > 0 for all μ MX,Y , we have

max{Φϕ(μ) | μ MX,Y } = lim n Kn(X, Y )1/n.

Proof. Because max i=1,...,k ai = lim n 1 n log

ΦM ϕ (X, Y ) = lim n 1 n log

μ MX,Y en log Φϕ(X)Y μ

= lim n log Kn(X, Y )1/n.

7 Future work Although the effectiveness of our framework has been already proven through some experiments, we will run experiments in a larger scale with a wider variation of machine learning methods including but not limited to distance, multiple alignment, pattern extraction and kernel. For this purpose, we have a plan to develop utility programs that analyze an input dataset exhaustively and consistently by means of the morphism distance, the morphism-based pattern extraction and the moment kernels and others derived from appropriately parameterized pairs of (M, ϕ). For real application of this utility, the user will be able to select the most appropriate method based on the output of the utility and can use it for further analysis.

Acknowledgments This work was partially supported by the Grant-in-Aid for Scientiﬁc Research (JSPS KAKENHI Grant Number 17H00762) from the Japan Society for the Promotion of Science.

References Berg, C.; Christensen, J. P. R.; and Ressel, R. 1984. Harmonic Analysis on semigroups. Theory of positive deﬁnite and related functions. Springer. Collins, M., and Duffy, N. 2001. Convolution kernels for natural language. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001], 625 632. MIT Press.

Gusﬁeld, D. 1993. Efﬁcient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of Mathematical Biology 55:141 154. Haussler, D. 1999. Convolution kernels on discrete structures. UCSC-CRL 99-10, Dept. of Computer Science, University of California at Santa Cruz. Kao, M.-Y.; Lam, T.-W.; Sung, W.-K.; and Ting, H.-F. 2007. An even faster and more unifying algorithm for comparing trees via unbalanced bipartite matchings. Kashima, H., and Koyanagi, T. 2002. Kernels for semistructured data. In the 9th International Conference on Machine Learning (ICML 2002), 291 298. Kuboyama, T. 2007. Matching and Learning in Trees. Ph.D. Dissertation, Department of Advanced Interdisciplinary Studies, The University of Tokyo. Leslie, C.; Eskin, E.; Cohen, A.; Weston, J.; and Noble, W. S. 2004. Mismatch string kernels for discriminative protein classiﬁcation. Bioinformatics 20(4). Levenshtein, V. I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10(8):707 710. Lodhi, H.; Shawe-Taylor, J.; Cristianini, N.; and H., W. C. J. C. 2001. Text classiﬁcation using string kernels. Advances in Neural Information Processing Systems (NIPS 2000) 13. Lu, C. L.; Su, Z. Y.; and Tang, G. Y. 2001. A New Measure of Edit Distance between Labeled Trees. In LNCS, volume 2108, pp. 338 348. Springer-Verlag Heidelberg. Neuhaus, M., and Bunke, H. 2007. Bridging the gap between graph edit distance and kernel machines. World Scientiﬁc. Shawe-Taylor, J., and Cristianini, N. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press. Shin, K., and Kuboyama, T. 2008. A generalization of Haussler s convolution kernel - mapping kernel. In ICML 2008. Shin, K., and Kuboyama, T. 2010. A generalization of haussler s convolution kernel - mapping kernel and its application to tree kernels. J. Comput. Sci. Technol 25(5):1040 1054. Shin, K., and Niiyama, T. 2018. Parameterized mapping distances for semi-structured data. In ICAART 2018 (Revised Selected Papers), LNCS 11352, 443 466. Shin, K. 2011. Partitionable kernels for mapping kernels. In ICDM 2011, 645 654. Ta ı, K. C. 1979. The tree-to-tree correction problem. journal of the ACM 26(3):422 433. Zhang, K.; Wang, J. T. L.; and Shasha, D. 1996. On the editing distance between undirected acyclic graphs. International Journal of Foundations of Computer Science 7(1):43 58. Zhang, K. 1996. A Constrained Edit Distance Between Unordered Labeled Trees. Algorithmica 15:205 222.