# hyperbolic_embeddings_of_supervised_models__756c52f4.pdf

Hyperbolic Embeddings of Supervised Models

Richard Nock Google Research richardnock@google.com

Ehsan Amid Google Deep Mind eamid@google.com

Frank Nielsen Sony CS Labs, Inc. frank.nielsen@acm.org

Alexander Soen RIKEN AIP Australian National University alexander.soen@anu.edu.au

Manfred K. Warmuth Google Research manfred@google.com

Models of hyperbolic geometry have been successfully used in ML for two main tasks: embedding models in unsupervised learning (e.g. hierarchies) and embedding data. To our knowledge, there are no approaches that provide embeddings for supervised models; even when hyperbolic geometry provides convenient properties for expressing popular hypothesis classes, such as decision trees (and ensembles). In this paper, we propose a full-ﬂedged solution to the problem in three independent contributions. The ﬁrst linking the theory of losses for class probability estimation to hyperbolic embeddings in Poincaré disk model. The second resolving an issue for an easily interpretable, unambiguous embedding of (ensembles of) decision trees in this model. The third showing how to smoothly tweak the Poincaré hyperbolic distance to improve its encoding and visualization properties near the border of the disk, a crucial region for our application, while keeping hyperbolicity. This last step has substantial independent interest as it is grounded in a generalization of Leibniz-Newton s fundamental Theorem of calculus.

1 Introduction

Models of hyperbolic geometry have been successfully used to embed hierarchies, i.e. tree-based structures [14, 20, 38, 36]. Through the property of low-distortion hyperbolic embeddings [30], symbolic (hierarchies) and numeric (quality metrics) properties of unsupervised learning models can be represented in an accurate and interpretable manner [29]. When it comes to supervised learning, the current trend involves embedding data in a hyperbolic space, with models trained on the now hyperbolic data [9, 15, 8, 11]. It is important to note that none of these supervised methods embed models. Indeed, the focus of the prior literature is to learn supervised models from hyperbolic data, rather than representing supervised models in hyperbolic geometry. This is in stark contrast to the unsupervised usage described, where the (tree-based) structural properties of unsupervised models are directly exploited for good embeddings. This hints at beneﬁts which can be used in the supervised case, especially for popular supervised models which are (structurally) tree-based.

Our paper proposes a full-ﬂedged solution for this problem, in three independent contributions.

The ﬁrst focuses on the numerical part (quality metrics) of the embedding and its link with supervised losses: to provide a natural way of embedding classiﬁcation and the conﬁdence of prediction, we show a link between training with the log-loss (posterior estimation) or logistic loss (real-valued classiﬁcation) and hyperbolic distances computation in the Poincaré disk.

The second focuses on the symbolic part (hierarchies) of the embeddings for a popular kind of supervised of models: decision trees (and ensembles). Unlike for unsupervised learning, we show

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

that getting an unambiguous embedding of a decision tree requires post-processing the model. Our solution extracts its monotonic sub-tree via a new class of supervised models we introduce, monotonic decision trees. This is also convenient for explainability purposes [32].

The third and most technical one focuses on visualization and the accuracy of the numerical encoding in the Poincaré disk model. The more accurate / conﬁdent is a supervised prediction, the closer to the border of the disk it is represented. Thus, the best models will have their best predictions squeezed close to the border. In addition to being suboptimal for visualization, this region is also documented for being numerically error-prone for hyperbolic distance computation [29]. Two general ﬁxes currently exist: encoding values with (exponentially) more bits or utilizing a trick from Riemannian geometry [19]. As neither is satisfactory for our usage, we propose a third, principled route: like other distances, Poincaré distance is integral. We generalize Leibniz-Newton s fundamental Theorem of calculus using a generalization of classical arithmetic [22]. This generalized tempered integral provides a parameter to smoothly alter the properties of the classical integral. When deﬁning the Poincarè distance, the tempering controls the hyperbolic constant of the embedding whilst also improving the visualization and numeric accuracy of the embedded models. The generalization of integrals to distances has independent interest as many application in ML rely on integral based distances and distortions, i.e., Bregman divergences, f-divergences, integral probability metrics, etc.

Experiments are provided on readily available domains, and all proofs, additional results and additional experiments are given in an Appendix.

2 Related work

Models of hyperbolic geometry have been mainly useful to embed hierarchies, i.e. tree-based structures [14, 20, 38, 36], with a sustained emphasis on coding size and numerical accuracy [19, 29, 37]. In unsupervised learning and clustering, some applications have sought a simple representation of data on the form of a tree or via hyperbolic projections [7, 6, 17, 34]. Approaches dealing with supervised learning assumes the data lies in a hyperbolic space: the output visualized is primarily an embedding of the data itself, with additional details linked to classiﬁcation of secondary importance, either support vector machines [9], neural nets, logistic regression [15], or (ensembles of) decision trees [8, 11]. We insist on the fact that the aforementioned methods do not represent the models in the hyperbolic space, even when those models tree-shaped. The question of embedding classiﬁers is potentially important to improve the state of the art visualization: in the case of decision trees, popular packages stick to a topological layer (the tree graph) to which various additional information about classiﬁcation are superimposed but without principled links to the embedding *.

3 Basic deﬁnitions

Training a supervised model starts with a set (sample) of examples, each of which is a pair px, yq, where x P X (a domain) and y P t 1, 1u (labels or classes). A decision tree (DT) consists of a binary rooted tree H, where at each internal nodes, the two outgoing arcs deﬁne a Boolean test over observation variables (see Figure 1, center, for an example); Np Hq is the set of nodes of H, Λp Hq Ď Np Hq is the set of leaf nodes of H. The log-loss is best introduced in the context of the theory of losses for class probability estimation (CPE) [28]. We follow the notations of [18]: A CPE loss function, ℓ: Y ˆ r0, 1s Ñ R, is

ℓpy, pq . Jy 1K ℓ1ppq Jy 1K ℓ 1ppq, (1)

where J.K are Iverson brackets [16]. Functions ℓ1, ℓ 1 are called partial losses. A CPE loss is symmetric when ℓ1ppq ℓ 1p1 pq, @p P r0, 1s. The log-loss is a symmetric CPE loss with ℓLOG 1 ppq . logppq. The goal of learning using a CPE loss is to optimize the Bayes risk, Lppq . infπ EY pℓp Y, πq. In the case of the log-loss, it is the criterion used to learn DTs in C4.5 [26]:

LOGppq p log p p1 pq logp1 pq. (2)

Models like DTs predict an empirical posterior ˆPrr Y 1|Xs at the leaves: for an observation x reaching leaf λ, the posterior prediction is the local relative proportion of positive examples in leaf λ, as estimated by the training sample (noted p λ ). There exists a duality between CPE classiﬁcation

*See for instance https://github.com/parrt/dtreeviz.

Decision Tree H Monotonic Decision Tree H1

Figure 1: Left pane: subtree of a decision tree (DT) (colors red, green denote the majority class in each node, grey = random posterior, tests at arcs not shown, n1 is the root, no leaves shown) and its embedding following the simple recipe in (3): it is impossible to tell n2 from n5 and the tree depicted in B is not a faithful representation of the DT nor of any of its subtrees. Right pane: a small DT learned on UCI abalone (left) and its corresponding monotonic decision tree (MDT, right) learned using GETMDT. In each node, the real-valued prediction (ψLOGppq) is indicated, also in color. Observe that, indeed, H does not grant path-monotonic classiﬁcation but H1 does (Deﬁnition 4.1). In H1, some nodes have outdegree 1; also, internal node #6 in the DT, whose prediction is worse than its parent, disappears in H1. One arc in H1 is represented with double width as its Boolean test aggregates both tests it takes to go from #3 to #10 in H. Observations that would be classiﬁed by leaf #11 (resp. #5, resp. #9) in H are now classiﬁed by internal node #3 (resp. #2, resp. #4) in H1, but predicted labels are the same for H and H1 and so the accuracy of H and H1 are the same.

and the real-valued classiﬁcation setting familiar to e.g. deep learning: optimizing Bayes risk for the posterior is equivalent to minimizing its convex surrogate using the canonical link of the loss to compute real classiﬁcation [24, Theorem 1]. The canonical link of the log-loss, ψLOG : r0, 1s Ñ R is the famed inverse sigmoid, which has a very convenient form for our purpose,

ψLOGppq log ˆ p 1 p

ñ |ψLOGppq| log ˆ1 r

, r . |2p 1|. (3)

Notably, the absolute value |ψLOGppq| is a conﬁdence with the sign giving the predicted class. In the case of the log-loss, the convex surrogate is hugely popular in ML: it is the logistic loss.

We now introduce concepts from hyperbolic geometry, particularly those from the Poincaré disk model. Our deﬁnitions are simpliﬁed, a detailed account can be found in [5, 20]. The distance function d of a metric space is τ-hyperbolic for some τ ě 0 iff for any three geodesic curves γ1, γ2, γ3 linking three points, there exists z such that maxi dpz, γiq ď τ, where dpz, γq . infz1Pγ dpz, z1q. A small hyperbolic constant guarantees thin triangles and embeddings of trees with good properties. The Poincaré disk model, B (negative curvature, 1) is a popular model of hyperbolic geometry with the hyperbolic distance between the origin and some z P B with Euclidean norm r . }z} being:

d Bpz, 0q log ˆ1 r

4 Posterior embedding in Poincaré disk and clean embeddings for DTs

Node embedding We exploit the similarity between log-loss conﬁdences |ψLOGppq| and hyperbolic distances d Bpz, 0q when embedding classiﬁcation models, similarity which is obvious from (3) and (4). In (3), we see that if r 0, the prediction is as bad as a fair coin s. As r edges closer to 1, prediction approaches maximum conﬁdence. The connection with the Poincaré distance (4) is immediate: a natural embedding of a posterior prediction p is then a point z in the Poincaré disk B at radius r . }z} from the origin (3). The origin of Poincaré disk represents the worst possible conﬁdence. As each leaf λ of a DT H corresponds to a posterior p λ , the leaves can be embedded as described. In addition, all nodes ν P Np Hq in the tree also have corresponding posteriors, and thus can also be embedded identically to leaf nodes. It is also a good idea to embed them because, as classical DT induction algorithms ﬁrst proceed by top-down induction, any node ν in a DT was a leaf at some point during training, and has a corresponding posterior. This apparently clean picture for node embedding however becomes messy as soon as we step to embedding full trees.

No clean embedding for a full DT A simple counterexample indeed demonstrates that a clean embedding of a decision tree is substantially trickier, see Figure 1 (left). In this pathological example, some distinct nodes (resp. arcs) of the DT H are embedded in the same node (resp. edge) in B: the depiction has no subgraph relationship to H and some embedded nodes cannot be distinguished between each other without additional information. Getting a clean embedding: Monotonic Decision Trees To introduce our ﬁx, we deﬁne three broad objectives for the embedding in B of DTs

Embedding objective

(A) Embed a tree-based model from H in B, which deﬁnes an injective mapping of the nodes of H and induces a subtree of H with each edge in B corresponding to a path in H; (B) Locally, each node ν of H gets embedded to some zν such that rν is close to }zν} (3); (C) Globally, the whole embedding remains convenient and readable to compare, in terms of conﬁdence in classiﬁcation, different subtrees in the same tree, or even between different trees.

As we shall see later in this Section, this agenda also carries to embedding ensembles of DTs, thus including their leveraging coefﬁcients, exploiting properties of boosting algorithms. For now, our solution for embedding a single DT that can satisfy (A-C) is simple in principle:

Embed the monotonic classiﬁcation part of a DT,

meaning for each path from the root to a leaf in H, we only embed the subsequence of nodes whose absolute conﬁdences are strictly increasing. To do so, we replace the DT by a "close" path-monotonic approximation model using a new class of models we call Monotonic Decision Trees (MDT).

Deﬁnition 4.1. A Monotonic Decision Tree (MDT) is a rooted tree with a Boolean test labeling each arc and a real-valued prediction at each node. Any sequence of nodes in a path starting from the root is strictly monotonically increasing in absolute conﬁdence. At any internal node, no two Boolean tests at its outgoing arcs can be simultaneously satisﬁed. The classiﬁcation of an observation is obtained by the bottom-most node s prediction reachable from the root.

Algorithm 1 GETMDT(ν, b, ν1, I)

Input: Node ν P Np Hq (H = DT), Boolean test b, Node ν1 P Np H1q (H1 = MDT being build from H), forbidden posteriors I Ă r0, 1s; 1 : if ν P Λp Hq then 2 : if p ν P I then 3 : ν1 Ð TAGDTLEAFpνq; // tags ν1 P Np H1q with info from leaf ν P Λp Hq 4 : else 5 : ν2 Ð H1.NEWNODEpνq; // ν2 will be a new leaf in H1 6 : H1.NEWARCpν1, b, ν2q; // adds arc ν1 Ñb ν2 in H1 7 : endif 8 : else 9 : if p ν R I then 10 : ν2 Ð H1.NEWNODEpνq; // ν2 = new internal node in H1 11 : H1.NEWARCpν1, b, ν2q; // adds arc ν1 Ñb ν2 in H1 12 : Inew Ð rmintp ν , 1 p ν u, maxtp ν , 1 p ν us; // I Ă Inew 13 : ν1 new Ð ν2; 14 : bnew Ð true; 15 : else 16 : Inew Ð I; ν1 new Ð ν1; bnew Ð b; // ν yields no change in H1 17 : endif 18 : GETMDT(ν.leftchild, bnew ν.TESTpleftchildq, ν1 new, Inew); 19 : GETMDT(ν.rightchild, bnew ν.TESTprightchildq, ν1 new, Inew); 20 : endif

We introduce an algorithm, GETMDT, which takes as input a DT H and outputs an MDT H1, which approximates H via the following key property: (M) For any observation x P X, the prediction H1pxq is equal to the prediction in the path followed by x in H of its deepest node in the strictly monotonic subsequence starting from the root of H. Figure 1 (right) presents an example of MDT H1 that would be built for some DT H (left) and satisfying (M) (unless observations have missing values, H1 is unique). Figure 1 adopts some additional conventions to ease parsing of H1, (D1) and (D2): (D1) Some internal nodes of H1 are also tagged with labels corresponding to the leaves of H. If a

node in H1 is tagged with a label of one leaf λ of H, it indicates that examples reaching λ in the original H are being classiﬁed by H1 s tagged node; (D2) Arcs in H1 have a width proportional to the number of boolean tests it takes to reach its tail from its head in H. A large width thus indicates a long path in H to improve classiﬁcation conﬁdence. To produce the MDT H1 from DT H with Algorithm 1, after having initialized it to a root = single leaf, we just run

GETMDT(rootp Hq, true, rootp H1q, rp root, p roots)

(we let p root . mintp root, 1 p rootu, p root . maxtp root, 1 p rootu). Upon ﬁnishing, the tree rooted at rootp H1q is the MDT sought. We complete the description of GETMDT: When a leaf of H does not have sufﬁcient conﬁdence and ends up being mapped to an internal node of the MDT H1, TAGDTLEAF is the procedure that tags this internal node with information from the leaf (see (D1) above). We consider a tag being just the name of the leaf (Figure 1, leaves #5, 9, 11), but other conventions can be adopted. The other two methods we use grow MDT H1 by creating a new node via NEWNODE and adding a new arc between existing nodes via NEWARC. We easily check the following. Theorem 1. H1 built by GETMDT satisﬁes (M) with respect to DT H.

Property (M) is crucial to keep the predictions of the DT and the MDT close to each other: to each node of the MDT indeed corresponds a node in the DT with the same posterior prediction. Because the set of nodes of H1 corresponds to a subset of nodes in H, predictions can locally change for some observations. This phenomenon is limited by two factors: (i) the set of leaves of the MDT corresponds to a set of leaves in the DT (thus, predictions do not change for the related observations), (ii) more often than not in our experiments, it is only the conﬁdence that changes, not the sign of the prediction, which means that accuracies of the DT and its corresponding MDT are close to each other (Section 6). It is also worth remarking that the MDT H1 always has the same depth as the DT H. Finally, any pruning of H is a subtree of H and its corresponding MDT is a subtree of the MDT of H. The aforementioned troubles to embed the DT in the Poincaré disk considering (A-C) do not exist anymore for the MDT because the best embeddings necessarily have all arcs going outwards in the Poincaré disk. The algorithm we use to embed the MDT in Poincaré disk is a simple alteration of Sarkar s embedding [30] which, because of the lack of space, we defer to the Appendix (Section C.VIII).

The quality of the embedding is obtained using an embedding error that takes into account all nodes:

ρp H1q . 100 Eν Np H1q

||ψν| d Bp0, zνq|

where ψν refers to the relevant ψLOGpp ν q in (3) (and zν P B is its embedding in B). Two example ﬁnal representations are in Figure 2. Notice the small errors ρ in both cases. Remark 1. Using a particular boosting algorithm to boost the logistic loss, we can as well represent the leveraging coefﬁcients of a DT/MDT in a boosted combination, following a convention sketched in Figure 2 and explained at length, along with the boosting algorithm and its properties, in the Appendix (Section C.IX).

5 Smoothly altering integrals: T-calculus and the t-self of Poincaré disk

It is apparent from Figure 2 (right, lowest red pentagon), that when utilizing standard hyperbolic distances d Bpz, 0q (4), the best MDT nodes are embedded close to the border. In addition to obviously not being great for visualization, these nodes are at risk of having numeric error push the nodes to the border BB, thereby giving a false depiction of inﬁnite conﬁdence. In this section, we provide alternative distances to prevent this phenomena. First, we quantify the numerical risk, which has been deﬁned by the critical region where high numeric error can occur [19, 29]. Deﬁnition 5.1. A point x P B is said k-close to boundary BB if }x} 1 10 k. It is said encodable iff }x} ă 1 in machine encoding (it is not pushed to the boundary ).

Machine encoding constrains the maximum possible k: in the double-precision ﬂoating-point representation (Float64), k 16 [19]. In the case of Poincaré disk, the maximal distance d from the

ratio of radii = leveraging coeﬃcient in ensemble (in absolute value)

node embedding error

isolines (posterior)

root (square node)

Monotonic Decision Tree

red / green = majority class

arc thickness = #tests in DT to reach child node (arcs going outwards in )

Figure 2: Two example MDTs learned on UCI domain online_shoppers_intention, showing embedding error ρ (5). Left: key parts of the embedding. Right: annotations used in Experiments, Section 6.

origin 0 that this authorizes (before numerical error warps a point to the border) is a small afﬁne order in k [19]:

d ď logp2q logp10q k Op10 kq, (6)

Hence, in practice, only a ball of radius d 38 around the origin can be accurately represented. This is in deep contrast with Euclidean representation, where d Ωp2kq. We now provide a principled solution to changing d while keeping hyperbolicity, which relies on a crucial generalization of Leibniz-Newton s fundamental Theorem of calculus.

5.1 T-calculus

We let rns . t1, 2, ..., nu for n P Ną0. At the core of our generalization is the replacement of the addition by the tempered (or t-)addition [22], z t z1 . z z1 p1 tqzz1, in the notion of Riemann sum. The additional term is a scaled saddle point curve zz1, positive when the sign of z and z1 agree and negative otherwise. Hereafter, f is deﬁned on an interval ra, bs. We deﬁne a generalization of Riemann integration to t-algebra (see [2] for a preliminary development which does not provide the generalization of Leibniz-Newton s fundamental Theorem of calculus) and for this objective, given an interval ra, bs and a division of this interval using n 1 reals x0 . a ă x1 ă ... ă xn 1 ă xn . b, we deﬁne the Riemann t-sum of f over ra, bs using , for a set of tags Ξ . tξi P rxi 1, xisui Prns,

Sptq pfq . p tqi Prnspxi xi 1q fpξiq (7)

(Sp1q pfq is the classical Riemann summation). Let sp q . maxi xi xi 1 denote the step of division . The conditions for t-Riemann integration are the same as for t 1. Deﬁnition 5.2. Fix t P R. A function f is t-(Riemann) integrable over ra, bs iff DL P R such that

@ε ą 0, Dδ ą 0, @ division with sp q ă δ and associated set of tags Ξ, ˇˇˇSptq pfq L ˇˇˇ ă ε. When this happens, we note

a fpxqdtx L. (8)

The case t 1, Riemann integration, is denoted using classical notations. We now prove our ﬁrst main results for this Section, namely the link between t-Riemann integration and Riemann integration.

Theorem 2. Any function is either t-Riemann integrable for all t P R simultaneously, or for none. In the former case, we have the relationship: (logt . pz1 t 1q{p1 tq for t 1 and log1 . log)

a fpuqdtu gt

, @t P R, with gtpzq . logt exp z. (9)

For example, if t 0, we get 1 p0qşb a fpuqdtu exp şb a fpuqdu, which is Volterra s integral [33, Theorem 5.5.11]. Theorem 2 is important because it allows to compute any t-Riemann integral from

the classical (t 1) case. Classical Riemann integration and derivation are fundamental inverse operations. The classical derivation is sometimes called Euclidean derivative [4]. We now elicit the notion of derivative, which is the inverse of t-Riemann integration. Unsurprisingly, it generalizes Euclidean derivative. The Theorem stands as a generalization of the classical fundamental Theorem of calculus. Theorem 3. Let z at z1 . z z1 1 p1 tqz1 denote the tempered subtraction. When it exists, deﬁne

Dtfpzq . lim δÑ0 fpz δq at fpzq

Suppose f t-Riemann integrable. Then, the function F deﬁned by Fpzq . ptqşz a fpuqdtu is such that Dt F f. We call F a t-primitive of f (which zeroes when z a) and Dt F the t-derivative of F.

The function gt (Figure 3) is key to many of our results; it has quite remarkable properties: in particular, it is strictly increasing for any t P R, strictly concave for any t ą 1, strictly convex for any t ă 1 and such that signpgtpzqq signpzq, @t P R. One can also note that Dtgtpzq 1. The Appendix proves these results (Sections C.I and C.II). Remark 2. The Appendix also proves many more results showing how Theorems 2 and 3, and properties of t-calculus, naturally percolate" through many properties known for classical integration and the fundamental theorem of calculus (among others, additivity, Chasles relationship, monotonicity, the hyperbolic Pythagorean theorem, mean value theorem, chain rule, etc.).

5.2 The t-self of Poincaré disk

Let us now put T-calculus to good use in our context. For a set X endowed with function d : XˆX Ñ r0, 8s (e.g. Poincaré disk and the Poincaré hyperbolic distance), the t-self of X is the set (implicitly) endowed with dptq . gt d. Before going further, let us provide the elegant dptq associated to Poincaré disk s t-self (with r . }z} as in (4)):

d Bptqp0, zq logt exp d Bp0, zq logt

The following Lemma (shown in Appendix) demonstrates the usefulness of T-calculus to address the encoding problem in B. Lemma 1. In Poincaré disk model, pick any increasing function gpkq ě 0. For the choice t 1 fpkq where fpkq P R is any function satisfying

log p1 fpkqgpkqq

fpkq ď logp10q k, (11)

we get in lieu of (6) the maximal dptq of dptq satisfying dptq ě gpkq, and the new hyperbolic constant τt of the t-self satisﬁes

τt exp pfpkqτq 1

fpkq . (12)

Since both logp1 xq x opxq and exppxq 1 x opxq in a neighborhood of 0, we check that (11) and (12) get back to properties of the Poincaré model as fpkq Ñ 0 [19]. We then have two cases: we can easily approach back the Euclidean d by picking fpkq ą 0: choosing fpkq 1 gets there and we still keep a ﬁnite hyperbolic constant, albeit exponential in the former one. If however we want to improve further the hyperbolic constant, we can pick some admissible fpkq ă 0, but then (11) will constrain gpkq to a very small value. Note that depending on the sign of fpkq, the triangle inequality can still hold or be replaced by a weaker version since we can show d Bptqpx, zq ď d Bptqpx, yq d Bptqpy, zq maxt0, fpkqu d Bptqpx, yq d Bptqpy, zq in Bptq.

Let us ﬁnally investigate how we can perform a general embedding in the t-self Bptq of the Poincaré disk to be (i) more encoding-savvy and (ii) have points originally close to BB substantially further away from the t-self Bptq border. We also add a third constraint (iii) that the distortions must be fair w.r.t. the original the Poincaré disk model embedding; i.e. if some z P B (e.g. the coordinate of a DT node) gets mapped to zptq P Bptq (the t-self), then we request

d Bptqpzptq, 0q d Bpz, 0q. (13)

-0.6 -0.4 -0.2

0 0.2 0.4 0.6 0.8

-0.3-0.2-0.1 0 0.1 0.2 0.3

0 0.2 0.4 0.6 0.8 1

Figure 3: Left: plot of gtpzq (9) for different values of t (color map on the right bar), showing where it is convex / concave. The t 1 case (g1pzq z) is emphasized in red. Right: suppose r . }z} is the Euclidean norm of a point z in Poincaré disk B. Fix t P r0, 1s (color map on the right bar). The plot gives the (Euclidean) norm rptq of a point zptq in the t-self such that d Bptqpzptq, 0q d Bpz, 0q (13). Remark that even for r very close to 1 we can have rptq substantially smaller (e.g. rptq ă 0.8).

online_shoppers_intention analcatdata_supreme abalone

online_shoppers_intention pj 1q

~just scaling

non-lin distort.

hardware pj 70q

Table 1: Left pane: embedding in Poincaré disk of the MDTs corresponding to the DTs learned at the 1st (top row) and 10th (bottom row) boosting iteration, on four UCI domains. Stark differences emerge between these domains from the plots alone. Right pane: comparison between Poincaré disk embedding and its t 0 t-self for an MDT learned on UCI online_shoppers_intention (top, boosting coefﬁcients information not shown) and UCI hardware (bottom). The p P t0.001, 0.999u isoline (rectangle) is barely distinguishable from BB but is clearly distinct from BBp0q. Note that in Bp0q, the center looks similar to a scaling (zoom) of B; while near the border, the high nonlinearity of Bp0q allows us to spot nodes that have high conﬁdence / training accuracy (in orange) but can hardly be distinguished from the bulk of just good nodes in B. Note also the set of red nodes in the Poincaré disk for j 70 (rectangle) that mistakenly look aligned, but not in the t-self.

All three conditions point to a non-linear embedding pt : B Ñ Bptq. The t-self offers a simple convenient solution with a trivial design for pt: We compute zptq by a simple scaling of the Euclidean norm of z in B to ensure that (13) holds. Figure 3 (right) displays the corresponding relationship between r . }z} and rptq . }zptq} when t P r0, 1s, clearly achieving (ii) in addition to (iii). For (i), even for t ă 1 very close to 1 and z close to BB (r close to 1), the mapping can send zptq substantially back in the t-self Bptq: for t 0.7 and r 1 10 4 i.e. k 4 in Deﬁnition 5.1 we get rptq 0.96, authorizing encoding with a less risky k 2. Two additional beneﬁts from (13), visible from Figure 3 (right) is that the distortion is low near the center and then merely afﬁne in a wide region until near the heavy non-linear changes, typically when r ą 0.8.

Remark 3. The Appendix details properties of the t-self of the Lorentz model of hyperbolic geometry, another popular model (Section C.V).

6 Experiments

We summarize a number of experiments (Table 1), provided otherwise in extenso in the Appendix. In the top-down induction scheme for DT, the leaf chosen to be split is the heaviest non pure leaf, i.e. the one among the leaves containing both classes with the largest proportion of training examples. Induction is stopped when the tree reaches a ﬁxed size, or when all leaves are pure, or when no further split allows decreasing the expected log-loss. We do not prune DTs. Our domains are public (Cf Appendix). Poincaré embeddings of DTs / MDTs See Figure 2 (left) for a summary of the visualization. We remind that the center of the disk is poor classiﬁcation (conﬁdence 0), or, in the context of boosting, random classiﬁcation. A striking observation is the sheer variety of patterns that are plainly obvious from our visualization but would otherwise be difﬁcult to spot using classical tools. Some are common to all domains: as boosting iterations increase, low-depth tree nodes tend to converge to the center of the disk, indicating increased hardness in classiﬁcation. This is the impact of a well known effect of boosting, whose weight modiﬁcations make the problem "harder". Among domain-dependent patterns, highly imbalanced domains get a root predictably initially embedded far from the origin (online_shoppers_intention, analcatdata_supreme) while more balanced domains get their roots embedded near the origin (abalone). Across the experiment, all trees have at most 200 nodes: while this clearly provides very good models after a large number of iterations on some domains, it is not enough to get good models after just a few iterations on others (analcatdata_supreme). Interpreting the hyperbolic embeddings can also be telling: consider Figure 2 (right), which a MDT of the (j 3) boosted DT learned on online_shoppers_intention. Notice the low embedding error (5.46%). We see in (A) (magenta) that classes are more balanced at the root compared to previous trees (All DTs in Appendix, Table IV) because the root is closer to the center of the disk: thus, hard examples belong to both classes (no class becomes much harder to learn than the other). Two clearly distinct subtrees are associated to non-purchase patterns (red, A). The bottom-most subtree achieves very good conﬁdence with several leaves close to the border: these are non-purchase patterns clearly distinct from purchase patterns (green nodes); the whole subtree dangles from the root via a test on feature Page Values (grey, B) achieving a substantial boost in conﬁdence; many nodes are way past posterior isoline p P t0.1, 0.9u. Red subtree in (A) is a lot closer to purchase patterns (C). One distinct pattern of purchase emerges, built from tests that always strictly increase its prediction s conﬁdence (orange). The full rule achieves very high accuracy (leaf posterior nears 0.99). Interest of the t-self for visualization As demonstrated by Table 1 (right) and Appendix, the t-self is particularly useful to tell the best parts of a tree. Because it also manages to push back the bulk of nodes from the border BBptq, it displays the t-self might also be useful as a standard encoding (use (13) to access Poincaré disk quantities and embedding). MDT vs DT for classiﬁcation We traded the hyperbolic visualization of a DT bound to be poor for the visualization of its monotonic part , an MDT. We have seen that the MDT visualization provides lots of clues about the DT as well. In terms of classiﬁcation, the fact that the set of leaves of the MDT is a subset of the set of leaves of its DT hints on similarities in classiﬁcation as well, but so far we have not explored a key question whose scope goes beyond this paper s: how does the classiﬁcation of a MDT compares to its DT ? This question is important because in addition to being more amenable to visualization than DTs, it might also be the case that classiﬁcation with MDTs might just be a good alternative to doing it with DTs. The scope of this question goes beyond this paper because of the tremendous popularity of DTs. The left pane of Table 2 provides a partial answer on a subset of our domains (the full set of results is in the Appendix, Section D.II; p-values are that of a Student paired t-test with H0 being the identity of the average errors). Clearly, the classiﬁcation accuracies are similar for a majority of the domains considered, but interestingly, we observe the two polarities when they are not: on domains ionosphere and online_shoppers_intention, the DT is better than its MDT, but on domain analcatdata_supreme it is the opposite. This is particularly interesting because the MDT is no larger and can be substantially smaller than its corresponding DT and thus could represent an efﬁcient pruning of its DT. MDT embedding: visualization error The right pane of Table 2 provides an example of embedding errors ρ (5) for the ﬁrst ﬁve MDTs in a boosted ensemble (we remind that the description of the boosting algorithm and the embedding of the leveraging coefﬁcients is provided in Appendix, Section C.IX). The full set of results is in Appendix, Section D.II. The results show that errors are small in general, which is good news given our quite heuristic modiﬁcation to Sarkar s embedding. There is more: the embedding error tends to decrease in some cases very signiﬁcantly with the MDT s index in the ensemble, see for example winered, german and qsar. The explanation is simple and

domain test error (DS vs MDT) MDT # in ens. + visu. error ρ DT MDT p-val #1 #2 #3 #4 #5 ionosphere 5.71 2.70 9.11 5.34 0.0237 15.21 9.84 8.20 6.88 3.45 winered 18.39 2.02 18.32 2.27 0.9227 12.13 11.87 8.56 6.31 4.44 german 24.00 4.37 23.90 4.25 0.8793 14.74 10.06 7.73 5.05 2.70 a._s. 23.40 1.85 22.56 1.75 0.0134 7.11 8.32 8.23 8.67 7.60 abalone 22.15 2.20 21.14 2.44 0.1267 9.37 6.53 8.20 7.61 8.18 qsar 12.98 2.63 12.89 4.05 0.8772 16.78 12.97 12.86 5.77 4.24 o._s._i. 9.90 0.78 10.67 0.54 0.0057 5.42 3.74 5.46 6.42 7.00 hardware 1.33 0.27 1.44 0.19 0.0534 7.78 8.03 3.44 3.49 2.48 Table 2: Summary of two experiments for a subset of our domains. Left pane: test errors (%) of DTs vs their corresponding MDTs (average std dev). Bold faces on p-values indicate keeping the identity of averages for a 0.05 ﬁrst-order risk. Right pane: embedding error ρ (5) (%) for the visualization of the ﬁrst ﬁve MDTs in an ensemble learned. Shorthands used: o._s._i. = online_shoppers_intention; a._s. = analcatdata_supreme. See text for details.

due to a key boosting feature, the reweighting mechanism underlined above: with weight updates that bring in general the total (new) weight of positive examples closer to that of negative examples, the root of the next MDT gets embedded closer to the center of the disk. As the root moves closer to the center, it is much easier to get a good spread of the tree in the rest of the disk at reduced error, in particular if the MDT is well balanced. For an example, see hardware for j 70 in Figure 1 (ρ ă 1%).

7 Conclusion

This paper proposes three separate contributions that altogether provide a solution for hyperbolic embedding of (ensembles of) decision trees (DT), via (i) a link between losses for class probability estimation and hyperbolic distances, (ii) the design of a clean, unambiguous embedding of a DT via its monotonic subtree and (iii) a way to improve visualization and encoding properties of the Poincaré disk model using its t-self, a notion we introduce. Each of these contributions are indeed separate: for example, one could reuse (i) to embed models different from DTs or (iii) to perform hyperbolic embeddings of objects being not even related to supervised models. Contribution (ii) is of independent interest with the introduction of a new kind of tree-based models, monotonic decision trees.

We believe that the way we address (iii) opens general applications even outside hyperbolic geometry. Indeed, we generalize classical integration and derivation to a context encompassing the concept of additivity, upon which integration is built, extending standard properties of integration and derivation in a natural way (see the Appendix for many examples). That such properties can be obtained without additional technical effort bodes well for perspectives of developments and applications in other subﬁelds of ML, where such tools could be used, not just in our geometric context, to smoothly tweak distortion measures. Many distortion measures used in ML are indeed integrals in nature: Bregman divergences, f-divergences, integral probability metrics, etc. One inherent limitation of our work is linked to the use of DTs: they typically ﬁt very well to tabular variables (attribute-value data) but otherwise should be used with caution. Our work is a ﬁrst step towards more sophisticated visualization for supervised models (Section 2), looking for tailored ﬁt geometric spaces to best blend their symbolic and numerical properties. More can be done and more has to be done: at an age of data collection and ML compute still ramping up, seeking model "pictures" worth a thousand words is a challenge but a necessity for responsible AI and explainability. Finally, our introduction of monotonic decision trees begs for a deeper statistical analysis of such models, given the tremendous popularity of decision trees that they could complete or replace on mainstream data science pipelines.

Acknowledgments

Thanks are due to Mathieu Guillame-Bert, reviewers and the area chair for many suggestions that helped to improve the paper s content. Work was done while AS was visiting RIKEN AIP.

[1] E. Amid, F. Nielsen, R. Nock, and M.-K. Warmuth. Optimal transport with tempered exponential measures. In AAAI 24, 2024. [2] Ehsan Amid, Frank Nielsen, Richard Nock, and Manfred K Warmuth. The tempered Hilbert simplex distance and its application to non-linear embeddings of TEMs. ar Xiv preprint ar Xiv:2311.13459, 2023. [3] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. In Proc. of the 4th SIAM International Conference on Data Mining, pages 234 245, 2004. [4] A.-F. Beardon and D. Minda. The hyperbolic metric and geometric function theory. In Proc. of the International Workshop on Quasiconformal Mappings and their Applications, 2005. [5] B.-H. Bowditch. A course on geometric group theory. Math. Soc. Japan Memoirs, 2006. [6] I. Chami, A. Gu, V. Chatziafratis, and C. Ré. From trees to continuous embeddings and back: Hyperbolic hierarchical clustering. In Neur IPS*33, 2020. [7] I. Chami, A. Gu, D. Nguyen, and C. Ré. Horopca: Hyperbolic dimensionality reduction via horospherical projections. In 38th ICML, volume 139 of Proceedings of Machine Learning Research, pages 1419 1429. PMLR, 2021. [8] P. Chlenski, E. Turok, A. Khalil Moretti, and I. Pe er. Fast hyperboloid decision tree algorithms. In 12th ICLR, 2024. [9] H. Cho, B. Demeo, J. Peng, and B. Berger. Large-margin classiﬁcation in hyperbolic space. In 22nd AISTATS, volume 89 of Proceedings of Machine Learning Research, pages 1832 1840. PMLR, 2019. [10] Z. Cranko and R. Nock. Boosted density estimation remastered. In 36th ICML, pages 1416 1425, 2019. [11] L. Doorenbos, P. Márquez-Neila, R. Sznitman, and P. Mettes. Hyperbolic random forests. Co RR, abs/2308.13279, 2023. [12] D. Dua and C. Graff. UCI machine learning repository, 2021. [13] J. Friedman, T. Hastie, and R. Tibshirani. Additive Logistic Regression : a Statistical View of Boosting. Ann. of Stat., 28:337 374, 2000. [14] O.-E. Ganea, G. Bécigneul, and T. Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. In 35th ICML, volume 80 of Proceedings of Machine Learning Research, pages 1632 1641. PMLR, 2018. [15] O.-E. Ganea, G. Bécigneul, and T. Hofmann. Hyperbolic neural networks. In Neur IPS*31, pages 5350 5360, 2018. [16] D.-E. Knuth. Two notes on notation. The American Mathematical Monthly, 99(5):403 422, 1992. [17] M.-T. Law, R. Liao, J. Snell, and R.-S. Zemel. Lorentzian distance learning for hyperbolic representations. In 36th ICML, volume 97 of Proceedings of Machine Learning Research, pages 3672 3681. PMLR, 2019. [18] Y. Mansour, R. Nock, and R.-C. Williamson. Random classiﬁcation noise does not defeat all convex potential boosters irrespective of model choice. In 40th ICML, 2023. [19] G. Mishne, Z. Wan, Y. Wang, and S. Yang. The numerical stability of hyperbolic representation learning. In 40th ICML, 2023. [20] M. Nickel and D. Kiela. Poincaré embeddings for learning hierarchical representations. In Neur IPS*30, pages 6338 6347, 2017. [21] F. Nielsen and R. Nock. On Rényi and Tsallis entropies and divergences for exponential families. Co RR, abs/1105.3259, 2011. [22] L. Nivanen, A. Le Méhauté, and Q.-A. Wang. Generalized algebra within a nonextensive statistics. Reports on Mathematical Physics, 52:437 444, 2003. [23] R. Nock, W. Bel Haj Ali, R. D Ambrosio, F. Nielsen, and M. Barlaud. Gentle nearest neighbors boosting over proper scoring rules. IEEE Trans.PAMI, 37(1):80 93, 2015. [24] R. Nock and A. K. Menon. Supervised learning: No loss no cry. In 37th ICML, 2020.

[25] M.-C. Pardo and I. Vajda. About distances of discrete distributions satisfying the data processing Theorem of Information Theory. IEEE Trans. IT, 43:1288 1293, 1997. [26] J. R. Quinlan. C4.5 : programs for machine learning. Morgan Kaufmann, 1993. [27] J.-G. Ratcliffe. Foundations of Hyperbolic Manifolds. Springer Graduate Texts in Mathematics, 1994. [28] M.-D. Reid and R.-C. Williamson. Information, divergence and risk for binary experiments. JMLR, 12:731 817, 2011. [29] F. Sala, C. De Sa, A. Gu, and C.Ré. Representation tradeoffs for hyperbolic embeddings. In 35th ICML, volume 80 of Proceedings of Machine Learning Research, pages 4457 4466. PMLR, 2018. [30] R. Sarkar. Low distortion Delaunay embedding of trees in hyperbolic plane. In GD 11, volume 7034, pages 355 366. Springer, 2011. [31] R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predictions. MLJ, 37:297 336, 1999. [32] Andrew D Selbst and Solon Barocas. The intuitive appeal of explainable machines. Fordham L. Rev., 87:1085, 2018. [33] A. Slavik. Product integration, its history and applications. Matfyzpress, Praze, 2007. [34] R. Sonthalia and A.-C. Gilbert. Tree! I am no tree! I am a low dimensional hyperbolic embedding. In Neur IPS*33, 2020. [35] T. van Erven and P. Harremoës. Rényi divergence and kullback-leibler divergence. IEEE Trans. IT, 60:3797 3820, 2014. [36] M. Yang, M. Zhou, R. Ying, Y. Chen, and I. King. Hyperbolic representation learning: Revisiting and advancing. In 40th ICML, volume 202 of Proceedings of Machine Learning Research, pages 39639 39659, 2023. [37] T. Yu and C. De Sa. Numerically accurate hyperbolic embeddings using tiling-based models. In Neur IPS*32, pages 2021 2031, 2019. [38] T. Yu, T.-J.-B. Liu, A. Tseng, and C. De Sa. Shadow cones: Unveiling partial orders in hyperbolic space. Co RR, abs/2305.15215, 2023.

To differentiate with the numberings in the main ﬁle, the numbering of Theorems, etc. is letter-based (A, B, ...).

Table of contents

Broader Impact Pg 14

t-algebra and t-additivity Pg 14

Supplementary material on proofs Pg 15

ãÑ Proof of Theorem 2 Pg 15 ãÑ Proof of Theorem 3 Pg 17 ãÑ Additional and helper results for Theorems 2 and 3 Pg 17 ãÑ Sets endowed with a distortion and their t-self: statistical information Pg 20 ãÑ The t-self of the Lorentz model of hyperbolic geometry Pg 21 ãÑ Proof of Lemma 1 Pg 22 ãÑ Proof of Theorem 1 Pg 22 ãÑ Modifying Sarkar s embedding in Poincaré disk Pg 23 ãÑ Boosting with the logistic loss à-la Ada Boost Pg 24

Supplementary material on experiments Pg 25

ãÑ Domains Pg 25 ãÑ Visualizing a DT via its MDT and embedding errors Pg 25 ãÑ All Poincaré disk embeddings Pg 27 ãÑ Interest of the t-self for visualization Pg 27

A Broader Impact

This paper presents work whose goal is to advance the ﬁeld of Machine Learning. Beyond our proposed tempered calculus which is used to amend the original Poincaré disk model of hyperbolic geometry, the monotonic decision trees and t-self hyperbolic embedding contribute to the ﬁeld of visualization and explainable AI for ML models. Thus, there are two distinct avenues for broader impact and related care:

1. from the geometry standpoint, the t-self brings a novel coding and visualization advantage over the founding Poincaré disk model for the embedding close to the disk s border, regardless of the nature of the embedding (whether it is data, models, etc. that are embedded). However, being new, it is important to keep the label that the t-self is being used, and also clearly display the relevant value of t or ranges, since the properties of the embedding can change depending on them; 2. from the model standpoint and a societal perspective, these visualizations will provide better methods for deployed decision tree models to be scrutinized. A potential negative of our approach is the required reduction from normal decision trees to monotonic decision trees, where the visualization does not directly correspond to the initial decision tree. We believe that monotonicity adds value to the interpretation of rules that are progressively built from trees, yet practitioners should be careful when extracting inferences from the reduced monotonic decision tree and clearly label them as being extracted from such a model, to prevent confusion with decision trees, that are overwhelmingly popular. From a pure prediction performance standpoint, we show at least empirically within the bounds of our settings that monotonic decision trees are similar to their original decision tree counterparts.

B t-algebra and t-additivity

We provide here a few more details on the basis of our paper, the t-algebra and the t-additivity of some divergences.

B.I Primer on t-algebra

Classical arithmetic over the reals can be used to display duality relationships between operators using the log, exp functions, such as for example logpa{bq log a log b, exppa bq exppaq exppbq, and so on. They can also be used to deﬁne one operator from another one. There is no difference between the operators appearing inside and outside functions. In the t-algebra, a difference appears and such relationships can be used to deﬁne the t-operators from those over the reals, as indeed one can deﬁne the tempered addition

x t y . logtpexptpxq exptpyqq,

the tempered subtraction,

x at y . logtpexptpxq{ exptpyqq,

(both simplifying to the expressions we use), and of course the tempered product and division,

x bt y . exptplogtpxq logtpyqq ; x mt y . exptplogtpxq logtpyqq,

whose simpliﬁed expression appears e.g. in [1]. See also [22].

B.II Functional t-additivity

As is well-known, Boltzman-Gibbs (and so, Shannon) entropy is additive while Tsallis entropy is t-additive. Note that additivity for BG requires being on the simplex, but t-additivity of Tsallis technically requires only positive measures the simplex restriction ensures the limit exists for t Ñ 1 and then TÑBG. Divergences can also be t-additive The KL divergence

DKLpp}qq . ÿ

k pk logppk{qkq

is additive on the simplex but not t-additive: using a decomposition of p, q as product of (independent) distributions (p1, p2 and q1, q2) using the cartesian product of their support, we indeed check DKLpp}qq DKLpp1}q1q DKLpp2}q2q with pij p1ip2j and qij q1iq2j. On the other hand, Tsallis divergence [21] is t-additive on positive measures with such a decomposition, with

DTpp}qq . ř

k pkppk{qkq1 t 1

and we check that DTpp}qq DTpp1}q1q DTpp2}q2q p1 tq DTpp1}q1q DTpp2}q2q (additional requirement to be on the simplex for convergence to DKL as t Ñ 1). For t R p1, 2q and on the simplex, Tsallis divergence is an f-divergence with generator z ÞÑ pz2 t 1q{p1 tq. Similarly, the tempered relative entropy is t-additive on the co-simplex, where in this case

Dtpp}qq . 1 ř k pkq1 t k 1 t , p, q P m.

The tempered relative entropy is a Bregman divergence with generator z ÞÑ z logt z logt 1 z.

C Supplementary material on proofs

C.I Proof of Theorem 2

The Theorem is tautological for t 1 so we prove it for t 1. Denote Pp Sq the set of subsets of set S and P p Sq . Pp Sqzt Hu. We ﬁrst transform Sptq npfq (index n shown for readability) in a better suited expression:

Sptq npfq . p tqn i 1|Ii|fpξiq

P PP prnsq p1 tq|P | 1 ź

i PP |Ii|fpξiq

i PP p1 tq|Ii|fpξiq

i 1 p1 p1 tq|Ii|fpξiqq 1

where we have used Ii . rxi 1, xis for conciseness. We now need a technical Lemma.

Lemma 2. Fix any 0 ď v ă 8{10 and consider any n reals qi, i P rns such that |qi| ď v, @i P rns. Then

ip1 qiq ď exp pnv vq . (15)

Proof. Suppose all the qi, i P rns satisfy qi P ru, vs and let ϕ be strictly convex differentiable and deﬁned over ru, vs. Then it comes from [10, Lemma 9] that we get the right-hand side of

0 ď Eirϕpqiqs ϕp Eirqisq ď Dϕ

ˆ w pϕ1q 1 ˆϕpuq ϕpvq

where we can pick w P tu, vu and Dϕ is the Bregman divergence with generator ϕ (the left-hand side is Jensen s inequality). Picking u . v (assuming wlog v ą 0) and letting

2v log ˆ 1 2v 1 v

for the choice ϕpzq . logp1 zq and w . v, we obtain after simpliﬁcation

i logp1 qiq ď Yv logp Yvq 1. (18)

Analizing function v ÞÑ Yv logp Yvq 1 reveals that it is upperbounded by v ÞÑ v2 if v P r 8{10, 8{10s. Hence, multiplying by n all sides and passing to the exponential, we get that

ip1 qiq ď exp nv2 , @0 ď v ă 8

which leads to the statement of the Lemma.

Let us come back to our Riemannian summation setting and let qi . p1 tq |Ii|fpξiq, @i P rns, v . max i |qi|.

Assume now that f is Riemann integrable, which allows to guarantee that, at least for n large enough and a step of division n not too large, we have |qi| ď 8{10, @i P rns and so we can use Lemma 2. We get from (14) and Lemma 2, if t ă 1

i 1 p1 qiq 1

expp nv vq 1

and we permute the bounds if t ą 1. Importantly, we note that ÿ

i qi p1 tq Sp1q npfq. (20)

So suppose limnÑ 8 Sp1q npfq . L1 . şb a fpuqu is ﬁnite and choose the step of division n not too large so that n max i |qi| p1 tq max i pn|Ii|q|fpξiq| ď K pp1 tq|b a| max fpra, bsqq,

for some constant K ě 1 (this is possible because f is (1-)Riemann integrable; we also note that if the division is regular then we can choose K 1). The value of K is not important: what is important is that nv remains ﬁnite in (19) as n increases, while limnÑ 8 v 0. Hence, expp nv vq Ñ 1 as n Ñ 8 and (19) implies, because limnÑ 8p1 a{nqn exp a, the two ﬁrst identities in

lim nÑ 8 Sptq npfq 1 1 t

1 1 t pexp pp1 tq L1q 1q (21)

logt expp L1q

((21) follows from (20)) and by deﬁnition limnÑ 8 Sptq npfq .

a fpuqdtu. So we get that

Riemann integration (t 1) grants

a fpuqdtu logt exp ż b

which, in addition to showing (9) (main ﬁle) also shows that t 1-Riemann integration is equivalent to all t 1-Riemann integration, ending the proof of Theorem 2. Remark 4. The absence of afﬁne terms in upperbounding v ÞÑ Yv logp Yvq 1 in the neighborhood of 0 is crucial to get to our result.

C.II Proof of Theorem 3

Using Theorem 2, we just have to analyze the limit in relationship to the Riemannian case:

Dt Fpzq . lim δÑ0

a fpuqdtu at

lim δÑ0 logt exp şz δ a fpuqdu at logt exp şz a fpuqdu δ (22)

lim δÑ0 1 δ logt

exp şz δ a fpuqdu exp şz a fpuqdu

lim δÑ0 1 δ logt exp

a fpuqdu ż z

lim δÑ0 1 δ

a fpuqdu ż z

. D1Fpzq f,

where the last identity is the classical Riemannian case, (22) follows from Theorem 2, (23) follows from a property of logt (logt a a logt b logtpa{bq), (24) follows from the fact that logt exppzq 0 z opzq. This completes the proof of Theorem 3.

C.III Additional and helper results for Theorems 2 and 3

We ﬁrst state a trivial but important Lemma, whose content is partially used in the main ﬁle.

Lemma 3. gt satisﬁes the following properties:

1. gtpzq is strictly increasing for any t P R, strictly concave for any t ą 1, strictly convex for any t ă 1 and such that signpgtpzqq signpzq, @t P R; 2. Dtgtpzq 1; 3. (t-integral mean-value) Suppose f Riemann integrable over ra, bs. Then there exists c P pa, bq such that

pb aq gt1 fpcq

a fpuqdtu pt1 . 1 p1 tqpb aqq.

We list a series of consequences to both theorems.

General properties of t-integrals Some properties generalize those for classical Riemann integration.

Theorem C.1. The following relationships hold for any t P R and any functions f, g t-Riemann integrable over some interval ra, bs:

a fpuq t or u gpuqdtu

padditivityq,

a λfpuqdtu λ

p1 p1 tqλqż b

a fpuqdtu pλ P Rq pdilativityq,

a fpuqdtu t

c fpuqdtu pc P ra, bsq p Chasles relationshipq, ˇˇˇˇˇ

p1 |1 t|qż b

a |fpuq|dtu ptriangle inequalityq,

a fpuqdtu ď

a gpuqdtu pf ď gq pmonotonicityq.

Proof. We show additivity for t{ (the same path shows the result for at{ ):

a fpuqdtu t

a gpuqdtu logt exp ż b

a fpuqdu t logt exp ż b

a fpuqdu exp ż b

logt exp ż b

a pf gqpuqdu

a pf gqpuqdtu.

We show dilativity, using t1 . 1 p1 tqλ:

a λfpuqdtu logt exp ż b

logt exp λ ż b

λ logt1 exp ż b

a fpuqdtu λ

p1 p1 tqλqż b

The triangle inequality follows from the relationship:

| logt exppzq| ď log1 |1 t| exp |z|, @z, t P R,

from which we use the fact that Riemann integration satisﬁes the triangle inequality and log1 |1 t| is monotonically increasing on R in the penultimate line of:

ˇˇˇˇˇlogt exp ż b

ď log1 |1 t| exp

ď log1 |1 t| exp ż b

p1 |1 t|qż b

a |fpuq|dtu.

We show Chasles relationship:

a fpuqdtu t

c fpuqdtu . logt exp ż c

a fpuqdu t logt exp ż b

a fpuqdu exp ż b

a fpuqdu ż b

logt exp ż b

where the second identity uses the property logt a logt b logtpabq. Monotonicity follows immediately from the fact that z ÞÑ logt exp z is strictly increasing:

a fpuqdtu logt exp ż b

ď logt exp ż b

This ends the proof of Theorem C.1.

Computing t-integrals Next, classical relationships to compute integrals do generalize to tintegration. We cite the case of integration by part.

Lemma 4. Integration by part translates to t-integration by part as:

a fg1du ptq rfgsb a at

where we let

ptq rhsb a . logt expphpbqq at logt expphpaqq.

(Proof immediate from Theorem 2)

Geometric properties based on t-integrals This is a more speciﬁc result, important in the context of hyperbolic geometry: the well-known Hyperbolic Pythagorean Theorem (HPT) does translate to a tempered version with the same relationship to the Euclidean theorem. Consider a hyperbolic right triangle with hyperbolic lengths a, b, c, c being the hyperbolic length of the hypothenuse. Let at, bt, ct denote the corresponding tempered lengths, which are therefore explicitly related using gt as

at logt exp a, bt logt exp b, ct logt exp c.

Deﬁne the tempered generalization of cosh:

cosht z . expt z exptp zq

The HPT tells us that cosh c cosh a cosh b. It is a simple matter of plugging gt, using the fact that logt and expt are inverse of each other and simplifying to get the tempered HPT, which we call t-HPT for short.

Lemma 5. (t-HPT) For any hyperbolic triangle described above, the tempered lengths are related as

cosht ct cosht at cosht bt.

Now, remark that for any t 0, a series expansion around 0 gives

exptpzq 1 tz2

(expt is always inﬁnitely differentiable around 0, for any t) So the t-HPT gives

1 tc2 t 2 opc3 tq ˆ 1 ta2 t 2 opa3 tq ˆ 1 tb2 t 2 opb3 tq ,

which simpliﬁes, if we multiply both sides by 2{t and simplify it into

c2 t opc3 tq a2 t b2 t opa3 tq opb3 tq,

which for an inﬁnitesimal right triangle gives c2 t a2 t b2 t, i.e. Pythagoras Theorem, as does the HPT one gives in this case (c2 a2 b2), which is equivalent of the particular t 1-HPT case.

t-mean value Theorem The t-derivative yields a generalization of the Euclidean mean-value theorem. Lemma 6. Let t P R and f be continuous over an interval ra, bs, differentiable on pa, bq and such that 1{p1 tq R fpra, bsq. Then Dc P pa, bq such that

Dtfpcq pfpbq at fpcqq pfpaq at fpcqq

Proof. We can obtain a direct expression of Dtf by using the deﬁnition of at and the classical derivative:

Dtfpxq lim δÑ0 1 δ fpx δq fpxq

1 p1 tqfpxq 1 1 p1 tqfpxq lim δÑ0 fpx δq fpxq

δ f 1pxq 1 p1 tqfpxq.

(26) From here, the mean-value theorem tells us that there exists c P ra, bs such that

f 1pcq fpbq fpaq

Dividing by 1 p1 tqfpcq (assuming fpcq 1{p1 tq) and reorganising, we get

Dtfpcq 1 b a fpbq fpaq 1 p1 tqfpcq

1 b a fpbq fpcq 1 p1 tqfpcq 1 b a fpaq fpcq 1 p1 tqfpcq

pfpbq at fpcqq pfpaq at fpcqq

which completes the proof of the Lemma.

In fact, the t-derivative of f at some c is "just" an Euclidean derivative for an afﬁne transformation of the function, namely z ÞÑ fpzq at fpcq, also taken at z c. This "proximity" between t-derivation and derivation is found in the tempered chain rule (proof straightforward). Lemma 7. Suppose g differentiable at z and f differentiable at gpzq. Then

Dtpf gqpzq Dtpfqpgpzqq g1pzq.

C.IV Sets endowed with a distortion and their t-self: statistical information

Here, X contains probability distributions or the parameters of probability distributions: f can then be an f-divergence (information theory) or a Bregman divergence (information geometry). Tsallis divergence and the tempered relative entropy are examples of t-additive information theoretic and information geometric divergences. A key property of information theory is the data processing inequality (D), X being a probability space, which says that passing random variables through a Markov chain cannot increase their divergence as quantiﬁed by f [25, 35]. A key property of

information geometry is the population minimizer property (P), which elicits a particular function of a set of points as the minimizer of the expected distortion to the set, as quantiﬁed by f [3]. We let (J) denote the joint convexity property, which would state for f and any x1, x2, y1, y2 that

fpλ x1 p1 λq x2, λ y1 p1 λq y2q ď λfpx1, y1q p1 λqfpx2, y2q, (27)

and convexity (C), which amounts to picking y1 y2 (convexity in the left parameter) xor x1 x2 (in the right parameter). Lemma 8. For any t P R, the following holds true:

(D) f satisﬁes the data processing inequality iff f ptq satisﬁes the data processing inequality;

(P) µ P arg minµ ř i fpxi, µq iff

µ P arg min µ p tqi f ptqpxi, µq; (28)

(J) f satisﬁes (27) iff the following pt, t1, t2q-joint convexity property holds:

f ptqpλ x1 p1 λq x2, λ y1 p1 λq y2q ď λf pt1qpx1, y1q p1 λqf pt2qpx2, y2q, (29)

with t1 . mintt, 1 λ λtu and t2 . mintt, λ p1 λqtu.

Proof. (D) and (P) are immediate consequences of Lemma 3 (point [1.], main ﬁle) and properties of logt. We prove (J). gt being strictly increasing for any t, we get for t ď 1

f ptqpλ x1 p1 λq x2, λ y1 p1 λq y2q ď gt pλfpx1, y1q p1 λqfpx2, y2qq ď λ gt fpx1, y1q p1 λq gt fpx2, y2q

λf ptqpx1, y1q p1 λqf ptqpx2, y2q,(30)

because gt is convex. If t ą 1, we restart from the ﬁrst inequality and remark that

gt pλfpx1, y1q p1 λqfpx2, y2qq gt pλfpx1, y1qq t gt pp1 λqfpx2, y2qq ď gt pλfpx1, y1qq gt pp1 λqfpx2, y2qq λ g1 λ λtq fpx1, y1q p1 λq gλ p1 λqt fpx2, y2q

λf p1 λ λtqpx1, y1q p1 λqf pλ p1 λqtqpx2, y2q. (31)

The inequality is due to the fact that a t b a b p1 tqab ď a b if ab ě 0 and t ě 1. The last equality holds because, for t1 . 1 p1 tqb, we have

logt ab ap1 tqb 1

1 t a1 t1 1

1 t1 b logt1paq.

Putting altogether (30) and (31), we get that for any t P R,

f ptqpλ x1 p1 λq x2, λ y1 p1 λq y2q

ď λf pmintt,1 λ λtuqpx1, y1q p1 λqf pmintt,λ p1 λqtuqpx2, y2q, (32)

as claimed for (J). This ends the proof of Lemma 8.

We note that (29) also translates into a property for (C); also, if t ď 1, then t t1 t2 in (29).

C.V The t-self of the Lorentz model of hyperbolic geometry

As an additional example of use, we consider the Lorentz model for a simple illustration in which one creates approximate metricity in the t-self. This d-dimensional manifold is embedded in Rd 1 via the hyperboloid with constant c ă 0 curvature and deﬁned by Hc . tx P Rd 1 : x0 ą 0 x x 1{cu, with x y . x0y0 řd i 1 xiyi the Lorentzian inner product [27, Chapter 3]. In ML, two distortions are considered on Hc [17], one of which is the Lorentzian distance", d Lpx, yq . p2{cq 2 x y. It is notoriously not a distance because it does not satisfy (T), yet we show that we can pick t and the curvature in such a way that the t-self is arbitrarily close to a metric space.

Lemma 9. @δ ą 0, pick t and curvature c as t 1 p1{δq, c 2{δ. Then the t-self of Hc is approximately metric: dptq L satisﬁes (R), (S), (I), and the δ-triangle inequality,

dptq L px, zq ď dptq L px, yq dptq L py, zq δ, @x, y, z P Hc. (33)

Proof. dptq L still obviously satisﬁes reﬂexivity, the identity of indiscernibles, and non-negativity, so we check the additional property it now satisﬁes, the weaker version of the triangle inequality. Given any x, y, z in Hc, condition dptq L px, zq ď dptq L px, yq dptq L py, zq δ for t ą 1 is

1 exp pt 1q 2

t 1 ď 1 exp pt 1q 2

1 exp pt 1q 2

which simpliﬁes to

exp p2pt 1q x yq exp p2pt 1q y zq ď exp ˆ 2pt 1q

pt 1qδ exp ˆ 2pt 1q

exp p2pt 1q x zq .

We have x y ď 1{c by deﬁnition, so a sufﬁcient condition to get the inequality is to have

exp p2pt 1q y zq ď pt 1qδ exp ˆ 2pt 1q

Function hpzq . zδ expp 2z{cq is maximum over R for z c{2, for which it equals hpz q cδ{p2eq. Fix t 1 pc{2q. We then have exp p2pt 1q y zq ď 1{e so to get (34) for this choice of t, it is sufﬁcient to pick curvature c 2{δ, yielding relationship t 1 p1{δq.

C.VI Proof of Lemma 1

We start by proving condition

dptq ě gpkq (35) in the Lemma. Using the proof of [19, Proposition 3.1], we know that any point x k-close to the boundary (Deﬁnition 5.1) satisﬁes

dptqpx, 0q logt

logt 2 10k 1 ,

so to get (35), we want logtp2 10k 1q ě gpkq. Letting t 1 fpkq with fpkq P R, we observe logtp2 10k 1q p2 10k 1qfpkq 1 {fpkq, so we want, after taking logs,

logp2 10k 1q ě log p1 fpkqgpkqq

(this also holds if fpkq ă 0 because 1 fpkqgpkq ă 1), and there remains to observe logp2 10k 1q k logp10q logp2 1{10kq with logp2 1{10kq ě 0, @k ě 0. Hence, to get (36) it is sufﬁcient to request log p1 fpkqgpkqq

fpkq ď logp10q k,

which is (35). For such t, the new hyperbolic constant satisﬁes

τt logt exp τ 1 fpkq pexp pfpkqτq 1q

C.VII Proof of Theorem 1

Take any observation x P X. Trivially, H1pxq is a real value that belongs to the path followed by x in H. Furthermore, if we consider any path from the root to a leaf in H, all the vertices in its strictly monotonically increasing subsequence of absolute conﬁdences appear in H1 (conditions 4, 8 in GETMDT), which guarantees the condition on prediction in (M). Hence (M) is satisﬁed.

Figure I: Schematic description of our modiﬁcation (bottom) of Sarkar s algorithm (up). Our modiﬁcation solely changes the step in which the children of a node (here, a) are computed, this node having been reﬂected back to the origin. Instead of using a ﬁxed angle and length to position arcs (and thus children of the node), the angle depends on the number of leaves that can be reached from a child, and the length depends on the difference between absolute conﬁdence between the node and the corresponding child. We also deﬁne an angle which represents the domain (before reﬂecting back) in which the embedding is going to take place, shown with the thick dashed line (here, this angle is π).

C.VIII Modifying Sarkar s embedding in Poincaré disk

Sarkar s algorithm [30] gives a clean low-distortion embedding when the tree is binary or the arc length is constant [29]. Things are different in our case: MDT nodes have arbitrary out-degrees, and lengths depend on the absolute conﬁdence at the corresponding MDT nodes. Plus, a direct implementation of Sarkar s algorithm would violate the constraint for strict path-monotonicity in an MDT that nodes in a path from the root need to progressively come closer to the border equivalently, it would create substantial embedding errors for conﬁdences and violate (B). Without focusing on an optimal solution (that we leave for future work), one can remark that all these problems can be heuristically addressed by changing one step of Sarkar s algorithm, replacing the use of the total 2π fan out angle for mapping children (Step 5: in Algorithm 1 of [29]) by a variable angle with a special orientation in the disk.

We refer to the concise and neat description of Sarkar s embedding in [29] for the full algorithm. Our modiﬁcation relies on changing one step of the algorithm, as described in Figure I. The key step that we change is step 5: in the description of [29, Algorithm 1]. This step embeds the children of a given node (and then the algorithm proceeds recursively until all nodes are processed). Sarkar s algorithm corresponds to the simple case where all arcs to/from a node deﬁne a ﬁxed angle, which does not change during reﬂection because Poincaré model is conformal. Hence, if the tree is binary, this angle is 2π{3, which provides a very clean display of the tree. In our case however, some children may have just one arc to a leaf while others may support big subtrees. Also, arc lengths can vary substantially. We thus design the region of the disk into which the subtrees are going to be embedded by choosing an angle proportional to the number of leaves reachable from the node, and of course lengths have to match the difference between absolute conﬁdence between a node and its children. There is no optimization step to learn a clean embedding, so we rely on a set of hyperparameters to effectively compute this new step of the algorithm.

C.IX Boosting with the logistic loss à-la Ada Boost

Algorithm 2 LOGISTICBOOST(S)

Input: Labeled sample S . tpxi, yiq, i P rmsu, T P Ną0; 1 : w1i Ð 1{2, @i P rms; // initialize all weights (equivalent to maximally unconﬁdent prediction) 2 : for j 1, 2, ..., T 3 : Hj Ð WEAK-LEARNp S, wjq; // request a DT as a "weak hypothesis" 4 : αj P R; // picks leveraging coefﬁcient for Hj 5 : for i 1, 2, ..., m // weight update, not normalized

wpj 1qi Ð wji wji p1 wjiq exppαjyi Hjpxiqq; (37)

Output: Classiﬁer HT . ř

j αj Hjp.q;

The link between distances in the Poincaré model of hyperbolic geometry and the canonical link of the log-loss (3) can be extended to leveraging coefﬁcients in boosted combinations [31]. To get there, we need a tailored boosting algorithm which parallels Ada Boost s design tricks (hence different from Logit Boost [13]). We want a boosting algorithm for the liner combination of DTs which displays classiﬁcation that can be easily and directly embedded in the Poincaré disk. Algorithm LOGISTICBOOST is provided above. For the weight update, we refer to [23]. We already know how to embed DTs via their Monotonic DTs. What a boosting algorithm of this kind does is craft

j 1 ujp.q, ujpxq . αj Hjpxq; (38)

Remark that we have merged the leveraging coefﬁcient and the DTs outputs H., on purpose. To compute the leveraging coefﬁcients α. in Step 4, we use Ada Boost s secant approximation trick , applied not to the exponential loss but to the logistic loss: for any z P r R, Rs and α P R,

logp1 expp αzqq ď 1 u

2 logp1 expp αRqq 1 u

2 logp1 exppαRqq. (39)

Hence, to compute the leveraging coefﬁcient αj that approximately minimizes the current loss, ř

i wji logp1 expp αyi Hjpxiqqq, we minimize instead the upperbound using (39). Letting

pψLOGq j . max λPΛp Hjq

p jλ 1 p jλ

(note the slight abuse of notation for readability, since we write pψLOGq j instead of |pψLOGq j |; we remind that Λp.q denotes the set of leaves of a tree; the index in the local proportion of positive example p jλ reminds that weights used need to be boosting s weights) which we note can be directly approximated on the Poincaré disk by looking at the leaf nearest to the border of the disk. Because the maximal absolute conﬁdence in the DT is also the maximal absolute conﬁdence in its MDT, we obtain the sought minimum,

αj 1 pψLOGq j log ˆ1 rj

where rj P r 1, 1s is the normalized edge

i Prms wji yi Hjpxiq maxk |Hjpxkq|

1 pψLOGq j log

p jλpxiq 1 p jλpxiq

Explained in [31, Section 3.1].

best conﬁdence in DT

normalized DT conﬁdence

node embedding error

isolines (posterior)

root (square node)

Monotonic Decision Tree

red / green = majority class

arc thickness = #tests in DT to reach child node (arcs going outwards in )

Leveraging coeﬃcient

Figure II: One MDT learned on UCI domain online_shoppers_intention reproducing the left of Figure 2 in the main ﬁle, with additional details for leveraging coefﬁcients.

where wj indicates normalized weights. It is not hard to show that because we use the local posterior p jλ at each leaf, αj ě 0 and also rj ě 0. Hence, everything is like if we had an imaginary node νj with p jνj . p1 rjq{2 (ě 1{2) and positive conﬁdence

pψLOGqj . ψLOGpp jνjq log

p jνj 1 p jνj

that we can display in Poincaré disk (we choose to do it as a circle; see Figure II, main ﬁle). We deduce from (38) that

ujpxq pψLOGqj pψLOGq j log

p jλpxq 1 p jλpxq

and note that all three key parameters can easily be displayed or computed directly from the Poincaré disk:

If we use MDTs directly for prediction, log ˆ p jλpxq 1 p jλpxq

Hjpxq is read directly from the

MDT s plot; the absolute value can be removed and the true sign guessed from the node s color (or shape); if se use DTs for prediction, the MDT s plot gives an approximation of the DT s output which is 100% accurate for all leaves of the MDT (each corresponds to a leaf in the DT); If Hj is leveraged (in an ensemble), pψLOGq j is read directly from the plot (40), as well as pψLOGqj. Their ratio gives the leveraging coefﬁcient. We note that pψLOGq j is the same for both the DT and its corresponding MDT. Only the value of pψLOGqj can differ for the DT and the MDT.

We note that regardless of whether we use the DT or its MDT to classify, the MDT s representation is also fully accurate for the DT for a subset of its leaves.

D Supplementary material on experiments

D.I Domains

The domains we consider are all public domains, from the UCI repository of ML datasets [12], Open ML, or Kaggle, see Table I.

D.II Visualizing a DT via its MDT and embedding errors

All test errors are estimated from a 10-fold stratiﬁed cross-validation. One can always choose to directly learn Monotonic DTs instead of DTs, given their natural ﬁt for embedding in the Poincaré disk. In this case, the hyperbolic representation of the MDT can be immediately used for assessment. Suppose we stick to learning DTs (because e.g. they have been standard in ML for decades) and wish

Domain m d License breastwisc 699 9 CC BY 4.0 ionosphere 351 33 CC BY 4.0 tictactoe 958 9 PDDL winered 1 599 11 CC BY 4.0 german 1 000 20 CC BY 4.0 analcatdata_supreme 4 053 8 CC BY 4.0 abalone 4 177 8 CC BY 4.0 qsar 1 055 41 CC BY 4.0 hillnoise 1 212 100 CC BY 4.0 firmteacher 10 800 16 CC BY 4.0 online_shoppers_intention 12 330 17 CC BY 4.0 give_me_some_credit 120 269 11 None Buzz_in_social_media (Tom s hardware) 28 179 96 CC BY 4.0 Buzz_in_social_media (twitter) 583 250 78 CC BY 4.0

Table I: UCI, Open ML (Analcatdata_supreme) and Kaggle (Give_me_some_credit) domains considered in our experiments (m total number of examples, d number of features), ordered in increasing m ˆ n. Datset licenses listed in last column.

domain DT MDT p-val breastwisc 4.15 2.18 5.00 2.38 0.2408 ionosphere 5.71 2.70 9.11 5.34 0.0237 tictactoe 2.61 1.85 2.09 1.10 0.1390 winered 18.39 2.02 18.32 2.27 0.9227 german 24.00 4.37 23.90 4.25 0.8793 analcatdata_supreme 23.40 1.85 22.56 1.75 0.0134 abalone 22.15 2.20 21.14 2.44 0.1267 qsar 12.98 2.63 12.89 4.05 0.8772 hillnoise 37.04 2.73 45.88 4.72 0.0008 firmteacher 6.87 1.23 7.04 1.17 0.4292 online_shoppers_intention 9.90 0.78 10.67 0.54 0.0057 hardware 1.33 0.27 1.44 0.19 0.0534 Table II: Results of the experiments checking whether "reducing" the interpretation of a DT to that of its MDT (using its hyperbolic embedding) can give accurate information about the tree as well. Numerical columns, from left to right, give the average std dev error for DTs and MDTs and provide the p-value of a Student paired t-test with H0 being the identity of the average errors. Entries in bold faces correspond to keeping H0 for a 0.05 ﬁrst-order risk. See text for details.

to use the visualization of its corresponding MDT to make inferences on the original DT itself. Can we make reliable conclusions? We explore this question by observing that the prediction from DT to MDT can only change if the leaf λ that an example reaches in the DT satisﬁes two conditions: (a) λ does not appear as a leaf in the MDT, and (b) it is tagged to an internal node of the MDT with a conﬁdence of the opposite sign (this is (D1), Step 3: in GETMDT). Without tackling the formal aspect of this question, we have designed a simple experiment for a simple assessment of whether / when this is reasonable. We have trained a set of T=200 boosted DTs, each with maximal size 200 (total number of nodes). After having computed the corresponding MDTs, we have compared the test errors of the boosted set of trees and that of the ensemble where each DT is replaced by its MDT, but the leveraging coefﬁcients do not change. Intuitively, if test errors are on par, the variation in classiﬁcation of DTs vs MDTs (including conﬁdences) is negligible, and we can reduce" the interpretation of the DT to that of its MDT in the Poincaré disk. The results are summarized in Table II.

From Table II, we can safely say that our hypothesis reasonably holds in many cases, with two important domains for which it does not: ionosphere and hillnoise. For the former domain, we attribute it to the small size of the domain, which prevents training big enough trees; for the latter domain, we attribute it to the fact that the domain contains substantial noise, which makes it

domain MDT # in ensemble + visualization error ρ #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 online_shoppers_intention 5.42 3.74 5.46 6.42 7.00 7.96 4.70 5.68 7.93 6.22 analcatdata_supreme 7.11 8.32 8.23 8.67 7.60 6.15 7.85 6.55 6.82 6.63 abalone 9.37 6.53 8.20 7.61 8.18 8.20 5.91 5.88 6.51 5.43 winered 12.13 11.87 8.56 6.31 4.44 3.13 2.09 1.43 3.84 3.03 qsar 16.78 12.97 12.86 5.77 4.24 2.55 1.68 2.92 2.20 4.13 german 14.74 10.06 7.73 5.05 2.70 2.36 1.26 1.19 2.07 1.57 ionosphere 15.21 9.84 8.20 6.88 3.45 5.49 8.34 4.17 5.86 8.82 hardware 7.78 8.03 3.44 3.49 2.48 1.10 0.74 0.81 0.55 0.45 kaggle 4.92 5.84 6.24 7.07 7.87 8.06 8.06 8,70 8.89 7.89

Table III: Embedding error ρ (5) for the visualization of the ﬁrst ten MDTs in an ensemble learned.

MDT #1 MDT #2 MDT #3 MDT #4 MDT #5

MDT #6 MDT #7 MDT #8 MDT #9 MDT #10 Table IV: First 10 Monotonic Decision Trees (MDTs) embedded in Poincaré disk, corresponding to several Decision Trees (DTs) with ď 200 nodes each, learned by boosting the log / logistic-loss on UCI online_shoppers_intention. Isolines correspond to the node s prediction conﬁdences, and geometric embedding errors (ρ%) are indicated. See text for details.

difﬁcult to substantially improve posteriors by splitting and thus make many DT nodes, including leaves, disappear" in the MDT conversion (Steps 2: and 14: in GETMDT). Table III summarizes the visualization errors of the MDT embeddings obtained in Tables IV - XII. A key feature of the embeddings is that on some domains (e.g. german) the error of the embedding dramatically decreases with the index of the MDT in the ensemble. The reason why this happens is simple: the weight update of boosting tends to equalize weights between classes (to make the problem "harder"); thus, new trees tend to have their roots embedded closer to the origin, which makes the embedding easier for the rest of the trees to keep good visualization at low error.

D.III All Poincaré disk embeddings

They are presented in Table IV through to XII, showing the ﬁrst models learned in a fold of the cross-validation experiments. producing Table II

D.IV Interest of the t-self for visualization

We now test the experimental impact of switching to the t-self in Poincaré disk model (See Table XIII for a clear depiction of how changing t can yields better visualization close to the border of the disk). Recall that switching to the t-self moves way model parts close to BB but keeps a low non-linear distortion around the center, which is thus roughly only affected by a scaling factor. Figure I presents a few results. For domain online_shoppers_intention, we note that the part of the tree that is within the isoline deﬁned by posterior p . P t0.1, 0.9u gets indeed just scaled: both plots look quite identical. Very substantial differences appear near the border: the best parts of the model could easily

MDT #1 MDT #2 MDT #3 MDT #4 MDT #5

MDT #6 MDT #7 MDT #8 MDT #9 MDT #10 Table V: First 10 MDTs for UCI analcatdata_supreme. Convention follows Table IV.

MDT #1 MDT #2 MDT #3 MDT #4 MDT #5

MDT #6 MDT #7 MDT #8 MDT #9 MDT #10 Table VI: First 10 MDTs for UCI abalone. Convention follows Table IV.

MDT #1 MDT #2 MDT #3 MDT #4 MDT #5

MDT #6 MDT #7 MDT #8 MDT #9 MDT #10 Table VII: First 10 MDTs for UCI winered. Convention follows Table IV.

MDT #1 MDT #2 MDT #3 MDT #4 MDT #5

MDT #6 MDT #7 MDT #8 MDT #9 MDT #10 Table VIII: First 10 MDTs for UCI qsar. Convention follows Table IV.

MDT #1 MDT #2 MDT #3 MDT #4 MDT #5

MDT #6 MDT #7 MDT #8 MDT #9 MDT #10 Table IX: First 10 MDTs for UCI german. Convention follows Table IV.

MDT #1 MDT #2 MDT #3 MDT #4 MDT #5

MDT #6 MDT #7 MDT #8 MDT #9 MDT #10 Table X: First 10 MDTs for UCI ionosphere. Convention follows Table IV.

MDT #1 MDT #2 MDT #3 MDT #4 MDT #5

MDT #6 MDT #7 MDT #8 MDT #9 MDT #10 Table XI: First 10 MDTs for UCI hardware. Convention follows Table IV.

MDT #1 MDT #2 MDT #3 MDT #4 MDT #5

MDT #6 MDT #7 MDT #8 MDT #9 MDT #10 Table XII: Plot of the 10 ﬁrst MDTs learned on kaggle give_me_some_credit (top panel and bottom panel). In each panel, we plot the embedding in B (top row) and the t-self Bp0q 1 (t 0, bottom row). Remark the ability for the t-self to display a clear difference between the best subtrees, subtrees that otherwise appear quite equivalent in terms of conﬁdence from B alone.

t 1.0 (B) t 0.6 t 0.3 t 0.0 Table XIII: T-self Bptq 1 for Poincaré disk B (left), for several ts. We plot the same set of isolines of B (and their mapping via pt), parameterized by probability p P r0, 1s which gives the norm r . |2p 1| in B (From the innermost to the outermost, p P t0.6, 0.7, 0.8, 0.9, 0.999u, or by symmetry for p1 . 1 p, also indicated). Remark the equidistance of isolines in B, approximately kept in Bptq 1 for p up to 0.9 (p1 0.1), while the distortion we need clearly happens near the border: outermost isoline p P t0.001, 0.999u (plotted with bigger width) is smoothly and substantially moved" within the t-self as t decreases, guaranteeing good readability and coding convenience (see text).

online_shoppers_intention pt 0q twitter pt 0q

~just scaling

hardware pt 70q hardware pt 139q

Figure I: Top pane: comparison between Poincaré disk embedding and its t-self for t 0 for an MDT learned on UCI online_shoppers_intention (left) and twitter (right). On the left panel, we have not plotted the boosting coefﬁcients information. The isoline distinguished in magenta is the big width one in Table XIII. Note the difference in (non-linear) distortion created in the t-self, in which the central part just enjoys a scaling. Bottom pane: stark differences in visualization between B and the t-self do not just appear initially: they can appear at any iteration. We display two MDTs learned on UCI hardware at two different iterations (indicated). It would be very hard to differentiate the best leaves from just the Poincaré disk embedding, while it becomes obvious from the t-self (orange arrows). Note also the set of red nodes in the Poincaré disk for t 70 that mistakenly look aligned, but not in the t-self (blue rectangle). See text for details.

be misjudged as equivalent from B alone (orange rectangle) but there is little double from Bptq 1 that one of them, which crosses the p . P t0.001, 0.009u isoline, is in fact much better than the others. When many subtrees seem to be aggregating near the border as in buzz_in_social_media, stark differences can appear on the t-self: the best subtrees are immediately spotted from the t-self (orange rectangles). In between, the t-self makes a much more visible ordering between the best nodes and subtrees, compared to Poincaré disk. hardware demonstrate that such very good nodes that are hard to differentiate from the others in B can appear at any iteration.

Neur IPS Paper Checklist

The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and follow the (optional) supplemental material. The checklist does NOT count towards the page limit.

Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist:

You should answer [Yes] , [No] , or [NA] .

[NA] means either that the question is Not Applicable for that particular paper or the relevant information is Not Available.

Please provide a short (1 2 sentence) justiﬁcation right after your answer (even for NA).

The checklist answers are an integral part of your paper submission. They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it (after eventual revisions) with the ﬁnal version of your paper, and its ﬁnal version will be published with the paper.

The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While "[Yes] " is generally preferable to "[No] ", it is perfectly acceptable to answer "[No] " provided a proper justiﬁcation is given (e.g., "error bars are not reported because it would be too computationally expensive" or "we were unable to ﬁnd the license for the dataset we used"). In general, answering "[No] " or "[NA] " is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justiﬁcation to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer [Yes] to a question, in the justiﬁcation please point to the section(s) where related material for the question can be found.

IMPORTANT, please:

Delete this instruction block, but keep the section heading Neur IPS paper checklist",

Keep the checklist subsection headings, questions/answers and guidelines below.

Do not modify the questions and only use the provided macros for your answers.

Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope?

Answer: [Yes]

Justiﬁcation: Claims are supported by both the theoretical results presented in the paper or experimental exploration.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reﬂect how much the results can be expected to generalize to other settings. It is ﬁne to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justiﬁcation: Limitations of the proposed embedding method are discussed. For instance, the requirement of only embedding the MDT are discussed. Impacts on classiﬁcation due to this are explored in the MDT vs DT classiﬁcation discussion, with further details in Appendix. A last paragraph in the conclusion also discusses limitations.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-speciﬁcation, asymptotic approximations only holding locally). The authors should reﬂect on how these assumptions might be violated in practice and what the implications would be. The authors should reﬂect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reﬂect on the factors that inﬂuence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efﬁciency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be speciﬁcally instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justiﬁcation: Assumptions are detailed in formal results and full proofs are provided in the Appendix.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justiﬁcation: Detailed settings of constructing visualizations are presented. Code provided with an example public domain and a resource ﬁle + README.txt for quick testing and validation. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or veriﬁable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might sufﬁce, or if the contribution is a speciﬁc model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justiﬁcation: Code provided with an example public domain and a resource ﬁle + README.txt for quick testing and validation. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justiﬁcation: Details are provided in main text and Appendix. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Signiﬁcance

Question: Does the paper report error bars suitably and correctly deﬁned or other appropriate information about the statistical signiﬁcance of the experiments? Answer: [Yes] Justiﬁcation: Error is stated whenever suitable. Multiple results are presented for visualizations. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, conﬁdence intervals, or statistical signiﬁcance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not veriﬁed. For asymmetric distributions, the authors should be careful not to show in tables or ﬁgures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding ﬁgures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] . Justiﬁcation: The code runs on any standard computer.

Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justiﬁcation: The research of the paper follows the code of ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justiﬁcation: Broader impact and societal impact is brieﬂy mentioned throughout the main text, with further discussion presented in an Appendix section. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake proﬁles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact speciﬁc groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efﬁciency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justiﬁcation: No release of data or models.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety ﬁlters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justiﬁcation: Appropriate credit / referencesse are provided. Licenses listed in Appendix.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justiﬁcation: No new assets provided.

Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip ﬁle.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justiﬁcation: No crowdsourcing or research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is ﬁne, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justiﬁcation: No research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary signiﬁcantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.