# generative_trees_adversarial_and_copycat__3d14c1d8.pdf

Generative Trees: Adversarial and Copycat

Richard Nock 1 Mathieu Guillame-Bert 1

Abstract While Generative Adversarial Networks (GANs) achieve spectacular results on unstructured data like images, there is still a gap on tabular data, data for which state of the art supervised learning still favours decision tree (DT)-based models. This paper proposes a new path forward for the generation of tabular data, exploiting decades-old understanding of the supervised task s best components for DT induction, from losses (properness), models (tree-based) to algorithms (boosting). The properness condition on the supervised loss which postulates the optimality of Bayes rule leads us to a variational GAN-style loss formulation which is tight when discriminators meet a calibration property trivially satisﬁed by DTs, and, under common assumptions about the supervised loss, yields one loss to train against them all for the generator: the χ2. We then introduce tree-based generative models, generative trees (GTs), meant to mirror on the generative side the good properties of DTs for classifying tabular data, with a boosting-compliant adversarial training algorithm for GTs. We also introduce copycat training, in which the generator copies at run time the underlying tree (graph) of the discriminator DT and completes it for the hardest discriminative task, with boosting compliant convergence. We provide experiments on tasks including fake/real distinction and missing data imputation.

1. Introduction

Generative Adversarial Networks have early established a gold standard for both neural networks as generative models and the loss to train generative models via a variational measure-based distortion (Goodfellow et al., 2014; Nowozin et al., 2016; Nock et al., 2017). While they have achieved spectacular results on a variety of unstructured data (Ni

1Google Research. Correspondence to: Richard Nock <richardnock@google.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

et al., 2021), the quality of outcomes on tabular data is still lagging behind with the sentiment that new approaches are needed (Camino et al., 2020). This is an important problem: recently, tabular data was still representing the most prevalent data type in real world AI (Chui et al., 2018, pp. 15). Interestingly, this chasm separating astonishing generation on unstructured data to suboptimal generation on tabular data mirrors another one, on the supervised side, where neural nets can achieve superhuman recognition on unstructured data (Linsley et al., 2021) but require massive amounts of sophistication to compete against standard libraries using decision-tree (DT) based models on tabular data (Arık & Pﬁster, 2021). DT induction has been perfected over decades, starting with core supervised loss functions known as proper (Savage, 1971; Reid & Williamson, 2011), using particularly ﬁt and simple graph-based tree models (Breiman et al., 1984; Quinlan, 1993), and culminating with a powerful algorithmic machinery to learn them, boosting (Kearns & Mansour, 1996; Friedman et al., 2000; Schapire & Singer, 1998). One would expect that potential generative approaches for tabular data would mirror those three key components on the generative side, but to our knowledge, none has been achieved. Such is our objective, and our paper thus contains three main technical contributions: On losses, the GAN approach formulates the generator s loss from a variational measure-based divergence, unveiling the discriminator s loss (Nowozin et al., 2016). Instead, we start from the discriminator s side and a general proper loss, i.e. a loss for which Bayes prediction is optimal, which is standard for DT induction since Breiman et al. (1984). We relate the corresponding information (De Groot, 1962) to a GAN-style formulation which provides us with the generator s loss. A difference with GANs variational formulation is there is no slack in the characterisation if the discriminator meets a calibration condition trivially satisﬁed by DTs: unlike e.g. Nowozin et al. (2016, Ineq. (4)), we get identities all the way through. A surprising corollary follows. If the discriminator s partial losses meet a property that most popular choices meet, then to minimize the generator s loss, it is sufﬁcient to minimize the χ2 between real and fake data: we get one loss to train generators against them all . This ﬁrst contribution is not speciﬁc to DTs as it holds for all calibrated discriminators in the properness framework. On models, we introduce generative trees (GTs). In the same way as generator and discriminator in GANs include

Generative Trees: Adversarial and Copycat

us ctgan ground truth

rand Gauss abalone iris

sepal length

petal length

longest shell

whole weight

Figure 1. 2D density plots for our generative trees (us) vs CTGANs on 2 UCI + 1 simulated domain. See Section 6 for details. Note how GTs improve on CTGANs.

a similar functional form (a neural net), our GTs include a tree (graph) structure like DTs, differences being stochastic activations at the arcs and leaf-dependent data generation. On algorithms, we propose a top-down induction algorithm to adversarially train GTs with provable boosting-compliant geometric convergence of the χ2, the weak generative learning assumption being a weak statistical dependence between the generator and discriminator. We propose a second way to train generative trees, extremely efﬁcient and that we think has no equivalent yet in neural networks. In this setting, that we nickname copycat, the generator tracks and copies the discriminator s tree (graph) at training time, and completes it for the hardest generative model given the discriminator1. The geometric convergence in density ratio loss of the generator (Menon & Ong, 2016) directly follows from a seminal result of Kearns & Mansour (1996). From an experimental standpoint, we have performed four main series of experiments on missing data imputation, training from generated data, fake/real discrimination and generated data augmentation, against each experiment s state of the art contender (van Buuren & Groothuis-Oudshoorn, 2011; Xu et al., 2019). Figure 1 provides a glimpse of generated data s quality for three domains, against neural nets (CTGANs, Xu et al. (2019)). The domains we used included simulated domains and domains from the UCI, Kaggle and the Stanford Open Policing project. The experiments display that GTs can be very efﬁcient contenders against sophisticated state of the art methods: on training from generated data, fake/real discrimination and generated data augmentation, GTs tend to get better results than neural nets, sometimes with tiny models relative to domain size; on missing data imputation, GTs can beat on low-dimensional problems the mice approach (van Buuren & Groothuis Oudshoorn, 2011), even when mice relies on tree-based

1For the informed reader, the generator learns boosting s balanced distribution of Kearns & Mansour (1996).

Other debtors in {none}

Number existing credits <= 2

(0.5, False)

Other debtors in {none}

(0.9, True) (0.1, False)

Number existing credits <= 2

(0.5, True)

Figure 2. Decision tree (left) and generative tree (right) with the same underlying tree, for the UCI german credit domain.

imputation using thousands+ of tree models against a GT essentially relying on a single one. Experiments are given in extenso in an Appendix (App), also containing all proofs.

2. Basic deﬁnitions

@k P N , we let rks . t1, 2, ..., ku. X denotes a domain, S . txi : i P rmsu Ă X is a sample of real observations. The associated supervised learning problem is a binary labeled problem where labels Y . t 1, 1u . tfake, realu distinguish between a fake (=generated) and a real observation. The objective of the supervised problem is to learn a posterior computing Pr Y 1|Xs, denoted η P r0, 1s X. With slight variations, many notations follow from (Reid & Williamson, 2011). π . Pr Y 1s is the prior. In the generative game, the prior is user-ﬁxed. p X, Pq and p X, Nq are measure spaces for positive/real and negative/fake observations respectively to avoid notation overloads, we leave implicit the σ-algebra. p Xˆt 1, 1u, Dq is the product measure space of labeled examples following the (supervised) binary task pπ, P, Nq Reid & Williamson (2011, Section 4); we let B . pπ, P, Nq for short. We also have the mixture space p X, Mq with M . π P p1 πq N. A posterior is particularly interesting for B, Bayes posterior, which is:

and is optimal for proper losses (more on this in Section 4).

We present the tree-based models we use as architectures for both the discriminator and the generator. Architectures: we start by the commonpoint between both, that we denote a tree for short.

Deﬁnition 3.1. A tree is a rooted, directed binary tree whose internal nodes are labeled with binary tests over observation variables and outgoing arcs are labeled with truth values. For any internal node, the left outgoing arc is labeled with truth value false and the right outgoing arc is labeled with true. Leaves are blank nodes.

Deﬁnition 3.2. A decision tree (DT) is a tree with leaves labeled in r0, 1s. A generative tree (GT) is a tree in which truth values at arcs are associated to Bernoulli events Bp.q.

Figure 2 presents examples of DT and GT with the same underlying tree. We assume without loss of generality that

Generative Trees: Adversarial and Copycat

trees are binary but our deﬁnitions could trivially be extended to trees of any arity. Hereafter, low caps like h are used to represent DTs while high-caps like G are used to represent GTs. Λp.q denotes the set of leaves of a tree. Access routines: an important routine needed for a DT h is, for any observation x, the leaf λpxq P Λphq reached by x. This is the leaf whose path from the root involves tests satisﬁed by x. If x contains no unknown feature values, this path is unique. The main access routine for a GT G is the generation of an observation. To do so, we simply stochastically traverse the tree using the Bernoulli events at the internal nodes. Once a leaf λ P Λp Gq is reached, sampling an observation is done by a uniform sampling in the complete domain that satisﬁes the tests traversed to reach λ. In Figure 2, the center leaf λ of the GT G is reached with probability q 0.1 0.5 0.05. If we do reach it, according to the UCI german credit data domain (Dua & Graff, 2021), then we sample uniformly at random an observation for which attribute Number existing credits is in t0, 1, 2u and Other debtors is in tco-applicant, guarantoru and all other attributes are chosen uniformly at random in their full domain, since they do not appear in the path to λ.

Remark 3.3. Uniform sampling imposes a ﬁnite length domain for real or integer features, which is a reasonable assumption for standard features like e.g. age, salary. Alleviating the constraint can be done using speciﬁc transformations, such as the Box-Muller transform, generating Normal deviates from uniform distributions (Box & Muller, 1958).

4. Loss functions involved

Departing from (W)GAN-style approaches, we design the losses involved from the discriminator s. Calibrated posteriors: for any function f P RX and measure Q over measurable space p X, Ωq, Qf is the restriction of Q to the sub-σ-algebra Ωf induced by the level set of f. A similar notation Qf with identical deﬁnition is used in van Erven & Harremo es (2014, Section II). It can be interpreted as the marginal of Q on the subset of events of Ωf, each of which is a union of events from Ωhaving the same f-value. We now deﬁne a property of a posterior for class probability estimation that shall be fundamental to analyse our losses.

Deﬁnition 4.1. Posterior η is said calibrated with task B (or just calibrated for short) iff η π d P η

d M η , and we let B η . pπ, P η, N ηq.

There are three important examples of calibrated posteriors: (i) the constant posterior ηπ . π is (the only constant) calibrated (posterior). To see it, it yields Ωηπ t H, Xu. The RHS in Def. (4.1) gives π d Pηπp Xq{d Mηπp Xq . π ş

X d M . π 1{1 π and we have for this posterior Pr Y|Xs Pr Y|X P Xs ηπ . π; (ii) Bayes posterior η

is calibrated; it follows from (1); (iii) let h be a DT. Any DT induces a partition of X at its leaves, t Xλ : λ P Λphqu;

Loss ℓ 1puq ℓDRpρq, ρ ě 0 Eq. cvx Œ

Log log2p1 uq log2 1 1

1 2 1 1 ρ 2

Jeffreys 2 log u 1 u 1 1 u

2 log 1 ρ 1

KL 2 2 logp1 uq u 1 u

2 2 log ρ 1 ρ 1

Normalized χ2 2 u2 p1 uq2 2 ρ2

Table 1. Differentiable partial losses for class 1 for symmetric losses, along with their corresponding ℓDR and its properties (see text; cvx = convex). (*) add +2 to Pearson χ2 to have ℓ 1p0q 0.

suppose without loss of generality that all leaves predictions are different, and consider Ωh t Hu Y t Xλ : λ P Λphqu. Without further correction, the posterior prediction at a leaf λ of h is classically computed as the ratio of the total weight of real observations reaching λ over the total weight of observations reaching λ. In mathematical form, we have here Pr Y|Xs Pr Y|X reaches λs πd Ph{d Mh, which is by deﬁnition the prediction of η at λ and shows that the posterior prediction of any DT is calibrated. We note that (i) is a particular case of (iii) when h is reduced to its root, and if the domain X is ﬁnite, then (ii) is a particular case of (iii) for h being any complete (ﬁnite) DT. Hereafter, a tilda like η denotes a calibrated posterior. Notation η denotes any posterior, disregarding eventual additional properties. Losses for class probability estimation: a loss for class probability estimation, ℓ: Y ˆ r0, 1s Ñ R, is expressed as

ℓpy, uq . Jy 1K ℓ1puq Jy 1K ℓ 1puq, (2)

where J.K is Iverson s bracket (Knuth, 1992). Functions ℓ1, ℓ 1 are called partial losses. A loss is symmetric when ℓ1puq ℓ 1p1 uq, @u P r0, 1s (Nock & Nielsen, 2008) and differentiable when both partial losses are differentiable. Table 1 presents examples partial losses of symmetric losses. The pointwise conditional risk of posterior η P r0, 1s with respect to ground truth η P r0, 1s is Lpη, η q . EY Bpη q rℓp Y, ηqs, i.e.,

Lpη, η q η ℓ1pηq p1 η q ℓ 1pηq. (3)

Bp.q denotes a Bernoulli for picking label Y 1. The associated (pointwise) Bayes risk is

Lpη q . inf η Lpη, η q. (4)

The interesting case is when the argument of the inf reduces to tη u, because then minimizing (3) for η encourages to pick ground truth η . Formally, when (i) Lpηq Lpη, ηq, @η P r0, 1s and (ii) Lpη, η q ą Lpη q, @η η , we say that the loss is strictly proper, and proper when (i) holds. The population version of (3), when both η, η P r0, 1s X, is the (full) risk (Reid & Williamson, 2011, pp 747),

Lpη, η , Mq . EX M r Lpηp Xq, η p Xqqs . (5)

Generative Trees: Adversarial and Copycat

We now assume that all losses for class probability estimation used hereafter are strictly proper, symmetric and differentiable (SPSD) and satisfy the additional technical assumption that ℓ 1p0q 0 (all but Jeffreys in Table 1 are SPSD), which makes η Bayes posterior in (1). Deﬁnition 4.2. The information of calibrated η P r0, 1s X

is: Lp η,Mq . Lpηπ,ηπ,Mq Lp η, η,Mq Lpπq Lp η, η,Mq.

This deﬁnition is a convenient restriction to calibrated posteriors of the original deﬁnition in De Groot (1962, Eq. (2.2)) and Reid & Williamson (2011, Eq. (20)). It represents how much information η brings compared to the constant calibrated posterior ηπ. Decision tree induction would traditionally maximize Lp η, Mq via the minimisation of some Lp η, η, Mq, where η is the calibrated posterior at the leaves of the decision tree: CART s uses the square loss (Breiman et al., 1984), C4.5 uses the log-loss (Quinlan, 1993), etc. . Losses for measure estimation and binary task information: a substantial body of work has tightened the GAN loss to variational f-divergences (Nowozin et al., 2016; Nock et al., 2017). Here, we are also interested in such a formulation but for a very speciﬁc set of f introduced decades ago ( Osterreicher & Vajda, 1993, Theorem 2):

f πptq . Lpπq pπt 1 πq L

ˆ πt πt 1 π

which involves prior π (controlled in the generative game). Deﬁnition 4.3. The information of binary task B . pπ, P, Nq, Ip Bq, is the f π-divergence

Ip Bq . If πp P, Nq, (6)

where we recall If πp P, Nq . ş f π d P

d N d N. Given Ip Bq, we could directly dig into the variational formulation of the f π-divergence to design the generative modelling game and loss at the expense of an eventual slack due to the variational argument (Nowozin et al., 2016, Ineq. (4)). We avoid the slack via a trick using calibrated posteriors. Losses for the adversarial generative game: we need to deﬁne two additional functions, for any posterior η:

We call ρ the density ratio and ϱ the likelihood ratio, following conventions in Reid & Williamson (2011, pp 746)2. To take an example, if we consider Bayes posterior η in (1), then it follows ϱ d P{d N, justifying the name. Let f pzq . supttzt fptqu denote the convex conjugate of f. We note that for any SPSD loss ℓ, f π is differentiable. Deﬁnition 4.4. Let B and ϱ be any binary task and likelihood ratio. Let Gℓpzq . pf πq f π1pzq and

Gℓp N|ϱq . EX Nr Gℓpϱp Xqqs, (8)

Dℓpϱ|Bq . EX P f π1 ϱp Xq Gℓp N|ϱq

2Names can otherwise vary in the literature.

respectively denote the generator and discriminator risks.

For any f-divergence, we have the f-GAN deﬁning inequality (Nowozin et al., 2016, eqs. (4-6)) similar to -(9):

Ifp P, Nq ě sup ϱ t EX P f 1 ϱp Xq EX Nrf f 1 ϱp Xqsu, (10)

so both (8) and (9) deﬁne the corresponding functions to minimise for the generator and discriminator in this variational inequality after the change f Ñ f π. While the change is anecdotical with respect to the inequality (10), it conceptually operates a radical shift with respect to classical (f-)GANs: the generator s loss is completely determined in our case by the loss of the discriminator as it appears in f π, a loss whose design heavily relies on properness. The change also has a key fortunate mathematical consequence: we can replace the inequality (10) by a chain of equalities involving all key risks, as we now show.

Theorem 4.5. For any SPSD loss ℓ, any binary task B, any calibrated posterior η whose likelihood ratio is denoted ϱ, the following holds:

Lp η, M ηq Dℓp ϱ|B ηq Ip B ηq. (11)

Furthermore, we have the expression for function Gℓin Deﬁnition 4.4 (with ℓDRpρq . ℓ 1 p1{p1 ρqq):

Gℓpϱq Lpπq p1 πq ℓ DRpρq, (12)

and the transformation ϱ Ø ρ is obtained via (7).

The proof (in Appendix, Section I.1) also provides the conjugate pf πq , of potential independent interest.

Remark 4.6. Since f-divergences satisfy the data processing inequality, we also have for any calibrated posterior η, Ip B ηq ď Ip Bq. Together with (11), this gives a precise way of how the GAN game operates with calibrated posteriors and proper losses: training a discriminator to maximise its statistical information Lp η, M ηq, e.g. as done with DT induction algorithms, increases as well the information of the binary task Ip B ηq. On the other hand, training in turn the generator to minimize Gℓp N η|.q reduces the information of the binary task Ip B ηq. In the case of DT algorithms, as the tree grows, its calibrated posterior η converges to an empirical Bayes best posterior (based on training real data). Disregarding generalisation issues, as long as the generator stands the growth of the discriminator by keeping Ip B ηq small enough, it is guaranteed to improve with iterations.

A generative loss to learn against them all (almost) Table 1 shows that ℓDR has several invariant properties for the losses shown. We formalise some of them.

Lemma 4.7. For any ℓproper symmetric and differentiable, (i) ℓDR is decreasing and (ii) ℓDR is convex ρ or in 1{ρ, @ρ.

Proof in Appendix, Section I.2. In the examples of Table 1, the or in Lem. 4.7 is in fact an and . Convexity is

Generative Trees: Adversarial and Copycat

important because it yields a single loss to efﬁciently train the generator against any proper trained discriminator: the χ2, i.e. the f-divergence whose generator is fptq . pt 1q2.

Lemma 4.8. For any SPSD loss ℓfor which ℓDR is convex, any binary task B, calibrated posterior η (likelihood ratio . ϱ), the following bound holds on the generator s risk Gℓ:

Gℓp N η| ϱq

ď Lpπq p1 πq ℓ 1

ˆ π 1 p1 πq χ2p N η||P ηq

Proof in Appendix, section I.3. For any SPSD loss, ℓ 1 is increasing (Cf proof of Theorem 4.5). Therefore, if we train the generator to reduce χ2p N η||P ηq, it reduces the RHS in (13) and provides a smaller bound on the generator s risk, regardless of the proper loss used as long as ℓDR is convex.

5. Training h and G

One way to train both the DT h and the GT G would be to proceed as in generative adversarial networks (Goodfellow et al., 2014). The DT can be trained using any commercial package (Breiman et al., 1984; Quinlan, 1993) or more generally any greedy induction of a tree minimizing a SPSD loss with convex ℓDR. We can then train the adversarial GT by minimizing χ2p N η||P ηq and alternate between phases of training the DT and training the GT. We call this setting adversarial for short. Due to the architecture of the models, there is a more speciﬁc training available for generative trees, more constrained than the adversarial setting but with a straightforward implementation and direct convergence guarantees coming from the convergence of the DT training. In this case, the GT copies the tree architecture of the DT and ﬁts the probabilities to keep χ2p N η||P ηq 0. We call this setting the copycat setting. We detail them.

5.1. Adversarial training of the generator

We adopt a greedy induction of the GT. The current calibrated posterior of the generator h is η. Let λ denote a general leaf of h. S denotes the current sampling node at the generator G that we are going to split to create a subtree with two sampling leaves and associated Bernoulli probability p to compute the new arcs at S . Figure 3 provides an overview of the process, pointing to a new variable, τ, which is the local (relative) proportion of examples generated from the right sub-domain at the candidate split. For any λ P Λphq in the discriminator and candidate split at leaf S P Λp Tq in the generator, we deﬁne: ãÑ pλ, the total weight of real examples reaching λ; ãÑ nλ . ş

Xpλq d N η, the theoretical proportion of fake examples reaching λ, where Xpλq is the subset of X of observations that reach λ in h; ãÑ n0 λ, the total weight of fake examples reaching λ but

generated by Λp Gqzt S u these weights do not change after the split at S ; ãÑ nl λ, the total weight of fake examples reaching λ, generated by S and whose value for attribute Xi (the one considered for the split) is in Xl; ãÑ nr λ, the total weight of fake examples reaching λ, generated by S and whose value for attribute Xi (the one considered for the split) is in Xr.

It is worth noticing that nλ, n0 λ, nl λ, nr λ can all be calculated exactly from the trees of h and G. After the split, only the proportions in Yλtnl λu Yλ tnr λu are potentially changed by the split. We can compute τ as a function of these quantities:

λPΛphq nr λ ř

λPΛphq nl λ nr λ . (14)

Deﬁne three more quantities:

lλ . n0 λ nl λ 1 τ ; rλ . n0 λ nr λ τ ; δλ . rλ lλ.

These quantities are interpretable as follows: lλ would be the new nλ after split if we were to pick p 0; rλ would be the new nλ after split if we were to pick p 1 and δλ quantiﬁes the difference in generation between these two extreme strategies. These strategies are extreme because for example if we choose p 0, then we discard the support at S covering observations whose value for Xi is in Xr. Some coefﬁcients are particularly important to compute p:

l2 λ pλ ; µRR . ÿ

r2 λ pλ ; µLR . ÿ

These are also interpretable: if we let χ2 N1 ηppq||P η de-

note the new χ2 after the split at S with Bernoulli p, then µLL 1 χ2 N1 ηp0q||P η , µRR 1 χ2 N1 ηp1q||P η , and µLR is a correlation between both strategies. The proof of Lemma 4.8 shows those identities. Algorithm TD-GEN summarizes the steps to split one leaf, without giving speciﬁc constraint on the choice of leaf to split S , feature i, and split parameters p Xl, Xrq. We leave these open because general convergence rates can be obtained for TD-GEN that do not constraint those choices. Function CLAMPpzq is deﬁned as CLAMPpzq . maxtmintz, 1u, 0u. We have two different regimes for the convergence of the χ2, depending on whether p P p0, 1q or p P t0, 1u (that latter case means that we discard support for the generation of examples). We give those results in two different Theorems. For our ﬁrst Theorem, let Iδ . r0, δq denote a range of acceptable values for the χ2s. The question we ask is what is the guaranteed convergence rate when we are not in this favorable case, that is, when the χ2s (before and after update) are not in Iδ, a situation we refer to as TD-GEN being

Generative Trees: Adversarial and Copycat

(A) (B) (C) (D) (E)

Figure 3. Splitting a sampling leaf in G to create a subtree with two new sampling leaves. (A) we pick a current leaf and decide on a variable Xi whose local density (therefore uniform, in dark gray) is going to be split in two at the new leaves. (B) a potential split creates two local intervals Xl, Xr, and we can compute the relative local proportion of examples that would be generated from Xr (τ) and from Xl (1 τ), (C). Finally, we compute Bernoulli s p (D). Note that τ does not appear after split (E), it is just used to compute p.

Algorithm 1 TD-GENp G, hq

Input: current generator G, current discriminator h; Output: G with a new split; Step 1 : pick S P Λp Gq, i P rds; // leaf and variable for the current split Step 2 : choose p Xl, Xrq and compute τ; // split choice Step 3 : compute p as

p Ð CLAMP ˆ µLL µLR µLL µRR 2µLR

Step 4 : replace S by a split as designed in Steps 1,2 w/ Bernoulli probability p as in (16);

outside regime Iδ . We show TD-GEN exhibits geometric convergence rate related to δ and the proximity of p to τ.

Theorem 5.1. Suppose p P p0, 1q. For any ε ą 0, δ ą 0, if (i) TD-GEN is outside regime Iδ and (ii) p in Step 3 satisﬁes |τ p| ě ε, then after one iteration of TD-GEN, we have:

χ2 N1 ηppq||P η ď 1 1 δε2 χ2 p N η||P ηq . (17)

Condition (ii) does make sense because if p τ, then there is no change in the χ2 as χ2 N1 ηpτq||P η χ2 p N η||P ηq.

To cover the case p P t0, 1u, we need an additional assumption that mirrors the weak learning assumption that governs the convergence of the discriminator h in the boosting framework. We call it a weak generating assumption.

Deﬁnition 5.2. (δ-WGA) Let δ ą 0 be a constant. We say that the split at S meets the δ-Weak Generating Assumption iff µDD ě δ maxtµLL, µRRu, where µDD . ř

λPΛphq δ2 λ{pλ.

Theorem 5.3. Suppose p P t0, 1u and the δ-WGA holds. Then after one iteration of TD-GEN, we have:

χ2 N1 ηppq||P η ď 1 1 δpτ p1 2τqpq2 χ2 p N η||P ηq .

Remark 5.4. Deﬁnition 5.2 is weak in a generative sense. Indeed, the only case where µDD 0 is when

nl λ{p1 τq nr λ{τ, @λ P Λphq, which brings after solving for τ, the relationships nr λ{pnr λ nl λq nr λ1{pnr λ1 nl λ1q for any two leaves in Λphq, and after simplifying, nr λ nl λ1 nl λ nr λ1, @λ, λ1 P Λphq. Filling any 2ˆ2 contingency table with destination leaves in the discriminator (λ, λ1) versus provenance in the generator (l, r) immediately leads to a Pearson s χ2 0. What the WGA prevents is thus the extreme independence where the generator s examples would be randomly attributed to the leaves in the discriminator.

5.2. Copycat training of the generator

Algorithm: when using a (decision) tree as discriminator, both the GT and DT have an underlying tree (graph). Copycat training takes advantage of this scenario as the GT G copies the tree of the DT h as it is learned: if h involves the usual top-down induction scheme, after each of the new splits in h, the generative tree G replicates the same split in its tree, computing the Bernoulli probabilities in such a way that the new proportion of fake observations is going be the same as that of real observations at the new leaves of h. In other words, after the update of G, the new h performs as badly as a fair coin. We see two substantial downsides to copycat vs adversarial training: the generator peeks in the discriminator s tree, which can be a problem for privacy or fairness issues, and it has zero freedom to grow its own tree. There is, however, a major upside of copycat training over adversarial training: it requires no additional expensive computation for the new Bernoulli s p in G. Denote mλ the number of positive examples at the leaf λ P Λphq to be split in h, and mr λ the number of positive examples ending up in its right sub-leaf after split. Then in G we have at the same λ: p mr λ{mλ. Convergence: there is another beneﬁt of copycat training: in the boosting model of Kearns & Mansour (1996, Section 5.1), the convergence rates for G towards the distribution of observed real data directly follow from the boosting rates of h on the supervised task. To show this, we proceed in two steps: the ﬁrst introduces and conveniently decomposes a risk quantifying the discrepancy between measures in the

Generative Trees: Adversarial and Copycat

density ratio model (Menon & Ong, 2016) for this objective, we introduce indexes in notations, and let h T denote the generator h after T splits, so that h0 is the single-root DT. Similarly, we let ηT denote the corresponding calibrated posterior and PT denote the measure induced on X by P and h T by (i) ensuring it is locally uniform at each leaf and (ii) it locally sums to the local weight of P. In equation, it satisﬁes, U denoting the uniform measure,

λ d U, @x reaching λ. (18)

Denote pλ . ş

λ d P and uλ . ş

λ d U. For differentiable and convex F : R Ñ R, the Bregman divergence with generator F is BF pz}z1q . Fpzq Fpz1q pz z1q F 1pz1q. Given function g : R Ñ R, the generalized perspective transform of F given g is (Mar echal, 2005a;b; Nock et al., 2016) ˇFpzq . gpzq F z gpzq . g is implicit in notation ˇF.

Deﬁnition 5.5. The Likelihood ratio risk of PT with respect to P for SPSD loss ℓis (with gpzq . z p1 πq{π): Bℓ P, PT . π EU Bp} Lq

d P d U d PT

Risks expressed as in Def. 5.5 have a history in density ratio estimation (Menon & Ong, 2016) (and references within).

Lemma 5.6. @ SPSD loss ℓand DT h T , } L

is strictly

convex and Bℓ P, PT If πp P, Uq If πp P ηT , U ηT q.

The proof is given in Appendix, Section I.5. Strict convexity is crucial: in such a case a Bregman divergence zeroes iff its two arguments are equal, implying at the risk level in Def. 5.5 that Bℓ P, PT 0 iff P PT almost everywhere. If πp P, Uq is a constant and the data processing inequality satisﬁed by f-divergences brings 0 If πp P η0, U η0q ď ... ď If πp P ηT , U ηT q ď ... ď If πp P, Uq, so regardless of the top-down induction algorithm used for h, Lemma 5.6 shows a form of convergence of PT towards P as accounted by the likelihood ratio risk Bℓ P, PT . The last part of copycat training s convergence is to make those inequalities strict with guaranteed slack: this is achieved using the boosting analysis of Kearns & Mansour (1996) as is. We do not put iteration indexes in G, assuming the one we consider is the one after the update of the last discriminator h T . Denote λ a leaf to be split in h and Pλ, Uλ the distributions conditioned to reaching λ; we denote as uniformly generated the observations sampled from Uλ. Deﬁne the local mixture Mλ . π Pλ p1 πq Uλ and the balanced mixture, M1 λ, is deﬁned as Mλ . p1{2q Pλ p1{2q Uλ. Let t denote the predicate value of a split chosen for λ.

Deﬁnition 5.7. (Kearns & Mansour, 1996) Fix δ ą 0. Predicate t at leaf λ satisﬁes the δ-Weak Hypothesis Assumption (WHA) iff Pr M1 λrtp Xq Ys ď 1{2 δ.

It turns out that the balanced mixture is the one against which each new split in h. is evaluated after the generator

is updated in copycat training (the modiﬁcations at G are local since the underlying tree deﬁnes a partition of X). We use the WHA to ensure that the split chosen at any leaf during copycat training complies with Deﬁnition 5.7. Using a result of Kearns & Mansour (1996), this brings guaranteed rates for the maximisation of the information of its calibrated posterior (Deﬁnition 4.2). Theorem 4.5 then directly yields rates for the maximisation of If πp P ηT , U ηT q, and Lemma 5.6 translates them to convergence for PT G towards P, where G denotes the measure induced by generator G (equality PT G is guaranteed by copycat training). We make those convergence rates explicit for the boosting-optimal splitting criterion, Matusita s loss (Table 1), for which Lpuq 2 a

up1 uq. For any ε P r0, 1s, we abbreviate Lεp Bq . ε Lpπq p1 εq Lpη , η , Mq. Theorem 5.8. Deﬁne the binary task B . pπ, P, Uq, M. Suppose SPSD loss ℓis Matusita s loss and the WHA is satisﬁed at each split of h.. Then for any ε P r0, 1s, if the

number of splits in h. satisﬁes T ě 1 Lεp Bq 32

δ2 , then the likelihood ratio risk achieved by generator G with respect to the distribution of real observations satisﬁes Bℓp P, Gq ď ε If πp P, Uq. The proof directly follows from the proof of Kearns & Mansour (1996, Theorem 10), using Defns in (4.2), (5.5), Thm 4.5 and Lem. 5.6 to calibrate the bound that is needed on the information of ηT to guarantee the bound on Bℓp P, Gq.

6. Experiments

We carried out experiments on four topics: missing data imputation, training on generated data (training on fakes vs training on real), fake-real discrimination (distinguishing fakes from real) and generated augmentation (adding fakes to real for training) on a total of 11 readily available datasets, from the UCI (Dua & Graff, 2021), Kaggle and the Stanford Open Policing project, to which we added 4 simulated datasets. For simplicity, all GT experiments use copycat training, implemented in Java. Due to the lack of space, we refer to App, Section II for all details. To get the simple 2D heatmaps in Figure 1, we trained generative models with the full domain (1K epochs for neural nets, 10K splits for GTs) and generated the same amount of data as in the training sample. An interesting observation, quite intuitive considering copycat training s properties and visible on all 2D heatmaps (see also App), is the virtual absence of mode collapse during training: modes that appears gets reﬁned during induction but do not disappear at some point.

6.1. Missing data imputation ( IMPUTE )

Objective and experimental setting A GT G is not just useful to generate data: it can trivially be used for missing data imputation (Muzellec et al., 2020). For this, we con-

Generative Trees: Adversarial and Copycat

Us vs Mice| NORM CART RF

CART RF CART RF #trees p. fold N/A 10 1 000 35 3 500 125 12 500 (q=)5%

U(0.003) U U(0.07)

M U 10% U(0.03) U(0.002) U(0.007) U M U M 20% U(0.0005) U(0.001) U(0.001) U(0.05) U U M 50% U(0.04) U U us mice|NORM mice|CART M(0.02) M(0.06) M(0.01) M(0.07)

Table 2. Best method (U(s) vs M(ice); p-values shown if p ď 0.1) on IMPUTE. circ Gauss: domain (green dots) vs imputed data (red).

TRAIN-GEN GEN-DISCRIM

C(.) 10 300 1K 10 300 1K

10 1 / 7 / 2 3 / 4 / 3 2 / 3 / 5 7 / 4 / 2 6 / 5 / 2 6 / 5 / 2 300 8 / 1 / 1 8 / 1 / 1 4 / 5 / 1 8 / 4 / 1 8 / 3 / 2 8 / 3 / 2

MAX 8 / 1 / 1 8 / 1 / 1 7 / 2 / 1 8 / 4 / 1 8 / 3 / 2 8 / 3 / 2

Table 3. TRAIN-GEN (left table) & GEN-DISCRIM (right table): statistical wins / ties / statistical losses for us (U()) vs CTGAN(C()). Statistical = signiﬁcant for p ď 0.01. For example, a / b / c means we statistically win a times, lose c times and there is no statistical difference b times. On TRAIN-GEN, each red star ( ) indicates a domain for which the related technique performed statistically worse than uniform sampling (UNIF) for p ď 0.05.

strain the support of the tree to the observed variables and then sample in region(s) of maximal density. This costs no more than Op|Λp Gq|q per observation. We compared against a few powerful alternatives (mostly tree-based) from the R mice package (van Buuren & Groothuis-Oudshoorn, 2011). Such methods rely on round-robin prediction of missing values: after having initialized them, one circles several times (5 in our experiments) through predicting each column from all others using trained models from a speciﬁc method. We used method P t NORM, CART, RFu (RF = random forests with 100 trees each, CART learns regression / decision trees (Breiman et al., 1984)). It is important to realise that even on a domain like led24 with 25 variables, 5 round-robin iterations with RF implies using no less than 12 500 trees per fold when we rely on 1 in our GT. We grow the GT to its MAX size (limit: 10 000 nodes) and prevent splits with p P t0, 1u, thus avoiding discarding support for data generation. We generate Missing Completely At Random (MCAR, van Buuren (2018)) data by removing a ﬁxed proportion q P t5%, 10%, 20%, 50%u, embedded in a 5-fold cross validation for each q. Imputing a complete dataset on each fold, we judge imputation s quality with optimal transport s Wasserstein s W 2 2 (Muzellec et al., 2020). Results Table 2 summarises 3 domains (more in App) giving a good panorama of our observations, the ﬁrst of which being the fact that our simple approach can in fact beat mice on problems with restricted number of variables (but the picture is reversed on domains with large number of variables, App). Imputations quality wrt mice is clear from Table 2 (q 20%). While our objective was not to beat such ﬁt-for-purpose imputation methods relying on a comparatively huge number of models, we remark that our results can serve as basis for using GTs in more sophisticated approaches.

6.2. Training on generated data ( TRAIN-GEN )

Objective and experimental setting In this basic experiment, we seek to answer whether generated data can be used in lieu of the original data to solve the original data s supervised / regression problem (e.g. predicting the variety for iris). We use a 5-folds CV experiment where on each fold a supervised classiﬁer is trained on fake or real data and then used to classify the (fresh) real data s fold. Fake data is obtained from a generator trained on the training data, then sampled for the same data size. We consider 3 GTs with different sizes, with 10, 300 and up to MAX splits (same as in Section 6.1). Our contender is the state of the art CTGAN(Xu et al., 2019), trained for a number of epochs in t10, 300, 1Ku; the original data s supervised problem is then solved by training RFs and gradient boosted decision trees (GBDT) on real or fake data, and comparing the output accuracy / RMSE (details in App, Section II.3). Results Table 3 (left) provides a summary of the results on the 10 total domains considered, from which it emerges that when GTs have 600 nodes (300+ splits), we tend to beat neural networks (NNs, regardless of the number of epochs considered). What is worse for NNs is that in a total of 4 cases, they are statistically beaten by a uniform sampling of the training data, which means they fail at learning the domain s characteristics. This never happens to GTs. Detailed 2D plots display that GTs tend to better learn domain-speciﬁc features. Also, the ﬁnal size of the GT can be tiny compared to the training data, e.g. less than 0.5% on dna and open policing (see App, Section II.6).

6.3. Fake-real discrimination ( GEN-DISCRIM )

Objective and experimental setting While the objective ﬁts in a simple question (can the generated data look like real data ?), its treatment necessitated a complex pipeline, in particular to avoid rewarding generators whose output would be a mere copy their training sample. The complete pipeline is detailed in App (Section II.7); very brieﬂy, it starts by shufﬂing a 3-partition of the training data in a 3! 6-fold CV and ends up with supervised RF / GBDT classiﬁers (same as in Section 6.2) for a 2-class supervised problem of fakes vs real distinction. The smaller their accuracy, the better is the generator. We use CTGANs as contenders; all parameters (GTs, CTGANs) are the same as in Section 6.2. Results Table 3 (right) provides a summary of the results on 13 total domains considered. They display that GTs (regardless of their sizes) achieved a better job at fooling

Generative Trees: Adversarial and Copycat

UCI abalone

UCI house-votes

UCI winewhite

UCI winered

Table 4. Experiment GEN-AUG: detailed results on six domains (more in App, Section II.8). On the left pane, the metric is accuracy (higher = better); on the right pane, the metric is RMSE (lower = better). In each plot, the x value of a vertical bar indicates a method s metric and the height along the y axis indicates the % of real data that represents generated data used to train the ﬁnal classiﬁer (up to 100%). Green ﬁlled circles are GTs results, the size of the circle indicating the number of splits in the GTs ( = 10, = 300, = 10K). Red ﬁlled diamonds are CTGANs results, the size of the diamond indicating the number of epochs ( = 10, = 300, = 1K). Finally, empty pink squares ( ) are UNIF s results and ﬁlled blue squares ( ) are those of COPY (the optimal reference from the standpoint of the task).

classiﬁers than neural nets. Much more interesting is perhaps the fact that GTs managed, on 3 (simulated) datasets, to better fool classiﬁer than the original real data itself. This never happened for CTGANs. However, there is still a gap to ﬁll for all techniques: CTGANs do a statistically worse job at fooling classiﬁers than uniformly generated data on 6 domains while GTs do statistically worse on 3 domains.

6.4. Generated augmentation ( GEN-AUG )

Objective and experimental setting Supplementing real data with generative data is a particular case of data augmentation, and this setting is a variation on the TRAINGEN experiment. Two key factors under control are (i) the technique used to supplement the additional data and (ii) the proportion of additional data with respect to the training set size. In the case of (i), we do not just include data generated by our technique or GANs but also consider adding purely UNIFormly generated data and also adding real data from COPY itself. A generative model is all the better as its performances come closer to the COPY metrics. Results Table 4 displays results on 6 domains, with a clear advantage to GTs on 5 out of the 6. On dna, GTs improve CTGANs by up to more than 50% accuracy, as neural nets clearly overﬁt the domain, being beaten by UNIF; on house-votes, CTGANs trained with the largest number of epochs get good results, though GTs are the only one managing to beat COPY with 5% or 10% additional real data. On 4 out of 6 domains, more training neural nets does not improve results, while learning bigger GTs does improve results in general, as well as does training with more generated data. Such experiments tend to indicate that training GTs can be quite resilient to overﬁtting.

7. Discussion and Conclusion

Our contributions have different application spectra: while our models are tree-based, our contribution on losses has wider applicability to any calibrated classiﬁers. While copycat training is speciﬁc to a tree vs tree training procedure, our adversarial algorithm could be used to train generative trees against any calibrated classiﬁer. GTs have advantages that neural nets do not necessarily have: they provide us with an exact and cheap to compute expression of the measure learned, they can easily be used for missing data imputation, and they also collect many beneﬁts of DTs: interpretability (of the measure); they can be trained using various feature types (numeric, nominal, ordinal, etc.); and they can be trained from data with missing values. They also share some downsides of DTs, such as the fact that the underlying tree graph induces an axis-parallel partition of the measure s support. Tricks used to alleviate DTs downsides could also be used for GTs, though maybe in a non-trivial way (Heath et al., 1993). Important open problems include scaling the beneﬁts of generative trees to ensembles of generative trees. We believe our work brings new tools for models, losses and algorithms to train powerful generative models tailored to tabular data, and hope it contributes to ﬁll the gap in data generation quality for tabular data noted in recent work.

Acknowledgments

The authors thank Ehsan Amid, Sercan Arık, Olivier Bousquet, Julie Josse, Yishay Mansour, Aditya Krishna Menon, Madeleine Udell, Jean-Philippe Vert, Manfred Warmuth, Bob Williamson and the reviewers for many comments, suggestions and stimulating discussions.

Generative Trees: Adversarial and Copycat

Arık, S.- O. and Pﬁster, T. Tab Net: Attentive interpretable tabular learning. In AAAI 21, pp. 6679 6687, 2021.

Box, G.-E.-P. and Muller, M.-E. A note on the generation of random normal deviates. Annals of Mathematical Statistics, 29(2):610 611, 1958.

Breiman, L., Freidman, J. H., Olshen, R. A., and Stone, C. J. Classiﬁcation and regression trees. Wadsworth, 1984.

Camino, R.-D., State, R., and Hammerschmidt, C.-A. Oversampling tabular data with deep generative models: Is it worth the effort? In I Can t Believe It s Not Better Workshop (ICBINB@Neur IPS 2020), 2020.

Chui, M., Manyika, J., Miremadi, M., Henke, N., Nel, R. C. P., and Malhotra, S. Notes from the AI frontier. Mc Kinsey Global Institute, 2018.

De Groot, M.-H. Uncertainty, information, and sequential experiments. Annals of Mathematical Statistics, 33(2): 404 419, 1962.

Dua, D. and Graff, C. UCI machine learning repository, 2021. URL http://archive.ics.uci.edu/ml.

Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., and Courville, A.-C. Adversarially learned inference. In ICLR 17. Open Review.net, 2017.

Friedman, J., Hastie, T., and Tibshirani, R. Additive Logistic Regression : a Statistical View of Boosting. Ann. of Stat., 28:337 374, 2000.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS*27, pp. 2672 2680, 2014.

Heath, D., Kasif, S., and Salzberg, S. Learning oblique decision trees. In Proc. of the 13 th International Joint Conference on Artiﬁcial Intelligence, pp. 1002 1007, 1993.

Kearns, M. and Mansour, Y. On the boosting ability of top-down decision tree learning algorithms. In Proc. of the 28 th ACM STOC, pp. 459 468, 1996.

Knuth, D.-E. Two notes on notation. The American Mathematical Monthly, 99(5):403 422, 1992.

Linsley, J.-W., Linsley, D.-A., Lamstein, J., Ryan, G., Shah, K., Castello, N.-A., Oza, V., Kalra, J., Wang, S., Tokuno, Z., Javaherian, A., Serre, T., and Finkbeiner, S. Superhuman cell death detection with biomarker-optimized neural networks. Science Advances, 7(50):eabf8142, 2021.

Mar echal, P. On a functional operation generating convex functions, part 1: duality. J. of Optimization Theory and Applications, 126:175 189, 2005a.

Mar echal, P. On a functional operation generating convex functions, part 2: algebraic properties. J. of Optimization Theory and Applications, 126:375 366, 2005b.

Menon, A. and Ong, C.-S. Linking losses for density ratio and class-probability estimation. In 33rd ICML, pp. 304 313, 2016.

Muzellec, B., Josse, J., Boyer, C., and Cuturi, M. Missing data imputation using optimal transport. In 37th ICML, volume 119, pp. 7130 7140, 2020.

Ni, Y., Koniusz, P., Hartley, R., and Nock, R. Manifold learning beneﬁts GANs. Co RR, abs/2112.12618, 2021.

Nock, R. and Nielsen, F. On the efﬁcient minimization of classiﬁcation-calibrated surrogates. In NIPS*21, pp. 1201 1208, 2008.

Nock, R., Menon, A.-K., and Ong, C.-S. A scaled Bregman theorem with applications. In NIPS*29, pp. 19 27, 2016.

Nock, R., Cranko, Z., Menon, A.-K., Qu, L., and Williamson, R.-C. f-GANs in an information geometric nutshell. In NIPS*30, 2017.

Nowozin, S., Cseke, B., and Tomioka, R. f-GAN: training generative neural samplers using variational divergence minimization. In NIPS*29, pp. 271 279, 2016.

Osterreicher, F. and Vajda, I. Statistical information and discrimination. IEEE Trans. IT, 39(3):1036 1039, 1993.

Quinlan, J. R. C4.5 : programs for machine learning. Morgan Kaufmann, 1993.

Reid, M.-D. and Williamson, R.-C. Information, divergence and risk for binary experiments. JMLR, 12:731 817, 2011.

Rob, P. and Coronel, C. Database systems - design, implementation, and management. Boyd and Fraser, 1995.

Rockafellar, R. T. Convex Analysis. Princeton University Press, 1970.

Savage, L.-J. Elicitation of personal probabilities and expectations. J. of the Am. Stat. Assoc., pp. 783 801, 1971.

Schapire, R. E. and Singer, Y. Improved boosting algorithms using conﬁdence-rated predictions. In 9 th COLT, pp. 80 91, 1998.

Sypherd, T., Nock, R., and Sankar, L. Being properly improper. In 39th ICML, 2022.

Generative Trees: Adversarial and Copycat

van Buuren, S. Flexible Imputation of Missing Data. Chapman & Hall / CRC, 2018.

van Buuren, S. and Groothuis-Oudshoorn, K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3):1 67, 2011.

van Erven, T. and Harremo es, P. R enyi divergence and kullback-leibler divergence. IEEE Trans. IT, 60:3797 3820, 2014.

Xiao, C., Zhong, P., and Zheng, C. Bour GAN: Generative networks with metric embeddings. In Neur IPS 18, pp. 2275 2286, 2018.

Xu, L., Skoularidou, M., Cuesta-Infante, A., and Veeramachaneni, K. Modeling tabular data using conditional GAN. In Neur IPS*32, pp. 7333 7343, 2019.

Yoon, J., Jordon, J., and van der Schaar, M. GAIN: missing data imputation using generative adversarial nets. In 35th

ICML, volume 80, pp. 5675 5684, 2018.

Generative Trees: Adversarial and Copycat

This is the Appendix to Paper Generative Trees: Adversarial and Copycat by R. Nock and M. Guillame-Bert, appearing in ICML 22. To differentiate with the numberings in the main ﬁle, the numbering of Theorems is letter-based (A, B, ...).

Table of contents

Supplementary material on proofs Pg 13

ãÑ Proof of Theorem 4.5 Pg 13 ãÑ Proof of Lemma 4.7 Pg 16 ãÑ Proof of Lemma 4.8 Pg 17 ãÑ Proof of Theorems 5.1 and 5.3 Pg 18 ãÑ Proof of Lemma 5.6 Pg 21

Supplementary material on experiments Pg 23

ãÑ Examples of generative trees Pg 23 ãÑ Domains Pg 23 ãÑ Data generation experiments Pg 25 ãÑ Missing data imputation experiments (IMPUTE) Pg 35 ãÑ Training on generated data experiment (TRAIN-GEN) Pg 38 ãÑ Fake-real discrimination experiment (GEN-DISCRIM) Pg 42 ãÑ Generated augmentation experiment (GEN-AUG) Pg 44

Generative Trees: Adversarial and Copycat

I. Appendix on proofs

I.1. Proof of Theorem 4.5

We proceed in several steps.

Ź Proof of the identity between the external elements in (11) We write, as in Reid & Williamson (2011, Appendix A.3), the second equality of

Ip B ηq . If πp P η, N ηq

Lpπq ż L p ηq d M η (19)

Lpπq EX M η r Lp ηqs

Lpπq EX M η r Lp ηp Xq, ηp Xqqs (20)

Lpπq Lp η, η, Mηq (21)

Lp η, M ηq, (22)

where (19) holds because η is calibrated and (20) holds because the loss is proper and η matches Bayes posterior on B η.

Ź Proof of the rightmost identity in (11) We ﬁrst prove several helper results. We ﬁrst remark that ℓbeing strictly proper differentiable implies f π strictly convex and differentiable. We show these results for completeness, starting by showing L strictly concave: otherwise, we write Lpη1:2q p1{2q p Lpη1q Lpη2qq for η1:2 . p1{2q pη1 η2q. If we had both

Lpη1, η1:2q ą Lpη1:2, η1:2q . Lpη1:2q p1{2q p Lpη1q Lpη2qq, (23)

Lpη2, η1:2q ą Lpη1:2, η1:2q . Lpη1:2q p1{2q p Lpη1q Lpη2qq, (24)

then making the average of both yields p1{2q p Lpη1q Lpη2qq ą p1{2q p Lpη1q Lpη2qq, a contradiction, and yielding for example Lpη1, η1:2q ď Lpη1:2, η1:2q, contradicting the strict properness of the loss in η1:2. Strict concavity of L implies strict convexity of f π from its deﬁnition in (6). Also, the differentiability of the partial losses imply the differentiability of f π.

For any strictly convex differentiable function f, we have f pzq zf 1 1pzq fpf 1 1pzqq and pf q1 f 1 1, and if it is lower semicontinuous then f f. We check that f π is indeed lower semicontinuous. Because L is continuous (Sypherd et al., 2022, Lemma 3.1), we study the set

Ipαq . " t : pπt 1 πq L

ˆ πt πt 1 π

ě Lpπq α * , (25)

for α P R. Denote for short gptq . pπt 1 πq L

πt πt 1 π , which is continuous and also concave Reid & Williamson

(2011, Appendix A.3). Recalling BF pz}z1q . Fpzq Fpz1q pz z1q F 1pz1q the Bregman divergence with generator F. We have

ˆ πt πt 1 π

πt 1 π L1 ˆ πt πt 1 π

π ˆ p Lqp1q p Lq ˆ πt πt 1 π

ˆ 1 πt πt 1 π

p L1q ˆ πt πt 1 π

ˆ 1 πt πt 1 π

ˆ πt πt 1 π

which, since ℓ1p1q 0, shows lim 8 g1 0 ; g being concave, g1 is decreasing (which also shows ℓ1 decreasing and ℓ 1 increasing). To conclude, g is increasing; hence, (when it is not empty) Ipαq rtu, 8q for some ﬁnite tu, and so

Generative Trees: Adversarial and Copycat

Ipαq p 8, tuq is open, showing the closedness of Ipαq and the closedness of tt : f πptq ď αu, and we get e.g. from Rockafellar (1970, Theorem 7.1, point (b)) that f π is lower semicontinuous and thus pf πq f π. We complete the proof of the identities: (26) holds because ℓ 1p0q 0 implies Lp0q . 0 ℓ1p0q 1 ℓ 1p0q 0 and because ℓis also symmetric, then Lp1q . 1 ℓ1p1q 0 ℓ 1p1q ℓ1p1q ℓ 1p0q 0. Hence, Lp0q Lp1q 0. We state (28) as a standalone Lemma.

Lemma A. Suppose ℓis proper differentiable and satisﬁes ℓ 1p0q ℓ1p1q 0. Then

B L p0||uq ℓ 1puq.

B L p1||uq ℓ1puq.

Proof: The following two relationships hold because ℓis proper, @u P r0, 1s (using (4)):

Lpuq u ℓ1puq p1 uq ℓ 1puq, (29)

u ℓ1 1puq p1 uq ℓ1 1puq 0. (30)

The second identity expresses the fact that properness implies (from (3)),

B Bv Lpv, uq ˇˇˇˇ v u 0. (31)

We derive @u P r0, 1s, using Lp0q 0,

B L p0||uq Lpuq u L1puq

u ℓ1puq p1 uq ℓ 1puq u pℓ1puq u ℓ1 1puq ℓ 1puq p1 uq ℓ1 1puqq

u ℓ1puq p1 uq ℓ 1puq u pℓ1puq ℓ 1puqq

as anticipated. We also have @u P r0, 1s, using Lp1q 0,

B L p1||uq Lpuq p1 uq L1puq

Lpuq u L1puq L1puq

u ℓ1puq p1 uq ℓ 1puq u pℓ1puq ℓ 1puqq ℓ1puq ℓ 1puq

ℓ 1puq ℓ1puq ℓ 1puq

which completes the proof of the Lemma. We now ﬁnish the proof of (11). Since η is calibrated, then the corresponding likelihood satisﬁes

ϱ . η 1 η 1 π

d M η p1 πq d N η

d P η d N η . (33)

Generative Trees: Adversarial and Copycat

Putting all this together, we get:

Dℓp ϱ|B ηq . EX Ph f π1 ϱp Xq EX N ηr Gℓp ϱp Xqqs

EX P η f π1 ϱp Xq EX N ηrpf πq f π1 ϱp Xqs

d N η f π1 ˆ d P η

d N η ż pf πq f π1 ˆ d P η

d N η f π1 ˆ d P η

pf πq f π1 ˆ d P η

d N η pf π 1q 1 ˆ d P η

pf πq pf π 1q 1 ˆ d P η

ż pf πq ˆ d P η

ż f π ˆ d P η

If πp P η, N ηq

as claimed. The key to avoiding the inequality (10) is (34), which holds only because when the σ-algebras of the measures involved are coarsened with the level set of a calibrated posterior, it becomes Bayes posterior in the measure spaces obtained. This ends up the proof of (11).

Ź Proofs of (12) and expression of pf πq pzq We now compute pf πq pzq. Let

t . pp1 πq p1 pqπ . (35)

pf πq pzq . sup tě0

" tz Lpπq pπt 1 πq L

ˆ πt πt 1 π

Lpπq sup tě0

" tz pπt 1 πq L

ˆ πt πt 1 π

Lpπq sup p Pr0,1s

p1 pqπ z 1 π

1 p L ppq *

π sup p Pr0,1s

"pz π L ppq

We see that pf πq is unbounded if z ą 0. Otherwise, we have

ˆpz π L ppq

pz π L1 ppqqp1 pq ppz π L ppqq

z π p L ppq p1 pq L1 ppqq

z π p L p1q p L ppqq p1 pqp Lq1 ppq

z π B L p1}pq

where we have used the fact that L p1q ℓ 1p0q 0 and. Zeroing the derivative, we thus seek pπpzq such that

ℓ1ppπpzqq z

Generative Trees: Adversarial and Copycat

and since ℓis strictly proper, ℓ1 is invertible and we get

pπpzq ℓ 1 1 z

pf πq pzq Lpπq 1 π

π zpπpzq πL ppπpzqq

Lpπq p1 πq pπpzq B L p1}pπpzqq L ppπpzqq

Lpπq p1 πq pπpzq L ppπpzqq pπpzqp1 pπpzqq L1 ppπpzqq L ppπpzqq

Lpπq p1 πq L ppπpzqq pπpzq L1 ppπpzqq

Lpπq p1 πq L p0q p Lq ppπpzqq p0 pπpzqqp Lq1 ppπpzqq

Lpπq p1 πq B L p0}pπpzqq

Lpπq p1 πq ℓ 1ppπpzqq

Lpπq p1 πq ℓ 1 ℓ 1 1 z

This shows the expression of pf πq pzq. To compute pf πq f π1pzq, letting η . πz{pπz 1 πq, we observe

ˆ πz πz 1 π

1 π πz 1 π L1 ˆ πz πz 1 π

L pηq p1 ηq L1 pηq

p Lq p1q p Lq pηq p1 ηq p Lq1 pηq

ℓ1pηq, (38)

pf πq f π1pzq Lpπq p1 πq ℓ 1 pηq (39)

Lpπq p1 πq ℓ 1

ˆ πz πz 1 π

Lpπq p1 πq ℓ 1

and we remark that when z . ϱ is a likelihood ratio,

a density ratio, as claimed.

I.2. Proof of Lemma 4.7

From the chain rule and change of variable η . 1{p1 ρq, we get

ℓ 1 1pηq dℓ 1pηq

dρ dη dℓ 1p1{p1 ρqq

η2 ℓ DR1pρq (45)

p1 ρq2 ℓ DR1pρq. (46)

Generative Trees: Adversarial and Copycat

(28) shows that ℓ1 is decreasing and by symmetry, ℓ 1 is increasing. So, ℓ 1 1pηq ě 0 and (46) shows then ℓDR1pρq ď 0, thereby proving ℓDR is decreasing. We get from (30) after working all parameters using 46,

1 1 ρ ˆ 1 1

ρ 1 ρ p1 ρq2 ℓ DR1pρq, (47)

which becomes after simpliﬁcation

ρ3 ℓ DR1pρq, @ρ ě 0. (48)

Fix ρ, ε ą 0, we also have

ℓ DR1 ˆ 1 ρ ε

pρ εq3 ℓ DR1pρ εq, (49)

From which we get

ℓDR1pρ εq ℓDR1pρq

ρ3 ℓ DR1 ˆ1

1 pρ εq3 ℓ DR1 ˆ 1 ρ ε

Suppose the LHS is ď 0, which indicates a secant with negative slope at the right of ρ. From the RHS, we then get the ﬁrst inequality of

pρ εq3 ℓ DR1 ˆ 1 ρ ε

ě ℓ DR1 ˆ 1 ρ ε

and the second is due to the fact that ℓDR1p.q ď 0. Letting ρ1 . 1{pρ εq and ε1 . ε{pρpρ εqq, we then get

ℓDR1 pρ1 ε1q ℓDR1 pρ1q

ε1 ě 0, (52)

which indicates a secant with positive slope at the right of ρ1 or equivalently at the left of 1{ρ. Taking the limits for ε Ñ 0, we see that if the right derivative at ρ is negative, then the left derivative at 1{ρ is positive. Switching ε ă 0 from (50) switches the directional derivatives and shows that if ℓDR1 is not convex in ρ, then it is convex in 1{ρ.

I.3. Proof of Lemma 4.8

The proof follows from Jensen s inequality: if ℓDR is convex then we have by deﬁnition of ϱ:

Gℓp N η| ϱq . Lpπq p1 πq EN η

Lpπq p1 πq EN η

ď Lpπq p1 πq ℓ DR ˆ1 π

and we check that

ˆd N η d P η

χ2 p N η||P ηq 1, (55)

Generative Trees: Adversarial and Copycat

from which we get

π χ2 p N η||P ηq 1

π pχ2 p N η||P ηq 1q

ˆ π 1 p1 πq χ2 p N η||P ηq

and obtain the lowerbound of (13) after combining with (54).

I.4. Proof of Theorems 5.1 and 5.3

We start by a simple technical Lemma.

Lemma B. Let fpuq . au2 bu c; suppose fpuq ě 0, @u P R, implying a ě 0, 4ac b2 ě 0. Consider u . arg min R fpuq b{p2aq. Let Q . p4ac b2q{4a. For any ε ą 0 and any v P R,

|v u | ě ε ñ fpu q ď 1 1 ε2Q fpvq. (57)

Proof: Fix δ P r0, 1q. We want to compute the vs such that fpu q ď p1 δqfpvq. Equivalently, we want

av2 bv b2 4δac

4p1 δqa ě 0. (58)

We have the discriminant

1 δ δp4ac b2q

and we have ě 0 because of the constraints on f. The vs we seek therefore satisfy

δ 1 δ ˆ4ac b2

Solving the RHS ε for δ thus yields that if |v u | ě ε then

fpu q ď 1 1 ε2Q fpvq, (61)

where Q is deﬁned in the statement of the Lemma. We have the expressions related to split of S using feature X:

n10 λ n0 λ; n1l λ nl λ 1 p

1 τ n1r λ nr λ p

and n1 λ n10 λ n1l λ n1r λ, nλ n0 λ nl λ nr λ. Figure 4 gives an example of the way the scaling factors are obtained on a simple example. These come from scaling the densities after the split (which partitions further the support) and the computation of the associated Bernoullis Bppq (which deﬁnes the stochastic activation of the generative tree). Since this process does not change the way the support is split by the discriminator, the weights integrals of these densities are scaled by the same factors as depicted in (62). We recall notations

lλ . n0 λ nl λ 1 τ ; rλ . n0 λ nr λ τ ; δλ . nr λ τ nl λ 1 τ rλ lλ.

Generative Trees: Adversarial and Copycat

1 leaf = uniform sampler

Figure 4. Explanation of (62) on a 1D example: two splits on the same variable in a generative tree creating a piecewise constant but non uniform density for the variable.

For any p P r0, 1s, the new χ2 after split of S using feature X admits the simpliﬁed expression using (62) and µLL . ř λPΛphq l2 λ{pλ, µDD . ř λPΛphq δ2 λ{pλ, µLD . ř λPΛphq lλδλ{pλ,

χ2 N1 ηppq||P η ÿ

pλ n0 λ nl λ 1 p

n0 λ nl λ 1 p

1 µLL 2pµLD p2µDD, (64)

where we put p in parameter of N1 ηp.q. Note we can also take a fork at (65) and write instead:

χ2 N1 ηppq||P η 1 ÿ

prλ p1 pqδλq2

1 µRR 2p1 pqµRD p1 pq2µDD. (66)

Three values of p are of interest:

for p τ, we get

χ2 N1 ηpτq||P η χ2 p N η||P ηq , (67)

and there is no change in χ2 after split.

for p 0, we get

χ2 N1 ηp0q||P η 1 µLL, (68)

which corresponds to discarding support on Xr and yields n1 λ lλ;

for p 1, we get

χ2 N1 ηp1q||P η 1 µRR, (69)

which corresponds to discarding support on Xl and yields n1 λ rλ.

Generative Trees: Adversarial and Copycat

Case 1 suppose µDD ą 0. Deﬁne f as in Lemma B using (64), with a . µDD, b . 2µLD, c . µLL 1. We have

µDD µLL µLR µLL µRR 2µLR

Case 1.1 suppose in addition µLL ě µLR and µRR ě µLR. We have u P r0, 1s and ﬁx p . u . For the choice v τ, Lemma B says that

|τ p| ě ε ñ χ2 N1 ηppq||P η ď 1 1 ε2Q χ2 p N η||P ηq , (71)

Q µDDpµLL 1q µ2 LD µDD 1 µLL µ2 LD µDD . (72)

We also remark that with the value of p as in (70), χ2 N1 ηppq||P η turns out to be

χ2 N1 ηppq||P η 1 µLL 2 µLD

µDD µLD ˆ µLD

2 µDD Q. (73)

Hence, for any δ ą 0, as long as χ2 N1 ηp.q||P η ě δ, whenever |τ p| ě ε, one step of TD-GEN achieves geometric

convergence with rate 1{p1 δε2q. Case 1.2 suppose now µLL ă µLR, which implies u ă 0. Pick p 0. From (68), (67) and (64), to get χ2 N1 ηp0q||P η ď

p1{p1 εqq χ2 p N η||P ηq, we equivalently need

τ 2µDD 2τµLD εµLL ε ě 0, (74)

which, expressed using the fact that µDD µRR µLL 2µLR and µLD µLR µLL, yields:

pτ 2 2τ εqµLL τ 2µRR 2τp1 τqµLR ε ě 0, (75)

The Weak Generative Assumption, µDD ě δ maxtµLL, µRRu implies µRR ě 2µLR µLL δ µLL, and with Case 1.2 s assumption, µLR ą µLL, yields µRR ě 2µLL µLL δ µLL p1 δq µLL, so we get

pτ 2 2τ εqµLL τ 2µRR 2τp1 τqµLR ε ą pτ 2 2τ εqµLL p1 δqτ 2 µLL 2τp1 τqµLL ε

pδτ 2 εq µLL ε. (76)

Fixing ε δτ 2 thus brings (74) and we conclude with

χ2 N1 ηp0q||P η ď 1 1 δτ 2 χ2 p N η||P ηq . (77)

Case 1.3 suppose now µRR ă µLR, which implies u ą 1 (the denominator of u is always ě 0). Pick p 1. From (69), (67) and (66), to get χ2 N1 ηp1q||P η ď p1{p1 εqq χ2 p N η||P ηq, we equivalently need

p1 τq2µDD 2p1 τqµRD εµRR ε ě 0. (78)

Using µDD µRR µLL 2µLR and µRD µRR µLR, we break this down to:

p1 τq2µLL p1 τ 2 εqµRR 2τp1 τqµLR ε ě 0, (79)

The Weak Generative Assumption yields this time µLL ě 2µLR µRR δ µRR, which, together with Case 1.3 s assumption, µLR ą µRR, yields µLL ě p1 δq µRR, so we get this time

p1 τq2µLL p1 τ 2 εqµRR 2τp1 τqµLR ε ą p1 τq2p1 δq µRR p1 τ 2 εqµRR 2τp1 τqµRR ε

pδp1 τq2 εq µRR ε. (80)

Generative Trees: Adversarial and Copycat

Fixing ε δp1 τq2 thus brings (74) and we conclude with

χ2 N1 ηp0q||P η ď 1 1 δp1 τq2 χ2 p N η||P ηq . (81)

Cases 1.2 and 1.3 can be summarised as

χ2 N1 ηppq||P η ď 1 1 δpτ p1 2τqpq2 χ2 p N η||P ηq , p P t0, 1u. (82)

Case 2 suppose µDD 0. We remark in this case that the optimal p to minimize χ2 N1 ηppq||P η as in (64) or (66) is in t0, 1u, which brings us to the case where the Weak Generative Assumption holds and therefore make that Case 2 does not happen for the analysis of TD-GEN.

I.5. Proof of Lemma 5.6

We note that we have for any λ P Λph T q and x reaching λ,

πd PJ p1 πqd Upxq πpλ πpλ p1 πquλ , (83)

where we recall pλ . ş

λ d P and uλ . ş

λ d U. We get from the scaled Bregman Theorem (Nock et al., 2016, Theorem 1)

that since g is afﬁne, the perspective transform } L

is convex and the ﬁrst equality holds in:

d P d U gp d P

X pπd P p1 πqd Uq p Lq ˆ πd P πd P p1 πqd U

looooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooon

X pπd P p1 πqd Uq p Lq ˆ πd PJ

πd PJ p1 πqd U

loooooooooooooooooooooooooooooooooomoooooooooooooooooooooooooooooooooon

X pπd P p1 πqd Uq ˆ πd P πd P p1 πqd U πd PJ

πd PJ p1 πqd U

p Lq1 ˆ πd PJ

πd PJ p1 πqd U

looooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooomooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooon

and the second equality follows from the deﬁnition of Bregman divergences. Since leaves in h T induce a partition of X, C simpliﬁes to:

λPΛph T q p Lq1 ˆ πpλ πpλ p1 πquλ

λ d P πpλ πpλ p1 πquλ ż

λ pπd P p1 πqd Uq

λPΛph T q p Lq1 ˆ πpλ πpλ p1 πquλ

loooooooooomoooooooooon

Generative Trees: Adversarial and Copycat

Figure 5. Depiction of the two Bregman divergences in (85).

We can also reformulate A B:

X pπd P p1 πqd Uq L

ˆ πd P πd P p1 πqd U

λPΛph T q pπpλ p1 πquλq L

ˆ πpλ πpλ p1 πquλ

If πp P, Uq If πp P ηT , U ηT q,

which leads to the statement of the Lemma.

Remark: there is a simple argument to show the strict convexity of } L

that relates its derivative to negative a Bregman divergence:

1 pzq p Lq ˆ z

p Lq p1q ˆ p Lq p1q p Lq ˆ z

because in our case we have p Lq p1q 0. Let for short gpzq . z K for K ą 0. We have gpzq ą z and so for any δ ą 0,

z δ gpz δq z δ gpzq δ ą z gpzq, (84)

and since L is strictly concave and z{gpzq ă 1 for z ą 0, Bregman divergences lead to:

ˆ 1 z δ gpz δq

(See Figure 5 for a depiction of the quantities), hence } L

1 pz δq ą } L

1 pzq, showing the derivative of the perspective transform of negative the pointwise Bayes risk is strictly increasing and the function is therefore strictly convex.

Generative Trees: Adversarial and Copycat

Figure 6. Generative trees (crop) learned on a fold of Stanford open policing data for Hartford (top) and UCI house votes (bottom).

II. Appendix on experiments

II.1. Examples of generative trees

Figure 6 provides examples of subsets of generative trees learned on Stanford open policing data and UCI house votes. Each node takes the form

[prob value, [variable (nominal) in {set of nominal values}]; ... ]--[#node name] [prob value, [variable (continuous) in [continuous interval]]; ... ]--[#node name] [prob value, [variable (integer) in {int value n, n+1, ..., m}]; ... ]--[#node name]

prob value is the Bernoulli probability associated to the arc pointing to the node. If a node appears as

[ ... ]--[#node name (sampling)]

then it is a leaf (sampling) node. The rest of the Figures should be self-explanatory. In open policing, notice sampling nodes #9118, #9119, inducing a higher probability of sampling a young person (age within 14 and 29) in the related part of the domain.

II.2. Domains

ring Gauss is the seminal 2D ring Gaussians appearing in numerous GAN papers (Xiao et al., 2018); those are eight (8) spherical Gaussians with equal covariance, sampling size and centers located on sightlines regularly spaced (2-2 angular distance) and at equal distance from the origin. grid Gauss was generated as a decently hard task from (Dumoulin et al., 2017): it consists of 25 2D mixture spherical Gaussians with equal variance and sampled sizes, put on a regular grid. circ Gauss is a Gaussian mode surrounded by a circle, from (Xiao et al., 2018). rand Gauss is a substantially harder version of ring Gauss with 16 mixture components, in which covariance, sampling sizes and distances on sightlines from the origin are all random, which creates very substantial discrepancies between modes.

II.3. Algorithms conﬁguration and choice of parameters

GTs We only report comparisons with the copycat approach in which the discriminator is Kearns & Mansour (1996) s greedy induction algorithm optimising Matusita s loss; our implementation is in Java. The input of our algorithm to train a generator is a .csv ﬁle containing the training data without any further information. In particular, each feature s domain is learned from the training data only; while this could surely and trivially be replaced by a user-informed domain for improved

Generative Trees: Adversarial and Copycat

Domain Source Missing data ? m d # Nom. # Num. iris UCI No 150 5 1 4 ring Gauss No 1 600 2 2 circ Gauss No 2 200 2 2 grid Gauss No 2 500 2 2 house-votes 84 UCI Yes 435 16 16 rand Gauss No 3 800 2 2 led UCI No 1 000 8 8 tictactoe UCI No 958 9 9 winered UCI No 1 599 12 1 11 led24 No 1 000 25 25 abalone UCI No 4 177 9 1 3 winewhite UCI No 4 898 12 1 11 sigma-cabs Kaggle Yes 5 000 13 5 8 open-policing SOP Yes 18 419 20 16 4 dna UCI No 3 186 181 181

Table A5. Public domains considered in our experiments (m total number of examples, d number of features), ordered in increasing m ˆ d. Nom. is a shorthand for nominal / ordinal / binary; Num. stands for integers / reals. ( ) = simulated, ( = Hartford data from the Stanford Open Policing Project, https://openpolicing.stanford.edu/) (see text).

results (e.g. indicating a proportion s domain as r0%, 100%s, informing the complete list of socio-professional categories, etc.) and is in fact standard in some ML packages like weka s ARFF ﬁles, we did not pick this option to alleviate all side information available to the GT learner. Technical details are:

1. features domains are computed from training data; our software automatically recognizes three types of variables: nominal, integer and ﬂoating point represented3;

2. the discriminator s training follows the top-down induction blueprint in Kearns & Mansour (1996);

3. we do not accept splits that will incur a branching probability p P t0, 1u to prevent discarding support; leaves are split from the heaviest ﬁrst; when a real examples with missing values branch in the decision tree on one of its missing values, the probability of following the left / right arc is not 1/2 but takes into account the length of the left / right domains of the variable at the split (for example, if the left branch s variable domain is t A, B, Cu and the right one is t Du for a nominal variable, then the left branching probability is 3/4);

4. we consider three basic sizes of GTs, corresponding to 10, 300 and a MAX = 10 000 nodes splits (thus with total number of nodes equal to 21, 601 and no more than 20 001). In the IMPUTE experiment, we only use the MAX size. Note that the MAX size usually is smaller than 20 001 on small datasets because of the support constraint in [2].

mice We have used the R mice package V 3.13.0 with three choices of methods for the round robin (column-wise) prediction of missing values: cart (Breiman et al., 1984), norm and random forests (rf) (van Buuren & Groothuis Oudshoorn, 2011). In that last case, we have replaced the default number of trees (10) by a larger number (100) to get better results. We use the default number of round-robin iterations (5).

CTGAN We have used the Python implementation4 with default values.

TENSORFLOW To learn the additional Random Forests and Gradient Boosted Decision Trees involved in experiments TRAIN-GEN, GEN-DISCRIM and GEN-AUG, we used Tensorﬂow Decision Forests library5. The important points are:

3This is a difference with mice for which categorical variables need to be explicitly stated. 4https://github.com/sdv-dev/CTGAN 5https://github.com/google/yggdrasil-decision-forests/blob/main/documentation/learners. md

Generative Trees: Adversarial and Copycat

for Random Forests, we use 300 trees with max depth 16. Attribute sampling: sqrt(number attributes) for classiﬁcation problems, number attributes / 3 for regression problems (Breiman rule of thumb);

for Gradient Boosted Decision Trees, we use max 300 trees, with 10% of the training dataset for validation and early stopping. Max depth is 6 and there is no attribute sampling;

in both cases, the min #examples per leaf is 5, we use CART to ﬁnd splits on numerical and categorical features (i.e. we don t use one-hot encoding for categorical features). Induction is top-down.

Computers used We ran part of the experiments on a Mac Book Pro 16 Gb RAM w/ 2 GHz Quad-Core Intel(R) Core i5(R) processor, and part on a desktop Intel(R) Xeon(R) 3.70GHz with 12 cores and 64 Gb RAM.

II.4. Data generation experiments

Figures 7, 8, 9, 10, 11, 12, 13, 14, 15 present 2D heatmap of density learned by generative trees with shown total number of nodes. We have run a simple 10-fold CV experiment, each plot being the generator that minimizes over all folds an empirical χ2 between the training data and a set of generated data. For UCI domains, the variables plotted are indicated and we indicate the number (m1) of examples generated to build the plots.

Warning: each of those plots records 1-of-10 results minimizing an empirical χ2 given the current size of the GT, so the related GT could be different between any two iterations: sudden changes in densities do not stem from any instability of training GTs but just from different GTs being selected. For example, in Figure 8, the GTs kept in the pairs of #nodes = 7 and 11, or #nodes = 35 and 39, or #nodes = 55 and 59, are different from each other (e.g. densities do not suddenly rotate by 90 after few iterations).

Generative Trees: Adversarial and Copycat

3 7 11 15 19

23 27 31 35 39

43 47 51 55 59

101 201 301 401 target

Figure 7. Results on the rand Gauss simulated data (target in the bottom-right, m 3800; colors indicate sampled density), for copycat training. Numbers are the total number of nodes of the generators; generators sampled for m1 4000 points each, each plot shows results for one of the ten generators in the CV folds (not necessarily from the same fold); training method = copycat.

Generative Trees: Adversarial and Copycat

3 7 11 15 19

23 27 31 35 39

43 47 51 55 59

91 101 201 301 target

Figure 8. Results on the grid Gauss simulated data (with m 2500, m1 4000), convention follows Fig. 7. Check the warning in Section II.4 for an accurate interpretation of the plots (any two can refer to different generative trees).

Generative Trees: Adversarial and Copycat

3 7 11 15 19

23 27 31 35 39

43 47 51 55 59

91 101 201 301 target

Figure 9. Results on the circ Gauss simulated data (with m m1 2200), convention follows Fig. 7.

Generative Trees: Adversarial and Copycat

3 7 11 15 19

23 27 31 35 39

43 47 51 55 59

91 101 201 301 target

Figure 10. Results on the ring Gauss simulated data (with m 1500, m1 4000), convention follows Fig. 7.

Generative Trees: Adversarial and Copycat

3 7 11 15 19

23 27 31 35 39

43 47 51 55 59

107 157 207 267 target

Figure 11. Results on UCI iris domain (with m 150, m1 150) for the 2D plane petal-length ˆ petal-width, convention follows Fig. 7.

Generative Trees: Adversarial and Copycat

3 7 11 15 19

23 27 31 35 39

43 47 51 55 59

107 157 207 267 target

Figure 12. Results on UCI iris domain (with m 150, m1 4000) for the 2D plane sepal-length ˆ sepal-width, convention follows Fig. 7.

Generative Trees: Adversarial and Copycat

3 7 11 15 19

23 27 31 35 39

43 47 51 55 59

201 301 401 501 target

Figure 13. Results on UCI winered domain (with m m1 1599) for the 2D plane residual-sugar ˆ chlorides, convention follows Fig. 7.

Generative Trees: Adversarial and Copycat

3 7 11 15 19

23 27 31 35 39

43 47 51 55 59

201 301 401 501 target

Figure 14. Results on UCI winered domain (with m m1 1599) for the 2D plane residual-sugar ˆ density, convention follows Fig. 7. The discontinuous strip-look is due to the huge difference in scales for the variables.

Generative Trees: Adversarial and Copycat

3 7 11 15 19

23 27 31 35 39

43 47 51 55 59

91 101 201 301 target

Figure 15. Results on UCI sigma-cabs domain (with m m1 5000) for the 2D plane trip-distance ˆ life-style-index, convention follows Fig. 7.

Generative Trees: Adversarial and Copycat

II.5. Missing data imputation experiments (IMPUTE)

Objective Imputing missing values in data is an important process (Muzellec et al., 2020). Classically, imputation methods are speciﬁcally designed for the task (van Buuren, 2018), even when they rely on generative models (Yoon et al., 2018). In our case, a general purpose GT G can trivially be used for missing data imputation: we constrain the support of the tree to the observed variables and then sample in the region(s) of maximal density, for a process that takes no more than Op|Λp Gq|q per observation. We have tested this simple procedure against the state of the art tree-based methods in the mice R package (van Buuren & Groothuis-Oudshoorn, 2011) (and one non-tree based but known to be a good ﬁt for normal data). The methods we have considered in mice all have a commonpoint: they carry out round-robin imputation (Muzellec et al., 2020). After having imputed the missing values with an initial guess, they circle round the attributes, repeatedly updating the prevision of one attribute by predicting it from all the current others. Notice the potentially huge number of classiﬁers used. In our case, on a domain like dna with 181 variables, the default number of iterations (5) with random forests of 100 trees each means imputing a dataset necessitates no less than 90 500 trees. In comparison, we rely on a single tree-based model to simultaneously impute all values. In particular on such domains, we cannot hope to beat such approaches, but our approach was rather to compare with SOTA over ranges of problem complexity, variable diversity and have speciﬁc simulated domains to further scrutinise differences for GTs used as basic components of imputation methods.

Experimental setting We consider copycat training against a discriminator minimizing Matusita s loss (Kearns & Mansour, 1996). We grow the GT to a MAX size (with limit 10 000 nodes, see Section II.6) and prevent splits with p P t0, 1u, thus avoiding discarding support for data generation. For each domain, we generate data that is Missing Completely At Random (MCAR, van Buuren (2018)) by removing a ﬁxed proportion of modalities q P t5%, 10%, 20%, 50%u, embedded in a 5-fold cross validation for each q. In each fold, we thus impute a complete dataset and compare the resulting imputations to the observed values. In such a setting, classical per-observation metrics like RMSE are not necessarily the best choices: if after removing MCAR features two observations were then the same for the resulting features, a perfect imputation of the missing values resulting in a permutation of the observations in the dataset would incur non-zero RMSE, yet would arguably be correct. Similarly to Muzellec et al. (2020), we have thus opted for an optimal transport metric, Wasserstein s W 2 2 . We use mice with method oracles in t CART, NORM, RANDOM FORESTS (RF) (100 trees)u. NORM is not tree-based but a good alternative on normal data (van Buuren, 2018). Notice that we have not used the default number of trees for random forests (10), which we considered too small for our purpose. Due to the difﬁculty of aggregating different types of variables to compute the Wasserstein distance without accidentally dimming the contribution of some to the total, we consider here only domains for which all variables have the same or closely-related types (e.g. categorical with a close number of modalities). We compute our metric using the squared errors normalized to the variable domain for continuous variables, and the error (0/1 loss, multiclass single valued) of prediction for nominal variables (in all cases, the contribution to the distance of each variable is in r0, 1s).

Results A key part of our experiments was to compare approaches on simulated data since we then know the ground truth, including data with non-trivial structure. Table A6 provides all results on our simulated domains. Several conclusions dan be drawn: ﬁrst, our GTs appear to be winning on at least half of the total domain ˆ MCAR% combinations, regardless of the mice contender, even when heavy disparities appear depending on the domain. On grid Gauss for example, we become competitive for large % MCAR (though not statistically signiﬁcantly) while on circ Gauss we statistically signiﬁcantly beat all mice contenders on all but one run. We can also notice the quality of the imputations from the plots: NORM in mice is clearly failing to impute mostly on the heavy dense regions of the domain. Our results are visually much closer to those of CART and RF. We suspect that our method has a different imputation quality pattern vs CART and RF: such methods typically successfully impute near the modes while we can get a more balanced allocation of data among modes. On domains like circ Gauss, we suspect this is the source of our better results. We then have tested what happens when the number of variables increases: to assess this, we used different kind of data (boolean / trinary valued) and domains with an increasing number of variables (from 8 to 181). Results are in Table A7, from which it comes that we are competitive on problems on up to 25 variables and while we are signiﬁcantly beaten by mice on the largest problem, one has to keep in mind that on dna, we compete on imputation with a single tree model per fold when random forests aggregate 90 500 of them to do the same task. On may expect that this has impact as well on the time to complete the task. This turns out to be true: on dna, it takes less than 5 minutes to impute a fold with GTs (taking into account the training of the GT), while mice requires more than two hours for the same task on random forests. The implementations of our algorithms (Java) and mice s (R) require caution in comparing times, but we can safely say that on such domains with relatively large number of

Generative Trees: Adversarial and Copycat

Us vs Mice| NORM CART RF

#trees N/A 10 1 000

5% M M M 10% M M M 20% M M M 50% U U U us mice|NORM mice|CART mice|RF

Us vs Mice| NORM CART RF

#trees N/A 10 1 000

5% U U(0.07) M 10% U U M 20% M U U 50% M M M(0.08) us mice|NORM mice|CART mice|RF

Us vs Mice| NORM CART RF

#trees N/A 10 1 000

5% U(0.04) U U(0.09) 10% U(0.05) U(0.0003) U(0.0007) 20% U(0.001) U(0.001) U(0.002) 50% U(0.04) U(0.05) U(0.05) us mice|NORM mice|CART mice|RF

Us vs Mice| NORM CART RF

#trees N/A 10 1 000

5% U U U 10% M U(0.02) U 20% M M M 50% U M M us mice|NORM mice|CART mice|RF

Table A6. Experiments IMPUTE on simulated domains gridgauss, ringgauss, circgauss and randgauss: comparison of copycat GT induction with maximal size (us) vs mice (van Buuren & Groothuis-Oudshoorn, 2011) with three prediction methods:

NORM, CART and Random Forests. Left table: for each % of missing completely at random (MCAR) variables (in t5, 10, 20, 50u), we indicate which of us (U) or mice (M) achieves the lowest average Wasserstein metric W 2 2 . Bold faces indicate when a Student paired t-test returns a p-val ă 0.1, in which case the p-values is indicated in parenthesis. Right plots: examples imputation results (red dots) on top of the domain s data (green) for 50% MCAR.

variables, our approach takes much less time to complete the task. Finally, mice is speciﬁcally designed for imputation while our GTs can be used for other purposes than just imputation itself.

Generative Trees: Adversarial and Copycat

Us vs Mice| CART RF #trees 40 4 000

5% U(0.04) U 10% U(0.04) M 20% U U 50% M U

Us vs Mice| CART RF #trees 45 4 500

5% U M(0.07) 10% U M 20% M M(0.05) 50% U M

Us vs Mice| CART RF #trees 125 12 500

5% M U 10% U M 20% U M 50% M(0.01) M(0.07)

Us vs Mice| CART RF #trees 905 90 500

5% M(0.03) M(0.02) 10% M(0.02) M(0.004) 20% M(0.001) M(0.002) 50% M(0.002) M(0.001)

Table A7. Experiments IMPUTE on binary/trinary valued domains led, tictactoe, led24, dna, with increasing number of nominal description variables. Notations follow Table A6 (NORM not shown as it does not impute all NAs). While we are competitive on domains with the smallest number of variables, we are clearly beaten by mice when the number of variables substantially increases like for dna. Those numbers have to be read keeping in mind the number of trees involved in imputing a single fold: while we compete with 1 tree against 40 (CART) and 4 000 (RF) on led, we compete with 1 tree against 905 (CART) and 90 500 (RF) on dna.

Generative Trees: Adversarial and Copycat

Domain #1 #2 #3 #4 #5 #6 #7 #8

abalone COPY U(MAX) U(300) C(1K) C(300) C(10) U(10) UNIF

dna COPY U(300) U(MAX) UNIF C(10) U(10) C(1K) C(300)

house votes COPY U(300) U(MAX) C(1K) U(10) C(10) C(300) UNIF

iris COPY C(1K) U(300) U(MAX) C(10) C(300) U(10) UNIF

led24 COPY U(300) U(MAX) C(10) U(10) C(1K) UNIF C(300)

led COPY U(300) U(MAX) C(1K) U(10) C(10) C(300) UNIF

winered COPY U(300) U(MAX) C(1K) UNIF U(10) C(300) C(10)

winewhite COPY U(MAX) U(300) U(10) UNIF C(300) C(10) C(1K)

sigma-cabs COPY C(1K) C(300) C(10) U(MAX) U(300) UNIF U(10)

open-policing COPY U(MAX) C(1K) U(300) C(300) C(10) U(10) UNIF

Table A8. Ranking results on experiment TRAIN-GEN, showing for each domain the order (left to right: from best to worst) of the three sizes of runs of our GTs (U(.), number in parenthesis = number of splits; MAX = up to 10 000 splits), the three runs of CTGAN with different epoch numbers (C(.)), the COPY approach (we use the original data) and the UNIF(orm) approach (we use a random sample). Those two last methods have their cells shaded to locate them.

COPY U(300) U(MAX) C(1K) C(10) U(10) C(300) UNIF

1 2.9 3 4.5 5.7 6 6.2 6.8

Table A9. Average rank for each approach in the TRAIN-GEN, as collected in Table A8, ordered in increasing average rank.

II.6. Training on generated experiment (TRAIN-GEN)

Objective In the context of tabular data, there can be several reasons to replace a training sample with a data generator: (i) the storing size of the generated model, in particular for big databases and / or with substantial redundancy (Rob & Coronel, 1995), (ii) privacy (disregarding the privacy model), (iii) security (with respect to classical databases), etc. . The objective of this experiment is the replacement of the training data by a data generator to address the supervised learning problem related to the training data. For example, for domain iris, the objective is the prediction of a ﬂower s variety among three.

Experimental setting For each domain, we carry out a 5-fold CV, leaving 20% of the data for testing and the rest for training. We train a data generator with the training data, then generate a training sample the size of the generator s training

C(.) 10 300 1K

10 1 / 7 / 23 / 4 / 32 / 3 / 5

300 8 / 1 / 18 / 1 / 14 / 5 / 1

MAX 8 / 1 / 18 / 1 / 17 / 2 / 1

Table A10. Experiment TRAIN-GEN: statistical wins / ties / statistical losses for us (U()) vs CT-GAN (C()). Statistical = signiﬁcant for p ď 0.01. For example, a / b / c means we statistically win a times, lose c times and there is no statistical difference b times. Each red star ( ) indicates a domain for which the related technique performed statistically worse than uniform sampling (UNIF) for p 0.05 (See Table A8 to spot those domains).

Generative Trees: Adversarial and Copycat

Domain U(10) U(300) U(MAX) C(10) C(300) C(1K)

abalone 23 90 193 14 62 215

dna 2 7 38 143 1 051 3 128

house-votes ε ε ε 7 20 51

iris ε ε ε 7 17 42

led24 ε 1 2 6 22 62

led ε 1 1 6 14 46

winered 3 8 14 13 24 62

winewhite 17 46 131 17 52 210

sigma-cabs 19 60 286 13 86 419

open-policing 168 311 933 24 338 1 316

Table A11. Average training times to get the generated training sample on experiment TRAIN-GEN, in seconds, rounded to the nearest second. ε means average ă 0.5s.

Domain |GT|{|X| 100 (%)

abalone 3.99

house-votes 21.57

iris 200.13

winered 7.82

winewhite 2.55

sigma-cabs 2.31

open-policing 0.41

Table A12. Experiment TRAIN-GEN: average sizes of the GT obtained using U(300) relative to the domain size (see text).

Generative Trees: Adversarial and Copycat

ground truth U(10) U(300) U(MAX) C(10) C(300) C(1K)

Figure 16. Experiment TRAIN-GEN: 2D distribution plots for domain abalone (warning: scales vary).

ground truth U(10) U(300) U(MAX) C(10) C(300) C(1K)

Figure 17. Experiment TRAIN-GEN: 2D distribution plots for domain winered (warning: scales vary). Notice that GTs manage to capture complex distribution shapes (such as the wavy 2nd row) that neural nets do not necessarily capture, and neural networks can generate data clearly outside admissible bounds (negative values for some variables)

Generative Trees: Adversarial and Copycat

sample. This training sample is then used in lieu of the original training sample to train a model following the original data s task. The model is then used to predict the class of the testing fold s examples. To alleviate the bias on the choice of this classiﬁer and its training algorithm, we pick two types of classiﬁers / training models: random forests (RFs) and gradient boosted decision trees (GBDTs), that we thus run on each dataset. We compute the accuracy for label prediction and root mean square error for regression problems. For each domain and each generator type, we thus obtain 10 statistics (5 for RFs, 5 for GBDTs), to which we add the 10 running times and use them for comparisons between methods. Speaking methods, we consider for generative trees (GTs) copycat training against a discriminator minimizing Matusita s loss (Kearns & Mansour, 1996). We grow generative trees with three different sizes: very small (10 splits, i.e. 21 total nodes), medium (300 splits, i.e. 601 total nodes) and maximal . In this last case, we provide a limit size of 10 000 splits (i.e. 20 001 total nodes). Notice that this maximal size is usually not reached, in particular when a training fold contains less than 10 000 training examples. We compare our three GT training ﬂavours to three training neural network based methods relying on the state of the art CTGAN (Xu et al., 2019). We use CTGAN code6 with default parameters and varying number of epochs, choosing small (10), medium (300) and large (1K) training epochs. We also use two more contenders: the ﬁrst is the COPY contender, which uses the original training data as training data. The second is the UNIForm contender, which just generates uniform data in the features domains. While we expect COPY to lead the pack of algorithms, contender UNIF is used to point algorithms failing at learning anything substantial about the domain when it signiﬁcantly beats them.

Results Table A8 provides the ranking results among all eight contenders for each domain, where the average of the metric (accuracy or RMSE) is used to rank from best to worst. Table A9 provides a more synthetic view by computing the average rank for each contender. There are several conclusions to draw: (i) as perhaps expected, COPY is always the best approach; what is perhaps less expected is that (ii) UNIF is far from always being the worst, CTGAN being the most frequently beaten contenders (albeit not necessarily statistically signiﬁcantly, see below) on four out of ten domains. More importantly, (iii) our approach largely performs the best in all non-COPY approaches unless small GTs are used (U(10)). In eight out of ten domains, U(300) is in the top-three contenders, and top-two if we exclude COPY. An interesting observation is that U(MAX) is also in the top-three contenders in eight out of ten domains, showing that there is reduced overﬁtting effect due to the large tree size (which we attribute in part to the constraint that p R t0, 1u for GT splits). Obviously, in a context where explainability would be key, the smaller size option (U(300)) would be a preferred choice. To dig further in comparing our method to CTGAN, we have computed the number of domains one signiﬁcantly (p 0.01) beats the other, adding to those statistics the number of times UNIF does signiﬁcantly (p 0.05) beat some contender(s). All results are summarized in Table A10. The results display the superiority of our GT-based approach (unless, again small trees are used), but they also display that CTGAN are, in few cases, signiﬁcantly beaten by UNIF and this can happen for all three epoch numbers. This aligns with the observation that dealing with tabular data with neural nets forces sophisticated choices for the design (Xu et al., 2019) and probably has as consequence that not optimizing sufﬁciently hyperparameters can result in worse performances than uniform generation. We clearly do not have this problem and believe that this is due to the fact that the tree (graphs) used in GTs bring the same convenience for data generation as the tree (graphs) used in decision trees have for discrimination. Figures 16 and 17 provide two examples of sets of 2D plots showing the distribution of generated examples according to different sets of couples of real-values variables. The absence of overﬁtting as well as the capturing of sophisticated features of the data s distribution is quite apparent, also in comparison with neural nets. Last, we have computed the average computation time to get the generated datasets for us and CTGAN thus, inclusive of the training time for the generator. Results are provided in Table A11. One must be cautious in comparing numbers as our implementation of our algorithms is in Java while CTGAN s is in Python, but at least one conclusion seems fair to draw: we are in general and especially for bigger models / longer training achieving much better results than CTGAN. The dna domain, for which the imputation experiments were already displaying our superiority in terms of training time (Section 6), is a clear example of reduction in training time that can be of order 10ˆ 100ˆ with GTs compared to neural networks.

To put our results in perspective, we have also computed the relative size of the GT learned with respect to each domain size, for each experiment: we use as GT size |GT| the total number of vertices and arcs, which equals (1 5 split number) and as domain size, |X|, the total size of the dataset used (number of examples times number of variables). Table A12 presents the results obtained for U(300), from which it appears that the GT learned can represent a tiny proportion of a domain s size at most a few percents in most cases.

6https://github.com/sdv-dev/CTGAN

Generative Trees: Adversarial and Copycat

Split1 Split3 Split2

Test on discrimination

Permut roles

Figure 18. Experiment GEN-DISCRIM: General overview of the pipeline, designed to avoid rewarding generators that would just copy their training sample.

II.7. Fake-real discrimination experiment (GEN-DISCRIM)

Generative Trees: Adversarial and Copycat

Domain wins vs COPY loses vs UNIF

circgauss U(300) (0.01), U(MAX) (0.02) None

randgauss U(300) (0.00004), U(MAX) (0.00002) C(10) (0.01)

ringgauss U(300) (0.002), U(MAX) (0.001) None

abalone None C(10) (0.0006)

dna None None

house-votes None C(10) (0.0002)

iris None C(10) (0.007), C(300) (0.0002)

led24 None None

led None C(1K) (0.03)

winered None U(300) (0.003), U(MAX) (0.001)

winewhite None U(300) (0.006), U(MAX) (0.001)

sigma cabs None U(10) (0.001), U(300) (0.001), U(MAX) (0.001), C(10) (0.001), C(300) (0.003)

open-policing None None

Table A13. Experiment GEN-DISCRIM: for each domain, we compute the list of contenders in our method and CT-GAN statistically winning against COPY and statistically losing against UNIF (for p ď 0.05, indicated in parenthesis).

C(.) 10 300 1K

10 7 / 4 / 26 / 5 / 26 / 5 / 2

300 8 / 4 / 18 / 3 / 28 / 3 / 2

MAX 8 / 4 / 18 / 3 / 28 / 3 / 2

Table A14. Experiment GEN-DISCRIM: statistical wins / ties / statistical losses for us (U()) vs CT-GAN (C()). Statistical = signiﬁcant for p ď 0.01. For example, a / b / c means we statistically win a times, lose c times and there is no statistical difference b times. (See Table A13 to spot those domains).

Generative Trees: Adversarial and Copycat

C(.) 10 300 1K

10 6 / 1 / 36 / 2 / 24 / 1 / 5

300 9 / 0 / 19 / 0 / 16 / 2 / 2

MAX 9 / 0 / 19 / 0 / 17 / 1 / 2

Table A15. Experiment GEN-AUG: statistical wins / ties / statistical losses for us (U()) vs CT-GAN (C()). Statistical = signiﬁcant for p ď 0.01. For example, a / b / c means we statistically win a times, lose c times and there is no statistical difference b times.

Objective The objective ﬁts in a simple question: can generated examples look like real ones ?

Experimental setting The question is simple but its treatment non trivial: we need in particular to avoid rewarding generators that would just copy their training examples. Figure 18 provides the training pipeline we have designed, that we ran for each domain considered. In short, we split each domain in three equal sized parts, say S1, S2 and S3. One of these parts, say S1, is used to trained the generator, which then generates a sample S1 having the same size as S1. We then train a discriminator for the 2-classes supervised learning problem consisting in distinguishing real from fake, using S1 and another original part, say S2 as training samples. The discriminator is then tested on the problem consisting in distinguishing S1 from the last original part (not yet used), S3 in this case. The smaller the ﬁnal accuracy, the more realistic is considered S1. We then permute the roles of the three samples and run the experiment again, ending up in 3! 6 accuracies for each domain. Considering that the discriminator is trained and tested on two different subsets of the original data as real data, there is an incentive to not just copy the original data but capture features about the domain that generalise well for data generation. We have considered the same generators as in experiment TRAIN-GEN : CTGAN with small (10), medium (300) and large (1K) number of training epochs; our method with 10, 300 and MAX splits for GTs (recall that MAX = training up to 10 000 splits); we also consider the COPY and UNIForm baselines, the former giving an idea of the accuracy for the original data and the latter giving the most blunt baseline. We consider the same discriminators as in experiment TRAIN-GEN (random forests and gradient boosted decision trees).

Results We ﬁrst have a look at the extreme results, i.e. how our method and CTGAN compare to COPY and UNIF. Table A13 presents the detailed results obtained for each domain. In this table, we only look at statistically signiﬁcant results for example, when the accuracies on testing were statistically signiﬁcantly larger than UNIF (which means that UNIF performed better), or when the accuracies on testing were statistically signiﬁcantly smaller than COPY (which means that

COPY performed worse). The picture with respect to UNIF displays that some CTGAN (disregarding the number of epochs) get worse results on almost half of the domains (6) while some of our methods gets worse results on 3 of them. When looking at COPY, we see that on our simulated domains, our method actually gets systematically signiﬁcantly better results than COPY when the number of splits is at least 300. This, we believe, signals the potential for GTs to be used as efﬁcient data generators, eventually as parts of more complex generators for more complex domains than our generated domains. We have also drilled down in the comparison between our approach and CTGAN in the same way as we did for experiment TRAIN-GEN. Table A14 presents the aggregated results, whose formatting follows the same rules as for Table A10. The conclusion from this Table is that our approach does better at producing realistically looking datasets than CTGAN does. When the GTs are big enough (at least 300 splits), the picture displays that our approach wins against all CTGAN alternatives on a large majority of domains.

II.8. Generated augmentation experiment (GEN-AUG)

Generative Trees: Adversarial and Copycat

open-policing

Table A16. Experiment GEN-AUG: detailed results on four additional domains for which the metric is the accuracy. Conventions follow Table 4 in the main ﬁle.

Generative Trees: Adversarial and Copycat

Objective Supplementing real data with additional faithful generated data could be envisioned as a way to improve the performances of models trained from the whole data, a problem of substantial practical impact. Our objective was to test how generative models can perform in such a scenario.

Experimental setting The setting can be summarized as the equivalent of the TRAIN-GEN experiment (Section II.6), with the sole modiﬁcation that we train (supervised) models using generated data mixed with the training data of the generators instead of just the generated data alone. Equivalently, we train with COPY + a generated sample. This generated sample could be generated by CTGAN or our GTs, but we also try the COPY case (we add real examples) and the UNIForm case (we add uniformly generated examples). We ﬁrst tried the simplest experimental setting in which the size of the generated data was the same as the size of the training data, but the experiments were largely inconclusive in terms of who wins or loses. Hence, we have dug in this scenario, allowing a varying % of generated data to be added to the training data, for a % of generated data that would represent 5%, 10%, 15%, ..., 100% of the training (real) data. This represents a lot more experiments but also allows us to understand, for each generative technique, what is the effect of putting more generative examples in the training sample. Regardless of the comparison with the COPY approach, a good generative approach should bring improved results as the number of generated examples increases. All other parameters, generative models and supervised classiﬁers considered are the same as in the TRAIN-GEN experiment.

Results Table A15 summarises the results obtained, much in the same way as Tables A10 and A14. The results display that GTs tend to be a better ﬁt for data augmentation than CTGAN. Training more the neural nets does not yield an advantage over GTs, as even when comparing with GTs having a few dozen nodes, the picture is still quite balanced. Table A16 completes Table 4 for the remains domains used. We observe that while dna is clearly the domain where CTGAN obtained the worst results, with all accuracies (substantially) lower than any of UNIF, such bad performances also occur in domains winewhite and winered, which could signal that the issue is not linked to the domain being categorical (winewhite and winered are both real-valued). On iris, CTGAN have a slight edge over GTs, an edge which much more signiﬁcant on sigma-cabs where GT results basically cannot be distinguished from UNIF. We observe that in this domain, CTGAN manage to get an improvement over 5% real data, in the same way as GTs manage to get an improvement over 5% real data on abalone. open-policing, which is our biggest domain, displays that training for a longer time is beneﬁcial to both CTGAN and GTs, but ultimately GTs get the best results. Two conclusions can be drawn: ﬁrst, there is still work to do to beat adding real data, for both neural nets and generative trees, even when those latter models seem to overall get the best and most stable results. Second, we do not observe for generative trees the apparent overﬁtting pattern that follows CTGAN on dna, winewhite and winered. Apart from the fact that we prevent discarding support in the induction of the GTs, we do not see currently any other reason for this observation to happen.