# tropical_geometry_of_deep_neural_networks__e6811786.pdf

Tropical Geometry of Deep Neural Networks

Liwen Zhang 1 Gregory Naitzat 2 Lek-Heng Lim 2 3

We establish, for the ﬁrst time, explicit connections between feedforward neural networks with Re LU activation and tropical geometry we show that the family of such neural networks is equivalent to the family of tropical rational maps. Among other things, we deduce that feedforward Re LU neural networks with one hidden layer can be characterized by zonotopes, which serve as building blocks for deeper networks; we relate decision boundaries of such neural networks to tropical hypersurfaces, a major object of study in tropical geometry; and we prove that linear regions of such neural networks correspond to vertices of polytopes associated with tropical rational functions. An insight from our tropical formulation is that a deeper network is exponentially more expressive than a shallow network.

1. Introduction

Deep neural networks have recently received much limelight for their enormous success in a variety of applications across many different areas of artiﬁcial intelligence, computer vision, speech recognition, and natural language processing (Le Cun et al., 2015; Hinton et al., 2012; Krizhevsky et al., 2012; Bahdanau et al., 2014; Kalchbrenner & Blunsom, 2013). Nevertheless, it is also well-known that our theoretical understanding of their efﬁcacy remains incomplete.

There have been several attempts to analyze deep neural networks from different perspectives. Notably, earlier studies have suggested that a deep architecture could use parameters more efﬁciently and requires exponentially fewer parameters to express certain families of functions than a shallow architecture (Delalleau & Bengio, 2011; Bengio & Delalleau,

1Department of Computer Science, University of Chicago, Chicago, IL 2Department of Statistics, University of Chicago, Chicago, IL 3Computational and Applied Mathematics Initiative, University of Chicago, Chicago, IL. Correspondence to: Lek-Heng Lim <lekheng@galton.uchicago.edu>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

2011; Montufar et al., 2014; Eldan & Shamir, 2016; Poole et al., 2016; Telgarsky, 2016; Arora et al., 2018). Recent work (Zhang et al., 2016) showed that several successful neural networks possess a high representation power and can easily shatter random data. However, they also generalize well to data unseen during training stage, suggesting that such networks may have some implicit regularization. Traditional measures of complexity such as VC-dimension and Rademacher complexity fail to explain this phenomenon. Understanding this implicit regularization that begets the generalization power of deep neural networks remains a challenge.

The goal of our work is to establish connections between neural network and tropical geometry in the hope that they will shed light on the workings of deep neural networks. Tropical geometry is a new area in algebraic geometry that has seen an explosive growth in the recent decade but remains relatively obscure outside pure mathematics. We will focus on feedforward neural networks with rectiﬁed linear units (Re LU) and show that they are analogues of rational functions, i.e., ratios of two multivariate polynomials f, g in variables x1, . . . , xd,

f(x1, . . . , xd) g(x1, . . . , xd) ,

in tropical algebra. For standard and trigonometric polynomials, it is known that rational approximation approximating a target function by a ratio of two polynomials instead of a single polynomial vastly improves the quality of approximation without increasing the degree. This gives our analogue: An Re LU neural network is the tropical ratio of two tropical polynomials, i.e., a tropical rational function. More precisely, if we view a neural network as a function ν : Rd Rp, x = (x1, . . . , xd) 7 (ν1(x), . . . , νp(x)), then ν is a tropical rational map, i.e., each νi is a tropical rational function. In fact, we will show that:

the family of functions represented by feedforward neural networks with rectiﬁed linear units and integer weights is exactly the family of tropical rational maps.

It immediately follows that there is a semiﬁeld structure on this family of functions. More importantly, this establishes a

Tropical Geometry of Deep Neural Networks

bridge between neural networks1 and tropical geometry that allows us to view neural networks as well-studied tropical geometric objects. This insight allows us to closely relate boundaries between linear regions of a neural network to tropical hypersurfaces and thereby facilitate studies of decision boundaries of neural networks in classiﬁcation problems as tropical hypersurfaces. Furthermore, the number of linear regions, which captures the complexity of a neural network (Montufar et al., 2014; Raghu et al., 2017; Arora et al., 2018), can be bounded by the number of vertices of the polytopes associated with the neural network s tropical rational representation. Lastly, a neural network with one hidden layer can be completely characterized by zonotopes, which serve as building blocks for deeper networks.

In Sections 2 and 3 we introduce basic tropical algebra and tropical algebraic geometry of relevance to us. We state our assumptions precisely in Section 4 and establish the connection between tropical geometry and multilayer neural networks in Section 5. We analyze neural networks with tropical tools in Section 6, proving that a deeper neural network is exponentially more expressive than a shallow network though our objective is not so much to perform state-of-the-art analysis but to demonstrate that tropical algebraic geometry can provide useful insights. All proofs are deferred to Section D of the supplement.

2. Tropical Algebra

Roughly speaking, tropical algebraic geometry is an analogue of classical algebraic geometry over C, the ﬁeld of complex numbers, but where one replaces C by a semiﬁeld2

called the tropical semiring, to be deﬁned below. We give a brief review of tropical algebra and introduce some relevant notations. See (Itenberg et al., 2009; Maclagan & Sturmfels, 2015) for an in-depth treatment.

The most fundamental component of tropical algebraic geometry is the tropical semiring T := R { }, , , also known as the max-plus algebra. The two operations and , called tropical addition and tropical multiplication respectively, are deﬁned as follows.

Deﬁnition 2.1. For x, y R, their tropical sum is x y := max{x, y}; their tropical product is x y := x + y; the tropical quotient of x over y is x y := x y.

For any x R, we have x = 0 x = x and x = . Thus is the tropical additive identity and 0 is the tropical multiplicative identity. Furthermore, these operations satisfy the usual laws of arithmetic: associativity, commutativity, and distributivity. The set R { } is therefore a semiring under the operations and . While

1Henceforth a neural network will always mean a feedforward neural network with Re LU activation. 2A semiﬁeld is a ﬁeld sans the existence of additive inverses.

it is not a ring (lacks additive inverse), one may nonetheless generalize many algebraic objects (e.g., matrices, polynomials, tensors, etc) and notions (e.g., rank, determinant, degree, etc) over the tropical semiring the study of these, in a nutshell, constitutes the subject of tropical algebra.

Let N = {n Z : n 0}. For an integer a N, raising x R to the ath power is the same as multiplying x to itself a times. When standard multiplication is replaced by tropical multiplication, this gives us tropical power:

x a := x x = a x,

where the last denotes standard product of real numbers; it is extended to R { } by deﬁning, for any a N,

( if a > 0, 0 if a = 0.

A tropical semiring, while not a ﬁeld, possesses one quality of a ﬁeld: Every x R has a tropical multiplicative inverse given by its standard additive inverse, i.e., x ( 1) := x. Though not reﬂected in its name, T is in fact a semiﬁeld.

One may therefore also raise x R to a negative power a Z by raising its tropical multiplicative inverse x to the positive power a, i.e., x a = ( x) ( a). As is the case in standard real arithmetic, the tropical additive inverse does not have a tropical multiplicative inverse and a

is undeﬁned for a < 0. For notational simplicity, we will henceforth write xa instead of x a for tropical power when there is no cause for confusion. Other algebraic rules of tropical power may be derived from deﬁnition; see Section B in the supplement.

We are now in a position to deﬁne tropical polynomials and tropical rational functions. In the following, x and xi will denote variables (i.e., indeterminates). Deﬁnition 2.2. A tropical monomial in d variables x1, . . . , xd is an expression of the form

c xa1 1 xa2 2 xad d

where c R { } and a1, . . . , ad N. As a convenient shorthand, we will also write a tropical monomial in multiindex notation as cxα where α = (a1, . . . , ad) Nd

and x = (x1, . . . , xd). Note that xα = 0 xα as 0 is the tropical multiplicative identity. Deﬁnition 2.3. Following notations above, a tropical polynomial f(x) = f(x1, . . . , xd) is a ﬁnite tropical sum of tropical monomials

f(x) = c1xα1 crxαr,

where αi = (ai1, . . . , aid) Nd and ci R { }, i = 1, . . . , r. We will assume that a monomial of a given multiindex appears at most once in the sum, i.e., αi = αj for any i = j.

Tropical Geometry of Deep Neural Networks

Deﬁnition 2.4. Following notations above, a tropical rational function is a standard difference, or, equivalently, a tropical quotient of two tropical polynomials f(x) and g(x): f(x) g(x) = f(x) g(x).

We will denote a tropical rational function by f g, where f and g are understood to be tropical polynomial functions.

It is routine to verify that the set of tropical polynomials T[x1, . . . , xd] forms a semiring under the standard extension of and to tropical polynomials, and likewise the set of tropical rational functions T(x1, . . . , xd) forms a semiﬁeld. We regard a tropical polynomial f = f 0 as a special case of a tropical rational function and thus T[x1, . . . , xd] T(x1, . . . , xd). Henceforth any result stated for a tropical rational function would implicitly also hold for a tropical polynomial.

A d-variate tropical polynomial f(x) deﬁnes a function f : Rd R that is a convex function in the usual sense as taking max and sum of convex functions preserve convexity (Boyd & Vandenberghe, 2004). As such, a tropical rational function f g : Rd R is a DC function or differenceconvex function (Hartman, 1959; Tao & Hoai An, 2005).

We will need a notion of vector-valued tropical polynomials and tropical rational functions. Deﬁnition 2.5. F : Rd Rp, x = (x1, . . . , xd) 7 (f1(x), . . . , fp(x)), is called a tropical polynomial map if each fi : Rd R is a tropical polynomial, i = 1, . . . , p, and a tropical rational map if f1, . . . , fp are tropical rational functions. We will denote the set of tropical polynomial maps by Pol(d, p) and the set of tropical rational maps by Rat(d, p). So Pol(d, 1) = T[x1, . . . , xd] and Rat(d, 1) = T(x1, . . . , xd).

3. Tropical Hypersurfaces

There are tropical analogues of many notions in classical algebraic geometry (Itenberg et al., 2009; Maclagan & Sturmfels, 2015), among which are tropical hypersurfaces, tropical analogues of algebraic curves in classical algebraic geometry. Tropical hypersurfaces are a principal object of interest in tropical geometry and will prove very useful in our approach towards neural networks. Intuitively, the tropical hypersurface of a tropical polynomial f is the set of points x where f is not linear at x. Deﬁnition 3.1. The tropical hypersurface of a tropical polynomial f(x) = c1xα1 crxαr is

T (f) := x Rd : cixαi = cjxαj = f(x)

for some αi = αj .

i.e., the set of points x at which the value of f at x is attained by two or more monomials in f.

Figure 1. 1 x2 1 1 x2 2 2 x1x2 2 x1 2 x2 2. Left: Tropical curve. Right: Dual subdivision of Newton polygon and tropical curve.

A tropical hypersurface divides the domain of f into convex cells on each of which f is linear. These cells are convex polyhedrons, i.e., deﬁned by linear inequalities with integer coefﬁcients: {x Rd : Ax b} for A Zm d and b Rm. For example, the cell where a tropical monomial cjxαj attains its maximum is {x Rd : cj + αT jx ci + αT ix for all i = j}. Tropical hypersurfaces of polynomials in two variables (i.e., in R2) are called tropical curves.

Just like standard multivariate polynomials, every tropical polynomial comes with an associated Newton polygon.

Deﬁnition 3.2. The Newton polygon of a tropical polynomial f(x) = c1xα1 crxαr is the convex hull of α1, . . . , αr Nd, regarded as points in Rd,

(f) := Conv αi Rd : ci = , i = 1, . . . , r .

A tropical polynomial f determines a dual subdivision of (f), constructed as follows. First, lift each αi from Rd

into Rd+1 by appending ci as the last coordinate. Denote the convex hull of the lifted α1, . . . , αr as

P(f) := Conv{(αi, ci) Rd R : i = 1, . . . , r}. (1)

Next let UF P(f) denote the collection of upper faces in P(f) and π : Rd R Rd be the projection that drops the last coordinate. The dual subdivision determined by f is then

δ(f) := π(p) Rd : p UF P(f) .

δ(f) forms a polyhedral complex with support (f). By (Maclagan & Sturmfels, 2015, Proposition 3.1.6), the tropical hypersurface T (f) is the (d 1)-skeleton of the polyhedral complex dual to δ(f). This means that each vertex in δ(f) corresponds to one cell in Rd where the function f is linear. Thus, the number of vertices in P(f) provides an upper bound on the number of linear regions of f.

Figure 1 shows the Newton polygon and dual subdivision for the tropical polynomial f(x1, x2) = 1 x2 1 1 x2 2 2 x1x2 2 x1 2 x2 2. Figure 2 shows how we

Tropical Geometry of Deep Neural Networks

Dual subdivision of Newton polygon

Upper envelope of polytope

(0, 0) (1, 0)

(1, 1) (0, 1)

Figure 2. 1 x2 1 1 x2 2 2 x1x2 2 x1 2 x2 2. The dual subdivision can be obtained by projecting the edges on the upper faces of the polytope.

may ﬁnd the dual subdivision for this tropical polynomial by following the aforementioned procedures; with step-by-step details given in Section C.1.

Tropical polynomials and tropical rational functions are clearly piecewise linear functions. As such a tropical rational map is a piecewise linear map and the notion of linear region applies.

Deﬁnition 3.3. A linear region of F Rat(d, m) is a maximal connected subset of the domain on which F is linear. The number of linear regions of F is denoted N(F).

Note that a tropical polynomial map F Pol(d, m) has convex linear regions but a tropical rational map F Rat(d, n) generally has nonconvex linear regions. In Section 6.3, we will use N(F) as a measure of complexity for an F Rat(d, n) given by a neural network.

3.1. Transformations of Tropical Polynomials

Our analysis of neural networks will require ﬁguring out how the polytope P(f) transforms under tropical power, sum, and product. The ﬁrst is straightforward.

Proposition 3.1. Let f be a tropical polynomial and let a N. Then P(f a) = a P(f).

a P(f) = {ax : x P(f)} Rd+1 is a scaled version of P(f) with the same shape but different volume.

To describe the effect of tropical sum and product, we need a few notions from convex geometry. The Minkowski sum of two sets P1 and P2 in Rd is the set

P1 + P2 := x1 + x2 Rd : x1 P1, x2 P2 ;

and for λ1, λ2 0, their weighted Minkowski sum is

λ1P1+λ2P2 := λ1x1+λ2x2 Rd : x1 P1, x2 P2 .

Weighted Minkowski sum is clearly commutative and associative and generalizes to more than two sets. In particular, the Minkowski sum of line segments is called a zonotope.

Let V(P) denote the set of vertices of a polytope P. Clearly, the Minkowski sum of two polytopes is given by the convex hull of the Minkowski sum of their vertex sets, i.e., P1 + P2 = Conv V(P1) + V(P2) . With this observation, the following is immediate.

Proposition 3.2. Let f, g Pol(d, 1) = T[x1, . . . , xd] be tropical polynomials. Then

P(f g) = P(f) + P(g),

P(f g) = Conv V(P(f)) V(P(g)) .

We reproduce below part of (Gritzmann & Sturmfels, 1993, Theorem 2.1.10) and derive a corollary for bounding the number of verticies on the upper faces of a zonotope.

Theorem 3.3 (Gritzmann Sturmfels). Let P1, . . . , Pk be polytopes in Rd and let m denote the total number of nonparallel edges of P1, . . . , Pk. Then the number of vertices of P1 + + Pk does not exceed

The upper bound is attained if all Pi s are zonotopes and all their generating line segments are in general positions.

Corollary 3.4. Let P Rd+1 be a zonotope generated by m line segments P1, . . . , Pm. Let π : Rd R Rd be the projection. Suppose P satisﬁes:

(i) the generating line segments are in general positions;

(ii) the set of projected vertices {π(v) : v V(P)} Rd

are in general positions.

Then P has d X

vertices on its upper faces. If either (i) or (ii) is violated, then this becomes an upper bound.

As we mentioned, linear regions of a tropical polynomial f correspond to vertices on UF P(f) and the corollary will be useful for bounding the number of linear regions.

4. Neural Networks

While we expect our readership to be familiar with feedforward neural networks, we will nevertheless use this short

Tropical Geometry of Deep Neural Networks

section to deﬁne them, primarily for the purpose of ﬁxing notations and specifying the assumptions that we retain throughout this article. We restrict our attention to fully connected feedforward neural networks.

Viewed abstractly, an L-layer feedforward neural network is a map ν : Rd Rp given by a composition of functions

ν = σ(L) ρ(L) σ(L 1) ρ(L 1) σ(1) ρ(1).

The preactivation functions ρ(1), . . . , ρ(L) are afﬁne transformations to be determined and the activation functions σ(1), . . . , σ(L) are chosen and ﬁxed in advanced.

We denote the width, i.e., the number of nodes, of the lth layer by nl, l = 1, , L 1. We set n0 := d and n L := p, respectively the dimensions of the input and output of the network. The output from the lth layer will be denoted by

ν(l) := σ(l) ρ(l) σ(l 1) ρ(l 1) σ(1) ρ(1),

i.e., it is a map ν(l) : Rd Rnl. For convenience, we assume ν(0)(x) := x.

The afﬁne function ρ(l) : Rnl 1 Rnl is given by a weight matrix A(l) Znl nl 1 and a bias vector b(l) Rnl:

ρ(l)(ν(l 1)) := A(l)ν(l 1) + b(l).

The (i, j)th coordinate of A(l) will be denoted a(l) ij and the

ith coordinate of b(l) by b(l) i . Collectively they form the parameters of the lth layer.

For a vector input x Rnl, σ(l)(x) is understood to be in the coordinatewise sense; so σ : Rnl Rnl. We assume the ﬁnal output of a neural network ν(x) is fed into a score function s : Rp Rm that is application speciﬁc. When used as an m-category classiﬁer, s may be chosen, for example, to be a soft-max or sigmoidal function. The score function is quite often regarded as the last layer of a neural network but this is purely a matter of convenience and we will not assume this. We will make the following mild assumptions on the architecture of our feedforward neural networks and explain next why they are indeed mild:

(a) the weight matrices A(1), . . . , A(L) are integer-valued;

(b) the bias vectors b(1), . . . , b(L) are real-valued;

(c) the activation functions σ(1), . . . , σ(L) take the form

σ(l)(x) := max{x, t(l)},

where t(l) (R { })nl is called a threshold vector.

Henceforth all neural networks in our subsequent discussions will be assumed to satisfy (a) (c).

(b) is completely general but there is also no loss of generality in (a), i.e., in restricting the weights A(1), . . . , A(L)

from real matrices to integer matrices, as:

real weights can be approximated arbitrarily closely by rational weights;

one may then clear denominators in these rational weights by multiplying them by the least common multiple of their denominators to obtain integer weights;

keeping in mind that scaling all weights and biases by the same positive constant has no bearing on the workings of a neural network.

The activation function in (c) includes both Re LU activation (t(l) = 0) and identity map (t(l) = ) as special cases. Aside from Re LU, our tropical framework will apply to piecewise linear activations such as leaky Re LU and absolute value, and with some extra effort, may be extended to max pooling, maxout nets, etc. But it does not, for example, apply to activations such as hyperbolic tangent and sigmoid.

In this work, we view an Re LU network as the simplest and most canonical model of a neural network, from which other variants that are more effective at speciﬁc tasks may be derived. Given that we seek general theoretical insights and not speciﬁc practical efﬁcacy, it makes sense to limit ourselves to this simplest case. Moreover, Re LU networks already embody some of the most important elements (and mysteries) common to a wider range of neural networks (e.g., universal approximation, exponential expressiveness); they work well in practice and are often the go-to choice for feedforward networks. We are also not alone in limiting our discussions to Re LU networks (Montufar et al., 2014; Arora et al., 2018).

5. Tropical Algebra of Neural Networks

We now describe our tropical formulation of a multilayer feedforward neural network satisfying (a) (c).

A multilayer feedforward neural network is generally nonconvex, whereas a tropical polynomial is always convex. Since most nonconvex functions are a difference of two convex functions (Hartman, 1959), a reasonable guess is that a feedforward neural network is the difference of two tropical polynomials, i.e., a tropical rational function. This is indeed the case, as we will see from the following.

Consider the output from the ﬁrst layer in neural network

ν(x) = max{Ax + b, t},

where A Zp d, b Rp, and t (R { })p. We will decompose A as a difference of two nonnegative integervalued matrices, A = A+ A with A+, A Np d; e.g., in the standard way with entries

a+ ij := max{aij, 0}, a ij := max{ aij, 0}

respectively. Since

max{Ax + b, t} = max{A+x + b, A x + t} A x,

Tropical Geometry of Deep Neural Networks

we see that every coordinate of one-layer neural network is a difference of two tropical polynomials. For networks with more layers, we apply this decomposition recursively to obtain the following result.

Proposition 5.1. Let A Zm n, b Rm be the parameters of the (l + 1)th layer, and let t (R { })m be the threshold vector in the (l + 1)th layer. If the nodes of the lth layer are given by tropical rational functions,

ν(l)(x) = F (l)(x) G(l)(x) = F (l)(x) G(l)(x),

i.e., each coordinate of F (l) and G(l) is a tropical polynomial in x, then the outputs of the preactivation and of the (l + 1)th layer are given by tropical rational functions

ρ(l+1) ν(l)(x) = H(l+1)(x) G(l+1)(x),

ν(l+1)(x) = σ ρ(l+1) ν(l)(x) = F (l+1)(x) G(l+1)(x)

respectively, where

F (l+1)(x) = max H(l+1)(x), G(l+1)(x) + t ,

G(l+1)(x) = A+G(l)(x) + A F (l)(x),

H(l+1)(x) = A+F (l)(x) + A G(l)(x) + b.

We will write f (l) i , g(l) i and h(l) i for the ith coordinate of F (l), G(l) and H(l) respectively. In tropical arithmetic, the recurrence above takes the form

f (l+1) i = h(l+1) i (g(l+1) i ti),

g(l+1) i = n K

j=1 (f (l) j )a ij n K

j=1 (g(l) j )a+ ij ,

h(l+1) i = n K

j=1 (f (l) j )a+ ij n K

j=1 (g(l) j )a ij bi.

Repeated applications of Proposition 5.1 yield the following.

Theorem 5.2 (Tropical characterization of neural networks). A feedforward neural network under assumptions (a) (c)

is a function ν : Rd Rp whose coordinates are tropical rational functions of the input, i.e.,

ν(x) = F(x) G(x) = F(x) G(x)

where F and G are tropical polynomial maps. Thus ν is a tropical rational map.

Note that the tropical rational functions above have real coefﬁcients, not integer coefﬁcients. The integer weights A(l) Znl nl 1 have gone into the powers of tropical monomials in f and g, which is why we require our weights to be integer-valued, although as we have explained, this requirement imposes little loss of generality.

By setting t(1) = = t(L 1) = 0 and t(L) = , we obtain the following corollary.

Corollary 5.3. Let ν : Rd R be an Re LU activated feedforward neural network with integer weights and linear output. Then ν is a tropical rational function.

A more remarkable fact is the converse of Corollary 5.3.

Theorem 5.4 (Equivalence of neural networks and tropical rational functions).

(i) Let ν : Rd R. Then ν is a tropical rational function if and only if ν is a feedforward neural network satisfying assumptions (a) (c).

(ii) A tropical rational function f g can be represented as an L-layer neural network, with

L max{ log2 rf , log2 rg } + 2,

where rf and rg are the number of monomials in the tropical polynomials f and g respectively.

We would like to acknowledge the precedence of (Arora et al., 2018, Theorem 2.1), which demonstrates the equivalence between Re LU-activated L-layer neural networks with real weights and d-variate continuous piecewise functions with real coefﬁcients, where L log2(d + 1) + 1.

By construction, a tropical rational function is a continuous piecewise linear function. The continuity of a piecewise linear function automatically implies that each of the pieces on which it is linear is a polyhedral region. As we saw in Section 3, a tropical polynomial f : Rd R gives a tropical hypersurface that divides Rd into convex polyhedral regions deﬁned by linear inequalities with integer coefﬁcients: {x Rd : Ax b} with A Zm d and b Rm. A tropical rational function f g : Rd R must also be a continuous piecewise linear function and divide Rd into polyhedral regions on each of which f g is linear, although these regions are nonconvex in general. We will show the converse any continuous piecewise linear function with integer coefﬁcients is a tropical rational function.

Proposition 5.5. Let ν : Rd R. Then ν is a continuous piecewise linear function with integer coefﬁcients if and only if ν is a tropical rational function.

Corollary 5.3, Theorem 5.4, and Proposition 5.5 collectively imply the equivalence of

(i) tropical rational functions,

(ii) continuous piecewise linear functions with integer coefﬁcients,

(iii) neural networks satisfying assumptions (a) (c).

An immediate advantage of this characterization is that the set of tropical rational functions T(x1, . . . , xd) has a semiﬁeld structure as we pointed out in Section 2, a fact that we have implicitly used in the proof of Proposition 5.5. However, what is more important is not the algebra but the

Tropical Geometry of Deep Neural Networks

algebraic geometry that arises from our tropical characterization. We will use tropical algebraic geometry to illuminate our understanding of neural networks in the next section.

The need to stay within tropical algebraic geometry is the reason we did not go for a simpler and more general characterization (that does not require the integer coefﬁcients assumption). A tropical signomial takes the form

j=1 xaij j ,

where aij R and bi R { }. Note that aij is not required to be integer-valued nor nonnegative. A tropical rational signomial is a tropical quotient ϕ ψ of two tropical signomials ϕ, ψ. A tropical rational signomial map is a function ν = (ν1, . . . , νp) : Rd Rp where each νi : Rd R is a tropical rational signomial νi = ϕi ψi. The same argument we used to establish Theorem 5.2 gives us the following. Proposition 5.6. Every feedforward neural network with Re LU activation is a tropical rational signomial map.

Nevertheless tropical signomials fall outside the realm of tropical algebraic geometry and we do not use Proposition 5.6 in the rest of this article.

6. Tropical Geometry of Neural Networks

Section 5 deﬁnes neural networks via tropical algebra, a perspective that allows us to study them via tropical algebraic geometry. We will show that the decision boundary of a neural network is a subset of a tropical hypersurface of a corresponding tropical polynomial (Section 6.1). We will see that, in an appropriate sense, zonotopes form the geometric building blocks for neural networks (Section 6.2). We then prove that the geometry of the function represented by a neural network grows vastly more complex as its number of layers increases (Section 6.3).

6.1. Decision Boundaries of a Neural Network

We will use tropical geometry and insights from Section 5 to study decision boundaries of neural networks, focusing on the case of two-category classiﬁcation for clarity. As explained in Section 4, a neural network ν : Rd Rp

together with a choice of score function s : Rp R give us a classiﬁer. If the output value s(ν(x)) exceeds some decision threshold c, then the neural network predicts x is from one class (e.g., x is a CAT image), and otherwise x is from the other category (e.g., a DOG image). The input space is thereby partitioned into two disjoint subsets by the decision boundary B := {x Rd : ν(x) = s 1(c)}. Connected regions with value above the threshold and connected regions with value below the threshold will be called the positive regions and negative regions respectively.

We provide bounds on the number of positive and negative regions and show that there is a tropical polynomial whose tropical hypersurface contains the decision boundary.

Proposition 6.1 (Tropical geometry of decision boundary). Let ν : Rd R be an L-layer neural network satisfying assumptions (a) (c) with t(L) = . Let the score function s : R R be injective with decision threshold c in its range. If ν = f g where f and g are tropical polynomials, then

(i) its decision boundary B = {x Rd : ν(x) = s 1(c)} divides Rd into at most N(f) connected positive regions and at most N(g) connected negative regions;

(ii) its decision boundary is contained in the tropical hypersurface of the tropical polynomial s 1(c) g(x) f(x) = max{f(x), g(x) + s 1(c)}, i.e.,

B T (s 1(c) g f). (3)

The function s 1(c) g f is not necessarily linear on every positive or negative region and so its tropical hypersurface T (s 1(c) g f) may further divide a positive or negative region derived from B into multiple linear regions. Hence the in (3) cannot in general be replaced by = .

6.2. Zonotopes as Geometric Building Blocks of Neural Networks

From Section 3, we know that the number of regions a tropical hypersurface T (f) divides the space into equals the number of vertices in the dual subdivision of the Newton polygon associated with the tropical polynomial f. This allows us to bound the number of linear regions of a neural network by bounding the number of vertices in the dual subdivision of the Newton polygon.

We start by examining how geometry changes from one layer to the next in a neural network, more precisely:

Question. How are the tropical hypersurfaces of the tropical polynomials in the (l + 1)th layer of a neural network related to those in the lth layer?

The recurrent relation (2) describes how the tropical polynomials occurring in the (l + 1)th layer are obtained from those in the lth layer, namely, via three operations: tropical sum, tropical product, and tropical powers. Recall that a tropical hypersurface of a tropical polynomial is dual to the dual subdivision of the Newton polytope of the tropical polynomial, which is given by the projection of the upper faces on the polytopes deﬁned by (1). Hence the question boils down to how these three operations transform the polytopes, which is addressed in Propositions 3.1 and 3.2. We follow notations in Proposition 5.1 for the next result.

Lemma 6.2. Let f (l) i , g(l) i , h(l) i be the tropical polynomials produced by the ith node in the lth layer of a neural network,

Tropical Geometry of Deep Neural Networks

i.e., they are deﬁned by (2). Then P f (l) i , P g(l) i , P h(l) i

are subsets of Rd+1 given as follows:

(i) P g(1) i and P h(1) i are points.

(ii) P f (1) i is a line segment.

(iii) P g(2) i and P h(2) i are zonotopes.

(iv) For l 1,

P f (l) i = Conv P g(l) i t(l) i P h(l) i

if t(l) i R, and P f (l) i = P h(l) i if t(l) i = .

(v) For l 1, P g(l+1) i and P h(l+1) i are weighted Minkowski sums,

P g(l+1) i =

j=1 a ij P f (l) j +

j=1 a+ ij P g(l) j ,

P h(l+1) i =

j=1 a+ ij P f (l) j +

j=1 a ij P g(l) j

where aij, bi are entries of the weight matrix A(l+1) Znl+1 nl and bias vector b(l+1) Rnl+1, and e := (0, . . . , 0, 1) Rd+1.

A conclusion of Lemma 6.2 is that zonotopes are the building blocks in the tropical geometry of neural networks. Zonotopes are studied extensively in convex geometry and, among other things, are intimately related to hyperplane arrangements (Greene & Zaslavsky, 1983; Guibas et al., 2003; Mc Mullen, 1971; Holtz & Ron, 2011). Lemma 6.2 connects neural networks to this extensive body of work but its full implication remains to be explored. In Section C.2 of the supplement, we show how one may build these polytopes for a two-layer neural network.

6.3. Geometric Complexity of Deep Neural Networks

We apply the tools in Section 3 to study the complexity of a neural network, showing that a deep network is much more expressive than a shallow one. Our measure of complexity is geometric: we will follow (Montufar et al., 2014; Raghu et al., 2017) and use the number of linear regions of a piecewise linear function ν : Rd Rp to measure the complexity of ν.

We would like to emphasize that our upper bound below does not improve on that obtained in (Raghu et al., 2017) in fact, our version is more restrictive given that it applies only to neural networks satisfying (a) (c). Nevertheless our goal here is to demonstrate how tropical geometry may be used to derive the same bound.

Theorem 6.3. Let ν : Rd R be an L-layer real-valued feedforward neural network satisfying (a) (c). Let t(L) = and nl d for all l = 1, . . . , L 1. Then ν = ν(L)

has at most

linear regions. In particular, if d n1, . . . , n L 1 n, the number of linear regions of ν is bounded by O nd(L 1) .

Proof. If L = 2, this follows directly from Lemma 6.2 and Corollary 3.4. The case of L 3 is in Section D.7 in the supplement.

As was pointed out in (Raghu et al., 2017), this upper bound closely matches the lower bound Ω (n/d)(L 1)dnd

in (Montufar et al., 2014, Corollary 5) when n1 = = n L 1 = n d. Hence we surmise that the number of linear regions of the neural network grows polynomially with the width n and exponentially with the number of layers L.

7. Conclusion

We argue that feedforward neural networks with rectiﬁed linear units are, modulo trivialities, nothing more than tropical rational maps. To understand them we often just need to understand the relevant tropical geometry.

In this article, we took a ﬁrst step to provide a proof-ofconcept: questions regarding decision boundaries, linear regions, how depth affect expressiveness, etc, can be translated into questions involving tropical hypersurfaces, dual subdivision of Newton polygon, polytopes constructed from zonotopes, etc.

As a new branch of algebraic geometry, the novelty of tropical geometry stems from both the algebra and geometry as well as the interplay between them. It has connections to many other areas of mathematics. Among other things, there is a tropical analogue of linear algebra (Butkoviˇc, 2010) and a tropical analogue of convex geometry (Gaubert & Katz, 2006). We cannot emphasize enough that we have only touched on a small part of this rich subject. We hope that further investigation from this tropical angle might perhaps unravel other mysteries of deep neural networks.

Acknowledgments

The authors thank Ralph Morrison, Yang Qi, Bernd Sturmfels, and the anonymous referees for their very helpful comments. The work in this article is generously supported by DARPA D15AP00109, NSF IIS 1546413, the Eckhardt Faculty Fund, and a DARPA Director s Fellowship.

Tropical Geometry of Deep Neural Networks

Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding deep neural networks with rectiﬁed linear units. International Conference on Learning Representations, 2018.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. ar Xiv:1409.0473, 2014.

Bengio, Y. and Delalleau, O. On the expressive power of deep architectures. In International Conference on Algorithmic Learning Theory, pp. 18 36. Springer, 2011.

Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.

Butkoviˇc, P. Max-linear systems: theory and algorithms. Springer Science & Business Media, 2010.

Delalleau, O. and Bengio, Y. Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems, pp. 666 674, 2011.

Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In Conference on Learning Theory, pp. 907 940, 2016.

Gaubert, S. and Katz, R. Max-plus convex geometry. In International Conference on Relational Methods in Computer Science. Springer, 2006.

Greene, C. and Zaslavsky, T. On the interpretation of whitney numbers through arrangements of hyperplanes, zonotopes, non-radon partitions, and orientations of graphs. Transactions of the American Mathematical Society, 280 (1):97 126, 1983.

Gritzmann, P. and Sturmfels, B. Minkowski addition of polytopes: computational complexity and applications to gr obner bases. SIAM Journal on Discrete Mathematics, 6(2):246 269, 1993.

Guibas, L. J., Nguyen, A., and Zhang, L. Zonotopes as bounding volumes. In Proceedings of 14th Annual ACMSIAM Symposium on Discrete Algorithms, pp. 803 812. SIAM, 2003.

Hartman, P. On functions representable as a difference of convex functions. Paciﬁc Journal of Mathematics, 9(3): 707 713, 1959.

Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82 97, 2012.

Holtz, O. and Ron, A. Zonotopal algebra. Advances in Mathematics, 227(2):847 894, 2011.

Itenberg, I., Mikhalkin, G., and Shustin, E. I. Tropical algebraic geometry, volume 35. Springer Science & Business Media, 2009.

Kalchbrenner, N. and Blunsom, P. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1700 1709, 2013.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097 1105, 2012.

Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436 444, 2015.

Maclagan, D. and Sturmfels, B. Introduction to tropical geometry, volume 161. American Mathematical Society, 2015.

Mc Mullen, P. On zonotopes. Transactions of the American Mathematical Society, 159:91 109, 1971.

Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems, pp. 2924 2932, 2014.

Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, pp. 3360 3368, 2016.

Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl Dickstein, J. On the expressive power of deep neural networks. In International Conference on Machine Learning, pp. 2847 2854, 2017.

Tao, P. D. and Hoai An, L. T. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of Operations Research, 133(1 4):23 46, 2005.

Telgarsky, M. beneﬁts of depth in neural networks. In Conference on Learning Theory, 2016.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv:1611.03530, 2016.