# unitary_branching_programs_learnability_and_lower_bounds__6968e555.pdf Unitary Branching Programs: Learnability and Lower Bounds Fidel Ernesto D ıaz Andino 1 Maria Kokkou 2 Mateus de Oliveira Oliveira 3 Farhad Vadiee 3 Bounded width branching programs are a formalism that can be used to capture the notion of non-uniform constant-space computation. In this work, we study a generalized version of bounded width branching programs where instructions are defined by unitary matrices of bounded dimension. We introduce a new learning framework for these branching programs that leverages on a combination of local search techniques with gradient descent over Riemannian manifolds. We also show that gapped, read-once branching programs of bounded dimension can be learned with a polynomial number of queries in the presence of a teacher. Finally, we provide explicit near-quadratic size lower-bounds for boundeddimension unitary branching programs, and exponential size lower-bounds for bounded-dimension read-once gapped unitary branching programs. The first lower bound is proven using a combination Neˇciporuk s lower bound technique with classic results from algebraic geometry. The second lower bound is proven within the framework of communication complexity theory. 1. Introduction Bounded width branching programs, also known as nonuniform finite automata, may be regarded as model of computation that generalizes the notion of constant space computation to the non-uniform setting (Nakanishi et al., 2000; Barrington, 1989; Erg un et al., 1995). A celebrated result due to Barrington states that any Boolean function f : {0, 1}n {0, 1} that can be computed by a Boolean circuit of depth d can also be computed by a width-5 branching program of length 4d (Barrington, 1989). As a consequence, functions that can be computed by logarithmic 1University of S ao Paulo, S ao Paulo, Brazil 2Chalmers University of Technology, Gothenburg, Sweden 3University of Bergen, Bergen, Norway. Correspondence to: Mateus de Oliveira Oliveira . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). depth Boolean circuits can be computed by width-5 branching programs of polynomial length. The problem of constructing branching programs consistent with a given dataset has been studied under a large variety of paradigms (Bergadano et al., 1997; Mansour & Mc Allester, 2002; Erg un et al., 1995; Raghavan & Wilkins, 1993; Bshouty et al., 1998). In this work, we address the problem of constructing bounded-width branching programs consistent with a given input dataset from the perspective of continuous optimization theory. We formalize branching programs according to Barrington s M-program model, where branching programs are viewed as a sequence of tuples of elements of a monoid M (i.e., a semigroup with identity) (B edard et al., 1993; Cleve, 1991; Caussinus, 1996; Barrington, 1989). This point of view allows us to consider branching programs over matrix groups, and in particular over the unitary group U(k), i.e., the group of k k unitary matrices with matrix multiplication as the operation. For each fixed k N, the unitary group U(k) is a compact and connected topological group, that has has the structure of a Riemannian manifold, known as the complex Stiefel manifold (Adams & Walker, 1965). This allows us to view the problem of searching for a branching program consistent with a given dataset as the problem of minimizing function with unitary constraints (Abrudan et al., 2008; 2009; 2008). The caveat is that such problems can be reformulated as minimization problems in the Stiefel manifold, and therefore one can leverage on the machinery of Riemannian gradient descent to search for optimal solutions. Still, when the number of variables become too large (meaning a few hundreds), state-of-the-art tools that implement Riemannian gradient descent algorithms become impractical. To mitigate this computational intractability, we combine Riemannian gradient descent with a simple local optimization technique inspired on the paradigm of waveform relaxation (Lelarasmee et al., 1982; Janssen & Vandewalle, 1997; Crow & Ilic, 1990). More precisely, we start by fixing the values of a large number of variables in the system. Subsequently, we compute an optimal solution with respect to the remaining variables. At this point, the values of the optimized variables are fixed, and a new small set of variables to be optimized is selected. The process is repeated until no improvement is observed. Interestingly, this algorithm works Unitary Branching Programs: Learnability and Lower Bounds surprisingly well for the task of learning unitary branching programs consistent with sparse unstructured datasets. More precisely, our heuristic was able to learn read-once unitary branching programs of dimension 3 consistent with datasets containing n positive and n negative binary strings, each of length n, with near 0% error. In our experiments, n was chosen from the set {16, 32, ..., 1024} and the time for learning the branching programs varied from a few seconds (for n = 16) to about 72 hours (for n = 1024). It is worth noting that the time to learn these datasets with 10% error was substantially smaller (less than 4 hours). Besides our learning framework, we also study connections between unitary branching programs and the theory of learning from a minimally adequate teacher. In particular, we show that gapped, read-once branching programs of bounded dimension can be learned with a polynomial number of queries in the presence of a teacher. Finally, we provide explicit near-quadratic size lower-bounds for bounded-dimension unitary branching programs, and exponential size lower-bounds for bounded-dimension read-once gapped unitary branching programs. The first lower bound is proven using a combination Neˇciporuk s lower bound technique with classic results from algebraic geometry. The second lower bound is proven within the framework of communication complexity theory. 2. Unitary Branching Programs We denote by N, R and C the sets of natural numbers, real numbers and complex numbers respectively. The set of positive natural numbers is denoted by N+. For each n N+, we let [n] ={1, . . . , n} and Jn K = {0, . . . , n 1}. Given a vector V we may write V [i] to denote the i-th entry of V . A semigroup is a nonempty set G endowed with an associative binary operation of type G G G. If a and b are elements of G, then we denote by a b the result of applying the semigroup operation to the ordered pair (a, b). Branching Program. Let G be a semigroup and s N+. An (s, G)-instruction is a tuple I = (A0, . . . , As 1) of elements from G. An n-input branching program over G of length ℓ, arity s, and class size c is a tuple B = (J , I, C) where J = (j1, . . . , jℓ) is a sequence of indices from [n], I = (I1, . . . , Iℓ) is a sequence of (s, G)-instructions, and C = (C0, . . . , Cc 1) is sequence of pairwise disjoint subsets of G. Given a string W Js Kn, the value of B on W is defined as follows. value(B, W) = I1[W[j1]] I2[W[j2]] . . . Iℓ[W[jℓ]]. (1) Intuitively, the value(B, W) is obtained by multiplying a sequence of ℓelements from the semigroup G. For each r [ℓ], the r-th factor of this multiplication is the W[jr]-th element of the instruction Ir. We say that a branching program B computes a function f : Js Kn Jc K if for each string W [s]n, the element value(B, W) belongs to the set Cf(W ). A celebrated complexity theoretic result due to Barrington, states that if an n-input Boolean function f : {0, 1}n {0, 1} can be computed by a fan-in-2 Boolean circuit of depth d, then f can also be computed by an n-input branching program over the symmetric group S5 of length 4d (Barrington, 1989). This implies that Boolean functions that can be computed by circuits of logarithmic depth can be computed by polynomial-size branching programs over the group S5. The importance of Barrington s theorem stems from the fact that many interesting Boolean functions can be computed by polynomial-size circuits of logarithmic depth, including arithmetic functions such as addition, multiplication, powering and division (Chiu et al., 2001; Beame et al., 1986), several cryptographic functions (Viola, 2009) and a large number of combinatorial problems (Elberfeld et al., 2012). 2.1. Unitary Branching Programs Our main object of study is the notion of unitary branching program. Intuitively, unitary branching programs are precisely the branching programs defined above, except for the fact that we need to be more specific in the way we partition the unitary group U(k) into classes. We choose a standard approach. Namely, if a branching program B has class size c, then we fix c representative unitary matrices X0, . . . , Xc 1 and say that an input string W belongs to class i if the unitary matrix value(B, W) is closer to Xi than to any other representative matrix. In case of ties, we choose the class of smallest index. Distance Between Matrices. Given complex vectors x and y, we let x, y = P i xi yi denote the inner product of x and y. Given unitary matrices U with columns u1, . . . , uk, and W with columns v1, . . . , vk, we define the distance between U and W as i=1 (1 ui, vi )(1 ui, vi ). Please note that MD(U, W) is always a non-negative real value. This notion of distance will be used for two purposes. First, to define a notion of distance for unitary branching programs. Second, to partition the space of unitary matrices into a finite number of regions. Partitioning the Unitary Group U(k). Given a tuple ζ = (X0, . . . , Xc 1) of pairwise distinct k k unitary Unitary Branching Programs: Learnability and Lower Bounds matrices, we partition the space U(k) into c regions Cζ = (C0, . . . , Cc 1) as follows. For each matrix Y U(k), we let Y belong to Ci if i is the minimum index in Jc K with the property that MD(Y, Xi) MD(Y, Xj) for every j Jc K. Intuitively, a matrix Y belongs to Ci if Xi is the closest matrix to Y among the matrices in the sequence ζ. If this closest distance is achieved by more than one matrix in ζ, then Y is added to the cell of the partition corresponding to the matrix Xi of smallest index. Unitary Branching Programs. A k-dimensional unitary branching program is a branching program B = (J , I, C) over the group U(k), where C = (C1, . . . , Cc) is the partition of U(k) induced by some tuple ζ = (X0, . . . , Xc 1) of pairwise distinct unitary matrices. We say that ζ is the class sequence of B. 3. Learning Unitary Branching Programs Using Gradient Descent In this section, we introduce a new heuristic for learning unitary branching programs consistent with a given dataset. The idea is to view the search for a separating branching program as a continuous optimization problem and to leverage on the power of gradient descent algorithms. Unitary matrices enjoy several properties that are relevant in our context. First, the set of unitary matrices of a given dimension forms a group, and therefore, is closed under multiplication. Second, the determinant of an unitary matrix is a complex number of norm 1. This is important for stability reasons since computing with branching programs requires the multiplication of a long sequence of matrices. Third, the group of permutation matrices form a subgroup of unitary matrices, and therefore, the machinery from Barrington s theorem can be simulated by branching programs over the unitary group. Third, the unitary group U(k) is endowed with the relative topology as a subset of X(k, C), the set of all k k complex matrices. Notice that X(k, C) is homeomorphic to the euclidean space. As a topological space, U(k) is both compact and connected. Additionally, U(k) has the structure of a Riemannian manifold called the complex Stiefel manifold. Therefore, instead of considering the problem of finding an optimal branching program as an optimization problem with unitary constraints over the euclidean space, we will use Riemannian gradient descent algorithms to view the learning problem for branching programs as an unconstrained optimization problem over the complex Stiefel manifold. We note that our choice of the unitary group over the orthogonal group stems from the fact that the former is connected while the latter is not. In particular, for each m N the space formed by the Cartesian product of m copies of the unitary group still has a unique component, the product of m copies of the orthogonal group has 2m connected components. Also note that the special orthogonal group, the subgroup of the orthogonal group containing matrices with determinant equal to 1 is indeed connected. Nevertheless, restricting instructions to use such matrices may restrict the power of the branching program, since for instance, permutation matrices corresponding to odd permutations do not belong to this group. Still, given the high dimensionality of our data, simply formulating the problem as an optimization problem on Riemannian manifolds turns to be intractable to current state of the art optimization tools. To counter this drawback, we follow an approach similar to the one used in the context of waveform relaxation algorithms. We start by guessing random solution, and by using this random solution to fix the value of all but a small fraction of the variables. Subsequently, we apply the Riemannian optimization algorithm only on this small fraction of variables. Once this optimum has been achieved, we select a new small fraction of variables to optimize. We keep repeating this process until time is up or until we have reached a solution with a good enough accuracy. In our case, the variables chosen to be optimized at each time step are either those corresponding to a small set of instructions of the unitary branching program, or those corresponding to the class matrices. Datasets. An n-input dataset of arity s and class-size c is a tuple D = (D0, . . . , Dc 1) where for each i Jc K, Di is a subset of Js Kn and for each i, j Jc K, Di Dj = . We say that an n-input branching program B of arity s and class size c is consistent with D if for each i Jc K, and each string W Di, the matrix value(B, W) belongs to Ci. Discrete Distance. It will be convenient to define a measure of how well a branching program approximates a dataset D. We define the discrete distance between B and D as the value DD(B, D) = max i Jc K |{W Di : value(B, W) / Ci}| In other words, for each class i we consider the ratio between the number of strings in Di that were incorrectly classified by the branching program and the size of Di. The distance between B and D is the maximum value of these ratios. Note that if the branching program is consistent with the branching program then the distance is zero, while if the branching program makes mistakes in every string belonging to some class, then the distance is 1. Continuous Distance. For purposes of optimization, we also define a continuous real-valued distance function between unitary branching programs and datasets. This will be the function to be optimized by our Riemannian gradient Unitary Branching Programs: Learnability and Lower Bounds descent algorithm. MD(value(B, W), Xi) Learning Unitary Branching Programs The problem of learning a k-dimensional unitary branching program B compatible with a given n-input dataset D can be formulated as a continuous optimization problem with unitary constraints. More precisely, given a sequence J = (j1, . . . , jl) of indices from [n], and an n-input dataset D of arity s and class size c, the goal is to find sequences1 I = (A0 0, . . . , A0 s 1, . . . , Al 0, . . . , Al s 1) and ζ = (X0, . . . , Xc 1) of k-dimensional unitary matrices such that the branching program B = (J , I, Cζ) minimizes the continuous distance function CD(B, D). For each k, the unitary group U(k) has the structure of a manifold, known as the k k complex Stiefel manifold. This allows one to reformulate the problem of solving an optimization problem with a single k k unitary constraint as an optimization problem over the k k complex Stiefel manifold comprising of k2 complex variables (Absil et al., 2009). More generally, a problem with several k k unitary constraints can be reformulated as an optimization problem over a Cartesian product of Stiefel manifolds. In our case, this amounts to solving an optimization problem with k2 (l s+c) complex variables over the manifold U(k)l s+c, since we have l s + c constraints over U(k). More precisely, this corresponds to optimizing over the unitary matrices A1 0, . . . , Al s 1 defining the instructions of the branching programs and the matrices X0, . . . , Xc 1 defining representatives for the partition of the space U(k) used by the branching program. Optimization problems over the complex Stiefel manifolds, and many other well studied manifolds, can be easily specified using open source tool boxes such as Manopt (Boumal et al., 2014) or ROPTLIB (Huang et al., 2018). In particular, our optimization problem can be easily specified in using Matlab in conjunction with Manopt or in C++ in conjunction with ROPTLIB. The problem is that a direct specification in this setting does not scale well with the length of the strings in the dataset. In particular, the task of learning read-once unitary branching programs over the group U(2) compatible 1For each i {1, . . . , l}, (Ai 0, . . . , Ai s 1) are the matrices corresponding to the i-th instruction of the branching program. with randomly generated datasets consisting of 64 positive strings and 64 negative strings, each with 64 bits, turned out to be impractical using this direct setting. Note that this corresponds to solving an optimization problem with 520 complex variables2 over the product manifold U(2) 66. To circumvent this difficulty, we perform two crucial improvements. First, instead of applying the optimization directly into the whole product manifold U(k)l s+c we combine a local optimization paradigm together with sliding window protocol to optimize the problem in small parts. Second, we use pre-computation to reduce significantly the time required to evaluate the function to be optimized at each step. We will describe both steps below. 3.1. Local Optimization. As mentioned above, we address the global optimization problem of finding a branching program that minimizes the objective function CD(B, D) into a succession of local optimization problems over manifolds of much smaller dimension. This process is split into two parts, an instruction optimization part and a class representative optimization. We describe these two parts in more detail below. Window of Instructions Optimization. In the first part, given a fixed window size parameter w, an initial position i, and an initial branching program B, we construct a branching program B by keeping all matrices fixed, except for the matrices Ai 0, . . . , Ai s 1, . . . , Ai+w 1 0 , . . . , Ai+w 1 s belonging to instructions Ii, . . . , Ii+w 1. In this sense, at each step, we solve an unconstrained optimization problem with k2 s w variables over the manifold U(k)s w. Note that here, the window size is supposed to be small, and can even be set to 1, meaning that a single instruction (containing s unitary matrices) is optimized at a time. Class Representative Optimization. In the second part, given an initial branching program B, we construct a branching program B by keeping all matrices fixed, except for the matrices ζ = (X0, . . . , Xc 1) which define the class representatives of the branching program. The intuition is that fixing a sequence of class representatives a priory may affect the ability of a branching program learning a dataset. Since this effect is very hard to predict, the best option is to also leave the task of learning the class representatives to the learning algorithm. 2In this case, k = 2, l = 64, s = 2 and c = 2. The corresponding optimization problem to be solved has k2 (l s + c) = 520 variables. Unitary Branching Programs: Learnability and Lower Bounds Combining Both Optimization Procedures. In essence, the algorithm works as follows. We start by sampling all matrices of the branching program, including the instruction matrices, and class representative matrices at random. Subsequently, we make a round of instruction optimizations. This round consists of l w + 1 steps, where at step i, then we optimize the matrices corresponding to instructions Ii, . . . , Ii+w 1 as described above. When this round is finished, we perform the class representative optimization. We iterate this process until the target accuracy has been achieved, or until time is up. The problem with applying this approach without further modification is at each call of the objective function CD(B, D), one needs to evaluate the branching program on each string of the dataset. Each such evaluation takes time O(k3 n m) where m is the number of strings in the dataset, n is the length of each string, and k is the dimension of the unitary matrices in the branching program. 3.2. Pre-computation We can significantly reduce the time spent per call of the continuous distance function by using pre-computation. Given a dataset D = (D0, . . . , Dc 1). A pre-computation for D is a tuple (f0, . . . , fc) of functions, where for each a Jc K, fa is a function from [|Da|] to U(k). Given a dataset D = (D0, . . . , Dc 1) a k-dimensional unitary branching program B with index sequence J = (j1, . . . , jℓ), and a non-negative integer p {0, . . . , l} we let Left D(B, p) = (f0, . . . , fc 1) where for each a Jc K, fa : [|Da|] U(k) is the function defined as follows for each z [|Da|], fa(z) = Jk if p = 0, I1[Da[z][j1]] . . . Ip[Da[z][jp]] otherwise. In other words, fa(z) is the unitary in U(k) obtained by applying instructions I1, . . . , Ip to the z-string of Da. In case p = 0, then fa(z) is the identity matrix Jk, which intuitively corresponds to applying no instruction at all. Analogously, given a dataset D and a branching program B as above and a non-negative integer p {1, . . . , l + 1}, we define Right D(B, p) = (g0, . . . , gc 1) where for each a Jc K, ga : [|Da|] U(k) is the function defined as follows for each string W Da, ga(z) = Jk if p = l + 1, Ip[Da[z][jp]] . . . Il[Da[z][jl]] otherwise. In other words, ga(z) is the unitary in U(k) obtained by applying instructions Ip, . . . , Il to the z-th string of Da. In case p = l + 1, then ga(z) is the identity matrix Jk, which intuitively corresponds to applying no instruction at all. Finally, given a dataset D and a branching program B as above and non-negative integers p, p {1, . . . , l} with p < p , we define Middle D(B, p, p ) = (h0, . . . , hc 1) where for each a Jc K, ha : [|Da|] U(k) is the function defined as follows for each z [|Da|]. ha(W) = Ip[Da[z][jp]] . . . Ip [Da[z][jp ]]. In other words, ha(z) is the unitary in U(k) obtained by applying instructions Ip, . . . , Ip to the z-th string in Da. With the notation above, we have the following observation. Observation 1. Let D = (D0, . . . , Dc) be a dataset, B be a branching program of length l, and p < p be positions in [l]. Let Left D(B, p 1) = (f0, . . . , fc 1), Middle D(B, p, p ) = (h0, . . . , hc 1), and Right D(B, p + 1) = (g0, . . . , gc 1). Then, for each a Jc K and each z [|Da|], we have that value(B, Da[z]) = fa(z) ha(z) ga(z). Let D = (D0, . . . , Dc) be a dataset, and B be a k-dimensional branching program of length l. Let L = (f0, . . . , fc) and R = (g0, . . . , gc) be precomputations for D, and p and w be positive integers. Let FD(B, D, L, R, p, w) be the real number obtained from the following sum: MD(fa(z) ha(z) ga(z), Xa) where (h0, . . . , ha) = Middle D(B, p, p + w 1). Then, from Observation 1, we have the following proposition stating that FD(B, D, L, R, p, w) can be used to compute the continuous distance between a branching program and a dataset. Proposition 2. Let D be a dataset, B be a branching program of length l, and p and w be positive integers such that p + w 1 l. Let L = Left D(B, p 1) and R = Right D(B, p + w) then FD(B, D, L, R, p, w) = CD(B, D). Unitary Branching Programs: Learnability and Lower Bounds The reason why we will be using FD(B, D, L, R, p, w) as a way of computing the continuous distance CD(B, D) between a branching program and a dataset is because the former is much more efficient than the later. More precisely, computing CD(B, D) takes time O(k3 n m), where m is the number of strings in the dataset, n is the length of each string, and k is the dimension of the unitary matrices in the branching program. On the other hand, computing this value using FD(B, D, L, R, p, w) takes time O(k3 w m), since k and w will be very small in our applications, this computation takes essentially linear time on the number of strings of the dataset, provided that the pre-computations L and R have already been performed. It turns out that in the process of optimizing the matrices belonging to instructions in a given window, the Riemannian gradient descent algorithm will need to call the function FD(B, D, L, R, p, w) many times until it stabilizes, while the pre-computations L and R will be kept fixed during the whole process. This represents a huge save of time, and allows us to deal with datasets containing strings with thousands of bits. In Algorithm 1, we depict a pseudo-code of our algorithm. In the pseudo-code, for given positive integers k and r, we let Sample(U(k)r) be the function that samples a tuple containing r k k unitary matrices at random. We let Opt Window(B, D, L, R, p, w) be the function that applies the Riemannian gradient descent method to optimize the matrices corresponding to instructions Ip, . . . , Ip+w 1. Internally, this function makes calls to the function FD(B, D, L, R, p, w), and the pre-computations L and R are obtained before calling this function. Finally, Opt Classes(B, D, M) is the function that optimizes the class representatives. Intuitively, M is the pre-computation corresponding to applying all instructions of the branching program. Internally, this optimization function calls the pre-computed function ˆFD(B, D, M) which when given a branching program B, a dataset D and a pre-computation M = (h0, . . . , hc 1), outputs the value MD(ha(z), Xa) This function can also be used to compute the continuous distance function between a branching program and a dataset, as stated in the following proposition. Proposition 3. Let D be a dataset and B be a branching program of length l. Let M = Middle D(B, 1, l) then ˆFD(B, D, M) = CD(B, D). 4. A Near Quadratic Size Lower Bound for Unitary Branching Programs The task of proving superlinear lower bounds on the size of Boolean circuits for explicit families of functions (i.e func- Algorithm 1: Sliding Window Unitary Learning. Data: An n-input dataset D of arity s and class-size c; an index sequence J , a dimension parameter k, a window-size parameter w; an accuracy parameter α; a maximum number of iterations Max Result: A branching program B with DD(B, D) α or the branching program obtained after executing iteration Max. I Sample(U(k)l s). ζ Sample(U(k)c). B (J , I, Cζ). t 0; Total number of iterations. p 1; Initial position of the window. Min Disc Dist DD(B, D); smallest discrete distance obtained so far. while Min Disc Dist > α and t < Max do L Left D(B, p 1) R Right D(B, p + w) B Opt Window(B, L, R, p, w) M Middle D(B, 1, l) B Opt Classes(B, M) if DD(B, D) < Min Disc Dist then Min Disc Dist DD(B, D) end if p + w l then p p + 1; else p 1; end end tions in NP), is still one of the major unsolved open problems in computational complexity theory (Find et al., 2015; Iwama & Morizumi, 2002; Lachish & Raz, 2001). This task is even more tantalizing when considering non-Boolean models of computation. Therefore, developing techniques to prove super-linear lower bounds distinct models of computation has been an active line of research (Neˇciporuk, 1966; Tur an & Vatan, 1997; Roychowdhury & Vatan, 2001). In particular, near-quadratic lower bounds have been obtained for Boolean formulas (Neˇciporuk, 1966), quantum formulas (Roychowdhury & Vatan, 2001), and more generaly for circuits of bounded treewidth (de Oliveira Oliveira, 2016). These lower bounds rely on a framework called the Neˇciporuk s method. In this section, we combine Neˇciporuk s method with techniques from algebraic geometry to prove a near-quadratic lower bound on the number of instructions in a unitary branching programs computing the element distictness function (Theorem 4). We note that our lower bound assumes no restriction at all on the number of bits necessary to represent the entries of the unitary matrices belonging to the branching program. Unitary Branching Programs: Learnability and Lower Bounds Let X be a set of variables. A Boolean assignment for X is a function of the form α : X {0, 1}. We denote by {0, 1}X the set of all Boolean assignments for X. A Boolean function over X is a function f : {0, 1}X {0, 1}. Let X = {x1, ..., xn} be a set of n = 2q log q distinct variables partitioned into q blocks Y1, Y2, ..., Yq, where each block Yi has 2 log q variables. The element distinctness function δn : {0, 1}X {0, 1} is defined as follows for each assignment s1, s2, ..., sq of the blocks Y1, Y2, ..., Yq respectively. δn(s1, s2, ..., sq) = 1 if si = sj for i = j, 0 otherwise. (3) Theorem 4. Let X be a set with n Boolean variables, and let δn : {0, 1}X {0, 1} be the n-bit element distinctness function. Let B be a unitary branching program of dimension k and length l computing δn. Then l = Ω( n2 k2 log n). 5. Gapped Read-Once Unitary Branching Programs Let B = (J , I, C) be a unitary branching program with class sequence ζ = (X0, . . . , Xc 1). We say that B is readonce if the indices in J are pairwise distinct. In other words, each symbol of the input string is read at most once. Let δ be a positive real number between 0 and 1. We say that B is δ-gapped if for each i, j in {0, . . . , c 1}, |MD(Xi, value(B, W)) MD(Xj, value(B, W))| > δ for each W Jc Kn. An n-input 2-classes dataset is a dataset D = (D0, D1) with two classes D0, D1 Js Kn for some s. The next theorem states that for each fixed k N+, any δ-gapped read-once unitary branching program consistent with a given n-input 2-classes dataset D can be transformed into a deterministic finite automaton over the alphabet Js K consistent with D whose number of states is polynomial in n and in 1/δ. Theorem 5. Let B be a gapped, read-once unitary branching program of dimension k separating an n-input 2-classes dataset D = (D0, D1). One can construct a deterministic finite automaton F(B) over Js K with n δ O(k2) states such that D1 L(F(B)) and D0 L(F(B)) = . In the classic framework of learning with membership and equivalence queries, the goal is to learn a regular language L over a given alphabet Js K in the setting where the learner has access to an oracle (a minimally adequate teacher) that is able to answer two types of queries: membership queries, where the learner selects a word in W Js K and the teacher answers whether or not W L; and equivalence queries where the learner selects an hypothesis automaton H and the teacher answers whether or not L is the language of H. If this is not the case, the teacher gives to the student a counterexample word W Σ \L. A celebrated result from Angluin (Angluin, 1987) states that any regular language can be exactly learned with a number of queries that is polynomial in the number of states of a minimum deterministic automaton representing the target language, and the size of the largest counter-example returned by the teacher. If L is the characteristic language of a function f : Js Kn {0, 1}, meaning that L(f) = {W Js Kn : f(W) = 1}, then Angluin s result implies that one can learn an automaton accepting L in time polynomial in n and in the size of the minimum deterministic finite automaton accepting L. Theorem 5 can be used to transfer this classic result from the context of functions computable by DFAs to functions computable by gapped read-once unitary branching programs. Theorem 6. For any fixed k and any δ R with 0 < δ < 1, the representation class of n-input δ-gapped read-once unitary branching programs is exactly learnable using the representation class of deterministic finite automata with a total number of queries that is polynomial in n and in 1/δ. Another interesting consequence of Theorem 5 is that it allows one to use techniques from best-partition communication complexity theory to prove explicit lower bounds for the dimension necessary to compute certain functions using gapped read-once branching programs. In the setting of best-partition communication complexity theory, two players, Alice and Bob wish to compute the value of a function f : {0, 1}n {0, 1}, known in advance by both players, on a specific input W. The caveat is that the positions in {1, ..., n} are partitioned into two sets A and B in such a way that Alice only has access to bits of W at positions in A and Bob only has access to bits of W at positions in B. In order to compute the value f(W), Alice and Bob exchange bits using a communication protocol, where in the end of the process, the last bit of the protocol is the value f(W). Both the protocol and the partition of the input bits are agreed by Alice and Bob before the bits of the input string are distributed to them. And the protocol should give the correct answer on every input string. The deterministic communication complexity of f is the minimum number of bits in a deterministic protocol computing f, while the non-deterministic complexity of f is the minimum number of bits in a non-deterministic protocol. Lemma 7. Let f : 0, 1n {0, 1} be a function such that L(f) is the language of some non-deterministic finite automaton with m states. Then the non-deterministic communication complexity of f is O(log m). Therefore, by combining Lemma 7 with Theorem 5, we have the following. Lemma 8. Let f : {0, 1}n {0, 1} be a function computable by a δ-gapped read-once unitary branching pro- Unitary Branching Programs: Learnability and Lower Bounds gram of dimension k. Then the communication complexity of f is O(k2 log n/δ). An example of function with best-partition nondeterministic communication complexity Ω(n) is the triangle-freeness function n : {0, 1}n {0, 1} (Jukna & Schnitger, 2002). This function takes as input an array x = (xij)1 i