# minimum_entropy_coupling_with_bottleneck__38e40216.pdf Minimum Entropy Coupling with Bottleneck M.Reza Ebrahimi University of Toronto mr.ebrahimi@mail.utoronto.ca Jun Chen Mc Master University chenjun@mcmaster.ca Ashish Khisti University of Toronto akhisti@ece.utoronto.ca This paper investigates a novel lossy compression framework operating under logarithmic loss, designed to handle situations where the reconstruction distribution diverges from the source distribution. This framework is especially relevant for applications that require joint compression and retrieval, and in scenarios involving distributional shifts due to processing. We show that the proposed formulation extends the classical minimum entropy coupling framework by integrating a bottleneck, allowing for a controlled degree of stochasticity in the coupling. We explore the decomposition of the Minimum Entropy Coupling with Bottleneck (MEC-B) into two distinct optimization problems: Entropy-Bounded Information Maximization (EBIM) for the encoder, and Minimum Entropy Coupling (MEC) for the decoder. Through extensive analysis, we provide a greedy algorithm for EBIM with guaranteed performance, and characterize the optimal solution near functional mappings, yielding significant theoretical insights into the structural complexity of this problem. Furthermore, we illustrate the practical application of MEC-B through experiments in Markov Coding Games (MCGs) under rate limits. These games simulate a communication scenario within a Markov Decision Process, where an agent must transmit a compressed message from a sender to a receiver through its actions. Our experiments highlight the trade-offs between MDP rewards and receiver accuracy across various compression rates, showcasing the efficacy of our method compared to conventional compression baseline. 1 Introduction Consider the following Markov Chain modeling a general lossy compression framework X p T |X T q Y |T Y , where the input X with a marginal distribution p X, is encoded by the probabilistic encoder p to generate the code T. Subsequently, the probabilistic decoder q reconstructs Y from T. The objective is to identify the encoder and decoder that minimize the distortion between X and Y , subject to an upper bound constraint on the expected code length H(T) R. It is common to measure the sample-wise distortion via direct comparison of (x, y) pairs through a distortion function d( , ), and consider the expectation E [d(X, Y )] as a measure of average distortion. Instead, we propose using the logarithmic loss (log-loss) H(X|Y ), or equivalently I(X; Y ), as an alternative metric to enforce the distortion constraint. The log-loss distortion measure, commonly employed in learning theory, was first explored within rate-distortion theory by Courtade and Wesel [1] and Courtade and Weissman [2]. This measure is particularly suitable in scenarios where reconstructions can be soft, meaning that the decoder produces a distribution rather than a distorted 38th Conference on Neural Information Processing Systems (Neur IPS 2024). sample point [3]. Consequently, the optimization problem is formulated as follows: min p T |X, q Y |T H(X|Y ) s.t. X T Y, H(T) R, P(X) = p X. It is straightforward to check that the optimal solution of (1) is achieved when T = Y , with the identity decoder. To address this issue of decoder collapse, we introduce a constraint on the output marginal distribution, P(Y ): Minimum Entropy Coupling with Bottleneck (MEC-B) IMEC-B(p X, p Y , R) = max p T |X, q Y |T I(X; Y ) s.t. X T Y, H(T) R, P(Y ) = p Y , P(X) = p X The addition of an output distribution constraint is a practical necessity, as in a lossy compression setup the decoder needs to generate outputs following a desired distribution. For example, in image restoration, the output consists of reconstructed images from the code adhering to a certain distribution, possibly the same as the input distribution. We explore two special cases of (2), where either the encoder or decoder is bypassed. This allows us to optimize the encoder and decoder separately using these cases. First, consider the case where the bottleneck is removed, meaning the constraint H(T) R is relaxed, or R H(X). In this scenario, X = T, and the optimization simplifies to: Minimum Entropy Coupling (MEC) IMEC(p X, p Y ) = max p Y |X I(X; Y ) s.t. P(Y ) = p Y , P(X) = p X This involves identifying the probabilistic mapping p Y |X between the marginals p X and p Y that maximizes the obtained mutual information. This problem, as described in (3), has been extensively studied in the literature as minimum entropy coupling (MEC), with early research conducted by Vidyasagar [4], Painsky et al. [5], Kovaˇcevi c et al. [6], Cicalese et al. [7], among others. Thus, we define the original problem presented in (2) as minimum entropy coupling with bottleneck (MEC-B). Next, consider the case where the decoder is removed, resulting from the relaxation of the output distribution constraint in (2): Entropy-Bounded Information Maximization (EBIM) IEBIM(p X, R) = max p T |X I(X; T) s.t. H(T) R, P(X) = p X Similar to minimum entropy coupling, this problem identifies the joint distribution between two random variables that maximizes their mutual information. However, rather than imposing a marginal distribution constraint, it enforces a more flexible entropy constraint on one of the variables. Lemma 1 provides a decomposition for the mutual information between input and output I(X; Y ), given the Markov chain X T Y . Lemma 1. Given a Markov chain X T Y : I(X; Y ) = I(X; T) + I(Y ; T) I(T; X, Y ). (5) The proof follows multiple applications of the chain rule for mutual information. The following lower bound on the MEC-B objective is attainable based on Lemma 1: I(X; Y ) I(X; T) + I(Y ; T) R . (6) In this work, we consider maximizing the lower bound (6) as a proxy to the main objective. This allows a decomposition of the encoder and decoder for the MEC-B formulation in (2): 1. Encoder Optimization: The encoder is first optimized separately, according to Entropy Bounded Information Maximization in (4), IEBIM(p X, R), resulting in the marginal distribution ˆp T on the code T. 2. Decoder Optimization: The decoder is then optimized by solving a minimum entropy coupling in (3) between the code and output marginals, IMEC(ˆp T , p Y ). Therefore, in terms of problems (2), (3), and (4): IMEC-B(p X, p Y , R) = max p T |X, q Y |T I(X; Y ) (7) max p T |X, q Y |T I(X; T) + I(Y ; T) R (8) IEBIM(p X, R) + IMEC(ˆp T , p Y ) R . (9) In this paper, we address the Entropy-Bounded Information Maximization problem in Section 3, providing theoretical insights into the solution structure across the entire spectrum of rate limits. We establish an upper bound on the objective and demonstrate that only deterministic mappings can achieve this bound. Then, in Section 3.1, we introduce a greedy algorithm designed to identify deterministic mappings with a guaranteed input-dependent gap from the optimal solution. Subsequently, in Section 3.2, we describe a method to identify optimal mappings near any deterministic mapping, effectively bridging the gap between discrete deterministic mappings and providing deeper theoretical insights into the problem structure. Following this theoretical groundwork, Section 4 applies the MEC-B framework to extend Markov Coding Games (MCG) with communication bottlenecks between the source and the agent. Experimental results for MCGs with rate limits are detailed in Section 4.2, showcasing the practical implications of our theoretical developments. The appendix sections complement these discussions by including formal proofs for all theorems and lemmas, a concise overview of the original minimum entropy coupling problem, and additional experimental results. 2 Related Work Couplings and Minimum Entropy Coupling A fundamental problem in probability theory, known as coupling, concerns determining the optimal joint distribution of random variables given their marginal distributions. This problem has a long history, with early examples by Fréchet [8] seeking the joint distribution that maximizes correlation subject to marginal constraints. References [9 12] provide a broader treatment of these problem classes and their applications. Notably, optimal transport (OT) emerges as a significant class within this framework, where optimality is defined as minimizing the expected value of a loss function over the joint distribution. See [13] for an in-depth treatment of the optimal transport problem. The minimum entropy coupling (MEC) focuses on finding the joint distribution with the smallest entropy given the marginal distribution of some random variables. This problem has been first studied in [4 7], among others. While it is shown by Vidyasagar [4], Kovaˇcevi c et al. [6] that MEC is NP-Hard, the literature contains many approximation algorithms for this problem. One of the earliest greedy algorithms for MEC was introduced by Kocaoglu et al. [14] in the context of causal inference, achieving a local minimum with a gap of 1 + log n bits from the optimum, where n represents the size of the alphabet. This bound was further improved in subsequent works [15, 16]. Based on tools from the theory of majorization [17], Cicalese et al. [18] developed a new greedy algorithm producing solutions 1 bit away from the optimal. Subsequent improvements by Li [19] enabled the construction of a coupling whose entropy is within 2 bits of the optimal value, regardless of the number of random variables involved. Despite these advances, Compton [15] identified a majorization barrier that limits further improvements, while Compton et al. [16] introduced the profile method offering stronger lower bounds for the coupling entropy. Minimum entropy coupling finds innovative applications beyond causal inference [14, 20, 21]. For instance, Sokota et al. [22] utilized it in Markov coding games to enable reinforcement learning agents to communicate via Markov decision process trajectories. This application showcased MEC s utility in enabling efficient information transmission through constrained environments like video game interactions. Similarly, de Witt et al. [23] applied MEC to securely encode secret information in regular text, showing MEC corresponds to the maximally efficient secure procedure. Lossy Source Coding While log-loss is widely used in prediction and learning, its application as a distortion measure in the context of source coding has been less explored, with the earliest examples appearing in [1] and [2]. Log-loss is particularly suited as a distortion measure in soft reconstructions, meaning the decoder outputs a distribution. Shkel and Verdú [3] explored a single-shot lossy source coding setting under logarithmic-loss, using a straightforward encoding scheme. Unlike the EBIM formulation in (4) which imposes a direct entropy constraint on the code, this approach constrained the code by the cardinality of its support. Finally, Blau and Michaeli [24] introduced the Rate-Distortion-Perception (RDP) tradeoff in lossy compression. The RDP framework does not fix the output distribution; instead, it imposes a softer perceptual constraint on the generated outputs. Additionally, our work incorporates an entropy constraint on T as a rate bottleneck, while in the RDP formulation, I(X; Y ) can be interpreted as the rate bottleneck. In this line of research, the work of Liu et al. [25] is closest in spirit to our approach, as the authors studied a lossy compression setting with different source and reconstruction distributions. They demonstrated that their setting could be formulated as a generalization of optimal transport with an entropy bottleneck. However, they used mean squared error (MSE) as the distortion metric, while we consider log-loss. Therefore, the mathematical machinery required for our analysis differs significantly from prior work. 3 Entropy-Bounded Information Maximization Consider a discrete random variable X defined over the alphabet X = {1, . . . , n} with a given marginal probability distribution p X. The following problem aims to establish a maximal information coupling between X and another random variable T, defined over the alphabet T = {1, . . . , m}, where the entropy of T is constrained to be no more than R bits. Unlike minimum entropy coupling, the marginal distribution of the second random variable T is not predetermined; the only constraint on T is its entropy. IEBIM(p X, R) = max p XT M I(X; T), (10) where set M consists of all joint distributions p XT that satisfy the following conditions: 1. P t p XT (x, t) = p X(x), ensuring that the marginal distribution of X is preserved. 2. H(T) R H(X), ensuring the entropy of T is constrained to be no more than R. We call this problem Entropy-Bounded Information Maximization (EBIM). Note that the objective in (10) is upper-bounded by R, since: IEBIM(p X, R) = max p XT M I(X; T) max p XT M H(T) R. (11) The following theorem establishes that only deterministic couplings can achieve this upper-bound. Theorem 1. IEBIM(p X, R) = R if and only if there exists a function g : X T such that H(g(X)) = R. The formal proof is presented in Section A.2. Note that the mutual information I(X; T) is invariant to permutations on T . Specifically, for any permutation π : T T , we have I(X; T) = I(X, π(T)). Given that problem (10) only constrains the entropy H(T), the objective is indifferent to such permutations. Let us define the permutation group of a joint distribution p XT as: P(p XT ) = {P | P(x, π(t)) = p XT (x, t), π : T T } . (12) Remark 1. Each partition of X is associated with a permutation group of a deterministic mapping. Consequently, the total number of potential deterministic mapping groups, independent of the entropy constraint on T, will be the total number of feasible partitions of X. The total number of ways to partition a set of size n corresponds to the n-th Bell number, symbolized by Bn. The growth rate of the Bell numbers is O(nn) [26], rendering brute force iteration of all deterministic mappings infeasible. In Figure 2 (left), a brute force method is applied to solve EBIM (10) for an input alphabet of size three. As observed, there are five potential partitions on X, each corresponding to a point where IEBIM(p X, R) = R. Given the impracticality of brute force search for large alphabet sizes, in Section 3.1, we introduce a greedy search algorithm to identify deterministic mappings with a guaranteed performance gap from the optimal. Following this, in Section 3.2, we explore optimal mappings close to these deterministic mappings, providing a strategy to narrow the gap between the identified deterministic mappings. 3.1 Proposed Search Algorithm for Deterministic Mappings Since iterating over all deterministic mappings is not feasible, one should look for carefully constructed search algorithms to find such mappings with resulting H(T) as close as possible to R. Without the loss of generality, suppose p X = [p1, p2, , pn] is arranged in a decreasing order. Algorithm 1 presents a search approach for discovering a deterministic mapping T = g(X), resulting in I(X; T) that is at most h(p2) bits away from the optimal IEBIM(p X, R), where h( ) is the binary entropy function: h(p) = p log(p) (1 p) log(1 p). Algorithm 1 Deterministic EBIM Solver Input: p X, R Output: p XT 1: p XT diag(p X) 2: if R H(X) then 3: return p XT 4: for i 1 to |p X| 1 do 5: p(i) s Merge the two columns with the smallest sum in p XT . 6: I(i) s Mutual Information imposed by p(i) s . 7: p(i) l Merge the two columns with the largest sum in p XT . 8: I(i) l Mutual Information imposed by p(i) l . 9: if I(i) s R then 10: return p(i) s 11: else if I(i) l R < I(i) s then 12: return p(i) l 13: else 14: p XT p(i) l The deterministic EBIM solver in Algorithm 1 has O(n log n) time complexity, where n is the cardinality of the input alphabet. This is because the main loop of the algorithm runs for at most n steps (as at each step we combine two elements of the input distribution) and finding min/max elements can be done in O(log n) using a heap data structure. Also, the mutual information calculation at each step can be done in constant time by only calculating the decrease in entropy after combining two elements of the distribution. Theorem 2. If the output of Algorithm 1 yields mutual information b I, then IEBIM(p X, R) b I h(p2), (13) where h( ) is the binary entropy function, and p2 denotes the second largest element of p X. Let n = |p X|. The procedure outlined in Algorithm 1 establishes a series of deterministic mappings p(1) s , p(1) l , , p(n 1) s , p(n 1) l , corresponding to a decreasing sequence of mutual information values I(1) s , I(1) l , , I(n 1) s , I(n 1) l . The algorithm then picks the mapping with the highest mutual information that does not exceed R. The proof involves establishing an upper bound on the gap between these successive mutual information values. A formal proof is presented in Section A.3. It is important to highlight that the gap described in Theorem 2 is bounded by one bit, i.e., h(p2) 1, with equality achieved when p X = [0.5, 0.5]. Although the gap is capped at one bit, the maximal mutual information in the EBIM formulation scales with R. Thus, the most natural interpretation of this gap emerges in higher rate regimes. Furthermore, the gap described in Theorem 2 remains small when p2 is small, specifically in cases where the input alphabet size is large and the distribution is not heavily skewed toward a few elements. 3.2 Optimal Coupling Around Deterministic Mappings Section 3.1 introduced a greedy search algorithm designed to identify deterministic mappings with a guaranteed and input-dependent gap from the optimal. In this section, we find the optimal couplings close to any deterministic mapping. This method allows us to close the gap between the mappings identified by Algorithm 1, as will be demonstrated later. Theorem 3. Let p XT denoted by a |X| |T | matrix, defines a deterministic mapping T = g(X), with I(X; T) = H(T) = Rg. We have IEBIM(p X, Rg) = Rg, and for small enough ϵ > 0: 1. IEBIM(p X, Rg + ϵ) is attained as follows: Normalize the columns by dividing each column by its sum. Then, select the cell with the smallest normalized value and move an infinitesimal probability mass from this cell to a new column of p XT in the same row. 2. IEBIM(p X, Rg ϵ) is achieved as follows: Identify the columns with the smallest and largest sums in p XT . Select the cell with the smallest value in the column with the lowest sum. Transfer an infinitesimal probability mass from this cell to the column with the highest sum in the same row. IEBIM(p X, R) Figure 1: An example for Theorem 3. Figure 1 on the right depicts an example of optimal solutions in the neighborhood of a deterministic mapping. While Algorithm 1 effectively identifies deterministic mappings that produce mutual information close to the budget R, Theorem 3 can help bridge the remaining gap. More specifically, one can begin with a deterministic mapping and use two probability mass transformations outlined in Theorem 3 to navigate across the I-R plane. Figure 2 illustrates this strategy; for p X = [0.7, 0.2, 0.1], identifying all 5 possible deterministic mappings is straightforward. Applying the transformations from Theorem 3 then yields various solutions across the I-R plane (represented by dashed lines). Subsequently, one can select the solution that maximizes mutual information for any given value of R (highlighted with a thick solid line), thus producing a comprehensive solution for every value of R. As demonstrated in Figure 2, this strategy recovers the optimal solutions, as determined by brute force, for the simple case of an input alphabet of three. However, while effective, the optimality of this approach remains a conjecture. 4 Application: Markov Coding Game with Rate Limit Markov Coding Games (MCGs), as introduced by Sokota et al. [22], represent a specialized type of multi-player decentralized Markov Decision Processes (MDPs) involving several key components: a source, an agent (sender), a Markov decision process, and a receiver. An MCG episode unfolds in three stages: initially, the agent receives a private message from the source, which it must then indirectly convey to the receiver. Next, the agent participates in an episode of the Markov decision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 R IEBIM (p X, R) Brute Force 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 R Figure 2: Solutions to the EBIM problem for p X = [0.7, 0.2, 0.1]. Left: brute force solution. Right: application of the transformations from Theorem 3 to each deterministic mapping (dashed lines) and selection of solutions with maximal mutual information for each R value (thick solid line). This strategy effectively recovers optimal solutions, aligning with those found by brute force in this case. process. Finally, the receiver attempts to decode the original message based on the observed MDP trajectory. The overall reward is a combination of the MDP payoff and the accuracy with which the receiver decodes the message. MCGs are particularly interesting due to their ability to generalize frameworks like referential games [27] and source coding [28]. We will consider a natural extension to Markov Coding Games, where the link from the source to the agent is rate-limited. This means, contrary to the original setting of Sokota et al. [22], the agent does not fully observe the message at each MDP round, but will receive a compressed version of the message iteratively, and in turn, encodes information about the message in the MDP trajectory for the receiver. Following Sokota et al. [22], we define a rate-limited MCG as a tuple (S, A, T , R), M, µ, ζ, R , where (S, A, T , R) is an MDP denoted by state and action spaces, and reward and transition functions, respectively. M is a set of messages, µ is the prior distribution over messages M, ζ is a non-negative real number we call the message priority, and finally, R is the communication rate limit between the source and the agent. An MCG episode proceeds in the following steps: 1. Message M µ is sampled from the prior over messages at the source. 2. Based on the selected message M and the history of the MDP episode, the source generates and transmits signal T to the agent, adhering to the rate limit R. 3. The Agent uses a conditional policy π|T , which takes current state s S and received signal T as input and outputs distributions over MDP actions A, to generate the next action a. 4. After repeating steps 2 and 3, the agent s terminal MDP trajectory Z is given to the receiver as an observation. 5. The receiver uses the terminal MDP trajectory Z to output a distribution over messages M, estimating the decoded message ˆ M. The objective of the agents is to maximize the expected weighted sum of the MDP reward and the accuracy of the receiver s estimate E[R(Z) + ζI[M = ˆ M]] [22]. Optionally, If a suitable distance function exists, instead, the objective can also be adjusted to minimize the difference between the actual message and the guess. A diagram of the structure MCG with rate limit is shown in Figure 3. Agent MDP Receiver Figure 3: The structure of a Markov Coding Game with Rate Limit. 4.1 Method Description Algorithm 2 Source 1: Input: π, µ, R, s0 2: Observe message m µ 3: Initialize message belief p M µ 4: Initialize state s s0 5: while Source s turn do 6: p MT compress(p M, R) 7: t p T |M(m) 8: Send t to the agent. 9: p T P m p MT (m , ) 10: p T A min_ent_coupling(p T , π(s)) 11: p MA P t p MT ( , t ) p A|T (t ) 12: a, s Observe action and next state 13: p M p M|A(a) Algorithm 3 Agent 1: Input: π, µ, R, s0 2: Initialize message belief p M µ 3: Initialize state s s0 4: while Agent s turn do 5: p MT compress(p M, R) 6: Receive t from the source. 7: p T P m p MT (m , ) 8: p T A min_ent_coupling(p T , π(s)) 9: π|T p A|T (t) 10: a π|T 11: s Commit action a to MDP. 12: p MA P t p MT ( , t ) p A|T (t ) 13: p M p M|A(a) Algorithm 4 Receiver 1: Input: z, π, µ, R, s0 2: Initialize message belief p M µ 3: Initialize state s s0 4: for s, a z do 5: p MT compress(p M, R) 6: p T P m p MT (m , ) 7: p T A min_ent_coupling(p T , π(s)) 8: p MA P t p MT ( , t ) p A|T (t ) 9: p M p M|A(a) 10: return arg maxm p M(m ) Marginal Policy Following Sokota et al. [22], before execution we first derive a marginal policy π for the MDP, based on the Maximum-Entropy reinforcement learning objective: t R(St, At) + βH(At|St) The value of β in (14) needs to be determined in accordance with the message priority ζ of the MCG. Note that this marginal policy does not depend on the choice of message. By introducing stochasticity into this policy, we can encode information about the message into the selection of actions at each step during runtime. For more details, see Section C.1. Step 1 - Source: Message Compression At the beginning of each round, given the updated message belief p M, the source compresses the message to generate the signal T, adhering to the sourceagent rate limit R, by solving EBIM in (10). The source then transmits the signal T to the agent. Subsequently, after observing the action taken by the agent, the source updates the message belief for the next round. Algorithm 2 outlines the steps taken by the source. Step 2 - Agent: Minimum Entropy Coupling As illustrated in Algorithm 3, at each round, upon receiving the signal T, the agent constructs a conditional policy π|T by performing minimum entropy coupling between the action distribution from the marginal policy π(s) with the signal distribution p T . Subsequently, the next action is sampled from the conditional policy, a π|T . Finally, the agent updates the message belief based on the chosen action. Receiver: Decoding the Message Given the agent s final MDP trajectory, the receiver mirrors the actions of the source and agent to update the message belief at each step. As outlined in Algorithm 4, the process begins with the receiver compressing the message based on the current message belief. This is followed by performing minimum entropy coupling between the marginal policy and the distribution of the compressed message. The final message belief is used to estimate the decoded message. 4.2 Experimental Results This section presents the experimental results of the method described in Section 4.1, applied to Markov Coding Games. For our experiments, we utilize a noisy Grid World environment for the Markov Decision Process. Section C.2 provides more detail on the environment setup used in this experiment. The marginal policy is learned through Soft Q-Value iteration, as described in Algorithm 8. By increasing the value of β in Equation (14), we induce more randomness into the marginal policy. Consequently, higher values of β lead to an increase in the total number of steps taken by the agent to reach the goal, resulting in a more heavily discounted reward. Conversely, as the entropy of actions at each state is increased, there is an increase in the mutual information between the actions and the compressed message during the minimum entropy coupling at each step. This dynamic establishes a fundamental trade-off between the MDP reward and the receiver s decoding accuracy, through the adjustment of β. Figure 8 shows policies learned by high and low values of β. Algorithm 5 Uniform Quantizer Encoder Input: p X, R Output: p XT 1: n length of p X 2: m 2R 3: partition_size n/m 4: Initialize p XT as an n m zero matrix 5: for i 0 to m 1 do 6: start i partition_size 7: end min(start + partition_size, n) 8: p XT [start : end, i] p X[start : end] 9: return p XT We compare our proposed compression method in Algorithm 1 with a baseline of uniform quantization. As detailed in Algorithm 5, given an entropy budget R, the input symbols are uniformly partitioned into 2R bins, and each bin is encoded with the same code. Figure 4 illustrates the trade-off between the average MDP reward and the receiver s decoding accuracy by varying β, using our deterministic EBIM solver in Algorithm 1, and the uniform quantization encoder in Algorithm 5. Here, the compression rate is defined by the ratio of the message entropy to the allowed code budget H(T). 0.0 0.2 0.4 0.6 0.8 1.0 Receiver Accuracy Average MDP Payoff Deterministic EBIM Solver (Ours) Compression Rate 1.00 1.25 1.67 2.50 5.00 10.00 0.0 0.2 0.4 0.6 0.8 1.0 Receiver Accuracy Uniform Quantizer Compression Rate 1.00 1.25 1.67 2.50 5.00 10.00 Figure 4: The trade-off between average MDP reward vs. receiver s accuracy, navigated by varying the value of β. Left: using our search algorithm for compression (Algorithm 1), Right: using uniform quantization in Algorithm 5. The message size is 512 with a uniform prior, and each data point is averaged over 200 episodes. Figure 5 illustrates the evolution of message belief over time for various values of β and rate budgets. A marginal policy optimized with a higher β prioritizes message accuracy over MDP payoff, as higher entropy of actions at each state provides more room for the agent to encode information about the message. Consequently, as observed, this leads to improved receiver accuracy in fewer steps. In addition, a lower compression rate permits the agent to retain more information about the message, enabling more effective encoding of information in the selected trajectory. 0 10 20 30 40 Agent Step Receiver Accuracy Deterministic EBIM Solver (Ours) 0 10 20 30 40 Agent Step Uniform Quantizer log Ø = 6 0 10 20 30 40 Agent Step Receiver Accuracy Deterministic EBIM Solver (Ours) 0 10 20 30 40 Agent Step Uniform Quantizer log Ø = 5 0 10 20 30 40 Agent Step Receiver Accuracy Deterministic EBIM Solver (Ours) 0 10 20 30 40 Agent Step Uniform Quantizer log Ø = 4.6 0 10 20 30 40 Agent Step Receiver Accuracy Deterministic EBIM Solver (Ours) 0 10 20 30 40 Agent Step Uniform Quantizer log Ø = 4 Compression Rate 10.00 5.00 2.50 1.67 1.25 1.00 Figure 5: Evolution of message belief over time, for various values of β and rate budget, using our search algorithm for compression in Algorithm 1 vs. uniform quantization in Algorithm 5. 5 Conclusion We investigated a lossy compression framework under logarithmic loss, where the reconstruction distribution differs from the source distribution. This framework supports joint compression and retrieval applications, or more generally, cases where distributional shifts occur due to processing. We demonstrated that this framework effectively extends the classical minimum entropy coupling by incorporating a bottleneck, which regulates the degree of stochasticity in the coupling. Furthermore, we showed that separately optimizing the encoder and decoder decomposes the Minimum Entropy Coupling with Bottleneck (MEC-B) into two distinct problems: Entropy-Bounded Information Maximization (EBIM) for the encoder, followed by Minimum Entropy Coupling (MEC) for the decoder. We conducted an extensive study of the EBIM problem, provided a functional mapping search algorithm with guaranteed performance, and characterized the optimal solution adjacent to functional mappings, offering valuable theoretical insights into the problem structure. To illustrate an application of MEC-B, we presented experiments on Markov Coding Games (MCGs) with rate limits. The results demonstrated the trade-off between MDP reward and receiver accuracy, with varying compression rates, compared to baseline compression schemes. Future research could focus on quantifying the gap between the separate optimization of the encoder and decoder and the optimal joint setting. Also, enabling fine-grained control over the entropy spread in the coupling can be key in some applications. Additionally, the application of Entropy-Bounded Information Maximization (EBIM) in watermarking language models [29] suggests a valuable intersection with state-of-the-art AI applications. Moreover, extending this framework to continuous cases could lead to the design of neural network architectures based on the proposed framework and provide information-theoretic insights into a broad spectrum of deep learning problems. These include unpaired sample-to-sample translation [30 32], joint compression and upscaling [25, 33], and the Info Max framework [34, 35], among others. [1] Thomas A Courtade and Richard D Wesel. Multiterminal source coding with an entropy-based distortion measure. In 2011 IEEE International Symposium on Information Theory Proceedings, pages 2040 2044. IEEE, 2011. [2] Thomas A Courtade and Tsachy Weissman. Multiterminal source coding under logarithmic loss. IEEE Transactions on Information Theory, 60(1):740 761, 2013. [3] Yanina Y Shkel and Sergio Verdú. A single-shot approach to lossy source coding under logarithmic loss. IEEE Transactions on Information Theory, 64(1):129 147, 2017. [4] Mathukumalli Vidyasagar. A metric between probability distributions on finite sets of different cardinalities and applications to order reduction. IEEE Transactions on Automatic Control, 57 (10):2464 2477, 2012. [5] Amichai Painsky, Saharon Rosset, and Meir Feder. Memoryless representation of markov processes. In 2013 IEEE International Symposium on Information Theory, pages 2294 298. IEEE, 2013. [6] Mladen Kovaˇcevi c, Ivan Stanojevi c, and Vojin Šenk. On the entropy of couplings. Information and Computation, 242:369 382, 2015. [7] Ferdinando Cicalese, Luisa Gargano, and Ugo Vaccaro. How to find a joint probability distribution of minimum entropy (almost) given the marginals. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 2173 2177. IEEE, 2017. [8] Maurice Fréchet. Sur les tableaux de corrélation dont les marges sont données. Ann. Univ. Lyon, 3ˆ e serie, Sciences, Sect. A, 14:53 77, 1951. [9] Frank Den Hollander. Probability theory: The coupling method. Lecture notes available online (http://websites. math. leidenuniv. nl/probability/lecturenotes/Coupling Lectures. pdf), 2012. [10] Gwo Dong Lin, Xiaoling Dou, Satoshi Kuriki, and Jin-Sheng Huang. Recent developments on the construction of bivariate distributions with fixed marginals. Journal of Statistical Distributions and Applications, 1:1 23, 2014. [11] Viktor Benes and Josef Stepán. Distributions with given marginals and moment problems. Springer Science & Business Media, 2012. [12] Lei Yu and Vincent YF Tan. Asymptotic coupling and its applications in information theory. IEEE Transactions on Information Theory, 65(3):1321 1344, 2018. [13] Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009. [14] Murat Kocaoglu, Alexandros Dimakis, Sriram Vishwanath, and Babak Hassibi. Entropic causal inference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017. [15] Spencer Compton. A tighter approximation guarantee for greedy minimum entropy coupling. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 168 173. IEEE, 2022. [16] Spencer Compton, Dmitriy Katz, Benjamin Qi, Kristjan Greenewald, and Murat Kocaoglu. Minimum-entropy coupling approximation guarantees beyond the majorization barrier. In International Conference on Artificial Intelligence and Statistics, pages 10445 10469. PMLR, 2023. [17] Albert W Marshall, Ingram Olkin, and Barry C Arnold. Inequalities: theory of majorization and its applications. 1979. [18] Ferdinando Cicalese, Luisa Gargano, and Ugo Vaccaro. Minimum-entropy couplings and their applications. IEEE Transactions on Information Theory, 65(6):3436 3451, 2019. [19] Cheuk Ting Li. Efficient approximate minimum entropy coupling of multiple probability distributions. IEEE Transactions on Information Theory, 67(8):5259 5268, 2021. [20] Spencer Compton, Murat Kocaoglu, Kristjan Greenewald, and Dmitriy Katz. Entropic causal inference: Identifiability and finite sample results. Advances in Neural Information Processing Systems, 33:14772 14782, 2020. [21] Mohammad Ali Javidian, Vaneet Aggarwal, Fanglin Bao, and Zubin Jacob. Quantum entropic causal inference. In Quantum Information and Measurement, pages F2C 3. Optica Publishing Group, 2021. [22] Samuel Sokota, Christian A Schroeder De Witt, Maximilian Igl, Luisa M Zintgraf, Philip Torr, Martin Strohmeier, Zico Kolter, Shimon Whiteson, and Jakob Foerster. Communicating via markov decision processes. In International Conference on Machine Learning, pages 20314 20328. PMLR, 2022. [23] Christian Schroeder de Witt, Samuel Sokota, J Zico Kolter, Jakob Foerster, and Martin Strohmeier. Perfectly secure steganography using minimum entropy coupling. ar Xiv preprint ar Xiv:2210.14889, 2022. [24] Yochai Blau and Tomer Michaeli. Rethinking lossy compression: The rate-distortion-perception tradeoff. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 675 685. PMLR, 09 15 Jun 2019. URL https://proceedings. mlr.press/v97/blau19a.html. [25] Huan Liu, George Zhang, Jun Chen, and Ashish J Khisti. Lossy compression with distribution shift as entropy constrained optimal transport. In International Conference on Learning Representations, 2022. [26] Nicolaas Govert De Bruijn. Asymptotic methods in analysis, volume 4. Courier Corporation, 1981. [27] Brian Skyrms. Signals: Evolution, learning, and information. OUP Oxford, 2010. [28] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999. [29] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In International Conference on Machine Learning, pages 17061 17084. PMLR, 2023. [30] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arxiv e-prints. ar Xiv preprint ar Xiv:1611.07004, 1611, 2016. [31] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223 2232, 2017. [32] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning, pages 1989 1998. Pmlr, 2018. [33] Byeongkeun Kang, Subarna Tripathi, and Truong Q Nguyen. Toward joint image generation and compression using generative adversarial networks. ar Xiv preprint ar Xiv:1901.07838, 2019. [34] Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. ar Xiv preprint ar Xiv:1907.13625, 2019. [35] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. ar Xiv preprint ar Xiv:1808.06670, 2018. [36] O. L. Mangasarian. Machine Learning via Polyhedral Concave Minimization, pages 175 188. Physica-Verlag HD, Heidelberg, 1996. ISBN 978-3-642-99789-1. doi: 10.1007/ 978-3-642-99789-1_13. URL https://doi.org/10.1007/978-3-642-99789-1_13. [37] Michel X Goemans. Spectral graph theory and numerical linear algebra. URL http://www.cs.cmu.edu/afs/cs/user/glmiller/public/Scientific-Computing/ F-11/Related Work/Goemans-LP-notes.pdf. [38] F Palacios-Gomez, L Lasdon, and M Engquist. Nonlinear optimization by successive linear programming. Management science, 28(10):1106 1120, 1982. [39] Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433 1438. Chicago, IL, USA, 2008. [40] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018. [41] Kevin Hanselman. grid-world-rl. https://github.com/kevin-hanselman/ grid-world-rl, 2023. [42] David Barber and Felix Agakov. The im algorithm: a variational approach to information maximization. Advances in neural information processing systems, 16(320):201, 2004. [43] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016. [44] Yann Le Cun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. [45] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 4. Granada, 2011. A Mathematical Proofs A.1 Proof of Lemma 1 Lemma 1. Given a Markov chain X T Y : I(X; Y ) = I(X; T) + I(Y ; T) I(T; X, Y ). (5) Proof. By multiple applications of the chain rule for mutual information, i.e. I(A; B, C) = I(A; B)+ I(A; C|B), we have: I(X; Y ) = I(X; Y, T) I(X; T|Y ) (15) = [I(X; T) + I(X; Y |T)] [ I(T; Y ) + I(T; X, Y )] (16) = I(X; T) + I(Y ; T) I(T; X, Y ). (17) Note that from the Markov chain property, we have I(X; Y |T) = 0. A.2 Proof of Theorem 1 Theorem 1. IEBIM(p X, R) = R if and only if there exists a function g : X T such that H(g(X)) = R. Proof. If such g exists, let p XT (x, t) = p X(x) t = g(x) 0 otherwise This joint distribution effectively sets T = g(X). Note that p XT M and we have I(X; T) = H(T) H(T|X) = H(g(X)) = R. Since IEBIM(p X, R) R, we conclude that IEBIM(p X, R) = R for p XT = p XT . Conversely, if IEBIM(p X, R) = R, then there exists p XT M such that I(X; T) = R. Therefore H(T) = I(X; T) + H(T|X) = R + H(T|X) R H(T|X) 0 As a result, H(T|X) = 0 and H(T) = R, which means p XT defines a function g such that T = g(X), and H(g(X)) = H(T) = R. A.3 Proof of Theorem 2 Before providing the formal proof, it is helpful to gain some insight into the structure of the solution first. Remark 2. Let n = |p X|. The procedure outlined in Algorithm 1 establishes a series of deterministic mappings p(0) l , p(1) s , p(1) l , , p(n 1) s , p(n 1) l , corresponding to a decreasing sequence of mutual information values I(0) l , I(1) s , I(1) l , , I(n 1) s , I(n 1) l . The algorithm then picks the mapping with the maximum mutual information that does not exceed R. Therefore R I(X; T) max n I(0) l I(1) s , I(1) s I(1) l , , I(n 1) s I(n 1) l o max n I(0) l I(1) l , I(1) l I(2) l , , I(n 2) l I(n 1) l o . (18) Example. For p X = [0.4 0.3 0.2 0.1], Algorithm 1 traverses through the following deterministic mappings, from left to right: p(0) l p(1) s p(1) l p(2) s p(2) l p(3) s p(3) l 0.4 0 0 0 0 0.3 0 0 0 0 0.2 0 0 0 0 0.1 0.4 0 0 0 0.3 0 0 0 0.2 0 0 0.1 0.4 0 0 0.3 0 0 0 0.2 0 0 0 0.1 0.4 0 0.3 0 0 0.2 0 0.1 0.4 0 0.3 0 0.2 0 0 0.1 0.4 0.3 0.2 0.1 0.4 0.3 0.2 0.1 I (X; T) H (p X) H ([.4 .3 .3]) H ([.7 .2 .1]) H ([.7 .3]) H ([.9 .1]) 0 0 Definition 1. Let P be a probability distribution resulted from merging two elements p > 0 and q > 0 in an original distribution P, i.e. P = [ p q ] and P = [ p + q ]. Then, the amount of decrease in the entropy from this merge operation is characterized by: H (p , q) = H (P) H (P ) (19) p + q log 1 q (p + q) log 1 p + q (20) = p log 1 + q + q log 1 + p Lemma 2. The following properties hold for the function H: 1. H ( , ) is monotonically increasing in both arguments. 2. H ( , ) is concave. 3. H (p , 1 p) = h(p). Proof. The properties are derived through straightforward derivative calculations: 1. p H = log(1 + q/p) 0, and q H = log(1 + p/q) 0. 2. The Hessian of H is negative semidefinite: H H = 1 p + q q/p 1 1 p/q with eigenvalues λ1 = 0 and λ2 = ( q q )( 1 p+q) < 0. 3. H (p , 1 p) = p log p (1 p) log(1 p) = h(p). Theorem 2. If the output of Algorithm 1 yields mutual information b I, then IEBIM(p X, R) b I h(p2), (13) where h( ) is the binary entropy function, and p2 denotes the second largest element of p X. Proof. For the gap to the optimal objective, IEBIM(p X, R) b I, we have: Equation (11) IEBIM(p X, R) b I R b I Remark 2 max i {1, ,n 1} I(i 1) l I(i) l Algorithm 1, Line 8 = max i {1, ,n 1} H P Definition 1 = max i {1, ,n 1} H k=1 pk , pi+1 Lemma 2.1 max i {1, ,n 1} H k=1 pk + n P k=i+2 pk , pi+1 = max i {1, ,n 1} H (1 pi+1 , pi+1) Lemma 2.3 = max i {1, ,n 1} h (pi+1) p2, p3, , pn 0.5 = h(p2). Note that the above bound on the optimality of the proposed algorithm is by no means tight, as it does not account for the intermediate distributions p(i) s . A.4 Proof of Theorem 3 Theorem 3. Let p XT denoted by a |X| |T | matrix, defines a deterministic mapping T = g(X), with I(X; T) = H(T) = Rg. We have IEBIM(p X, Rg) = Rg, and for small enough ϵ > 0: 1. IEBIM(p X, Rg + ϵ) is attained as follows: Normalize the columns by dividing each column by its sum. Then, select the cell with the smallest normalized value and move an infinitesimal probability mass from this cell to a new column of p XT in the same row. 2. IEBIM(p X, Rg ϵ) is achieved as follows: Identify the columns with the smallest and largest sums in p XT . Select the cell with the smallest value in the column with the lowest sum. Transfer an infinitesimal probability mass from this cell to the column with the highest sum in the same row. Example. Figure 6 depicts an example of optimal solutions in the neighborhood of a deterministic mapping. IEBIM(p X, R) Figure 6: Optimal solutions in the neighborhood of a deterministic mapping. Proof. Let us view the joint distribution as a n m matrix p XT . Note that: p XT (x, t) = p X(x) t = g(x) 0 otherwise (23) For example: 0.4 0 0 0 0.3 0 0 0 0 0.2 0 0 0 0 0.1 0 1, x = 1, 2 2, x = 3 3, x = 4 (24) Consider a perturbation d P Rn m to p XT . For p XT + d P to be a valid distribution in M, we need: t d P(x, t) = 0, x X 2. d P(x, t) 0, x, t s.t. t = g(x) 3. d P(x, t) 0, x, t s.t. t = g(x) We define the set of all such perturbations as Ω Rn m. Next, let us define basis perturbations x,t for t = g(x) as: ε, if i = x, j = g(x) +ε, if i = x, j = t 0, otherwise (25) Note that x,t represents moving a probability mass of ε from non-zero cell (x, g(x)) to empty cell (x, t). For the example in equation (24): 0 0 0 0 ε 0 +ε 0 0 0 0 0 0 0 0 0 The significance of these bases is that any perturbation in Ωcan be represented as: x,t t =g(x) αx,t x,t, (26) with coefficients αx,t 0. For example: 0 0 0 0 3ε +2ε +ε 0 0 0 0 0 0 0 0 0 = 2 1,1 + 1,2. Realizing IXT , HXT , and HT as functions of a joint distribution, we are interested in calculating the ratio d IXT /d HT with respect to a perturbation d P Ωas ε 0 at p XT . Note that since for any d P Ω, d HX = 0, we have: d HT = d HX + d HT d HX,T d HT = 1 d HX,T d HT = 1 d HX,T (p XT , d P) d HT (p XT , d P) t =g(x) αx,t x,t t =g(x) αx,t x,t t =g(x) αx,t d HX,T (p XT , x,t) P t =g(x) αx,t d HT (p XT , x,t) . (27) d HX,T (p XT , x,t) represents the amount of change in the joint entropy, when an infinitesimal mass of ε is moved from (x, g(x)) to (x, t). More precisely, from (23) and (25): d HX,T (p XT , x,t) = HX,T (p XT + x,t) HX,T (p XT ) = [ (p X(x) ε) log(p X(x) ε) ε log ε] [ p X(x) log p X(x)] = p X(x) log p X(x) p X(x) ε + ε log p X(x) ε = ε + O(ε2) + ε log p X(x) The last line uses the fact that for small enough x, f(x) = a log a a x = x + O(x2). Similarly: d HT (p XT , x,t) = HT (p XT + x,t) HT (p XT ) = h p T g(x) ε log p T g(x) ε p T (t) + ε log p T (t) + ε i h p T g(x) log p T g(x) p T (t) log p T (t) i = p T g(x) log p T g(x) p T g(x) ε + p T (t) log p T (t) p T (t) + ε + ε log p T g(x) ε p T (t) + ε = ε + O(ε2) ε + O(ε2) + ε log p T g(x) p T (t) + ε = log p T g(x) p T (t) + ε + O(ε2). (29) Plugging (28) and (29) back to (27), we will get: t =g(x) αx,t ε + ε log p X(x) t =g(x) αx,t log p T g(x) p T (t) + ε + O(ε2) αx,t log p X(x) αx,t log p T g(x) p T (t) + ε where αx,t is normalized by: αx,t = αx,t P t =g(x) αx,t . Let s focus on the limit of (30) when ε 0: If there is any t T with p T (t) = 0 and αx,t > 0, the denominator of the second term will grow without bound, otherwise the denominator remains bounded. Therefore, for the limit of (30) we have: lim ε 0 d IX,T if αx,t = 0 t s.t. p T (t) = 0 αx,t 1 if t : αx,t > 0 and p T (t) = 0 (31) For d HT > 0, we need to find a perturbation (i.e. coefficients αx,t) that maximizes d IXT /d HT . From (31), this means t T with αx,t > 0 and p T (t) = 0. d HT = argmax Therefore, P αx,t = 1 which means αx,t = 0 if p T (t) > 0. In other words, we should only consider perturbations where masses are moved to all-zero columns. Continuing (30): αx,t log p X(x) αx,t log p T g(x) αx,t log p X(x) αx,t log p T g(x) αx,t log p T g(x) αx,t log p X(x) p T g(x) . This is achieved by selecting 1, x = argmin x p X(x ) p T g(x ) and p T (t) = 0 0, otherwise (32) In other words, first, normalize columns in p XT by their sum, then move an infinitesimal probability mass from the cell with the smallest normalized value to an all-zero column. It is easy to confirm that d HT > 0 for such a perturbation. For the example distribution in (24): p XT + d P = 0.4 0 0 0 0.3 ε 0 0 ε 0 0.2 0 0 0 0 0.1 0 On the other hand, for d HT < 0, we need to find a perturbation (i.e. coefficients αx,t) that minimizes d IXT /d HT . From (31), this means αx,t = 0 for all t T that p T (t) = 0. Therefore, as in (30): αx,t log p X(x) αx,t log p T g(x) α 1 log ε + P αx,t log p X(x) αx,t log p T g(x) αx,t log p T g(x) This is achieved by selecting (1, x = argmin x p T g(x ) and t = argmax t p T (t ) 0, otherwise (34) In other words, moving an infinitesimal probability mass from the smallest column to the largest column of p XT . It is easy to confirm that d HT < 0 for such a perturbation. For the example distribution of (24): p XT + d P = 0.4 0 0 0 0.3 0 0 0 0 0.2 0 0 ε 0 0.1 ε 0 B Minimum Entropy Coupling Consider two discrete random variables X and Y , over alphabets X and Y with probability mass functions p X and p Y , respectively. The goal of minimum entropy coupling is to find the joint distribution p XY that minimizes the joint entropy H(X, Y ): min p XY H(X; Y ) y Y p XY (x, y) = p X(x) x X, x X p XY (x, y) = p Y (y) y Y This is a concave minimization problem over a standard polyhedron [36]. Therefore, every vertex of the polyhedron is a local minimum and the global minimum happens at a subset of the vertices. Note that an standard polyhedron is defined as P = {x Rn| Ax = b, x 0}, where A Rm n with linearly independent rows. A point x P is a vertex if and only if it has n m zero elements and columns of A corresponding to other m non-zero elements are linearly independent. Hence, to exhaustively iterate all the vertices: 1. Choose m linearly independent columns Aπ(1), , Aπ(m). 2. Let xi = 0 for all i π(1), ..., π(m) 3. Solve the system of m equations Ax b = 0 for the unknowns xπ(1), , xπ(m) Therefore a crude upper-bound on the number of vertices would be n m . This can be enhanced to n m/2 m/2 [37] which is still exponential in m. Next, we will show that the minimum entropy coupling problem as defined in (36) is essentially NP-Hard. This is done by reduction from another NP-complete problem, the k-Subset-Sum (see [6] for more details). Remark 3. The minimum entropy coupling problem in (36) is NP-Hard [6]. Proof. To show an optimization problem is NP-Hard, we need to show the corresponding decision problem is NP-Hard. Given an optimization problem, a decision version is whether or not any target value t is achievable. Without the loss of generality, assume |X| > |Y|. We set t = H(Y ), i.e. to decide if there exists a function f : X Y such that Y = f(X). Let s call this problem Deterministic Matching. Next, we show any instance of the k-Subset-Sum problem can be reduced to an instance of Deterministic Matching, by a polynomial-time procedure (denoted by the notation