# undetectable_steganography_for_language_models__bfed6dae.pdf

Published in Transactions on Machine Learning Research (10/2024)

Undetectable Steganography for Language Models

Or Zamir orzamir@tauex.tau.ac.il School of Computer Science Tel Aviv University

Reviewed on Open Review: https: // openreview. net/ forum? id= fq6a Qo MSHz

We introduce a cryptographic method to hide an arbitrary secret payload in the response of a Large Language Model (LLM). A secret key is required to extract the payload from the model s response, and without the key it is provably impossible to distinguish between the responses of the original LLM and the LLM that hides a payload. In particular, the quality of generated text is not affected by the payload. Our approach extends a recent result of Christ, Gunn and Zamir (2023) who introduced an undetectable watermarking scheme for LLMs.

Prompt: Write an email asking my professor Prof. Hannity to not make the final exam in Machine Learning 101 too difficult.

Response: Dear Professor Hannity, I hope this email finds you well. I am writing to respectfully request that you consider the level of difficulty for the final exam in Machine Learning 101. While I am confident in my understanding of the course materials and have put in a significant amount of effort throughout the semester, I do have concerns about the potential difficulty of the final exam. As you may recall, several students in my previous sections of Machine Learning have found the final exam to be very challenging, leading to frustration and disappointment. While I understand that the course is intended to push students to their limits and beyond, I believe that some relaxation of the latter may be in order. I would kindly ask that you consider reducing the difficulty of the final exam or offering some additional supports or resources to help students prepare. I believe that this could enhance the learning experience

Hidden Payload: OZ (Encoded as 10 bits.)

Figure 1: We asked Llama 2 to write an email urging a professor for an easy exam, intended to be sent anonymously. Nevertheless, the part of the response shown above secretly encodes the initials of the user who used the LLM. This secret payload was encoded without modifying the response distribution at all.

1 Introduction

As the applications of machine learning models generating human-like text become widespread, concerns for misuse proportionally rise. As detection of text generated by Large Language Models (LLMs) seemingly approaches infeasibility (JAML20; KSK+23; SKB+23; CBZ+23), the ability to intentionally plant a watermark in LLM-generated text becomes the most viable approach to differentiate LLM-generated from humangenerated text.

A long line of works showed that a watermark can be planted in LLMs by altering the output texts (AF21; QZL+23; YAJK23; MZ23; KGW+23a). Recently, Christ, Gunn and Zamir (CGZ23) showed that a watermark can be planted in LLM-outputs without altering the distribution of responses. Informally, CGZ show that any LLM can be modified such that: 1) It is computationally infeasible to distinguish between the original and

Published in Transactions on Machine Learning Research (10/2024)

the modified LLMs unless you hold a secret key, even when you are allowed to make many adaptive queries, 2) With the secret key, outputs of the modified LLM can be detected as watermarked. The importance of this notion of undetectibility is that it formally guarantees that the quality of text does not degrade (or change at all) in the process of planting the watermark.

In this work, we extend the watermarking scheme of CGZ to plant an arbitrary payload into the LLM response, while still maintaining the same property of undetectability. This implies, for example, that an LLM may secretly hide session-metadata (such as the user s name, the time, or the prompt used) in the response it gives. In particular, it can extend the ability of detecting text generated by the LLM to the ability of also knowing who used the LLM or when they did so. It also means that even an LLM running off-line may use the responses it gives to covertly leak internal information to those who will later be exposed to the seemingly-clean LLM responses. The main technical tool we use to transform the CGZ scheme from watermarking to embedding a message, is incorporating a dynamic variant of Error Correcting Codes with feedback.

The process of encoding a hidden message into a given channel (e.g., a picture or natural language) is called Steganography and was extensively studied in many domains (Hv AL08; DIRR09; KJGR21; d WSK+22). The unique property of our setting, though, is that in the case of LLM responses the distribution of the channel is unknown. The hidden message should be recovered from the response of the LLM, without knowing what the prompt used to generate the response was. In particular, the process of decoding the message must be agnostic to the distribution of responses. Recent manuscripts (FCT+23; WYC+23; YAK23) embed messages in LLM responses, but do so by altering the distribution of responses, similar to the non-undetectable watermarking schemes. In particular, those methods affect the response distribution and may degrade the quality of generated text.

Under realistic assumptions which we verify empirically, the amount of hidden bits our scheme can encode in an LLM output is linear in the length of the response, which is asymptotically optimal. Our scheme only negligibly affects the complexity of the generation process and is thus easy and efficient to deploy. While this paper is theoretical in nature and the properties of the suggested scheme are proven rigorously, we also implemented the scheme and provide empirical examples. An open problem not addressed in this paper is that our encoding scheme is not very robust to edits of the generated text, we discuss this in length in later sections.

1.1 Organization of the paper

In Section 2 we give the formal definitions of the problem s setting, and state the main theorems of this paper rigorously. We also give the necessary preliminaries. In Section 3 we give a quick overview of the CGZ watermark. In Section 4 we give an high-level overview of our scheme. Then, in Section 5 we introduce a simple dynamic error correcting code with feedback, which we later use as a building block. Finally, in Section 6 we give and analyze the full scheme. In Section 7 we discuss our implementation of the scheme and some empirical evaluation of it. In Section 8 we discuss limitations of our scheme, and in particular robustness to editing. We follow with open problems and conclusions.

2 Model and Preliminaries

Many of the notions in this section are adapted from (CGZ23), the last subsection contains our new definitions and theorems for Steganography in LLMs.

2.1 Preliminaries

Let λ be the security parameter, we denote by negl(λ) any function that is in O 1 p(λ) for every polynomial p( ). As is standard in Cryptography research, we think of λ as the key size", and of running times that are super-polynomial in λ as infeasible". We denote by log and ln the logarithm with base two and the natural logarithm, respectively. For a sequence s = (. . . , si, . . .) we denote by s[i : j] the sub-sequence (si, . . . , sj). The Hamming distance between two vectors is defined as the number of indices on which they differ, that

Published in Transactions on Machine Learning Research (10/2024)

is (z, z ) := |{i | zi = z i}|. We denote by x || y the concatenation of the vectors x, y, and by len(v) the dimension of the vector v.

Pseudorandom function (PRF). Let F = {Fk : {0, 1}ℓ1(λ) {0, 1}ℓ2(λ) | k {0, 1}λ} be a family of functions. F is a PRF if Fk is efficiently computable and for all probabilistic polynomial-time distinguishers D, Pr k {0,1}λ

h DFk( )(1λ) = 1 i Pr f

h Df( )(1λ) = 1 i

where f denotes a random function from {0, 1}ℓ1(λ) to {0, 1}ℓ2(λ). PRFs are a standard cryptographic primitive equivalent to one-way functions and can be constructed from standard assumptions (GGM86; HILL99). Intuitively, a PRF is simply a function that is indistinguishable from a totally random function without knowledge of the secret key k that parameterizes it.

2.2 Language Models

We adapt the definition of (CGZ23) for language models, which in turns follows that of (KGW+23b). We will often refer to language models simply as models.

Definition 2.1. A language model Model over token set T is a deterministic algorithm that takes as input a prompt prompt and tokens previously output by the model x = (x1, . . . , xi 1), and outputs a probability distribution pi = Model(prompt, x) over T .

A language model Model is used to generate text as a response to a prompt by iteratively sampling from the returned distribution until a special terminating token done T is drawn.

Definition 2.2. A language model s response to prompt is a random variable Model(prompt) T that is defined algorithmically as follows. We begin with an empty list of tokens x = (). As long as the last token in x is not done, we draw a token xi from the distribution Model(prompt, x) and append it to x. Finally, we set Model(prompt) = x.

We assume that our model never outputs text of length super-polynomial in λ.

2.3 Entropy and Empirical Entropy

For a probability distribution D over elements of a finite set X, the Shannon entropy of D is

H(D) = E x D[ log D(x)],

where D(x) is the probability of x in the distribution D. The empirical entropy (also known as Shannon information) of x in D is simply log D(x). The expected empirical entropy of x D is H(D).

The following definition of empirical entropy of a model s response is taken from (CGZ23).

Definition 2.3. For a language model Model, a prompt prompt, and a possible response x T , we define the empirical entropy of Model responding with x to prompt, denoted by He(Model, prompt, x), as

Model (prompt) = x i .

Note that in expectation, Definition 2.3 simply captures the entropy in the response generation. That is, we have E x [He(Model, prompt, x)] = H

Model (prompt) ,

where x Model (prompt).

The following definition naturally generalizes empirical entropy from whole outputs to substrings out of a model s output.

Published in Transactions on Machine Learning Research (10/2024)

Definition 2.4. For a language model Model, a prompt prompt, a possible response x T , and indices i, j [|x|] with i j we define the empirical entropy on substring [i, j] of Model responding with x to prompt as

H[i,j] e (Model, prompt, x) :=

Model (prompt) [i : j] = x[i : j] |

Model (prompt) [1 : (i 1)] = x[1 : (i 1)] i .

2.3.1 Empirical Entropy in Natural Language

Studies in linguistics (GC02; CLA17; SL22) conclude that in natural language the entropy per unit of text (e.g., a word) is usually constant throughout the text. In particular, the empirical entropy of a LLM response is expected to be linear in the length of the text, and roughly uniformly distributed among the different tokens that assemble the response. This intuition was empirically verified by (CGZ23). We reaffirm the above in Section 7 in which we also run empirical evaluations.

2.4 Steganography for LLMs

In this section we finally define rigorously steganography for language models. We first explain the definition intuitively. During the setup of the scheme, we generate a secret key k of size λ. To generate a response, we use a method Stegk that together with the key k, receives a prompt prompt and a secret message payload. A retrieval method Retrk should be able to retrieve the hidden payload from an output generated using Stegk, while also using the secret key k.

Definition 2.5 (Steganography Scheme). A steganography scheme for a model Model over T is a tuple of algorithms W = (Setup, Steg, Retr) where:

Setup(1λ) k outputs a secret key, with respect to a security parameter λ.

Stegk(prompt, payload) is a randomized algorithm that takes as input a prompt prompt and a payload payload, and generates a response in T .

Retrk(x) T is an algorithm that takes as input and returns as an output sequences in T .

The most important property of the steganography scheme we present in this paper is undetectability. Intuitively, we require that without knowledge of the secret key, Stegk(prompt, ) and Model(prompt) are indistinguishable even to a user allowed to make many adaptive queries. The payloads used in Stegk do not affect this property.

Definition 2.6 (Undetectability). A steganography scheme W = (Setup, Steg, Retr) is undetectable if for every security parameter λ and all polynomial-time distinguishers D, Pr[DModel,Model(1λ) 1]

Pr k Setup(1λ) [DModel,Stegk(1λ) 1]

where the notation DO1,O2 means that D is allowed to adaptively query both O1 and O2 with arbitrary prompts.

Another important desired property is that Retrk should succeed in retrieving the payload from the output of Stegk(prompt, payload). Such a successful retrieval inherently requires making assumptions on the length of the payload and the entropy of the model. It is for example impossible to encode an arbitrary payload

Published in Transactions on Machine Learning Research (10/2024)

longer than the output in the output, or to plant any payload at all in a deterministic model (thus, without any entropy) while remaining undetectable. The best possible result is thus encoding He(Model, prompt, x) bits of payload in an output x Stegk(prompt, payload), this is because the empirical entropy of an output exactly quantifies the number of entropy bits used for its generation.

We achieve the above best-possible bound asymptotically yet we require two additional technical conditions. First, we can plant payloads only when He(Model, prompt, x) = Ω(λ). Intuitively, this is because outputs with a very low empirical entropy (with respect to the security parameter) show up too frequently and thus can t be modified while maintaining undetectability. The necessity of this condition, that also appeared in the CGZ result, is proven in (CGZ23). Second, we require that the empirical entropy is spread somewhat uniformly throughout the output x. This appears to be necessary for technical reasons (e.g., avoiding a scenario in which the empirical entropy is high solely due to a single very-low-probability token in the response), yet could potentially be relaxed in future works. As discussed in Section 2.3.1 and empirically reaffirmed in Section 7, this condition is satisfied by natural language in which the entropy is roughly constant per each unit of text.

A semi-formal version of the theorem follows, the formal one appears in Section 6.

Theorem (Informal version of Theorem 6.3). Fix a model Model. Let prompt, payload be strings. Conditioned on the empirical entropy of a response y generated by Stegk(prompt, payload) being high enough, the expected length of the prefixes of payload and Retrk(y) that identify is at least Θ(len(y)).

The definition of high enough, formally stated in Section 6, roughly means that for any consecutive part of the response consisting of a large enough number r of tokens, the empirical entropy in that part is at least the square root Ω( r) of the length. As discussed in Section 2.3.1 and verified empirically in Section 7, in natural language we actually expect the entropy to grow linearly with the length of the text, much higher than the required square root. Under this condition, the theorem guarantees that a response of length L will allow retrieving the first Θ(L) bits of the payload, which is (up to constants) the best possible.

3 Overview of the CGZ Watermark

This section is adapted in its entirety from CGZ (CGZ23), and contains a high-level description of their watermarking scheme.

We first simplify the definition of a language model (Definition 2.1) by assuming that the token set is binary, T = {0, 1}. We may assume this without loss of generality due to a straightforward reduction that appears in Section 4.1 of CGZ. We will implicitly use this reduction throughout our work as well.

The intuitive idea behind the CGZ watermark is planting a watermark not by changing the model s distribution, but by correlating the randomness used by the model with the secret key. We begin by describing a simplified approach that works only for generating a single response, of length bounded by some parameter L. Let k = (k1, k2, . . . , k L) be the secret key, chosen by drawing each ki uniformly and independently from [0, 1]. To generate a response to a prompt prompt, we run the model as intended yet use ki to determine the random choice in the i-th response token generation. Let pi denote the probability, according to the real model with the previously chosen tokens as prefix, of the i-th token being 1. The watermarked model outputs xi = 1 if ki pi and xi = 0 otherwise. Crucially, as ki was chosen uniformly, the probability of xi = 1 is exactly pi and hence the output distribution of the model is not affected at all. On the other hand, we now expect some correlation between xi and ki.

For each response bit xi, the detection algorithm may compute the following score, that depends on the key and on the response but not on the prompt, the model, or the distributions pi,

s(xi, ki) =

ki if xi = 1 ln 1 1 ki if xi = 0 .

Published in Transactions on Machine Learning Research (10/2024)

Given a string x = (x1, . . . , xℓ), the detection algorithm sums the score of all bits

i=1 s(xi, ki).

The main observation is that the score is higher in responses generated by the above procedure than it is for unrelated strings. In non-watermarked text, the value of ki is independent of the value of xi. Therefore, s(xi, ki) is simply an exponential random variable with mean 1:

E ki[s(xi, ki)] = Z 1

0 ln(1/x) dx = 1,

and we have Ek[c(x) |x|] = 0. For watermarked responses, on the other hand,

E ki[s(xi, ki)] = Z pi

pi ln 1 1 x dx

x dx + Z 1 pi

= 1 + pi ln pi (1 pi) ln(1 pi)

= 1 + ln(2) H(pi),

and the total expected score is

E k[c(x) |x|] = ln 2 H(Model(prompt)).

We thus observed that at least in expectation, the score of watermarked texts is larger than that of arbitrary texts, and that the difference between those quantities is roughly the entropy in the response. To make this observation algorithmically useful, considering expectations is not sufficient, as we need to set some score threshold for detection and analyze the probability of the score passing this threshold in each of the two cases. This analysis will not be repeated in this short overview and appears in CGZ.

To avoid having an upper bound L on the length of the response, and to reduce the length of the key k, we use a Pseudo-Random Function F (PRF, as defined in Section 2). The key will now simply be a random string of length λ, and we would implicitly set ki := Fk(i). By the definition of a PRF, those values are indistinguishable from independently chosen random values.

The final obstacle is remaining undetectable even when many queries are allowed. In the above sketch the choice of the key k fully determines all randomness, and thus for example the same prompt will always get the same response. To overcome this hurdle, we begin the generation of each response with using real randomness (and not the key) to sample tokens, while counting the empirical entropy of the response prefix generated so far. When the response prefix passes a threshold of empirical entropy λ, we denote the response s prefix as r and start running the previous scheme with r as an additional input to the PRF. That is, after we set the prefix r we use the value Fk(r, i) to generate the i-th token. In the detection, we will enumerate over all possible prefixes of the response as r. In CGZ, it is shown that because the prefix r is set only after enough entropy was used, it has negligible probability to ever repeat itself in different queries. Thus the inputs to the PRF calls are each unique and the scheme becomes undetectable even with many queries being made.

The pseudo-codes for generation (Algorithm 1) and detection (Algorithm 2) of the watermark appear in the Appendix. In CGZ, those algorithms are then generalized to also support the detection of the watermark from a substring out of the response and not only from the response in its entirety as is sketched above.

4 High-Level Overview of Our Scheme

In this section we give an overview of our construction, with the rigorous details appearing in Sections 5 and 6. As in the CGZ overview in Section 3, we again assume without loss of generality that the token space is binary.

Published in Transactions on Machine Learning Research (10/2024)

Algorithm 1: Watermarking algorithm Watk Data: A prompt (prompt) and a secret key k Result: Watermarked text x1, . . . , x L i 1; H 0; while done / (x1, . . . , xi 1) do

pi Model(prompt, x1, . . . , xi 1); if H < λ then

// Collect more internal entropy Sample (xi, p) (1, pi) with probability pi, otherwise (0, 1 pi); H H log p; if H λ then

r (x1, . . . , xi); end else

// Embed the watermark xi 1[Fk(r, i) pi]; end i i + 1; end

Algorithm 2: Detector Detectk Data: Text x1, . . . , x L and a secret key k Result: True or False for i [L] do

r(i) (x1, . . . , xi);

Define v(i) j := xj Fk(r(i), j) + (1 xj) (1 Fk(r(i), j)) for j [L];

if PL j=i+1 ln(1/v(i) j ) > (L i) + λ

L i then return True; end end return False;

As a first attempt, we notice that one may generalize any watermark into a steganography scheme by using several keys. Let k1, . . . , km be m different secret keys, and setup a watermarking scheme with each of them. To encode a message i [m] within a response, simply use the watermarking instance corresponding to ki to generate said response. In the retrieval step, we will use the detection algorithm with every key kj to find which of them was used. While undetectability is trivially preserved, as we only use undetectable watermarks to generate responses, the scheme becomes infeasible as soon as m isn t very small. This is because both the rate of false-positives" and the detection time grow by a multiplicative factor of m. In particular, encoding ℓ bits of information will cause a multiplicative factor of 2ℓin the running time of the retrieval algorithm, and will also require that the false-positive rate of the watermarking scheme be much smaller than 2 ℓ.

A reasonable next step then, is breaking up the payload into smaller parts (say, characters or bits), and encoding each of those parts separately in a similar fashion to the previous suggestion. One subtle issue to overcome while implementing this idea is that partitioning the response into those smaller chunks is not straightforward. This is because we know a successful watermarking requires high empirical entropy, and it is not known in advance what parts of the response would contain high empirical entropy. Moreover, the retriever needs to be able to use the same partition as the steganography algorithm. We solve this problem by implicitly defining the partition to chunks using the detection score itself: Let t be some score threshold to be decided later. Denote the first bit of the payload by b {0, 1}. We start planting the payload in the same way as the CGZ watermark is embedded, but with b as an additional input to the PRF. That is,

Published in Transactions on Machine Learning Research (10/2024)

the randomness used in the i-th token is Fk(i, b). The retriever would compute the scores for both options of b over more and more tokens until one of the two scores passes a certain threshold determined by t. For example, we can consider the minimal prefix of the response x[: i] for which (c(x[: i]) i)/

i > t with respect to the score computed with either b = 0 or b = 1. Crucially, both Stegk and Retrk can compute those scores, so both can compute the index i in which one of the scores passed the threshold for the first time. Thus, both are able to agree on the end of the chunk encoding the first bit of the payload, and thus can iteratively restart this process with the next bit of the payload and so on. Following the CGZ analysis, a threshold of roughly t λ is required to make sure the correct bit is always the first one to pass the score threshold, as otherwise the false-positive rate of passing the score threshold is non-negligible. This means that whilst the decoding time and general error rate are now low enough, we still can t encode payloads that are too long. Even if every token contains constant entropy, only Θ(L/λ) bits of the payload can be encoded in a response of length L. Since λ is rather large, we should desire to get this quantity down to the optimal Θ(L).

Finally, we use Error Correcting Codes (ECC) to lower the threshold t and in turn increase the number of hidden payload bits. Intuitively, if instead of setting t λ we set t to be a large constant, then false-positives do happen, but only with a small constant probability. Thus, if instead of encoding the bits of the payload itself, we would encode the bits of an ECC applied to the payload, we should be able to recover the payload even if a constant fraction of the bits were transmitted incorrectly. This would reduce t from λ to O(1) as desired. A problem that remains is that applying a standard ECC scrambles" the message and thus if only a prefix of the code is successfully hidden then it would be impossible to decode anything out of the payload. For this reason, in Section 5 we define and construct a Dynamic ECC, this is a code in which for every i, the first i bits of the code can be used to retrieve the first Θ(i) bits of the message. Thus, a response of size L would hide the first Θ(L) bits of the payload. To significantly simplify the construction of the ECC, we use the fact our construction provides noiseless feedback. As mentioned before, during the encoding process, Stegk can also simulate Retrk and thus knows what bit will be detected by the retriever at each step. Thus, the ECC can depend on whether or not each previously sent bit was transmitted correctly. In the construction of Section 5 we actually use a ternary code alphabet rather than binary, which doesn t affect the sketch of the construction much.

We finally note that to support multiple queries, we use the same idea of CGZ sketched in Section 3, and begin by observing enough real entropy to set a unique prefix r, the following random choices will be made using Fk(r, i, b). For the detector to find the correct prefix to use as r, we need to use the first λ bits of entropy after setting the prefix to place a normal watermark, which the detector would use to verify the choice of r. This means that to be undetectable with many queries, we start encoding the payload only after the first Θ(λ) tokens (or bits of entropy). As long as L = Ω(λ) this does not matter asymptotically.

5 Dynamic Error Correcting Code

Error Correcting Codes (ECCs) are the means to compensate for errors in the transmission of messages. An ECC encoding is a function Enc : Σk Γn from messages of length k over alphabet Σ, to codewords of length n over alphabet Γ. The rate of a code is R(Enc) := k

n, which signifies how efficient the code is. The (relative) distance of a code is δ(Enc) := 1

n minz =z Enc(z) Enc(z ), which is twice the fraction of corrupt indices in a codeword that still allows decoding it to the original message. A code (or a family of codes) is considered good if both its rate and distance are constant, which means that the length of messages is only expanded by a constant factor, yet a constant fraction of errors can be corrected. ECCs are extensively studied and it is long known that good ECCs can be constructed, even when Σ = Γ = F2. (Ham50; Gil52; Var57; Jus72; SS96)

An ECC with feedback is an ECC in which we transmit the symbols of the codeword Enc(x) one-by-one, and immediately receive feedback with regards to whether an error occurred in transmitting this symbol. The following symbols we submit may adaptively depend on the feedback received so far. We say that the feedback is noiseless if the feedback received is always reliable. If the errors in transmission occur randomly (i.e., each transmitted symbol has the same probability of becoming corrupted), then it turns out that noiseless feedback does not improve the parameters of the best possible ECC. On the other hand, if the small fraction of corrupted symbols is chosen adversarially, then noiseless feedback does improve the best possible distance. Feedback also appears to allow simpler and more efficient encoding and decoding schemes. (Ber64; Cov88)

Published in Transactions on Machine Learning Research (10/2024)

We define a natural generalization of ECCs, in which the length of the message (and hence also of the code) is not known in advance. We call those Dynamic ECCs. We would require that for any k k, the first k

symbols of the message can be decoded from the first O(k ) symbols of the codeword, even if a small fraction of those codeword symbols are corrupted. This definition is similar yet weaker than the definition of Tree Codes (Sch93; Sch96).

Definition 5.1. For alphabets Σ, Γ a family {Enck}k N of functions Enck : Σk Γ is called a Dynamic ECC if for every k N, the function Enck : Σk Γnk is an ECC, where

Enck(x) := Enc1(x[: 1]) || Enc2(x[: 2]) || . . . || Enck(x),

nk := max x Σk len(Enck(x)).

In simple words, a Dynamic ECC is a family of standard ECCs where the codeword corresponding to the a prefix of a message, is always a prefix of the codeword corresponding to the entire message.

Definition 5.2. The rate of a Dynamic ECC is R(Enc) := infk N R(Enck) = infk N k nk . The distance of it is δ(Enc) := infk N δ(Enck).

In a similar manner, we also define a Dynamic ECC with (noiseless) feedback to be a Dynamic ECC in which after each symbol transmitted we receive a feedback as to which symbol was received. We next present a simple construction of a Dynamic ECC with feedback where |Σ| = 2, |Γ| = 3, and both the rate and distance are constant. This construction is rather straightforward and is similar to constructions used in slightly different settings (EGH15).

Theorem 5.3. For any ε (0, 1

2) there exists a Dynamic ECC with noiseless feedback with |Σ| = 2, |Γ| = 3, in which ε fraction of errors can be corrected and nk = k 1 2ε . Both encoding and correction take linear time.

We think of the message alphabet as binary Σ = {0, 1}, and to the codeword alphabet we add an additional symbol Γ = {0, 1, }. We would think of the symbol as a backspace". Intuitively, we will always compute the message that is the decoding of what the receiver saw so far, and if it is consistent with the input we simply send the next bit of the input. If it is not consistent with input, we will send a backspace" to indicate that the last symbol is incorrect. We will do so iteratively.

For a sequence y = (y1, . . . , yn) Γ , we recursively define decode(y) to be decode(y[: (n 1)]) || (yn) if yn {0, 1}, and decode(y[: (n 1)])[: 1] if yn = , where v[: 1] means removing the last symbol from the vector v (unless its empty). As the base case, we have decode(()) = ().

For a message x = (x1, . . . , xk) Σ and a previously transmitted (partial) codeword y = (y1, . . . , yn) Γ

we define the longest agreeing prefix of x and the decoding of y as

last(x, y) := max i {i | x[: i] = decode(y)[: i]}.

We then define the length of the wrong suffix of the decoding of y as suff(x, y) := len(decode(y)) last(x, y).

Given a message x and partial codeword y, we define the next symbol to be sent as next(x, y) =

if suff(x, y) > 0, and as next(x, y) = x[last(x, y) + 1] otherwise. Our protocol is thus simple, if x is the message and y is the codeword received by the receiver so far (which we know using the noiseless feedback), then the next symbol we send is next(x, y).

Lemma 5.4. Let x Σ be a message and y Γn be a partial codeword received by the receiver according to the described protocol, and assume that at most εn of the symbols in y were received differently than sent by the protocol. Then, last(x, y) (1 2ε)n.

Proof. For any partial received codeword y we define the potential function Φ(x, y ) := last(x, y ) suff(x, y ).

We first show that if the next token is received correctly then the potential increases by one, that is, Φ(x, y || next(x, y )) = Φ(x, y ) + 1. We show this by considering two cases. If suff(x, y ) = 0 then decode(y ) = x[: last(x, y )] and next(x, y ) = x[last(x, y ) + 1], thus decode(x, y || next(x, y )) =

Published in Transactions on Machine Learning Research (10/2024)

x[: last(x, y ) + 1]. Otherwise, suff(x, y ) > 0 and next(x, y ) = , and hence suff(x, y || next(x, y )) = suff(x, y ) 1.

Next, we show that if the next token is received incorrectly then the potential decreases by one, that is Φ(x, y || s) = max(0, Φ(x, y ) 1) whenever s = next(x, y ). We again consider two cases. If suff(x, y ) > 0 then we have s {0, 1} and in turn suff(x, y || s) = suff(x, y ) + 1. Otherwise suff(x, y ) = 0 and we either have suff(x, y || s) = 1 if s = or have last(x, y || s) = max(0, last(x, y ) 1) if s = .

We conclude that if e out of the n symbols in y were received incorrectly, then

Φ(x, y) 1 (n e) 1 e = n 2e n 2 εn.

On the other hand, as suff(x, y) 0 we also have last(x, y ) Φ(x, y ).

Proof of Theorem 5.3. Denote by nk = k 1 2ε . Let x be a message and y the first nk tokens received by running the protocol. Assume that at most εnk out of those tokens were received incorrectly. By Lemma 5.4, we have last(x, y) (1 2ε)nk k.

Hence, decode(y)[: k] correctly retrieves x[: k].

6 Our Scheme

As in the overview of Section 3, we begin by analysing a scheme in which only a single query is undetectable. Then, in Section 6.1 we apply the same idea of CGZ to go from undetectability for one query to complete undetectability. An intuitive explanation of our scheme is covered in Section 4.

Algorithm 3: One-query steganography algorithm Stegk Data: A prompt (prompt), a payload (payload), and a secret key k Result: Response x1, . . . , x L i 1; code (); scoreσ 0 for σ {0, 1, }; score_len 0; next next(payload, code); while done / (x1, . . . , xi 1) do

pi Model(prompt, x1, . . . , xi 1); xi 1[Fk(i, next) pi]; score_len score_len + 1; for σ {0, 1, } do

scoreσ scoreσ + s(xi, Fk(i, σ)); if (scoreσ score_len)/ score_len > t then

code code || (σ); scoreσ 0 for σ {0, 1, }; score_len 0; next next(payload, code); break; end end i i + 1; end

Algorithm 3 naturally follows the sketch of Section 4, while using the ECC (and notation) of Section 5. Let s informally describe the algorithm in words. The algorithm depends on a score threshold t to be set

Published in Transactions on Machine Learning Research (10/2024)

later. We apply the Dynamic ECC to the message payload, and denote by next the next symbol we are supposed to transmit according to the ECC. We start generating response tokens following the CGZ scheme, by comparing the actual distribution pi to the PRF value Fk(i, next). As the PRF value depends on next, a correlation between the randomness we use and the token we are attempting to transmit is created. After each generated response token, we also emulate the retriever: Ignoring our knowledge of next, for every code symbol σ {0, 1, } we compute the score (as defined in Section 3) of the response tokens we saw so far. We wait until the score of a some symbol σ, normalized with respect to the number of tokens involved in the score computation, passes a threshold t. That is, until

scoreσ score_len score_len > t.

When this happens, we view σ as the symbol received by the ECC receiver. While σ = next is supposed to be more likely, the symbol could be incorrect. Whenever we add a symbol to the code, we restart our computation of the score and start transmitting the next code symbol. Algorithm 4 shows the retrieval process, which is identical to what is emulated within the steganography algorithm. Note that both algorithms have a linear running time.

Algorithm 4: One-query retriever Retrk Data: Response x1, . . . , x L, and a secret key k Result: Retrieved payload payload code (); scoreσ 0 for σ {0, 1, }; score_len 0; for i = 1, 2, . . . , L do

score_len score_len + 1; for σ {0, 1, } do

scoreσ scoreσ + s(xi, Fk(i, σ)); if (scoreσ score_len)/ score_len > t then

code code || (σ); scoreσ 0 for σ {0, 1, }; score_len 0; break; end end end return decode(code);

We first observe that the distribution of the response is indistinguishable from the distribution of the original model (when a single query is made).

Lemma 6.1. For any Model and any prompt, payload, t, the distribution of Stegk(prompt, payload) over a random choice of key k is indistinguishable from the distribution of Model(prompt).

Proof. The proof is rather straightforward and follows CGZ and its overview in Section 3. If we replace each time Fk(i, next) is used (to determine xi) with a new uniformly chosen value in [0, 1], then the distribution of Model is completely unaffected. Using a PRF instead of fresh" random values is indisitinguishable as long as we don t use the PRF on the same input more than once. As each input to the PRF consists of the (unique) index i, we never call it on the same input.

We should next show that Retrk(Stegk(prompt, payload)) successfully retrieves payload. As discussed in Section 2.4, doing so requires making assumptions on the empirical entropy of the generated response. We prove that a relatively weak assumption (which in particular covers the case of natural languages) is sufficient,

Published in Transactions on Machine Learning Research (10/2024)

yet it is very likely that the proof can be adapted for other conditions as well - as the algorithm itself is quite flexible. We also note that in the proof we don t optimize for constants but for simplicity (of proof), the empirical evaluation in Section 7 implies that the actual constants are far better than in the following proof.

Definition 6.2. Let h = (h1, . . . , h L) be a sequence of empirical entropies (i.e., non-negative numbers). We say that h is r0-saturated if for every consecutive subsequence of h of length r r0, the sum of entropies is at least 10 r ln r. That is, for every r r0 and 1 i L (r 1), we have Pi+r 1 j=i hj 10 r ln r.

For example, if the empirical entropy in each consecutive block of b tokens is at least some constant α > 0, then the empirical entropies are O b2 α2 -saturated. This is because a consecutive block of bk tokens contains

at least αk entropy, which is larger than 10

bk ln(bk) if k = Ω b

α2 . Hence, natural language which has this property (as discussed in Section 2.3.1) is O(1)-saturated. In fact, the entropy of natural language grows linearly with the length of the text, while our condition is merely for it to grow faster than the square root of the length of the text. We verify these claims empirically in Section 7.

Finally, we prove that if the empirical entropy of a response is O(1)-saturated, and the response is of length L, then in expectation at least the first Θ(L) bits of the payload are retrieved correctly.

Theorem 6.3. Fix a model Model and an integer r0, there exists a choice of threshold t for which the following holds. Let prompt, payload be strings. Conditioned on the empirical entropy of a response y generated by Stegk(prompt, payload) being r0-saturated, the expected length of the prefixes of payload and Retrk(y) that identify is at least Θ(len(y)/r0).

The proof of Theorem 6.3 is deferred to Appendix A.

6.1 Complete Undetectability

To move from undetectability of a single response to the general undetectability defined in Definition 2.6, we simply repeat the trick" of CGZ as overviewed in Section 3. The complete details are deferred to Appendix B.

7 Empirical Evaluation

We implemented1 Algorithms 3 and 4 from Section 6, that provide undetectability for a single query. We did so for simplicity and as we only aim to evaluate the new contributions of this paper.

In Figure 2, we estimate the number of message bits we can hide in a response of a certain length. For each length of response, we ran our scheme for 100 times using the LLM model GPT-2 (RWC+19) on a randomly chosen prompt from the list of example prompts provided by Open AI on their GPT-2 webpage.2

We ran it with threshold parameter t = 2, which we didn t optimize. As expected, it is evident that the number of hidden bits grows linearly with the length of the response. In Figure 3, we detail the decoding process of the example shown in Figure 1, generated using the 7B parameters Instruct version of the Llama 2 model developed by Meta (TMS+23). We did not optimize for constants. The takeaways from this section are twofold; First, we show that the implementation of our scheme is simple and the run-time overhead is minimal; Second, we empirically support the claim from Section 2.3.1 that the empirical entropy in LLMs grows linearly with the length of the response.

8 Limitations and Open Problems

The main issue we did not discuss so far is robustness to editing. That is, can the payload be recovered even if the model s response is somehow edited? We mainly leave dealing with robustness to future work, yet next list a couple of observations regarding robustness. In CGZ (CGZ23), the watermarking scheme is adapted to restart" once-in-a-while so that the watermark will be detectable from any long enough consecutive substring of the response (and not only from the entire response). The same modification can easily be applied to

1Code available at: https://github.com/Or Zamir/steg 2https://openai.com/research/better-language-models

Published in Transactions on Machine Learning Research (10/2024)

20 40 60 80 100 0

Response length, in tokens.

Hidden payload length, in bits.

Figure 2: Plot of the number of successfully hidden payload bits, by length of response. Experiments ran on GPT-2 with a random choice of an example prompt taken from the Open AI website. The experiment was performed 100 times for each response length.

Prompt: Write an email asking my professor Prof. Hannity to not make the final exam in Machine Learning 101 too difficult.

Response: Dear Professor Hannity, I hope this email finds you well. I am writing to respectfully request that you consider the level of difficulty for the final exam in Machine Learning 101. While I am confident in my understanding of the course materials and have put in a significant 0 amount of effort throughout the semester, I do 1 have 1 concerns 0 about the potential 1 difficulty of the final exam. As you may recall, several students in my previous sections of Machine 1 Learning have found the final exam to be very challenging 1 , leading to frustration 1 and disappointment 0 . While I understand that the course is intended to push students to their limits and beyond, I believe that some relaxation of the latter 1 may be in order. I would kindly ask that you consider reducing the difficulty of the final exam or offering some additional supports or 1 resources to help students prepare. I 0 believe that this could 1 enhance 0

Error Correcting Code: 01101 111 0 11010

Hidden Payload: OZ", encoded as 01111 11010.

Figure 3: A breakdown of the decoding algorithm for the example in Figure 1.

our scheme as well, making the payload retrievable from any long enough substring out of the model s response. At the other end of the spectrum, it is known that under certain conditions powerful users can edit any watermark out of a model s response (ZEF+23; CGZ23). Intuitively, a complete rephrasing of the response, for example, is supposed to remove any watermark. The previous empirical works on watermarks and steganography, that do not guarantee undetectability, showcase some robustness to certain types of edits.

[AF21] Sahar Abdelnabi and Mario Fritz. Adversarial watermarking transformer: Towards tracing text provenance with data hiding. In 2021 IEEE Symposium on Security and Privacy (SP), pages 121 140. IEEE, 2021.

[Ber64] Elwyn R Berlekamp. Block coding with noiseless feedback. Ph D thesis, Massachusetts Institute of Technology, 1964.

Published in Transactions on Machine Learning Research (10/2024)

[CBZ+23] Souradip Chakraborty, Amrit Singh Bedi, Sicheng Zhu, Bang An, Dinesh Manocha, and Furong Huang. On the possibilities of AI-generated text detection. ar Xiv preprint ar Xiv:2304.04736, 2023.

[CGZ23] Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. ar Xiv preprint ar Xiv:2306.09194, 2023.

[CLA17] Ruina Chen, Haitao Liu, and Gabriel Altmann. Entropy in different text types. Digital Scholarship in the Humanities, 32(3):528 542, 2017.

[Cov88] Thomas M Cover. The role of feedback in communication. In Performance Limits in Communication Theory and Practice, pages 225 235. Springer, 1988.

[DIRR09] Nenad Dedić, Gene Itkis, Leonid Reyzin, and Scott Russell. Upper and lower bounds on black-box steganography. Journal of Cryptology, 22:365 394, 2009.

[d WSK+22] Christian Schroeder de Witt, Samuel Sokota, J Zico Kolter, Jakob Foerster, and Martin Strohmeier. Perfectly secure steganography using minimum entropy coupling. ar Xiv preprint ar Xiv:2210.14889, 2022.

[EGH15] Klim Efremenko, Ran Gelles, and Bernhard Haeupler. Maximal noise in interactive communication over erasure channels and channels with feedback. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 11 20, 2015.

[FCT+23] Pierre Fernandez, Antoine Chaffin, Karim Tit, Vivien Chappelier, and Teddy Furon. Three bricks to consolidate watermarks for large language models. ar Xiv preprint ar Xiv:2308.00113, 2023.

[GC02] Dmitriy Genzel and Eugene Charniak. Entropy rate constancy in text. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 199 206, 2002.

[GGM86] Oded Goldreich, Shafi Goldwasser, and Silvio Micali. How to construct random functions. Journal of the ACM (JACM), 33(4):792 807, 1986.

[Gil52] Edgar N Gilbert. A comparison of signalling alphabets. The Bell system technical journal, 31(3):504 522, 1952.

[Ham50] Richard W Hamming. Error detecting and error correcting codes. The Bell system technical journal, 29(2):147 160, 1950.

[HILL99] Johan Håstad, Russell Impagliazzo, Leonid A Levin, and Michael Luby. A pseudorandom generator from any one-way function. SIAM Journal on Computing, 28(4):1364 1396, 1999.

[Hv AL08] Nicholas Hopper, Luis von Ahn, and John Langford. Provably secure steganography. IEEE Transactions on Computers, 58(5):662 676, 2008.

[JAML20] Ganesh Jawahar, Muhammad Abdul-Mageed, and Laks VS Lakshmanan. Automatic detection of machine generated text: A critical survey. ar Xiv preprint ar Xiv:2011.01314, 2020.

[Jus72] Jørn Justesen. Class of constructive asymptotically good algebraic codes. IEEE Transactions on information theory, 18(5):652 656, 1972.

[KGW+23a] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. ar Xiv preprint ar Xiv:2301.10226, 2023.

[KGW+23b] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. Co RR, abs/2301.10226, 2023.

[KJGR21] Gabriel Kaptchuk, Tushar M Jois, Matthew Green, and Aviel D Rubin. Meteor: Cryptographically secure steganography for realistic distributions. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 1529 1548, 2021.

Published in Transactions on Machine Learning Research (10/2024)

[KSK+23] Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. ar Xiv preprint ar Xiv:2303.13408, 2023.

[MZ23] Travis Munyer and Xin Zhong. Deeptextmark: Deep learning based text watermarking for detection of large language model generated text. ar Xiv preprint ar Xiv:2305.05773, 2023.

[QZL+23] Jipeng Qiang, Shiyu Zhu, Yun Li, Yi Zhu, Yunhao Yuan, and Xindong Wu. Natural language watermarking via paraphraser-based lexical substitution. Artificial Intelligence, page 103859, 2023.

[RWC+19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019.

[Sch93] Leonard J Schulman. Deterministic coding for interactive communication. In Proceedings of the twenty-fifth annual ACM symposium on Theory of computing, pages 747 756, 1993.

[Sch96] Leonard J Schulman. Coding for interactive communication. IEEE transactions on information theory, 42(6):1745 1756, 1996.

[SKB+23] Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. Can AI-generated text be reliably detected? ar Xiv preprint ar Xiv:2303.11156, 2023.

[SL22] Yaqian Shi and Lei Lei. Lexical richness and text length: An entropy-based perspective. Journal of Quantitative Linguistics, 29(1):62 79, 2022.

[SS96] Michael Sipser and Daniel A Spielman. Expander codes. IEEE transactions on Information Theory, 42(6):1710 1722, 1996.

[TMS+23] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. ar Xiv preprint ar Xiv:2307.09288, 2023.

[Var57] Rom Rubenovich Varshamov. Estimate of the number of signals in error correcting codes. Docklady Akad. Nauk, SSSR, 117:739 741, 1957.

[WYC+23] Lean Wang, Wenkai Yang, Deli Chen, Hao Zhou, Yankai Lin, Fandong Meng, Jie Zhou, and Xu Sun. Towards codable text watermarking for large language models. ar Xiv preprint ar Xiv:2307.15992, 2023.

[YAJK23] Ki Yoon Yoo, Wonhyuk Ahn, Jiho Jang, and Nojun Kwak. Robust natural language watermarking through invariant features. ar Xiv preprint ar Xiv:2305.01904, 2023.

[YAK23] Ki Yoon Yoo, Wonhyuk Ahn, and Nojun Kwak. Advancing beyond identification: Multi-bit watermark for language models. ar Xiv preprint ar Xiv:2308.00221, 2023.

[ZEF+23] Hanlin Zhang, Benjamin L Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak. Watermarks in the sand: Impossibility of strong watermarking for generative models. ar Xiv preprint ar Xiv:2311.04378, 2023.

Published in Transactions on Machine Learning Research (10/2024)

A Proof of Theorem 6.3

We need to analyze two quantities. First, when Stegk adds a symbol to the code, what is the probability it is incorrect (i.e., different than the intended next symbol)?; Second, how many symbols do we manage to add to the code? Or equivalently, how many response tokens do we usually see before adding a symbol to the code?

To answer both questions, we analyze the evolution of the correct score (i.e., the one corresponding to the next symbol) and incorrect scores from the time we start computing them and until one of them passes the threshold.

While computing the score with respect to an incorrect symbol, every token s score is simply an exponential random variable with mean 1. Denote by s1, s2, . . . the scores of each individual token (i.e., independent Exp(1) random variables), and by Si := Pi j=1 sj their accumulative sums. By Lemma 5 in CGZ we have that for any ℓ, τ > 0,

Let b r0 be an integer to be chosen later. By a union bound, the probability of the score passing the threshold t at any time within the b first steps is bounded by b 4

For the score with respect to the correct symbol, the first b tokens contain at least 10

b ln b empirical entropy, as b r0 and as we conditioned on our response being r0-saturated. In CGZ it is shown that Sℓis still distributed as the sum of ℓindependent Exp(1) variables, but it is now additively shifted by the empirical entropy of those ℓvariables. In particular, by Theorem 7 and Lemma 5 in CGZ, it follows that for any τ > 0 we have Pr[Sb < b + 10

Equivalently,

b < 10 ln b τ e τ/2.

We choose t = 5 ln b and τ = (5 ln b)2 and deduce from the above statements that: I) The probability that an incorrect (normalized) score passed the threshold t within the first b steps is at most b 4

5 5 ln b = e(1 5 ln(5/4)) ln b < e (ln b)/10. II) The probability that the correct (normalized) score passed the threshold t within the first b steps, which is at least the probability it was above the threshold at the end of the b-th step, is at least 1 e τ/2 = 1 e 25(ln b)2/2. By combining (I) and (II) we conclude that the probability that the correct score passed the threshold within the first b steps, yet the two incorrect scores did not, is at least

1 e 25(ln b)2/2 2e (ln b)/10,

denote this number by (1 ε(b)). As limb ε(b) = 0, there exists a constant b0 such that for every b b0 it holds that ε(b) 1

3. We set b = max(r0, b0). Note that b0 is a universal constant independent of r0 and other parameters. We conclude that with probability at least 2

3 the correct symbol is transmitted within the first b tokens, and in particular the symbol is transmitted correctly with probability at least 2

As the probability of incorrectly transmitting a symbol is at most 1

2, we can use Theorem 5.3 to conclude that if n code symbols are transmitted overall, then the first Θ(n) bits of the payload are retrieved correctly. It is thus only left to analyze the number of transmitted code symbols.

We again consider the same inequality from before, that holds for any b r0, τ > 0, Pr h Sb b

b < 10 ln b τ i e τ/2. By choosing τ = (5 ln b )2 we observe that for any b > b we

have 10 ln b τ = 5 ln b > 5 ln b = t. And thus, Pr h Sb b

b < t i e 25(ln b ) 2/2. Denote by ℓthe

random variable which is the first step in which the score (w.r.t. the correct symbol) passed the threshold t,

Published in Transactions on Machine Learning Research (10/2024)

by the above inequality we have

i=1 Pr [ℓ i]

i=1 Pr [ℓ i] +

i=b+1 Pr [ℓ i]

= b + O(1).

As the correct symbol is expected to pass the score threshold after b + O(1) response tokens, in particular a symbol is expected to be transmitted in the protocol at least once every b + O(1) response tokens.

B Complete Undetectability

Our revised algorithm partitions the generation of response tokens into three parts:

1. We use real randomness to generate tokens and count the amount of empirical entropy used in the process, until enough (at least λ) empirical entropy was seen, we call the prefix of tokens generated in this step r.

2. We begin generating tokens using the PRF, with both r and the index as inputs. We don t yet submit code symbols and wait until the score passes λ. We do this step to leave a signal to the detector with regards to what prefix r was chosen by the generation algorithm.

3. We now run the one-query scheme, while adding r as an additional input to the PRF calls.

We remark that the second part can be avoided, as the correct r can be detected implicitly during the third part, but for simplicity of presentation we include it. With this structure, the proofs are rather straightforward. The pseudo-code for Stegk appears in Algorithm 5, and for Retrk in Algorithm 6.

Theorem B.1. Algorithm 5 is undetectable as defined in Definition 2.6.

Proof. This follows from Theorem 11 in CGZ, as the inputs to the PRF in the generation process are all unique within a single query, and all contain r which ever repeats within different queries only with negligible probability.

Theorem 6.3 also remains correct as-is, besides that we lose" O(λ) empirical entropy to Parts 1 and 2. As long as L = Ω(λ) then, the same theorem statement still holds. While the running time of Stegk is unaffected, the running time of Retrk is now quadratic instead of linear. This can be avoided by truncating the verification of r, but we do not do so for the sake of simplicity.

Published in Transactions on Machine Learning Research (10/2024)

Algorithm 5: Steganography algorithm Stegk Data: A prompt (prompt), a payload (payload), and a secret key k Result: Response x1, . . . , x L i 1; H 0; r_score 0; r_score_len 0; code (); scoreσ 0 for σ {0, 1, }; score_len 0; next next(payload, code); while done / (x1, . . . , xi 1) do

pi Model(prompt, x1, . . . , xi 1); if H < λ then

// Part 1 Sample (xi, p) (1, pi) with probability pi, otherwise (0, 1 pi); H H log p; if H λ then

r (x1, . . . , xi); end else if (r_score r_score_len) λ r_score_len then

// Part 2 xi 1[Fk(r, i, None) pi]; r_score r_score + s(xi, Fk(r, i, None)); r_score_len r_score_len + 1; else

// Part 3 xi 1[Fk(r, i, next) pi]; score_len score_len + 1; for σ {0, 1, } do

scoreσ scoreσ + s(xi, Fk(r, i, σ)); if (scoreσ score_len)/ score_len > t then

code code || (σ); scoreσ 0 for σ {0, 1, }; score_len 0; next next(payload, code); break; end end end i i + 1; end

Published in Transactions on Machine Learning Research (10/2024)

Algorithm 6: Retriever algorithm Retrk Data: Response x1, . . . , x L, and a secret key k Result: Retrieved payload payload code (); scoreσ 0 for σ {0, 1, }; score_len 0; for j = 1, 2, . . . , L do

r (x1, . . . , xj); r_score 0; r_score_len 0; for i = j + 1, j + 2, . . . , L do

if (r_score r_score_len) λ r_score_len then

// Verify r r_score r_score + s(xi, Fk(r, i, None)); r_score_len r_score_len + 1; else

// Correct r found score_len score_len + 1; for σ {0, 1, } do

scoreσ scoreσ + s(r, xi, Fk(r, i, σ)); if (scoreσ score_len)/ score_len > t then

code code || (σ); scoreσ 0 for σ {0, 1, }; score_len 0; break; end end end end if code = () then

return decode(code); end end return False;