# robust_multibit_text_watermark_with_llmbased_paraphrasers__612222cf.pdf Robust Multi-bit Text Watermark with LLM-based Paraphrasers Xiaojun Xu 1 Jinghan Jia 2 * Yuanshun Yao 1 Yang Liu 3 * Hang Li 1 We propose an imperceptible multi-bit text watermark embedded by paraphrasing with LLMs. We fine-tune a pair of LLM paraphrasers that are designed to behave differently so that their paraphrasing difference reflected in the text semantics can be identified by a trained decoder. To embed our multi-bit watermark, we use two paraphrasers alternatively to encode the pre-defined binary code at the sentence level. Then we use a text classifier as the decoder to decode each bit of the watermark. Through extensive experiments, we show that our watermarks can achieve over 99.99% detection AUC with small (1.1B) text paraphrasers while keeping the semantic information of the original sentence. More importantly, our pipeline is robust under word substitution and sentence paraphrasing perturbations and generalizes well to out-of-distributional data. We also show the stealthiness of our watermark with LLMbased evaluation. We open-source the code: https://github.com/xiaojunxu/ multi-bit-text-watermark. 1. Introduction Text watermark aims to encode some imperceptible signal into a piece of text so that people are able to decode the signal from the text (Liu et al., 2024). It can be useful in various applications such as copyright protection and hidden message communication. With the development of Large Language Models (LLMs), there is also a growing need to track misinformation spread by LLMs using text watermark injected to model outputs (Kirchenbauer et al., 2023). *Work done while at Byte Dance. 1Byte Dance Research 2Michigan State University 3University of California, Santa Cruz. Correspondence to: Xiaojun Xu . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). We study the methodology of injecting a multi-bit watermark message into a piece of text by paraphrasing. The watermarked text will keep the semantic meaning of the original text after paraphrasing. Another paired decoder will be used to decode the message from the watermarked text. Unlike lexical-based watermarks which inject watermarks by synonym substitutions, the paraphrasing-based method has a larger action space for watermark injection and also is more robust under perturbations. However, there are also challenges in designing paraphrasing-based watermarks, as it is unclear on how to properly inject imperceptible but detectable watermark signal while keeping the text quality and original semantic meaning. In this work, we propose a paraphrasing-based watermark by simultaneously fine-tuning an LLM-based paraphraser as the encoder and train a LM-based text classifier as the decoder. The pipeline is shown in Figure 1. In the encoding stage, we will paraphrase the input text conditioned on a user-chosen key to generate the watermarked text. In the decoding stage, we will extract the code from the input text with the decoder and compare with the previously chosen key to see if it is watermarked by the user. The key to produce a high-quality text watermark in our method is to train a good encoder-decoder pair. For the decoder, we can train it with standard classification loss so that it can better classify between bit-0 texts and bit-1 texts . For the encoder, we would like to fine-tune it so that its generated text can be better classified by the decoder. Inspired by (Xu et al., 2024), we show that we can use the decoder as a reward model to evaluate how well the paraphrased text generated by the encoder can be correctly classified. Thus, we can use PPO-based RL techniques to finetune the encoder so that the injected watermark can be better decoded. We adopt a co-training framework so that the encoder and decoder are alternatively updated during the training process. Through experiments, we show that our experiments can achieve a very high watermark detection performance while maintaining the paraphrasing fidelity. We achieve over 95% bit accuracy and over 0.99 detection AUC, both outperforming existing methods significantly. In addition, we can apply a simple repetition-based strategy and im- Robust Multi-bit Text Watermark with LLM-based Paraphrasers prove the detection AUC to over 0.9999. In addition, our method also shows a good robustness under word substitution and sentence paraphrasing perturbations. We also evaluate our methods over out-of-distributional (OOD) data and observe that our model can achieve over 0.99 AUC for most of the OOD tasks. All these results show the effectiveness and robustness of our watermark. The rest of the paper is organized as follows. We will first introduce the preliminary knowledge of the work in Section 2. Then we introduce our paraphrasing-based watermark methodology in Section 3. We will show the experiment results in Section 4. Finally, we discuss the related work in Section 5 and conclude the work in Section 6. 2. Preliminary 2.1. Goal of Multi-bit Text Watermark The goal of the work is to inject a multi-bit watermark message into a piece of text by paraphrasing. Formally speaking, in the watermark injection stage, we are given an original text xo and a watermark message M {0, 1} . We will inject watermark by generating a new watermarked text with a encoder xw = E(xo, M). To extract the watermark, we will use a watermark decoder M = D(xw) to decode the injected watermark. We hope that the decoded bits should match the prefix of the designed watermark message, i.e., M = M[: len(M )]. Note that this is a vary-length watermark, where the length of watermark message is dependent on the length of text - the longer the text is, the more information we can encode in the watermarked text. This is contrary to the fix-length text watermark (e.g. (Zhang et al., 2024b)), where the watermark code is a fixed length for any given input text. The length of M depend on different watermark designs, and we will introduce them in Section 3.1. We have the following requirements on the paraphrased text: Fidelity: The watermarked text should not change the meaning of the original text. The similarity sim(xo, xw) should be high. Accuracy: The watermark decoder should accurately decode the watermark message. The error rate |M M[: len(M )]|0 should be low. Robustness: The watermark message should still exist after the watermarked text undergoes some perturbation. Let M pert = D(pert(xw)) denote decoded message from perturbed watermarked text. We hope that the error rate after perturbation |M pert M[: len(M pert)]|0 should be low. Stealthiness: The watermark should not be easily detected by human eyes. We evaluate it with the criteria that human cannot easily detect the watermarks in the text. Formally speaking, let M h = Dhuman(xw) be the human guess on the watermark code. We hope that |M h M[: len(M h)]|0 should be high, i.e. human guess on the watermark code has a high error rate. 2.2. Background: PPO Proximal Policy Optimization (PPO) (Schulman et al., 2017) is a standard way to optimize a language model towards a high reward calculated by some pre-defined reward functions r(x) R, where x is the input text (i.e. a sequence of tokens). Let π(xt|x