# waves_benchmarking_the_robustness_of_image_watermarks__38424e2f.pdf WAVES: Benchmarking the Robustness of Image Watermarks Bang An * 1 Mucong Ding * 1 Tahseen Rabbani * 1 Aakriti Agrawal 1 Yuancheng Xu 1 Chenghao Deng 1 Sicheng Zhu 1 Abdirisak Mohamed 1 2 Yuxin Wen 1 Tom Goldstein 1 Furong Huang 1 Intheburgeoningageofgenerative AI,watermarks act as identifiers of provenance and artificial content. We present WAVES (Watermark Analysis via Enhanced Stress-testing), a benchmark for assessing image watermark robustness, overcoming the limitations of current evaluation methods. WAVES integrates detection and identification tasks and establishesastandardizedevaluationprotocolcomprised of a diverse range of stress tests. The attacks in WAVESrangefromtraditionalimagedistortions to advanced, novel variations of diffusive, and adversarial attacks. Our evaluation examines two pivotal dimensions: the degree of image quality degradation and the efficacy of watermark detection after attacks. Our novel, comprehensive evaluation reveals previously undetected vulnerabilities of several modern watermarking algorithms. We envision WAVES as a toolkit for the future development of robust watermarks. The project is available at https://wavesbench.github.io/. 1. Introduction Recent and pivotal advancements in text-to-image diffusion models (Ho et al., 2020; Dhariwal & Nichol, 2021; Rombach et al., 2022) have garnered the attention of the AI community and the general public. Open-source models such as Stable Diffusion and proprietary models such as the Dall E family and Midjourney have enabled users to produce images that are of human-produced quality. Consequently, there has been a strong push in the AI/ML community to develop reliable algorithms for detecting AI-generated content and determining its source (Executive Office of the President, 2023). One avenue for maintaining the provenance of generative content is by embedding watermarks. A watermark *Equal contribution, ordered alphabetically 1University of Maryland, College Park 2SAP Labs, LLC. Correspondence to: Bang Ang , Mucong Ding , Tahseen Rabbani . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). is a signal encoded onto an image to signify its source or ownership (Al-Haj, 2007; Zhu et al., 2018; Zhang et al., 2019; Tancik et al., 2020; Fernandez et al., 2023; Wen et al., 2023). To avoid degradation of image quality, an invisible watermark is desired. Many such watermarks are robust to common image manipulations (Lukas et al., 2023; Zhao et al., 2023a; Wen et al., 2023; Fernandez et al., 2023), and adversarial efforts to remove the watermark are complicated by the difficulty of decoding/extracting the message without private knowledge of the watermarking scheme (Tancik et al., 2020; Fernandez et al., 2023). Despite this difficulty, various watermark removal schemes can still be effective (Zhao et al., 2023a; Saberi et al., 2023). However, a lack of standardized evaluations in existing literature (i.e., inconsistent image quality measures, statistical parameters, and types of attacks) has resulted in an incomplete picture of the vulnerabilities and robustness of these algorithms in the real world. Figure 1. WAVES establishes a standardized evaluation framework that encompasses a comprehensive suite of stress tests including both existing and newly proposed stronger attacks (denoted by ). We present WAVES (Watermark Analysis via Enhanced Stress-testing), a benchmark for assessing watermark robustness, overcoming the limitations of current evaluation methods. WAVES consists of a comprehensive variety of novel & realistic attacks, including classical image distortions, image regeneration, and adversarial attacks. In an effort to stresstest existing/future watermarks, we propose several new attackssuchasadversarialembeddingattacks, andnewvariants of existing attacks such as multi-regeneration attacks. WAVES focuses on the sensitivity and robustness of watermark detection, measured by the true positive rate (TPR) at 0.1% false positive rate (FPR), and in the meantime, studies WAVES: Benchmarking the Robustness of Image Watermarks Table 1. Comparison of robustness evaluations with existing works. For categories of attacks, D, R, and A denote distortions, image regeneration, and adversarial attacks. Joint test means whether the performance and quality are jointly tested under a range of attack strengths. Our benchmark is the most comprehensive one, with a large scale of attacks, data, metrics, and more realistic evaluation setups. Research Num. of Categories Num. of Sample Size Non-watermarked Performance Num. of Joint Work Attacks of Attacks Datasets per Dataset Image Source Metric Quality Metrics Test Stega Stamp Watermark1 5 D 1 1000 bit accuracy 3 Stable Signature Watermark2 12 D, R 1 5000 bit accuracy 3 Tree Ring Watermark3 6 D 2 1000 generate by same model TPR@1%FPR 2 Regeneration Attack4 10 D, R 2 500 bit accuracy 3 Surrogate Model Attack5 2 R, A 1 2500 real images AUROC 0 Adaptive Attack6 10 D, A 1 1000 real images TPR@1%FPR 3 WAVES (ours) 26 D, R, A 3 5000 real images TPR@0.1%FPR 8 1 Tancik et al. (2020). 2 Fernandez et al. (2023). 3 Wen et al. (2023). 4 Zhao et al. (2023a). 5 Saberi et al. (2023). 6 Lukas et al. (2023). the severity of image degradations needed to decrease this sensitivity with multiple quality metrics. WAVES develops a series of Performance vs. Quality 2D plots varying over several prominent image similarity metrics, which are then aggregated in a heuristically novel manner to paint an overall picture of watermark robustness and attack potency. We extensively evaluate the security of three prominent watermarking algorithms, Stable Signature, Tree-Ring, and Steg Stamp, respectively representing three major techniques for embedding an invisible signature. WAVES effectively reveals weaknesses in them and discovers previously undetected vulnerabilities. For example, watermarking algorithms using publicly available VAEs can have their watermarks effectively removed with minimal image manipulation. DALL E3 s usage of an open-source KL-VAE underscores the need for unique VAEs in such systems. Our contributions are summarized as follows: (1) In practical scenarios where false alarms incur high costs, our evaluation metric for watermark detection prioritizes the True Positive Rate (TPR) at a stringent False Positive Rate (FPR) threshold, specifically 0.1%. This focus addresses the inadequacies of alternative metrics such as the p-value and Area Under the Receiver Operating Characteristic (AUROC). (2) Additionally, our metric incorporates image quality alongside TPR@0.1% FPR. This integration acknowledges the necessity of maintaining a balance between reducing the accuracy of watermark detection and the practical utility of the image in practical scenarios. (3) We introduce a comprehensive taxonomy of attacks that encompasses classical distortions (blurring, rotation, cropping, etc.) and powerful, novel variations of regeneration and adversarial attacks, against watermarks. (4) We standardize the evaluation of watermark robustness, allowing us to rank attacks and watermarks. We formalize the watermark detection and user identification problems and evaluate the robustness under both scenarios. (5) Our benchmark uncovers several especially harmful attacks for popular watermarks, some of which are first introduced in this work, underscoring the need for refinement of existing watermarking algorithms and systems. WAVES contributes as a toolkit to examine the watermark robustness and helps future development of robust watermarks. 2. Image Watermarks We briefly review invisible watermarks and defer detailed discussions to Appendix A. Generally, there are two types of watermarking methods. (1) Post-processing watermarks embed watermarks after image generation. (1a) Frequency-domain methods like DWT, DCT (Cox et al., 2007), and DWTDCT (Al-Haj, 2007) modify images in transform domains. (1b) Deep encoder-decoder methods such as Hi DDe N (Zhu et al., 2018), Riva GAN (Zhang et al., 2019), and Stega Stamp (Tancik et al., 2020) use trained neural networks for embedding and decoding watermarks. Post-processing watermarks are model-agnostic but can introduce human-visible artifacts, compromising image quality. (2) In-processing watermarks integrate watermarking into the image generation process, substantially eliminating visible artifacts. (2a) Whole model modifications embed watermarks by training the entire generative models on watermarked images (Yu et al., 2021; Zeng et al., 2023; Lukas & Kerschbaum, 2023). (2b) Partial model modifications such as Stable Signature (Fernandez et al., 2023) only fine-tune the decoder of the latent-diffusion model. (2c) Random seed modification watermarks like Tree-Ring (Wen et al., 2023) embed watermarks into the initial noise vector of diffusion models which can be retrieved at detection time. Robustness is an essential property of watermarks especially since there is an incentive to remove watermarks. Besides natural image distortions, some watermarks are shown to be vulnerable to regeneration through diffusion models or VAEs Zhao et al. (2023a); Saberi et al. (2023), and adversarial attacks Lukas et al. (2023); Saberi et al. (2023). However, some unrealistic attacks and inconsistent robustness evaluations across different studies have muddled the understanding of watermark robustness, obscuring the true vulnerabilities of these methods. Therefore, this paper provides a standardized and comprehensive benchmark, encompassing a set of realis- WAVES: Benchmarking the Robustness of Image Watermarks tic and strong attacks. Our benchmark enables apple-to-apple comparison of watermarks as well as attacks, which helps standardize and accelerate the studies of robust watermarks. 3. Standardized Evaluation through WAVES 3.1. Standardized Evaluation Workflow and Metrics As shown in Table 1, our benchmark, WAVES, stands out by considering three diverse datasets, incorporating 26 diverse attacks across three categories, and employing 8 quality metrics. These distinguish our work as the most extensive and realistic setup to date for watermark robustness evaluation. For more details on evaluation workflow, setups, metrics, and more analyses, see Appendix E. Applications and formulation of invisible image watermarks. Invisible image watermarks, originally for protecting creators intellectual property, have expanded into broader applications like AI Detection identifying AI-generated images (Saberi et al., 2023), and User Identification tracking the source of an image to its creator (Fernandez et al., 2023). We are interested in message-based approaches, whereaunique, invisibleidentifierisembeddedintoanimage. which may be recovered by the content creator at any time to establish provenance. The choice of message varies across methods, with Tree-Ring using random complex Gaussians and others like Stable Signature employing binary strings. Evaluation Workflow. The trade-off between watermark performance and image quality, especially when watermark attacks lead to image distortions, is critical. We introduce Performance vs. Quality 2D plots for a comprehensive comparison, a novel perspective over the typical performancecentric analyses. The evaluation process involves comparing watermarked images with a diverse set of real and AIgenerated reference images to produce the performance vs. quality 2D plots, and processing or aggregating the 2D plots to compare attacks and watermarks, as depicted in Figure 2. Performance Metrics in AI Detection and User Identification. WAVES prioritizes fairness and comprehensiveness by using evaluation metrics that are independent of the choice of statistical tests and p-value thresholds, in contrast to some prior practices such as (Fernandez et al., 2023). AI detection in WAVES is akin to binary classification, utilizing ROC curve-based metrics. Given the significant impact of false positives in mislabeling non-watermarked images, strict control over the false positive rate (FPR) is crucial. Therefore, rather than AUROC (since a high AUROC score does not necessarily imply a high true positive rate (TPR) at low FPR levels), WAVESfocuseson TPR@x%FPR,specificallyatachallenging low FPR threshold of 0.1%, extending recent studies such as (Wen et al., 2023) with a larger dataset and a more stringent FPR criterion. User identification is approached as multi-class classification, and we measure performance by the accuracy of correct image assignments to users. Implementing Diverse Image Quality Metrics: Recognizing that no single metric can fully capture the aspects of generated images, we use a range of image quality metrics and propose a normalized, aggregated metric for evaluating watermark and attack methods. WAVES integrates over 8 metrics in 4 categories: (1) Image similarities, including Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Normalized Mutual Information (NMI), which assess the pixel-wise accuracy after attacks; (2) Distribution distances such as Frechet Inception Distance (FID) (Heusel et al., 2017) and a variant based on CLIP feature space (CLIPFID) (Kynkäänniemi et al., 2022); (3) Perception-based metrics like Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018); (4) Image quality assessments including aesthetics and artifacts scores (Xu et al., 2023), which quantify the changes in aesthetic and artifact features. Normalization and Aggregation of Image Quality Metrics: Addressing the distinct characteristics of various image quality metrics, WAVES proposes a normalized and aggregated quality metric for a unified measure of image quality degradation and comprehensive scoring of attack or watermark methods. We define the normalized scale for each metric by assigning the 10% quantile value over all attacked images (across 26 attack methods, three watermark methods, and three datasets) as the 0.1 point, and the 90% quantile as the 0.9 point. Normalized quality metrics are always ranked in ascending order of image degradation. This normalization ensures equivalent significance across different metrics, defined by their quantiles in a large set of attacked watermarked images. Normalized metrics are aggregated and extensively utilized in Section 4 for Performance vs Quality plots, watermark radar plots, and attack leaderboards. 3.2. Stress-testing Watermarks We evaluate the robustness of watermarks with a wide range of attacks detailed in this section and summarized in Table 2 and Table 5. Figure 24 demonstrates the visual effects. Distortion Attacks. Watermarked images often face distortions such as compression and cropping during internet transmission, necessitating watermarks that can endure common alterations. However, most studies only test resilience against singular or extreme distortions. In WAVES, we establish the following distortions within an acceptable quality threshold as our baselines. Geometric distortions: rotation, resizedcrop, and erasing; Photometric distortions: adjustments in brightness and contrast; Degradation distortions: Gaussian blur, Gaussian noise, and JPEG compression; Combo distortions: combinations of geometric, photometric, and degradation distortions, both individually and collectively. Detailed setups for each are provided in the Appendix F.1. WAVES: Benchmarking the Robustness of Image Watermarks (a) Evaluation of a single attack on a watermarking method. We first attack watermarked images over a variety of strengths (also labeled stg ). Then, we evaluate the detection performance (TPR@0.1%FPR) and a collection of image quality metrics such as PSNR and plot a set of performance vs. quality plots. By normalizing and aggregating these quality metrics, we derive a consolidated 2D plot that represents the overall performance vs. quality for the evaluation. (b) Benchmarking watermarks and attacks. For each watermark, we plot all attacks on a unified performance vs. quality 2D plot to facilitate a detailed comparison. Based on this, we provide two additional analytical perspectives. We compare watermarks robustness through the averaged performance under different attacks. We evaluate attacks potency by ranking the quality at a specific performance threshold. Figure 2. Evaluation workflow. Table 2. A taxonomy of all the attacks in our stress-testing set. Novel attacks proposed by WAVES are marked with . Category Subcategory (prefix) Description Attack Names (suffix) Distortion Single (Dist-) Single distortion -Rotation, -RCrop, -Erase, -Bright, -Contrast, -Blur, -Noise, -JPEG Combination (Dist Com-) Combination of a type of distortions -Geo, -Photo, -Deg, -All Regeneration Single (Regen-) A single VAE or diffusion regeneration -Diff, -Diff P1, -VAE, -KLVAE2 Rinsing (Rinse-) A multi-diffusion regeneration -2x Diff, -4x Diff Adversarial Embedding (grey-box) (Adv Emb G-)3 Use the same VAE -KLVAE8 Embedding (black-box) (Adv Emb B-) Use other encoders -RN18, -CLIP, -KLVAE16, -Sdxl VAE Surrogate detector attack (Adv CLS-)4 Train a watermark detector -Un WM&WM, -Real&WM, -WM1&WM2 1 Diff P requires user prompts. 2 KLVAE with bottleneck size 8 is grey-box. 3 Adv Emb G is grey-box. 4 Adv CLS needs data and training. Figure 3. Regeneration attacks on Tree-Ringk. Regen-Diff is a single diffusive regeneration and Rinse-[N]x Diff is a rinsing one with N repeated diffusions, with the number of noising steps as attack strength. Regen-VAE uses a pre-trained VAE with quality factor as strength and Regen-KLVAE uses pre-trained KL-VAEs with bottleneck size as strength. Rinse D-VAE applies a VAE as a denoiser after Rinse-4x Diff. Regeneration Attacks, employing diffusion models or VAEs (Saberi et al., 2023; Zhao et al., 2023a), aim at altering an image s latent representation by noising and then denoising an image. Different from existing works that only perform a Single regeneration, we also investigate Rinsing regenerations, where an image undergoes multiple cycles of noising and denoising through a pre-trained diffusion model. Furthermore, we introduce two additional variations: prompted regeneration and mixed regeneration (rinse + VAE denoising). To simulate a realistic attack, we use a lower version diffusion model than the one used to generate watermarked images. All such attacks are detailed in Appendix F.2. As shown in Figure 3, in contrast with the conclusions of Zhao et al. (2023a), the Tree-Ring watermark is not robust against regeneration attacks. In particular, a single regeneration such as Regen-Diff and Regen-VAE can significantly harm the TPR@0.1%FPR while maintaining reasonable CLIP-FID. Rinsing regenerations significantly lower the TPR@0.1%FPR at the cost of markedly decreased image quality. A 2x rinsing regeneration (Regen-2x Diff) strikes a balance between both low-TPR@0.1%FPR and high image quality. In regards to the Stable Signature, Figure 7 and Table3concurwiththeanalysisof Zhaoetal.(2023a) regen- WAVES: Benchmarking the Robustness of Image Watermarks Figure 4. Adversarial embedding attacks target Tree-Ring at strengths of {2/255, 4/255, 6/255, 8/255}. Tree-Ring shows vulnerability to embedding attacks, especially when the adversary can access the VAE being used. erationattacksarecompletelydestructiveandrinsingregenerations reiterate this phenomenon. The Stega Stamp is mildly affected by regenerations, and only by diffusive attacks, including our novel rinsing and prompted regenerations. Adversarial Attacks. Deep neural networks are vulnerable to adversarial examples, (Ilyas et al., 2019; Chakraborty et al., 2018). In WAVES, we explore watermark robustness against two types of adversarial attacks. (A) Embedding Attacks. Watermark detection can be thwarted by perturbations on image embedding. Such attacks have been used against Multimodal Large Language Models like GPT-4V (Dong et al., 2023) and shown good transferability (Inkawhich et al., 2019). We examine if attacks on off-the-shelf embedding models can transfer to watermark detectors. Given an encoder f : X Z mapping images to latent features, we craft an adversarial image xadv to diverge its embedding from the original watermarked image x, within an l perturbation ball limit: maxxadv f(xadv) f(x) 2, s.t. xadv x ϵ. We approximately solve this using the PGD (Madry et al., 2017) algorithm (see details in Appendix F.3.1), and see if the adversarial image transfers to real watermark detectors. We evaluate five off-the-shelf encoders. Adv Emb B-RN18 uses a pre-trained Res Net18 (He et al., 2016), targeting the pre-logit feature layer. Adv Emb B-CLIP employs CLIP s (Radford et al., 2021) image encoder. Adv Emb GKLVAE8 utilizes the encoder of KL-VAE (f8) which is used in the victim latent diffusion model. This is a grey-box setting but reflects the use of public VAEs in proprietary models (for example, DALLE 3 uses a public KL-VAE according to https://cdn.openai.com/papers/dall-e-3. pdf). Further, we do ablation studies on KL-VAE (f16), which has a different architecture but is trained on the same data, and on SDXL-VAE (Podell et al., 2023), an enhanced version of KL-VAE (f8). They are black-box attacks and are labeled Adv Emb B-KLVAE16 and Adv Emb B-Sdxl VAE. As shown in Figure 4, Tree-Ring is vulnerable to embedding attacks, particularly under the grey-box condition where TPR@0.1%FPR can drop to nearly zero, effectively removing most watermarks. This is because the detection Figure 5. Three settings for training the surrogate detector. The Generator is the victim generator under attack. We externalize the watermarking process for simplicity, but it could be in-processing watermarks. After training the surrogate detectors, the adversary performs PGD attacks on them to flip the labels. process of Tree-Ring first maps the image to the latent representation through the encoder of KL-VAE (f8), then conducts inverse DDIM to retrieve the watermark. The embedding attack changes the latent representation severely; therefore, watermark retrieval becomes very difficult. Using similar yet distinct VAEs, attack effectiveness diminishes but still manages to remove some watermarks, with KL-VAE (f16), trained on the same images, demonstrating the highest transferability. CLIP-based attacks also achieve some success, especially on natural images like MS-COCO, likely due to CLIP being trained on natural images akin to those in MSCOCO, enhancing the transferability. Conversely, Stable Signature and Stega Stamp demonstrate robustness against embedding attacks (Figure 7), likely because their detectors are trained independently from generative models, differing significantly from standard classifiers and VAEs. Hence, our attacks fail to effectively transfer to their detectors. (B) Surrogate Detector Attacks. Watermark detection hinges on a detector that decodes and verifies messages from watermarked images. Adversaries might acquire numerous watermarked and non-watermarked images to train a surrogate detector, and transfer attacks on it to the actual watermark detector. Figure 5 explores our various settings. Adv Cls-Un WM&WM trains a surrogate detector with both watermarked and non-watermarked images from the victim generative model, as per Saberi et al. (2023). Note that this is anunrealistic settingfor proprietarymodelssince alltheir outputs are assumed to be watermarked. Adv Cls-Real&WM trains the surrogate watermark detector with watermarked and non-watermarked images, where non-watermarked WAVES: Benchmarking the Robustness of Image Watermarks images are sampled from the Image Net dataset (not from the generative model). This approach is more applicable to proprietary models. Adv Cls-WM1&WM2 only uses watermarked images. It actually trains a surrogate watermark message classifier to distinguish two users. Suppose the system assigns a particular message to each user for identification purposes, the adversary can collect the training data from two users outputs, with an identical set of prompts. Adversarial attacks on this surrogate model aim at user misidentification. All surrogate detectors are fine-tuned on Res Net18. We use Image Net text prompts A photo of a {class name} to generate training images (see details in Appendix F.3.2). With the trained surrogate detector f : X Y, where Y ={0,1}, adversaries launch targeted attacks. The goal is to craft an adversarial image xadv from an original image x so that f incorrectly predicts the target label ytarget (i.e., wrong label), minimizing the following with cross-entropy loss: minxadv L(f(xadv),ytarget), s.t. xadv x ϵ. It enables adversaries to erase watermarks from marked images or implant them into clean images in the first two settings, and to disrupt user identification as well as watermark detection in the third setting. We solve it with the PGD algorithm. Figure 6. Adversarial surrogate detector attacks on Tree-Ring. Figure 6 shows Tree Ring s vulnerability to surrogate detector-based attacks. In Adv Cls Un WM&WM, the adversary accessing nonwatermarked images has good transferability and removes watermarks effectively. However, it fails to add watermarks to clean images (spoofing attack), as detailed in Figure 20. The reason behind this is explored in Appendix G.2, where we find the attacker disrupts the entire latent space, not just the watermark (as shown in Figure 21). Conversely, the spoofing attack fails to embed the precise watermark. Adv Cls Real&WM attack fails entirely, likely due to the surrogate model appearing to differentiate real from generated images, using broader features than the watermark. The newly proposed Adv Cls-WM1&WM2 successfully attacks Tree Ring using only watermarked images. Like the first scenario, the surrogate model fails to precisely locate watermarks but learns the mapping to the latent feature space, allowing a PGD attack to remove the watermark by disturbing the entire latent space (see Figure 22). In user identification tasks (Figure 23), the attack doesn t consistently mislead the detector into misidentifying User1 s watermarked images as User2 s (targeted misidentification). Instead, imprecise perturbations often lead to incorrect attribution of User1 s images to others. Figure 7 shows that Stable Signature and Stega Stamp are robust to these attacks. Even with high surrogate classifier accuracy in Adv Cls-Un WM&WM, adversarial examples fail to transfer to the true detector, possibly due to reliance on different features than those used by the true detector. 4. Benchmarking Results and Analysis We extensively evaluate the security of three prominent watermarking algorithms (according to Appendix D.2), Stable Signature, Tree-Ring, and Stega Stamp, respectively representing three major watermarking types: in-processing via model modification, in-processing via random seed modification, and post-processing. We conduct thorough evaluations with images from Diffusion DB (Wang et al., 2022), MSCOCO (Lin et al., 2014), and the DALL E3 datasets; see Appendix D.1 for details. Note that our evaluation process can be applied to any watermark (as shown in Appendix G.5). Performance vs. Quality 2D plots. We evaluate 3 watermarking methods under 26 attacks, and report results across 3 datasets in Figure 25 to Figure 30. The quality of images post-attack is evaluated using 8 metrics and the detection performance is measured by TPR@0.1%FPR. Figure 13 shows that different quality metrics yield a similar ranking of attacks. Consequently, we aggregate these metrics into a single, unified quality metric Normalized Quality Degradation, with lower scores indicating lesser quality degradation caused by attacks. Furthermore, we aggregate the results across three distinct datasets, and derive the unified Performance vs. Quality degradation 2D plots in Figure 7, visualizing the unified evaluation results for each watermarking method against each attack. We defer the aggregation details to Appendix E. Based on these unified 2D plots, we benchmark watermarks and attacks in the following sections. 4.1. Benchmarking Watermark Robustness Figure 8 provides a high-level overview of watermarks robustness. We categorize effective attacks into seven types (sameascategoriesin Table2): Distortion Single, Distortions Combination, Regeneration Single, Regeneration Rinsing, Adv Embedding Grey-box, Adv Embedding Black-box, and Adv Surrogate Detector. Attacks considered are detailed in Appendix E.5. The Average TPR@0.1%FPR, calculated for each category across strength levels, assesses watermarking method robustness. Figure 8 shows the robustness of three watermarking methods where the area covered indicates the overall robustness. Figure 8 shows the distribution of quality degradation for each type of attack to illustrate the potential trade-off between attack effectiveness and image quality. WAVES provides a clear comparison of watermarks robustness and reveals undiscovered vulnerabilities. Figure 8 reveals that Stega Stamp occupies the largest area, signaling its exceptional robustness. Tree-Ring follows suit WAVES: Benchmarking the Robustness of Image Watermarks Figure 7. Unified performance vs. quality degradation 2D plots under detection setup. We evaluate each watermarking method under various attacks. Two dashed lines show the thresholds used for ranking attacks. (a) Average TPR@0.1%FPR under different types of attacks. (b) Distributions of quality degradation Figure 8. (a) Detection performance of three watermarks after attacks, measured by Average TPR@0.1%FPR with lower values (near center) indicating higher vulnerabilities. (b) The distribution of quality degradation. The lower, the better. with a smaller area, and Stable Signature occupies the least space. Interestingly, different watermarking methods exhibit vulnerabilities to different types of attacks. Tree-Ring is particularly vulnerable to adversarial attacks introduced in this paper, with a significant vulnerability to grey-box embedding and surrogate detector attacks. It is also vulnerable to regeneration rinsing attacks. Stable Signature is vulnerable to almost all regeneration attacks. All three watermarks maintain a relative robustness against distortions. Furthermore, as observed in Figure 8, adversarial attacks generally cause less quality degradation, highlighting their potency against Tree-Ring watermarks. WAVES offers an apple-to-apple comparison of watermarks through a multi-dimensional stress test of their robustness, enabling a nuanced and comprehensive understanding of their security in various scenarios. 4.2. Benchmarking Attacks Table 3 features a leaderboard ranking attacks based on their impact on detection performance and image qual- ity. We assess attacks using performance thresholds (TPR@0.1%FPR=0.95 and TPR@0.1%FPR=0.7) and quality degradation at these thresholds (Q@0.95P and Q@0.7P). Additionally, we evaluate average performance (Avg P) and quality degradation (Avg Q) across all strengths. These metricsareusedtorank26attacksforeachwatermarkingmethod, with details deferred to Appendix E.6. Attack effectiveness varies among watermarks. Table 3 shows variability in attack efficiency across watermarking methods. Metrics like Q@0.95P and Q@0.7P provide nuancedcomparisons, while Avg Pand Avg Qofferinsightsinto overall attack potency and image quality impact. Our analysis identifies each watermark s specific weaknesses to certain attacks. For instance, Adv Cls-Un WM&WM, Adv Cls WM1&WM2, and Adv Emb G-KLVAE8 are notably effective against Tree-Ring, whereas Regen-Diff and Regen-Diff P are more potent against Stable Signature. Regeneration attacks impact Stega Stamp but do not greatly affect its average detection performance; in contrast, certain distortion attacks significantly lower detection performance, at the cost of qual- WAVES: Benchmarking the Robustness of Image Watermarks Table 3. Comparison of attacks across three watermarking methods in detection setup. Q denotes the normalized quality degradation, and P denotes the performance as derived from Figure 7. Q@0.95P measures quality degradation at a 0.95 performance threshold where "inf" denotes cases where all tested attack strengths yield performance above 0.95, and "-inf" where all are below. A similar notation applies to Q@0.7P. Avg P and Avg Q are the average performance and quality over all the attack strengths. The lower the performance and the smaller the quality degradation, the stronger the attack is. For each watermarking method, we rank attacks by Q@0.95P, Q@0.7P, Avg P, Avg Q, in that order, with lower values ( ) indicating stronger attacks. The top 5 attacks of each watermarking method are highlighted in red. Attack Tree-Ring Stable Signature Stega Stamp Rank Q@0.95P Q@0.7P Avg P Avg Q Rank Q@0.95P Q@0.7P Avg P Avg Q Rank Q@0.95P Q@0.7P Avg P Avg Q Dist-Rotation 11 0.464 0.521 0.375 0.648 12 0.624 0.702 0.594 0.650 5 0.423 0.498 0.357 0.616 Dist-RCrop 18 0.592 0.592 0.332 0.463 24 inf inf 0.995 0.461 6 0.602 0.602 0.540 0.451 Dist-Erase 26 inf inf 1.000 0.490 25 inf inf 0.998 0.489 25 inf inf 1.000 0.483 Dist-Bright 25 inf inf 0.997 0.304 23 inf inf 0.998 0.305 22 inf inf 0.998 0.317 Dist-Contrast 22 inf inf 0.998 0.243 20 inf inf 0.998 0.243 17 inf inf 0.998 0.231 Dist-Blur 20 0.861 1.112 0.563 1.221 5 -inf -inf 0.000 1.204 9 0.848 0.962 0.414 1.198 Dist-Noise 16 0.548 inf 0.980 0.395 8 0.402 0.520 0.870 0.390 24 inf inf 1.000 0.360 Dist-JPEG 12 0.499 0.499 0.929 0.284 9 0.485 0.485 0.793 0.284 21 inf inf 0.998 0.263 Dist Com-Geo 13 0.525 0.593 0.277 0.768 13 0.850 inf 0.937 0.767 7 0.663 0.693 0.396 0.733 Dist Com-Photo 22 inf inf 0.998 0.242 20 inf inf 0.998 0.243 17 inf inf 0.998 0.239 Dist Com-Deg 19 0.620 inf 0.892 0.694 7 0.206 0.369 0.300 0.679 8 0.826 0.975 0.852 0.664 Dist Com-All 14 0.539 0.751 0.403 0.908 11 0.538 0.691 0.334 0.900 10 0.945 1.101 0.795 0.870 Regen-Diff 5 -inf 0.307 0.612 0.323 1 -inf -inf 0.001 0.300 1 0.331 inf 0.943 0.327 Regen-Diff P 4 -inf 0.307 0.601 0.327 1 -inf -inf 0.001 0.303 1 0.333 inf 0.940 0.329 Regen-VAE 17 0.578 0.578 0.832 0.348 10 0.545 0.545 0.516 0.339 23 inf inf 1.000 0.343 Regen-KLVAE 22 inf inf 0.990 0.233 6 -inf 0.176 0.217 0.206 17 inf inf 1.000 0.240 Rinse-2x Diff 6 -inf 0.333 0.510 0.357 3 -inf -inf 0.001 0.332 4 0.391 inf 0.941 0.366 Rinse-4x Diff 7 -inf 0.355 0.443 0.466 4 -inf -inf 0.000 0.438 3 0.388 inf 0.909 0.477 Adv Emb G-KLVAE8 3 -inf 0.164 0.448 0.253 20 inf inf 0.998 0.249 17 inf inf 1.000 0.232 Adv Emb B-RN18 10 0.241 inf 0.953 0.218 17 inf inf 0.999 0.212 14 inf inf 1.000 0.196 Adv Emb B-CLIP 15 0.541 inf 0.932 0.549 26 inf inf 0.999 0.541 25 inf inf 1.000 0.488 Adv Emb B-KLVAE16 8 0.195 inf 0.888 0.238 19 inf inf 0.997 0.233 14 inf inf 1.000 0.206 Adv Emb B-Sdxl VAE 9 0.222 inf 0.934 0.221 17 inf inf 0.998 0.219 14 inf inf 1.000 0.204 Adv Cls-Un WM&WM 1 -inf 0.102 0.499 0.145 14 inf inf 0.999 0.101 11 inf inf 1.000 0.101 Adv Cls-Real&WM 21 inf inf 1.000 0.047 14 inf inf 0.998 0.092 11 inf inf 1.000 0.106 Adv Cls-WM1&WM2 1 -inf 0.101 0.492 0.139 14 inf inf 0.999 0.084 13 inf inf 1.000 0.129 ity degradation. No single attack excels across all watermarking methods, yet regeneration attacks exhibit some level of consistent effectiveness. This significant variation in attack effectiveness emphasizes the imperative for diverse and watermark-tailored defensive strategies. 4.3. Benchmarking Results for User Identification We detail the user identification results, following the evaluation method from Section 3.1. The key distinction here is the use of identification accuracy as the performance metric. Our study includes scenarios with 100, and 1 million users, reflecting a range of real-world conditions. Utilizing the same evaluation approach, we generate unified Performance vs. Quality degradation 2D plots (Figure 19), radar plots for watermark comparison (Figure 9), and an attack leaderboard in the identification context (Table 6). Identification results mirror findings from detection, showing similar trends in watermark robustness and attack effectiveness. Figure 9 and Table 6 reveal that trends in Figure 9. Identification accuracy of three watermarks after attacks. watermark robustness and attack potency closely match those in detection, largely because both rely on precise watermark decoding. Notably, watermarks become more vulnerable as user numbers increase, a trend particularly evident in attacks that already strongly affect detection. Since identification demands more accurate decoding, its vulnerability amplifies with user growth. Thus, insights gained from detection scenarios generally apply to identification, especially when attacksarenotidentification-specific. However, novelattacks such as our Adv Cls-WM1&WM2, may target user identification. Watermarking strategies should evolve to address emerging challenges in both detection and identification. 4.4. Discussions Understanding watermark vulnerabilities. Tree-Ring is particularly vulnerable to adversarial attacks likely due to its unique watermark detection process. The detection first encodes an image into a latent space using a VAE encoder, then reverses the diffusion process to extract the initial noise vector and compares it with a key. Consequently, the detection hinges on the integrity of the latent feature space, and thus disturbances inside this domain significantly hinder watermark recovery. Embedding attacks, especially the grey-box setting, effectively disrupt the latent feature without altering the perceptual appearance of the image, making them highly effective against Tree-Ring. We also observe a similar phenomenon for surrogate detector attacks (Figure 21, Figure 22), which successfully disturb the latent features, including those related to the watermark. Stable Signature is vulnerable to regeneration attacks due to its unique watermarking protocol. Recall that latent diffusion models first WAVES: Benchmarking the Robustness of Image Watermarks perform diffusion in the latent space, then map to the image space through a VAE decoder. To embed watermarks, Stable Signature roots the watermark in the VAE decoder by training. However, regeneration attacks circumvent this special decoder by using an alternate VAE or diffusion model with a different decoder. As a result, the regenerated images are stripped of the original watermarks. Limitations of attacks. As shown in Table 5, we focus on realistic attacks where attackers have very limited knowledge, unaware of the watermarking algorithm in all scenarios. Distortion, regeneration, and adversarial embedding attacks (except for the grey-box setting) are universal attacks, not using specific watermark or model information. Therefore, their effectiveness may vary. Adversarial surrogate detector attacks target a watermark by training a surrogate detector on watermarked images. However, we found that they do not always work due to the transferability problem. Since the attackers do not know the true detector, the architecture of the surrogate detector (e.g., Res Net18 in this paper) may differ significantly from the true detector. Besides, there might be many features that can distinguish non-watermarked and watermarked images. Hence, despite achieving high classification accuracy, the surrogate may rely on features different from those of the true detector, leading to unsuccessful transfer of attacks. Enhanced attacker knowledge, such as the watermarking algorithm, could facilitate more effective adversarial attacks, as explored in (Lukas et al., 2023). Potential strategies to improve robustness. Although we reveal many vulnerabilities of existing watermarks, there are potential ways to improve them. For watermarks which rely on image perturbations for encoder/decoder training (Stegastamp, Stable Signature), including more types of transformations may improve robustness. For example, we have observed in internal testing that training Stable Signature s extractor with blur and rotation transformations as data augmentations improves its robustness to these transformations but also marginally reduces the encoded image quality. Similar to blur and rotation, we can add other transformations such as adversarial perturbations and regeneration as data augmentations to improve robustness towards them. There is also ample opportunity to improve the algorithmic frameworks themselves. For example, Tree-Ring relies on DDIM inversion, which we found is not accurate even without attack, directly affecting the watermark detection accuracy. Future work can improve it by incorporating cuttingedge techniques on more accurate DDIM inversion. For watermarks such as Tree-Ring, one may also insert a trainable U-Net which restores the watermark before it is extracted. Such a strategy may degrade the image to enhance the signal of the message, but this is irrelevant from the perspective of the image owner whose only goal is to simply detect their watermark. For more agnostic strategies: (1) Incorporating redundant bits. This technique, known as error correction coding, can help reconstruct the original message even when parts of the watermark are corrupted. (2) A hybrid approach. Since different watermarks have varied vulnerabilities, one can try to combine different watermarks, leveraging their strengths to defend a wider range of attacks. 4.5. Summary of Takeaway Messages WAVES provides a standardized framework for benchmarking watermark robustness and attack potency. WAVES evaluates both detection and identification tasks. It unifies the quality metrics and assesses attack potency against both performance degradation and quality degradation. The Performance vs. Quality 2D plots allow for a comprehensive analysis of various watermarks in one unified framework. With over twenty attacks tested, WAVES exposes new vulnerabilities in popular watermarking techniques. Different watermarking methods have different vulnerabilities. Our analysis reveals significant differences in watermark vulnerabilities against attacks. Specifically, Tree-Ring is more vulnerable to adversarial attacks, which generally cause less quality degradation, while Stable Signature is susceptible to most regeneration attacks. This diversity in vulnerabilities highlights the imperative for watermarking methods to identify and strengthen their specific weak areas. Avoid using publicly available VAEs. WAVES demonstrates the risks of using publicly available VAEs in watermarked diffusion models. An adversarial embedding attack using the same VAE easily compromises Tree-Ring by altering latent features with little visual change. Stable Signature s design renders it vulnerable to regeneration attacks that use a VAE with an encoder identical to the victim model s VAE encoder, while coupled with a different decoder. Today s proprietary generators, like DALL 3, typically train the latent diffusion model themselves but use a publicly available VAE. This practice, especially with Tree-Ring or Stable Signature watermarking, increases vulnerability, pointing to a critical security concern in those popular AI services. The robustness of Stega Stamp potentially illuminates a path for future robust watermarks. The Stega Stamp watermark (Tancik et al., 2020) stands out in our evaluation for its robustness. Designed for physical-world use which requires high robustness, Stega Stamp is trained with a series of distortions that mimic real-world scenarios, significantly enhancing its robustness. However, it s important to recognize the potential trade-off between watermark robustness and quality. As a post-processing method, the original paper finds that Stega Stamp may introduce artifacts. In contrast, this might not pose a problem for in-processing watermarks. Therefore, in-processing watermarks could still benefit from incorporating augmentation or adversarial training. WAVES: Benchmarking the Robustness of Image Watermarks Acknowledgements We thank Souradip Chakraborty and Amrit Singh Bedi for insightful discussions. An, Ding, Rabbani, Xu, Deng, Zhu, and Huang are supported by DARPA Transfer from Imprecise and Abstract Models to Autonomous Technologies (TIAMAT) 80321, National Science Foundation NSF-IIS-2147276 FAI, DOD-ONROffice of Naval Research under award number N00014-22-12335, DOD-AFOSR-Air Force Office of Scientific Research under award number FA9550-23-1-0048, DOD-DARPADefense Advanced Research Projects Agency Guaranteeing AI Robustness against Deception (GARD) HR00112020007, Adobe, Capital One and JP Morgan faculty fellowships. Wen and Goldstein are supported by the ONR MURI program, the AFOSR MURI program, the National Science Foundation (IIS-2212182), the NSF TRAILS Institute (2229885), Capital One Bank, the Amazon Research Award program, and Open Philanthropy. Impact Statement This work contains research that could be used to remove watermarks from images. However, our research is focused on uncovering vulnerabilities in watermarking systems to guide the development of more robust designs. As publicly available generative imaging services like Open AI s DALL E, Mid Journey, and Bing Image Creator become more popular, the demand for effective watermarks is intensifying. We test and contribute a large collection of distortion, regeneration, and adversarial attacks, setting a benchmark for evaluating and enhancing watermark strength. As the legal status of AI-generated content evolves, robust watermarking will become increasingly crucial for protecting creative ownership and preventing the misrepresentation of AI-generated content as real. Our research not only contributes to identifying weaknesses in watermarks but also advances the detection capabilities of AI-generated content, supporting the development of this significant aspect of digital watermarking technology. Ahmadi, M., Norouzi, A., Karimi, N., Samavi, S., and Emami, A. Redmark: Framework for residual diffusion watermarking based on deep networks. Expert Systems with Applications, 146:113157, 2020. Al-Haj, A. Combined dwt-dct digital image watermarking. Journal of computer science, 3(9):740 746, 2007. Ballé, J., Minnen, D., Singh, S., Hwang, S. J., and Johnston, N. Variational image compression with a scale hyperprior. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. Open Review.net, 2018. URL https://openreview. net/forum?id=rkc QFMZRb. Chakraborty, A., Alam, M., Dey, V., Chattopadhyay, A., and Mukhopadhyay, D. Adversarial attacks and defences: A survey. ar Xiv preprint ar Xiv:1810.00069, 2018. Chang, C.-C., Tsai, P., and Lin, C.-C. Svd-based digital image watermarking scheme. Pattern Recognition Letters, 26(10):1577 1586, 2005. Chen, H., Rouhani, B. D., Fu, C., Zhao, J., and Koushanfar, F. Deepmarks: A secure fingerprinting framework for digital rights management of deep learning models. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 105 113, 2019. Cox, I., Miller, M., Bloom, J., Fridrich, J., and Kalker, T. Digital watermarking and steganography. Morgan kaufmann, 2007. Cox, I. J., Kilian, J., Leighton, T., and Shamoon, T. Secure spread spectrum watermarking for images, audio and video. In Proceedings of 3rd IEEE international conference on image processing, volume 3, pp. 243 246. IEEE, 1996. Dhariwal, P.and Nichol, A. Diffusionmodelsbeatgansonimage synthesis. Advances in neural information processing systems, 34:8780 8794, 2021. Dong, Y., Chen, H., Chen, J., Fang, Z., Yang, X., Zhang, Y., Tian, Y., Su, H., and Zhu, J. How robust is google s bard to adversarial image attacks? ar Xiv preprint ar Xiv:2309.11751, 2023. Executive Office of the President. Executive order 14110: Safe, secure, and trustworthy development and use of artificial intelligence. Federal Register, 88:75191 75226, 2023. Fernandez, P., Sablayrolles, A., Furon, T., Jégou, H., and Douze, M. Watermarking images in self-supervised latent spaces. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3054 3058. IEEE, 2022. Fernandez, P., Couairon, G., Jégou, H., Douze, M., and Furon, T. The stable signature: Rooting watermarks in latent diffusion models. ar Xiv preprint ar Xiv:2303.15435, 2023. Hayes, J. and Danezis, G. Generating steganographic images via adversarial training. Advances in neural information processing systems, 30, 2017. WAVES: Benchmarking the Robustness of Image Watermarks He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., and Madry, A. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32, 2019. Inkawhich, N., Wen, W., Li, H. H., and Chen, Y. Feature space perturbations yield more transferable adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7066 7074, 2019. Jia, H., Choquette-Choo, C. A., Chandrasekaran, V., and Papernot, N. Entangled watermarks as a defense against model extraction. In 30th USENIX Security Symposium (USENIX Security 21), pp. 1937 1954, 2021a. Jia, Z., Fang, H., and Zhang, W. Mbrs: Enhancing robustness of dnn-based watermarking by mini-batch of real and simulated jpeg compression. In Proceedings of the 29th ACM international conference on multimedia, pp. 41 49, 2021b. Jiang, Z., Zhang, J., and Gong, N. Z. Evading watermark based detection of ai-generated content. ar Xiv preprint ar Xiv:2305.03807, 2023. Kutter, M. and Petitcolas, F. A. Fair benchmark for image watermarking systems. In Security and watermarking of multimedia contents, volume 3657, pp. 226 239. SPIE, 1999. Kynkäänniemi, T., Karras, T., Aittala, M., Aila, T., and Lehtinen, J. The role of imagenet classes in fr\ echet inception distance. ar Xiv preprint ar Xiv:2203.06026, 2022. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740 755. Springer, 2014. Lukas, N. and Kerschbaum, F. Ptw: Pivotal tuning watermarking for pre-trained image generators. ar Xiv preprint ar Xiv:2304.07361, 2023. Lukas, N., Diaa, A., Fenaux, L., and Kerschbaum, F. Leveraging optimization for adaptive attacks on image watermarks. ar Xiv preprint ar Xiv:2309.16952, 2023. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., and Anandkumar, A. Diffusion models for adversarial purification. ar Xiv preprint ar Xiv:2205.07460, 2022. ó Ruanaidh, J., Dowling, W., and Boland, F. Watermarking digital images for copyright protection. IEE PROCEEDINGS VISION IMAGE AND SIGNAL PROCESSING, 143:250 256, 1996. O Ruanaidh, J. J. and Pun, T. Rotation, scale and translation invariant digital image watermarking. In Proceedings of International Conference on Image Processing, volume 1, pp. 536 539. IEEE, 1997. Petitcolas, F. A. Watermarking schemes evaluation. IEEE signal processing magazine, 17(5):58 64, 2000. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., and Rombach, R. Sdxl: Improving latent diffusion models for high-resolution image synthesis. ar Xiv preprint ar Xiv:2307.01952, 2023. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022. Rouhani, B. D., Chen, H., and Koushanfar, F. Deepsigns: A generic watermarking framework for ip protection of deep learning models. ar Xiv preprint ar Xiv:1804.00750, 2018. Saberi, M., Sadasivan, V. S., Rezaei, K., Kumar, A., Chegini, A., Wang, W., and Feizi, S. Robustness of ai-image detectors: Fundamental limits and practical attacks. ar Xiv preprint ar Xiv:2310.00076, 2023. Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. ar Xiv preprint ar Xiv:2010.02502, 2020. Tancik, M., Mildenhall, B., and Ng, R. Stegastamp: Invisible hyperlinks in physical photographs. In Proceedings of the IEEE/CVFconferenceoncomputervisionandpattern recognition, pp. 2117 2126, 2020. WAVES: Benchmarking the Robustness of Image Watermarks Tao, H., Chongmin, L., Zain, J.M., and Abdalla, A.N. Robust image watermarking theories and techniques: A review. Journal of applied research and technology, 12(1):122 138, 2014. Wang, Z. J., Montoya, E., Munechika, D., Yang, H., Hoover, B., and Chau, D. H. Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. ar Xiv preprint ar Xiv:2210.14896, 2022. Wen, Y., Kirchenbauer, J., Geiping, J., and Goldstein, T. Treering watermarks: Fingerprints for diffusion images that are invisible and robust. ar Xiv preprint ar Xiv:2305.20030, 2023. Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., and Dong, Y. Imagereward: Learning and evaluating human preferences for text-to-image generation. ar Xiv preprint ar Xiv:2304.05977, 2023. Yu, N., Skripniuk, V., Abdelnabi, S., and Fritz, M. Artificial fingerprinting for generative models: Rooting deepfake attribution in training data. In Proceedings of the IEEE/CVF International conference on computer vision, pp. 14448 14457, 2021. Zeng, Y., Zhou, M., Xue, Y., and Patel, V. M. Securing deep generative models with universal adversarial signature. ar Xiv preprint ar Xiv:2305.16310, 2023. Zhang, K. A., Xu, L., Cuesta-Infante, A., and Veeramachaneni, K. Robust invisible video watermarking with attention. ar Xiv preprint ar Xiv:1909.01285, 2019. Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018. Zhao, X., Zhang, K., Su, Z., Vasan, S., Grishchenko, I., Kruegel, C., Vigna, G., Wang, Y.-X., and Li, L. Invisible image watermarks are provably removable using generative ai, 2023a. Zhao, Y., Pang, T., Du, C., Yang, X., Cheung, N.-M., and Lin, M. A recipe for watermarking diffusion models. ar Xiv preprint ar Xiv:2303.10137, 2023b. Zhu, J., Kaplan, R., Johnson, J., and Fei-Fei, L. Hidden: Hiding data with deep networks. In Proceedings of the European conference on computer vision (ECCV), pp. 657 672, 2018. WAVES: Benchmarking the Robustness of Image Watermarks WAVES: Benchmarking the Robustness of Image Watermarks Supplementary Material A A Mini Survey of Image Watermarks 14 B Formalism of Watermark Detection and Identification 15 B.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 B.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 C Details on Performance Metrics 17 C.1 Clarifications on p-Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C.2 Performance Metrics for User Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 C.3 Other Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 D Design Choices of WAVES 18 D.1 Dataset Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 D.2 Selection of Watermark Representatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 E Evaluation Details 19 E.1 Watermarking Protocol and Evaluation Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 E.2 Performance Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 E.3 Processing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 E.4 Normalization and Aggregation of Quality Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 E.5 Details of Benchmarking Watermarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 E.6 Details of Benchmarking Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 F Details of Attacks 22 F.1 Distortion Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 F.2 Regeneration Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 F.3 Adversarial Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 G Additional Results 26 G.1 More Results for Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 G.2 More Analyses on Surrogate Detector Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 G.3 Visualization of Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 G.4 Full Results on Diffusion DB, MS-COCO and DALL E3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 G.5 Evaluation on Additional Watermarks: DWT-DCT and MBRS . . . . . . . . . . . . . . . . . . . . . . . . 29 H Limitations 37 WAVES: Benchmarking the Robustness of Image Watermarks A. A Mini Survey of Image Watermarks In this section, we detail the existing landscape of watermarking approaches in the era of AI-Generated Content (AIGC) everywhere. Figure 10 depicts our scenario of interest. First, an AI company/owner embeds a watermark into its generated images. Then, if the owner is shown one of their watermarked images at a later point in time, they can identify ownership of it by recovering the watermark message. Commonly, users might modify watermarked images for legitimate personal purposes. There are also instances where users attempt to erase a watermark for malicious reasons, such as disseminating fake information or infringing upon copyright. For simplicity, we term any image manipulation as an attack. Figure 10. An illustration of a robust watermarking workflow. An AI company provides two services: (1) generate watermarked images, i.e., embed invisible messages, and (2) detect these messages when shown any of their watermarked images. There is an attack stage between the watermarking and detection stages. The watermarked images may experience natural distortions (e.g., compression, re-scaling) or manipulated by malicious users attempting to remove the watermarks. A robust watermarking method should still be able to detect the original message after an attack. Watermarking AI-generated Images. Imprinting invisible watermarks into digital images has a long and rich history. From conventional steganography to recent generative model-based methods, we categorize popular watermarking techniques into two categories: post-processing methods and in-processing methods. Post-processing approaches embed post-hoc watermarks into images. When watermarking AI-generated images, we apply such methods after the generation process. Post-processing watermarks are model-agnostic and applicable to any image. However, they sometimes introduce human-visible artifacts, compromising image quality. We review popular post-processing methods. P1) Frequency-domain methods. These methods manipulate the representation of an image in some transform domain (ó Ruanaidh et al., 1996; Cox et al., 1996; O Ruanaidh & Pun, 1997). The image transform can be a Discrete Wavelet Transform (DWT), Discrete Cosine Transform (DCT) (Cox et al., 2007), or SVD decomposition (Chang et al., 2005). These transformations have a range of invariance properties that make them robust to translation and resizing. The commercial implementation of Stable Diffusion (Rombach et al., 2022) uses DWTDCT (Al-Haj, 2007) to watermark its generated images. However, many studies have shown that these watermarks are vulnerable to common image manipulations (Zhao et al., 2023a). P2) Deep encoder-decoder methods. These methods rely on trained networks for embedding and decoding the watermark (Hayes & Danezis, 2017). Methods such as Hi DDe N (Zhu et al., 2018) and Riva GAN (Zhang et al., 2019) learn an encoder to imprint a hidden message inside an image and a decoder (also called a detector) to extract the message. To train robust watermarks, Red Mark (Ahmadi et al., 2020) integrates differentiable attack layers between the encoder and decoder in the end-to-end training process; Riva GAN (Zhang et al., 2019) employs an adversarial network to remove the watermark during training; Stega Stamp (Tancik et al., 2020) adds a series of strong image perturbations between the encoder and decoder during training, resulting in watermarks which are robust to real-world distortions caused by photographing an image as it appears on a display. P3) Others. There are other varieties of post-processing methods that do not fall into P1 or P2. SSL (Fernandez et al., 2022) embeds watermarks in self-supervised-latent spaces by shifting the image s features into a designated region. Deep Signs (Rouhani et al., 2018) and Deep Marks (Chen et al., 2019) embed target watermarks into the probability density functions WAVES: Benchmarking the Robustness of Image Watermarks of weights and activation maps. Entangled watermarks (Jia et al., 2021a) designs a reinforced watermark based on a target watermark and the task data. In-processing methods adapt generative models to directly embed watermarks as part of the image generation process, substantially reducing or eliminating visible artifacts. With diffusion models presently dominating the field of image generation, a surge of in-processing approaches specific to these models has recently emerged. We categorize current work into three categories. I1) Model modification. The entire model. This line of work inherits the encoder-decoder idea and bakes the encoder into the entire generative model. This is usually accomplished by watermarking training images with a pre-trained watermark encoder and decoder, then training or fine-tuning the generative model on these watermarked images (Yu et al., 2021; Zeng et al., 2023; Lukas & Kerschbaum, 2023). This type of method has been shown to work well on small models like guided diffusion, but suffers from the expensive training of large text-to-image generation models (Zhao et al., 2023b), making it inapplicable in practice. Parts of the model. Stable Signature (Fernandez et al., 2023) follows the above two-stage training pipeline while only finetuning the decoder of the latent-diffusion model (LDM) (Rombach et al., 2022), leaving the diffusion component unchanged. This type of watermarker is much more efficient to train. By fine-tuning multiple latent decoders, the model can embed different messages into images. The robustness of these two types of model modification critically relies on the robustness of the pre-trained encoder and decoder. I2) Modification of a random seed. Tree-Ring (Wen et al., 2023), different from all the above methods, embeds a pattern into the initial noise vector used by a diffusion model for sampling. The pattern can be retrieved at detection time by inverting the diffusion process using DDIM (Song et al., 2020) as the sampler. This method does not require any training, can easily embed different watermarks, and is robust to many simple distortions and attacks. The robustness of Tree-Ring relies on the accuracy of the DDIM inversion. Removing Watermarks Robustness is an essential property of watermarks. Evaluations of robustness in existing literature focus on simple image distortions like rotation, Gaussian blur, etc. Recently, inspired by adversarial purification (Nie et al., 2022), Zhao et al. (2023a) and Saberi et al. (2023) both find that regenerating images by noising and denoising images through a diffusion model or a VAE can effectively remove some watermarks. Saberi et al. (2023) propose adversarial attacks based on a trained surrogate watermark detector. Lukas et al. (2023) also introduces adversarial attacks but requires the knowledge of the watermarking algorithm and a similar surrogate generative model. Jiang et al. (2023) studies white-box attacks and black-box query-based attacks. Some attacks are not possible in realistic scenarios where the attacker has only API access. Furthermore, existing evaluations use differing quality/performance metrics, making it difficult to compare the effectiveness between watermarking methods and between attacks. Benchmarks for Image Watermarks. Before the advent of AIGC, there were significant benchmarks introduced that greatly accelerated the progress of watermark standardization (Kutter & Petitcolas, 1999; Tao et al., 2014; Petitcolas, 2000). However, with the development of AIGC, the need to watermark images generated by AI has become urgent, as previous methods were weak in robustness and could not meet current requirements. Nowadays, more and more methods for watermarking images generated by AI have been proposed, but they all use different methods to evaluate robustness. Therefore, this paper proposes a benchmark for the AIGC era. B. Formalism of Watermark Detection and Identification Invisible image watermarks, which are inspired by classical watermarks to protect the intellectual properties of creators, are now applied for a wider range of application scenarios. With the vast development of AI generative models, most current research focuses on applying invisible watermarks to (1) identify AI-generated images (AI Detection) (Saberi et al., 2023), and (2) identify the user who generated the image for source tracking (User Identification) (Fernandez et al., 2023). To fairly evaluate the different watermark methods for different applications, we start from formulating a general, messagebased watermarking protocol, partially adopting the notation of (Lukas et al., 2023), which generalizes most of the existing setups. Let θG denote an image generator, M the space of watermark messages, and X the domain of images. We assume M is a metric space with distance function D( , ). The choice of message space M can be very different depending on WAVES: Benchmarking the Robustness of Image Watermarks the watermarking algorithm: for Tree-Ring, messages are random complex Gaussians, while for the Stable Signature and Stega Stamp, each message is a length-d binary string, where d denotes the length of the message. For watermarking algorithms following the encoder-decoder training approach, like Stable Signature and Stega Stamp, the choice of message length d is fixed after training. Some methods, such as Tree-Ring, enjoy flexible message length at the time of injecting watermarks. In addition to classifying images as watermarked or non-watermarked, a good detector will often provide a p-value for the watermark detection, which measures the probability that the level of watermark strength observed in an image could occur by random chance. The Tree-Ring watermark also includes an image location parameter τ to embed a message m M, but we subsume this under the parameters of θG. We now introduce several important watermarking operations: EMBED:θG M X is the generative procedure that creates a watermarked image given user-defined parameters of θG (such as prompt, guidance scale, etc. for a diffusion model) and a target message m M. DECODE : X M is a recovery procedure of a message m embedded within a watermarked image x = EMBED(θG,m). In particular, the recovery m =DECODE(x) may be imperfect, i.e., m =m. VERIFYα : M M {0,1} is conducted by the model owner to decide whether x was watermarked by inspecting m =DECODE(x), where x=EMBED(θG,m). For a decoded message m , we consider the following p-value (further discussed in Section C) for evaluating whether the image could have been watermarked using m. which is defined as p=Pm D(ω,m )30 noising steps) significantly alter image quality as evidenced by Figure 17. (a) Regen-Diff-40 (b) Regen-Diff-120 (c) Regen-Diff-200 (d) Regen-VAE-1 Figure 16. Regenerative diffusion with varying depth of noising steps and a VAE regeneration with a low quality factor. (a) Rinse-4x Diff-10 (b) Regen-4x Diff-30 (c) Regen-4x Diff-50 Figure 17. 4x rinsing regeneration with varying depth of noising steps per diffusion. F.2.1. PROMPTED REGENERATION We propose a simple variation on a regenerative diffusion attack: if an image is produced via a known prompt, then an attacker uses the prompt to guide the diffusion of their surrogate model. This type of attack is reasonable and realistic for users of online generative models such as DALL E or Midjourney. Figure 3 and Tables 6 & 3 indicate that this type of attack, labeled Regen-Diff P is slightly stronger than conventional Regen-Diff. F.2.2. MIXED REGENERATION Mixed regeneration refers to any style of attack that uses a regenerative diffusion on an image followed by VAE-style regeneration for the purposes of denoising. In Figure 3, we label examples of such attacks as Rinse D-VAE and Regen D-KLVAE, which respectively denote VAE and KLVAE denoising following a 4x rinsing regeneration with 50 steps (Rinse-4x Diff-50). According to Figure 3, such a combination improves PSNR and CLIP-FID, as opposed to a Rinse-4x Diff alone. The restorative effects of mixed regeneration are visually observable for shallower (i.e., 2x or 3x) rinsing regenerations, as depicted in Figure 18. We do not extensively study or rank such attacks in this work, but include them as a future topic of research. All tested regeneration attacks are summarized as follows, with five evenly divided strengths between the listed minimum and maximum unless specified otherwise: Regeneration via diffusion: passes an image through Stable Diffusion v1.4 with strength as the number of noise/de-noising steps timesteps, 40 to 200. WAVES: Benchmarking the Robustness of Image Watermarks (a) Unattacked (b) Rinse-3x Diff (c) Rinse-3x Diff+VAE Figure 18. An image of a dragon attacked using a 3x rinsing regeneration. Pushing the image through a VAE restores image quality, noticeable in the eye color of the dragon (indicated by the green box). Image is drawn from the Gustavosta Stable Diffusion dataset available @ https://huggingface.co/datasets/Gustavosta/Stable-Diffusion-Prompts. Regeneration via prompted diffusion: passes an image through Stable Diffusion v1.4 conditioned on its generative prompt with strength as the number of noise/de-noising steps timesteps, 40 to 200. Regeneration via VAE: Image is encoded then decoded by a pre-trained VAE (bmshj2018) (Ballé et al., 2018) with strength as quality level from 1 to 7. Regeneration via KL-VAE: Image is encoded and then decoded by a pre-trained KL-regularized autoencoder with strength as bottleneck sizes 4, 8, 16, or 32. Rinsing generation 2x: an image is noised then de-noised by Stable Diffusion v1.4 two times with strength as number of timesteps, 20-100 (per diffusion). Rinsing generation 4x: an image is noised then de-noised by Stable Diffusion v1.4 two times with strength as number of timesteps, 10-50 (per diffusion). Mixed Regeneration via VAE: an image passed through a rinsing regeneration 4x (for 50 timesteps each) and then a VAE with strength as quality level from 1-7. Mixed Regeneration via KL-VAE: an image passed through a rinsing regeneration 4x (for 50 timesteps each) and then a KL-VAE with strength as bottleneck sizes 4, 8, 16, or 32. F.3. Adversarial Attacks F.3.1. EMBEDDING ATTACK The embedding attacks use off-the-shelf encoders and perform untargeted attacks. We use the Projected Gradient Descent (PGD) algorithm (Madry et al., 2017) to optimize the adversarial examples. We conduct the attack using a range of perturbation budgets ϵ, specifically {2/255, 4/255, 6/255, 8/255}. All the attacks are configured with a step size of α = 0.05 ϵ and the number of total iterations of 200. The attacks are on the watermarked images, aiming to remove the watermarks by perturbing their latent representations. F.3.2. SURROGATE DETECTOR ATTACK Figure 5 illustrates the three settings of training the surrogate detectors. In all three settings, we train the surrogate detectors by fine-tuning the Res Net182 for 10 epochs with a learning rate of 0.001 and a batch size of 128. The training images are either generated by the victim generator with the Image Net text prompts "A photo of a {Image Net class name}," or real Image Net images. We randomly shuffle those images and build the binary training set according to each setting. In the Adv Cls-Un WM&WM setting, we train the surrogate detector with 3000 images (1500 images per class) since we find a 2https://pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html WAVES: Benchmarking the Robustness of Image Watermarks larger training set might have the overfitting problem. In the Adv Cls-Real&WM and Adv Cls-WM1&WM2 settings, we train the surrogate detector with 15000 images (7500 images per class). The watermarked images in Adv Cls-WM1&WM2 are embedded with two distinct messages. One message is the one used in the test watermarked images. The other one is randomly generated. In all three settings, we use 5000 images (2500 images per class) for validation (derived from the same source as the training set), and the training yields nearly 100% validation accuracy in all cases. After completing the training phase, the adversary executes a Projected Gradient Descent (PGD) attack on the surrogate detector using the testing data (Diffusion DB, MS-COCO, DALL E3). In all three settings, we conduct the attack using a range of perturbation budgets ϵ, specifically {2/255, 4/255, 6/255, 8/255}. The attack is configured with a step size of α=0.01 ϵ and the number of total iterations of 50. By flipping the label, the adversary can either try to remove the watermarks or add the watermarks. The analyses of results appear in Appendix G.2. G. Additional Results G.1. More Results for Identification Figure 19 shows the Performance vs. Quality degradation plots under the user identification setting. Table 6 presents the ranking of attacks in the identification setup. Figure 19. Aggregated performance vs. quality degradation 2D plots under identification setup (one million users). We evaluate each watermarking method under various attacks. Two dashed lines show to thresholds used for ranking attacks. G.2. More Analyses on Surrogate Detector Attacks The Adv Cls-Un WM&WM attack leverages a surrogate model to distinguish between images that are watermarked and those that are not. As demonstrated in Figure 6, the PGD attack is effective in removing watermarks by flipping the label of watermarked images. This raises a question: Is it possible to similarly add watermarks to clean images by flipping their labels? This process, commonly referred to as a spoofing attack, which demonstrates a false detection of watermarks in clean images, is explored in our study. However, as illustrated in Figure 20, our attempts to add watermarks to clean images by simply flipping the labels were unsuccessful. In this experiment, detailed in Figure 20, we focus exclusively on unwatermarked images, aiming to introduce watermarks, while leaving already watermarked images untouched. Despite employing the most intensive perturbations, we were unable to artificially add watermarks to these images. This outcome leads to an intriguing inquiry: Why is the technique effective in removing watermarks but not in adding them? We delve into the underlying reasons for this asymmetry in Figure 21. WAVES: Benchmarking the Robustness of Image Watermarks Table 6. Comparison of attacks across three watermarking methods under the identification setup (one million users). Q denotes the normalized quality degradation and P denotes the performance as derived from aggregated 2D plots. Q@0.7P measures quality degradation at a 0.7 performance threshold where "inf" denotes cases where all tested attack strengths yield performance above 0.7, and "-inf" where all are below. Q@0.4P is defined analogously. Avg P and Avg Q are the average performance and quality over all the attack strengths. The lower the performance and the smaller the quality degradation, the stronger the attack. For each watermarking method, we rank attacks by Q@0.7P, Q@0.4P, Avg P, Avg Q, in that order, with lower values ( ) indicating stronger attacks. The top 5 attack of each watermarking method are highlighted in red. Attack Tree-Ring Stable Signature Stega Stamp Rank Q@0.7P Q@0.4P Avg P Avg Q Rank Q@0.7P Q@0.4P Avg P Avg Q Rank Q@0.7P Q@0.4P Avg P Avg Q Dist-Rotation 8 -inf 0.434 0.131 0.648 12 0.613 0.642 0.400 0.650 4 0.454 0.500 0.288 0.616 Dist-RCrop 11 -inf 0.592 0.094 0.463 24 inf inf 0.972 0.461 6 0.602 0.602 0.494 0.451 Dist-Erase 26 inf inf 0.986 0.490 25 inf inf 0.988 0.489 25 inf inf 1.000 0.483 Dist-Bright 22 inf inf 0.913 0.304 23 inf inf 0.982 0.305 22 inf inf 0.995 0.317 Dist-Contrast 23 inf inf 0.949 0.243 20 inf inf 0.979 0.243 17 inf inf 0.994 0.231 Dist-Blur 21 1.105 1.437 0.551 1.221 5 -inf -inf 0.000 1.204 9 0.897 0.970 0.280 1.198 Dist-Noise 16 0.427 inf 0.728 0.395 8 0.415 0.480 0.633 0.390 24 inf inf 1.000 0.360 Dist-JPEG 17 0.499 0.499 0.700 0.284 9 0.485 0.485 0.540 0.284 21 inf inf 0.995 0.263 Dist Com-Geo 9 -inf 0.559 0.105 0.768 13 0.788 0.835 0.519 0.767 7 0.676 0.717 0.359 0.733 Dist Com-Photo 23 inf inf 0.947 0.242 20 inf inf 0.981 0.243 17 inf inf 0.994 0.239 Dist Com-Deg 18 0.556 0.864 0.570 0.694 7 0.216 0.281 0.183 0.679 8 0.870 0.957 0.737 0.664 Dist Com-All 10 -inf 0.575 0.123 0.908 11 0.550 0.623 0.176 0.900 10 0.995 1.096 0.682 0.870 Regen-Diff 6 -inf 0.307 0.258 0.323 1 -inf -inf 0.000 0.300 2 0.333 inf 0.766 0.327 Regen-Diff P 6 -inf 0.308 0.256 0.327 1 -inf -inf 0.000 0.303 1 0.336 0.356 0.763 0.329 Regen-VAE 19 0.578 0.578 0.701 0.348 10 0.545 0.545 0.340 0.339 23 inf inf 1.000 0.343 Regen-KLVAE 14 0.257 inf 0.810 0.233 6 -inf -inf 0.047 0.206 17 inf inf 0.999 0.240 Rinse-2x Diff 5 -inf 0.270 0.220 0.357 3 -inf -inf 0.000 0.332 3 0.390 0.402 0.778 0.366 Rinse-4x Diff 1 -inf -inf 0.110 0.466 4 -inf -inf 0.000 0.438 5 0.488 0.676 0.687 0.477 Adv Emb G-KLVAE8 4 -inf 0.168 0.259 0.253 20 inf inf 0.985 0.249 17 inf inf 1.000 0.232 Adv Emb B-RN18 15 0.288 inf 0.811 0.218 17 inf inf 0.990 0.212 14 inf inf 1.000 0.196 Adv Emb B-CLIP 20 0.697 inf 0.798 0.549 26 inf inf 0.991 0.541 25 inf inf 1.000 0.488 Adv Emb B-KLVAE16 12 0.158 0.309 0.540 0.238 19 inf inf 0.983 0.233 14 inf inf 1.000 0.206 Adv Emb B-Sdxl VAE 13 0.214 inf 0.692 0.221 17 inf inf 0.986 0.219 14 inf inf 1.000 0.204 Adv Cls-Un WM&WM 2 -inf 0.123 0.352 0.145 14 inf inf 0.991 0.101 11 inf inf 1.000 0.101 Adv Cls-Real&WM 25 inf inf 0.986 0.047 14 inf inf 0.990 0.092 11 inf inf 1.000 0.106 Adv Cls-WM1&WM2 2 -inf 0.118 0.343 0.139 14 inf inf 0.991 0.084 13 inf inf 1.000 0.129 Figure 20. The spoofing attack fails for Adv Cls-Un WM&WM. The insights from Figure 21 reveal that the surrogate model does not exactly remove the watermark. Instead, it perturbs the watermark along with other features within the latent space. The disturbance alone is sufficient to confuse the detector, making it challenging to recognize the watermark. In contrast, successfully adding watermarks requires precise modifications in the latent space, rather than mere perturbations, which proves to be a more challenging task. The relative imprecision of this attack may stem from the transferable gap between the surrogate model and the ground-truth detector. Notably, for the purpose of watermark removal, perturbing the latent space proves to be adequately effective. These findings have led to the development of our proposed Adv Cls-WM1&WM2 attack, which utilizes images watermarked with different messages (e.g., collected from two users, User1 and User2). The essential requirement for this approach is the surrogate model s ability to map images to the generator s latent space. This mapping allows the attacker to perturb the latent space, removing the watermark. In contrast to the Adv Cls-Un WM&WM approach, which uses both watermarked WAVES: Benchmarking the Robustness of Image Watermarks Figure 21. Visualization of Adv Cls-Un WM&WM attack. (a) shows the watermarking mask of Tree-Ring where there are four channels, and we only watermark the last channel. The watermark message is the rings, which contain ten complex numbers that are not shown in the figure. (b) and (c) show the inversed latent before and after the attack in the Fourier space. We only show the real part of the latent. Clearly, the rings exist before the attack and vanish after the attack. (d) shows the magnitude of the element-wise difference before and after the attack. The attack not only perturbs the watermark part but also other features. The average magnitude change of the watermark-part and non-watermark-part is around 2:1. The attack successfully disturbs the watermark, albeit in an imprecise manner. Figure 22. Visualization of Adv Cls-WM1&WM2 attack. (a) and (b) are the same as that in Figure 21. (c) shows the inversed latent after the attack, where the watermark vanishes instead of changing to another watermark. (d) shows the magnitude of the element-wise difference before and after the attack. The attack not only perturbs the watermark part but also other features. The average magnitude change of the watermark-part and non-watermark-part is also around 2:1. Although the surrogate detector is trained to classify two different watermark messages. The attack based on it cannot change the watermark message from one to another but can effectively disturb the watermark. WAVES: Benchmarking the Robustness of Image Watermarks and non-watermarked images for training (differing only in the latent space), Adv Cls-WM1&WM2 uses two sets of images, each embedded with a distinct watermark message (differing only in the latent space as well). Figure 22 shows that Adv Cls WM1&WM2 attack effectively disrupts the latent features of the images, including the watermarks. However, it lacks the precision to interchange the embedded watermark message. Consequently, while this attack can remove watermarks and mislead user identification mistaking an image originally generated by User1 as belonging to another user it cannot accurately manipulate the identification to frame User2 as desired by the attacker. The identification results in Figure 23 also support this finding. Although Adv Cls-WM1&WM2 aims to misidentify images as belonging to User2, it often leads to misidentification as users other than User2. However, in a system with fewer users, like 100 users, and under intense attack conditions (e.g., strength=8), Adv Cls-WM1&WM2 demonstrates a targeted identification success rate of 0.7%, showing a potential direction for attacks aimed at targeted user identification. Figure 23. The user identification results for Tree-Ring under Adv Cls-WM1&WM2 attacks. The original watermarked images are embedded with User1 s message. Adv Cls-WM1&WM2 tries to disrupt the latent feature of those images so that they can be misidentified as User2 generated. We simulate two settings: 100 users and 1000 users in total. The blue curves represent the proportion of images correctly identified as belonging to User1, while the orange curves show those misidentified as User2 s. Note that, the axes for blue and orange curves have different ranges in the figure. With increasing attack strengths, the likelihood of correctly identifying them as User1 s decreases significantly under both 100 and 1K user scenarios. However, misidentification as User2 s images occurs notably only when the total number of users is small (e.g., 100 users). G.3. Visualization of Attacks In Figure 24, we present visualizations of several attacks included in the WAVES benchmark. Prefix indicates the attack strategy, while suffix indicates the strength. G.4. Full Results on Diffusion DB, MS-COCO and DALL E3 G.5. Evaluation on Additional Watermarks: DWT-DCT and MBRS To further demonstrate the utility and versatility of the WAVES benchmark, we evaluated two additional watermark methods: DWT-DCT (Al-Haj, 2007) and MBRS (Jia et al., 2021b). DWT-DCT combines Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT) for watermark embedding, while MBRS enhances the resilience of DNN-based watermarks to JPEG compression by incorporating real and simulated JPEG artifacts during training. Stress tests were conducted on these watermarks using all the attack methods in WAVES. Results are presented in Figures 31 and 32 as performance vs. quality degradation 2D plots. Figure 7 in the main paper provides a comparison with the three existing watermarks (Tree-Ring, Stable Signature, and Stega Stamp). These findings confirm the utility of WAVES for identifying weaknesses in different watermark methods and demonstrate the ease of use and versatility of our benchmark toolkit, making it a valuable standard for the watermark research community. WAVES: Benchmarking the Robustness of Image Watermarks (a) Tree-Ring Unattacked (b) Adv Emb G-KLVAE8-2/255 (c) Adv Emb G-KLVAE8-8/255 (d) Adv Emb B-CLIP-2/255 (e) Adv Emb B-CLIP-8/255 (f) Adv Cls WM1WM2-2/255 (g) Adv Cls WM1WM28/255 (h) Regen-Diff-40 (i) Regen-Diff-200 (j) Rinse-2x Diff-20 (k) Rinse-2x Diff-100 (l) Rinse-4x Diff-10 (m) Rinse-4x Diff-50 (n) Dist Com-Photo-0.15 (o) Dist Com-Geo-0.15 (p) Dist Com-Deg-0.15 Figure 24. A visual demonstration of various adversarial, regeneration, and distortion attacks on a Tree-Ring watermarked image. Figure (a) is the base unattacked image. The base prompt, drawn from Diffusion DB, is digital painting of a lake at sunset surrounded by forests and mountains, along with further styling details. WAVES: Benchmarking the Robustness of Image Watermarks Figure 25. Evaluation on Diffusion DB dataset under the detection setup (part 1). WAVES: Benchmarking the Robustness of Image Watermarks Figure 26. Evaluation on Diffusion DB dataset under the detection setup (part 2). WAVES: Benchmarking the Robustness of Image Watermarks Figure 27. Evaluation on MS-COCO dataset under the detection setup (part 1). WAVES: Benchmarking the Robustness of Image Watermarks Figure 28. Evaluation on MS-COCO dataset under the detection setup (part 2). WAVES: Benchmarking the Robustness of Image Watermarks Figure 29. Evaluation on DALL E3 dataset under the detection setup (part 1). WAVES: Benchmarking the Robustness of Image Watermarks Figure 30. Evaluation on DALL E3 dataset under the detection setup (part 2). WAVES: Benchmarking the Robustness of Image Watermarks Figure 31. Stress test results for DWT-DCT. It is highly susceptible to regeneration attacks (cross markers) and most distortion attacks (square markers), but relatively robust against adversarial attacks. Figure 32. Stress test results for MBRS. It is vulnerable to certain distortion attacks (resized-cropping, blurring, rotation, combo distortions) and regeneration attacks, but robust against other distortions (JPEG compression, brightness/contrast, random erasing, noise) and adversarial attacks. H. Limitations Although we have stress-tested five watermarks and 26 attacks, there could exist more watermarks and attacks that we did not include in this paper. However, we emphasize our framework is extensible to any watermarking method and attacks. Additionally, our attack ranking method relies on author-selected TPR thresholds and image quality metrics that we believe will fairly capture attack potency based on existing literature and experimental studies. The use of other quality metrics (MSE, Watson-DFT, etc.) and differing TPR thresholds may affect attack rankings.