# screenmark_watermarking_arbitrary_visual_content_on_screen__7fb1256b.pdf

Screen Mark: Watermarking Arbitrary Visual Content on Screen

Xiujian Liang, Gaozhi Liu, Yichao Si, Xiaoxiao Hu, Zhenxing Qian*

Fudan University, School of Computer Science Shang Hai 200438, China liangxj23,gzliu22,ycsi22,xxhu23@m.fudan.edu.cn, zxqian@fudan.edu.cn

Digital watermarking has shown its effectiveness in protecting multimedia content. However, existing watermarking is predominantly tailored for specific media types, rendering them less effective for the protection of content displayed on computer screens, which is often multi-modal and dynamic. Visual Screen Content (VSC), is particularly susceptible to theft and leakage through screenshots, a vulnerability that current watermarking methods fail to adequately address. To address these challenges, we propose Screen Mark, a robust and practical watermarking method designed specifically for arbitrary VSC protection. Screen Mark utilizes a threestage progressive watermarking framework. Initially, inspired by diffusion principles, we initialize the mutual transformation between regular watermark information and irregular watermark patterns. Subsequently, these patterns are integrated with screen content using a pre-multiplication alpha blending technique, supported by a pre-trained screen decoder for accurate watermark retrieval. The progressively complex distorter enhances the robustness of the watermark in real-world screenshot scenarios. Finally, the model undergoes fine-tuning guided by a joint-level distorter to ensure optimal performance. To validate the effectiveness of Screen Mark, we compiled a dataset comprising 100,000 screenshots from various devices and resolutions. Extensive experiments on different datasets confirm the superior robustness, imperceptibility, and practical applicability of the method.

Introduction

With the continuous advancement of the Internet and computer technologies, an increasing amount of information is presented in the form of Visual Screen Content (VSC), including images, videos, texts, web pages, windows, and more. In the context of users personal computers, VSC is displayed on screens without exception. From a visual perspective, this mode of presentation is the most readily acceptable form of expression. At the same time, this also means that the VSC can easily be leaked. For most enterprise and home computers, ensuring data security is a significant challenge. Traditional security measures, such as data encryption, firewalls, access control,

*Corresponding author. Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Figure 1: The comparison of the traditional watermarking and the proposed Screen Mark. The red and green content represent non-watermarked and watermarked, respectively.

and identity management, provide comprehensive protection against data leaks. However, these methods primarily manage permissions, allowing authorized users to capture screen content in real-time using screenshot tools. Current specialized protection techniques for VSC often rely on non-learning watermarking methods, which are even visible. These methods struggle to balance robustness and visual quality effectively, rendering them unsuitable for realworld applications. Therefore, this study focuses on VSC security in screenshot scenarios and aims to propose a universal learning-based screen watermarking method. In recent years, the rapid advancement of multimedia watermarking technology(Liu et al. 2023; Xiao et al. 2024; Tang et al. 2024; Li, Liao, and Wu 2024) has facilitated the protection of multimedia file content. However, it is important to recognize that current watermarking techniques are primarily designed for individual modalities, offering specialized protection for images, videos, text, and other media types. As depicted in Fig.1, there are two main drawbacks: limited scope of protection and response time constraints.

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

Fig.1(a) shows that while traditional watermarking can safeguard specific media content within a single modality or frame, it falls short in providing comprehensive protection across various file types and the vast number of files present on personal computers. Moreover, these methods struggle to counteract millisecond-level capture attacks under dynamic screen conditions. In contrast, the method proposed in this paper, illustrated in Fig.1(b), does not focus on protecting a single media file or specific screen frame at a given time. Instead, it integrates the watermark with the screen through a unique fusion process, offering comprehensive and realtime protection for arbitrary VSC displayed on the screen. This approach has been named Screen Mark. In Screen Mark, we introduce a three-stage watermarking framework utilizing progressive training(Li et al. 2022). This approach completes robustness training at various levels across different stages, resulting in a versatile screen watermarking solution for VSC sreenshot scenarios. Traditional watermarking methods often embed watermark information directly into media content via an encoder, leading to increased processing time and limited protection scope. To overcome these limitations, Screen Mark employs an irregular watermark pattern that blends more naturally and comprehensively with screen content, resembling a mask. Moreover, this irregular pattern makes it more challenging for unauthorized users to detect and remove the watermark, thereby enhancing the security of the protection mechanism. Building on this, the three-stage progressive training strategy further refines this approach. Each stage addresses specific challenges: from basic message diffusion and reversal to adaptive screen decoder training, and finally, handling composite distortion. Through progressively complex training scenarios, Screen Mark enhances system resilience in VSC capture and subsequent processing scenarios. Based on the above, the contribution of this paper can be summarized as follows: 1) We introduce a novel and practical multimedia protection scenario that addresses not only single-modal media content but also multi-modal VSC displayed on computer screens. And we point out the limitation of the mainstream single-modal watermarking methods in terms of protection scope and response time in this scenario. 2) To the best of our knowledge, we present the first learning-based watermarking framework specialized for VSC protection, named Screen Mark. In Screen Mark, regular watermarking information is diffused into irregular watermarking patterns and integrated with the screen display. We propose a three-stage progressive training strategy and design various levels of distorters tailored to different stage. 3) To enhance the applicability of Screen Mark to VSC protection, we have compiled a dataset of 100,000 screenshot images. These images were collected using different screenshot tools across a diverse range of devices and resolutions from SD (720x480) to 4K (3840x2160). 4) Extensive experiments demonstrate that Screen Mark matches or even surpasses the performance of four SOTA single-modal watermarking methods in screenshot scenarios in terms of robustness, invisibility, and applicability to real-world situations.

Related Work Deep-learning-based Watermarking Deep-learning approaches have effectively addressed the limitations of hand-crafted features in watermarking. (Zhu et al. 2018) introduced an end-to-end solution using an autoencoder architecture, establishing a foundation in the domain. To enhance robustness against JPEG compression, MBRS(Jia, Fang, and Zhang 2021) proposed a hybrid noise layer of real and simulated JPEG with a small batch strategy. (Fernandez et al. 2022) utilized a pre-trained model to create a transform-invariant latent space for watermark embedding, achieving higher robustness against various attacks. DWSF(Guo et al. 2023) offers a practical framework for decentralized watermarking, training an auto-encoder to resist non-geometric attacks and incorporating a watermark synchronization module for geometric attacks. Moreover, some recent methods(Guan et al. 2022; Xu et al. 2022; Fang et al. 2023b) explore invertible neural networks for watermark embedding and extraction. However, these primarily protect specific media content in a single modality and are inadequate for the multi-modal VSC in real-time scenarios.

Screen-related Watermarking The visibility of VSC to the public has sparked interest in its preservation among scholars. (Piec and Rauber 2014) utilized the Human Visual System to create dynamically adaptable watermarks, while (Du and Fan 2018) aimed to prevent screenshot data leakage through full-screen protection. These non-learning-based VSC watermarking methods struggled with balancing robustness and invisibility, making them easily detectable by attackers. Driven by the need for screen protection, screen-shooting resilient watermarking (SSRW) addresses cross-channel content leakage. (Fang et al. 2018) first modeled screen shooting distortions and proposed robust watermarking schemes based on DCT and SIFT. (Wengrowski and Dana 2019) introduced CDTF, a network simulating the screento-camera process using a multi-million dataset. (Tancik, Mildenhall, and Ng 2020) developed a noise layer for the printer-to-camera channel, addressing various distortions. (Jia et al. 2020) designed a 3D reconstruction-based noise layer for the camera-shot channel, achieving camera-shot robustness. Similarly, (Fang et al. 2022) created PIMo G, a noise layer for the screen-to-camera channel, enhancing screenshot robustness. To improve visual quality, (Jia et al. 2022) embedded information in sub-images and used a localization network to identify watermarked regions. To address distortion variability across different screens, (Fang et al. 2023a) introduced De No L, an efficient decoupling noise layer that simulates distortions accurately with fewer samples by fine-tuning transform layer. However, SSRW, which focuses on modeling the physical distortion of recapture, protects only specified fixed content and is limited in scope and response time, making it ineffective for screen interception scenarios in VSC. In contrast, our work aims to protect arbitrary VSCs, providing real-time protection as the screen changes and multi-modal protection within a single watermark framework.

Figure 2: The overall framework of the proposed method, which contains three main stages.

Proposed Method

Motivation & Overview

To address the limitations in protection scope and response times encountered by current watermarking techniques for screen content, this paper introduces a three-stage watermarking framework specifically designed for screen content. Inspired by the diffusion model(Dhariwal and Nichol 2021), which diffuses a regular image with noise to generate an irregular image, this framework adopts a novel approach. The diffusion process can be reversed through denoising, restoring the original image. This suggests directly integrating irregular watermark patterns obtained from the diffusion of regular watermark information with the screen content, rather than embedding information into the protected multimedia carrier using an encoder, as is common in traditional watermarking frameworks. This is particularly important for screen content protection because it allows real-time and cost-effective integration of watermark patterns with screen content. It offers a protection method that is unrestricted by scope and response time, integrating securely and naturally with screen content. Our approach transforms regular watermark information into irregular watermark patterns and integrates them with screen content, eliminating the need for encoder-based information embedding. With this in mind, we designed a threestage watermarking framework, executing robustness training at different levels. The overall framework shows in Fig.2.

Stage-1: Pairwise Initialization

In Stage-1, we initialize a pair of a Message Diffuser and a Message Reverser. These modules effectively facilitate the dissemination of watermark information into watermark patterns and their subsequent reverse recovery. During this stage, we introduce an image-level distorter to enhance robustness against image-level attacks that may occur with screen captures.The framework consists of a Message Diffuser MD, Message Reverser MR, and an Image-level Distorter DI.

Workflow Initially, we generate a batch of regular watermark messages Iw , with batch size N0 and length of information L. Subsequently, Iw is subjected to diffusion processing within MD, resulting in an irregular watermark pattern Pw. Upon acquisition of Pw, each pattern in batch undergoes parallel image-level distortion processing by Dk I DI, where k {1, 2, . . . , N0}, yielding the distorted watermark pattern Pd. Finally, the reverse-processed watermark information Ir is recovered through MR.

Architecture The MD is comprised of a linear layer, N1 diffusion block, and a convolution Block connected in series. It receives a regular watermark information Iw of size R1 1 L N0 and outputs a regular watermark pattern Pw of size RH W 3 N0. The diffusion blocks use upsampling and transposed convolution in parallel to suppress checkerboard effects and enhance feature representation without

increasing network depth. Subsequently, in DI, different batches of Iw are randomly subjected to one of three types of image-level distortions Dk I DI, specifically Resize distortion, Crop distortion , and Cropout distortion. These distortions are common in actual screenshot scenarios and cannot be restored by third parties without the attack parameters. Conversely, the MR consists of an HRNet, N2 reversal block, a double-convolution block, and two linear layers connected in series. It receives a distorted watermark pattern Pd of size RH W 3 N0 and outputs a reversed watermark information Ir of size R1 1 L N0.

Loss Function In this stage, the loss function serves two purposes: on one hand, control the security and stealthiness of the generated watermark patterns, and on the other hand, ensure the accuracy of the reversed information. To achieve the irregularity and invisibility of the watermark patterns, we propose four types of losses: near-zero loss, dispersion loss, variation loss, and channel loss. Near-zero loss reduce the interference of the watermark patterns on the original screen content. It minimize the mean squared error between the generated watermark pattern Pd and a tensor filled with zeros, signifying that the overall pixel values are close to zero , which can be formulated as:

Lzero = MSE(Pd, 0) (1)

where 0 represents the zero matrix with shape as Pd. Dispersion loss prevent concentration in specific areas for robustness against image processing attacks. It increases the dispersity of the watermark pattern by calculating the mean absolute value and variance, ensuring uniform distribution across the image, which can be formulated as:

Ldispersion = mean(|Pd|) + var(Pd) (2)

where |Pd|, mean and var represents the absolute value, the mathematical mean and the variance of Pd, respectively. Variation loss reduce the visual noise introduced by the watermark. It encourages the spatial smoothness of the generated watermark pattern by minimizing drastic changes between adjacent pixels, making the watermark harder to detect by the naked eye, which can be formulated as:

Lvariation =

(Pd[i, j] Pd[i + 1, j])2+

(Pd[i, j] Pd[i, j + 1])2]

where i and j respectively represent the row and column indices of pixels in the Pd. W and H represent the width and height of the Pd. Variation loss considers pixel variations in both the vertical and horizontal directions, reducing highfrequency noise while preserving edge information. Channel balance loss reduce noticeable color distortion and maintain the color balance. It minimizes the mean squared error of the mean value interpolation of the R, G, B color channels. This design is particularly important for color-sensitive screen content, which can be formulated as:

LChannel =mean((Rm Gm)2+

(Rm Bm)2 + (Gm Bm)2) (4)

where Rm,Gm and Bm respectively represent the mean of R,G and B channel of Pd, respectively. Based on the considerations above, the corresponding loss function for the pattern can be written as follows:

LP attern =λ0Lzero + λ1Ldispersion+ λ2Lvariation + λ3Lchannel (5)

To ensure the consistency of reversed information with the watermark, we design the loss function for the message using binary cross-entropy, which can be formulated as:

LMessage = 1

i=1 [Iwi log(Idi)+

(1 Iwi) log(1 Idi)]

where N is the number of samples in the batch, Idi is the predicted value for the ith sample, and Iwi is the actual value for the ith sample. The loss function in the initialization training process of stage-1 can be formulated as:

Lstage1 = βLP attern + γLMessage (7)

where β and γ each represent the importance of watermarking patterns and watermarking message in this stage.

Stage-2: Adaptive Pre-Training In Stage-2, we involve adaptive training for the screen decoder DS to accurately decode the irregular watermark patterns and reverse the message. We freeze the parameters of the model from stage 1 and input the patterns into an Alpha blending rendering module Rα for integration with the computer screen. This stage also introduces a pixel-level distorter to increase robustness against pixel-level attacks.

Workflow The process of transforming watermark information into watermark patterns remains consistent with Stage-1. Building on this, at this stage, the watermark patterns Pw and the screen content Sc are input into Rα. Through flattening or scaling, Rα integrates Pw with Sc to produce the watermarked screen content Sw. The specific way of integration is to form a kind of mask that floats above the screen through the α channel, so that the full-screen watermark effect can be realized on any resolution screen without any impact on the screen content. After obtaining Sw, we employ Pixel-level Distorter Dk P DP to process with parallel distortion, where k {1, 2, . . . , N0}, resulting in the distorted watermarked screen content Sd. Finally, the distorted watermark pattern Pd is decoded from Sd using DS.

Architecture The Rα utilizes Direct3D for efficient graphics processing and the Windows API for window management, incorporating pre-multiplied alpha technology to optimize transparency handling. Initially, window creation and configuration are performed using Windows Create Window Ex, ensuring the window stays atop others while allowing mouse events to pass through, maintaining unobtrusive user interaction. Subsequently, Direct3D is utilized for GPU-accelerated rendering, enhancing efficiency and ensuring speed during terminal information switching. Finally, Pre-multiplied alpha image processing is applied before loading, where each pixel s color value is multiplied by

Dataset Methods Length (Bits) Clean Crop(%) Cropout(%) Resize(%) Average 90 80 70 5 10 20 80 150 300

Stega Stamp 100 99.48 68.93 55.19 51.93 97.12 97.00 95.31 91.00 91.43 90.12 82.00 MBRS 30 97.97 79.64 75.43 72.13 97.36 97.11 93.87 93.38 93.16 92.64 88.30 PIMo G 30 99.38 78.83 73.14 70.16 97.56 97.31 93.12 92.21 92.17 90.25 87.19 DWSF 30 100.00 98.75 96.58 94.12 98.16 97.89 94.44 94.35 97.37 97.73 96.60 Screen Mark 100 100.00 98.71 96.75 94.17 98.09 97.38 95.65 95.30 97.98 97.25 96.80

Screen Image

Stega Stamp 100 97.06 72.68 60.25 55.93 90.43 91.06 89.00 94.68 93.68 94.31 82.45 MBRS 30 98.12 79.99 75.80 72.46 97.79 97.54 94.22 93.73 93.51 92.99 88.67 PIMo G 30 99.45 79.19 73.40 70.45 97.99 97.75 93.47 92.56 92.52 90.59 87.55 DWSF 30 100.00 99.20 96.82 93.46 98.61 98.34 95.08 94.79 95.82 95.17 96.37 Screen Mark 100 100.00 99.66 98.73 95.26 99.93 99.72 96.23 99.88 99.99 99.99 98.82

Table 1: Bit accuracy rate(BAR,%) on different image-level attacks

its alpha value, simplifying transparency blending calculations. This computation can be described as follows: Sw = αPw + (255 α)Sc (8) where α take the value of 5, meaning the watermark pattern affects less than 2% of screen content pixels. To optimize the image rendering, an advanced shader dynamically adjusts image transparency during rendering, leveraging pre-multiplied alpha techniques for automatic color and transparency blending. In DP , batches of Sw are randomly selected among three types of image-level distortions, denoted as Dk P DP . These types include JPEG compression, Gaussian noise, and Gaussian blur. The DS consists of three consecutive Conv Block, two Res Block, and an additional Conv Block, linked in series. It receives Sd of size RH W 3 N0 and outputs a Pds of the same size.

Loss Function This stage facilitates the DS in pretraining adaptively, building on the training foundation laid in the previous stage, to decode watermark patterns from screenshots of watermarked screens that have undergone pixellevel distortion attacks. The goal is to match the extracted pattern as closely as possible to the original pattern before distortion. With the weights from Stage-1 frozen, pretraining+ for the screen decoder is not interfered by other factors. The loss function in stage-2 is formalized as follows: Lstage2 = MSE(Pds, Pw) (9)

Stage-3: Enhancement Fine-Tuning In Stage-3, we will synergistically fine-tune the model weights acquired from the Message Diffuser MD, Message Reverser MR, and Screen Decoder DS from the previous two stages. Furthermore, we introduce an additional Jointlevel Distortion Layer DJ that encompasses both image and pixel-level distortions. This enhances the model s robustness when integrated with screen content, effectively compensating for the limitations of the previous two stages. In order to make the model have just the right amount of robustness, the level of the distorter and its position are different. Based on the aforementioned model, we have named the complete network Screen Mark. The loss function during the enhancement fine-tuning process is as follows: Lstage3 = Lstage1 + Lstage2 (10)

Experiments Experimental Settings Benchmarks. We are the first learning-based watermarking specialized for VSC protection and have no directly relevant baseline model to compare against. In order to measure our performance in terms of robustness, we still compared our method with four state-of-the-art(SOTAs) single-modal watermarking methods, i.e.,Stegastamp(Tancik, Mildenhall, and Ng 2020), PIMo G(Fang et al. 2022), MBRS(Jia, Fang, and Zhang 2021), DWSF(Guo et al. 2023).

Datasets. Given the absence of a suitable screenshot dataset for VSC protection, we created a dataset called Screen Image, comprising 100,000 screenshots from various devices and resolutions ranging from SD (720x480) to 4K (3840x2160). We randomly selected 50,000 images as our training dataset. To evaluate the Screen Mark, we randomly sample 1,000 images each from Image Net(Deng et al. 2009) and Screen Image(excluding training) respectively. Notably, MBRS only accepts a fixed input size post-training, necessitating the scaling of test images to 128x128. Detailed dataset categorization and collection are displayed in APPENDIX.

Implementations. Our method is implemented using Py Torch(2019) and executed on an NVIDIA Ge Force RTX 4090 GPU. In terms of experimental parameter, the information length L is 100. The height H and width W of watermark pattern PW is 512, optimized for blending with screens of different resolutions without implying a minimum size limit. The batch size N0, diffusion block N1 and reversal block N2 is 16, 5 and 2, respectively. In Stage-1, the loss function weight factors β and γ are 0.1 and 1, respectively. For LP attern , λ0, λ1, λ2 and λ3 are set to 1.0, 0.5, 0.1 and 0.01, respectively. Through 8 sets of ablation studies, we balanced choices based on fast training and high visual quality. The α of Alpha-Fusion Rendering Module Rα is set to 5, allowing us to control the range by adjusting α to achieve a trade-off. Robustness remains stable when the PSNR is between 36 and 42 d B (α [5, 8]), which aligns with the PSNR range of the SOTAs. We use the Adam optimizer(2014) with a learning rate of 1e-5, and set the training epochs to 100, while the compared methods adopt their default settings. Hyperparameter ablations are provided in APPENDIX.

Dataset Methods Length (Bits) Clean JPEG(QF) Noise(σ) Blur(κ) Average 95 85 75 0.01 0.05 0.1 3 5 7

Stega Stamp 100 98.48 95.43 94.18 92.37 97.31 96.75 91.62 96.62 95.37 94.12 94.86 MBRS 30 97.97 97.45 96.53 95.47 98.29 98.17 97.88 97.43 95.89 94.14 96.80 PIMo G 30 99.38 96.38 95.14 93.27 98.38 98.29 97.64 97.75 96.43 95.13 96.49 DWSF 30 100.00 99.31 97.64 95.88 99.13 98.65 95.24 98.66 96.17 96.82 97.50 Screen Mark 100 100.00 99.21 97.68 95.77 99.03 98.68 95.23 98.85 96.23 95.84 97.39

Screen Image

Stega Stamp 100 97.06 93.00 92.31 90.31 95.00 89.12 80.43 93.68 94.06 93.81 91.30 MBRS 30 98.35 98.92 97.49 95.58 98.63 98.51 98.18 97.91 96.35 94.67 97.36 PIMo G 30 99.86 96.89 95.59 93.47 98.79 98.70 98.10 98.22 96.98 95.59 96.92 DWSF 30 100.00 99.15 97.38 95.27 99.21 98.43 95.12 99.45 96.20 95.86 97.34 Screen Mark 100 100.00 99.39 98.34 96.69 100.00 99.08 96.19 100.00 98.38 95.65 98.19

Table 2: Bit accuracy rate(BAR,%) on different pixel-level attacks

Metrics. We consider Peak Signal to Noise Ratio (PSNR), Structure Similarity Index Measure (SSIM), Learned Perceptual Image Patch Similarity(Zhang et al. 2018) (LPIPS) to evaluate visual quality. And consider Bit Accuracy Rate (BAR) evaluate robustness performance.

Robustness Performance In this section, we compare the robustness of our Screen Mark method with four SOTAs across various attack types. The watermark length was set to 100 bits in our experiments. The types and implementation details of the attack settings also align with those used in SOTAs. To ensure objective results, we used the Image Net and Screen Image datasets for our experiments. Further experiments on watermark length, robustness against severe, hybrid and real-world scenarios attacks are included in the APPENDIX.

Robustness against Image-level Attacks We evaluate the robustness of the Screen Mark and SOTAs against imagelevel attacks in different factors. In Tab.1, the headers represent the proportion of the Crop, Cropout, and Resize attacks in the original images, measured in percentage. The attack methods used in experiments align with the SOTAs for consistency and comparability. Notably, Our Screen Mark consistently achieves over 94% bit accuracy, with an average performance of 97.81% across both datasets, regardless of the attack type. Although in some cases do not outperform the best method, our performance remains close to the top.

Robustness against Pixel-level Attacks In addition to image-level attacks, we also assessed robustness against pixel-level attacks, which are common in social networking scenarios. Tab.2 reports the bit error rate against pixellevel attacks in different factors. The table headers indicate the factors for each attack: JPEG compression quality factor (QF), Gaussian noise standard deviation (σ), and Gaussian blur kernel size (κ). Our Screen Mark method demonstrates exceptional stability and performance across various pixel-level attacks. In many cases, Screen Mark achieves the highest or second-highest bit accuracy rates, highlighting its robustness. Overall, Screen Mark consistently performs at a level comparable to the best method, and often surpasses other methods by approximately 1%, showcasing its reliability and effectiveness.

Figure 3: The visualization of the Screen Mark impact

Robustness in Real screenshot Scenarios To verify the practical application of Screen Mark, we tested its robustness in real screenshot scenarios. We collected screenshots using various publicly available tools, including Windows Screenshot, Snipaste, Greenshot, and We Chat Screenshot, across different resolutions. In experiments, watermarked VSC screenshots were randomly cropped with a fixed cropping size of 400*400 pixels, not exceeding 8% of the area of a 1080P image. The screenshots were saved in JPG format, with the compression quality determined by the default settings of each tool. The results are inclued in APPENDIX, indicating that Screen Mark maintains a bit accuracy rate above 94% across all resolutions and tools. This level of accuracy ensures that the watermark can be fully and correctly extracted in practical applications, especially when error-correcting codes are incorporated.

Visual Quality The present work not only addresses the shortcomings of mainstream watermarking methods in VSC protection scenarios, but achieves impressive visual quality. We verify the excellent performance of our work in terms of visual quality through qualitative visualization and quantitative metrics.

Visualization of Watermarking Residuals As described in Stage-1, we control the generated watermark pattern to be an irregular image that is close to zero, evenly dispersed, gently varying, and as balanced in RGB channels as possible. This minimizes the impact on the original VSC quality while avoiding malicious recognition and erasure by watermark attackers. Our proposed Screen Mark is fused with ar-

bitrary VSC, protecting massive media contents on screen in real time that SOTAs cannot. Considering this, we calculated and magnified the residuals of the watermarked image compared to the original image by 20 times, using a randomly selected test image from Screen Image. The results of the visualization are shown in Fig.3, confirming that our watermark pattern meets our design expectations. Richer visualization of Screen Mark is shown in APPENDIX.

Quantification of Watermarked Images We introduced relevant image quality metrics from existing watermarking frameworks to assess visual quality. PSNR measures the image quality reference value between the maximum signal and background noise in d B, with higher values indicating less distortion. SSIM quantifies structural similarity between two images, with values from 0 to 1, and higher values indicating more similarity. LPIPS measures the difference between two images, with lower values indicating greater similarity. Table 3 reports the visual quantification of watermarked images using different methods. Screen Mark achieves the best performance in both SSIM and LPIPS metrics, thanks to our pattern control and alpha fusion strategy.

Datasets Methods PSNR SSIM LPIPS

Stega Stamp 23.89 0.8025 0.0515 MBRS 36.49 0.9173 0.0387 PIMo G 36.21 0.9850 0.0312 DWSF 41.47 0.9831 0.0083 Screen Mark 41.38 0.9969 0.0058

Screen Image

Stega Stamp 28.06 0.9411 0.1640 MBRS 38.21 0.9384 0.0232 PIMo G 38.33 0.9873 0.0218 DWSF 42.60 0.9905 0.0074 Screen Mark 41.86 0.9948 0.0055

Table 3: Visual quantification in different methods

Other Comparison We have demonstrated the visual quality and robustness performance of Screen Mark, the key metrics for mainstream watermarkings. However, due to the unique scenarios addressed in this work, additional differences between Screen Mark and SOTAs need to be experimentally verified. The first key difference is the advantage of the proposed three-stage progressive training strategy in Screen Mark compared to the traditional end-to-end training approach. The second difference lies in the scope of protection, where Screen Mark offers a broader range of coverage than single-modality watermarking, as depicted in Fig.1, so it will not be verified again. The third difference is the response time for watermark embedding in dynamic VSC changes, where Screen Mark significantly outperforms single-modality watermarking.

The Ablation of Training Strategy One key advantage of Screen Mark is its three-stage progressive training strategy, compared to the traditional end-to-end training approach. This strategy provides two main benefits: it allows the model to gain experience with simpler tasks first, simplifying the

Figure 4: Loss (solid)/Acc (dashed) vs. Epoch of training: Stage-1(blue), Stage-2(orange), Stage-3(green), E2E(red).

learning process for more complex tasks and preventing premature convergence to local optima. Additionally, staged training enables the use of more focused and refined loss functions and optimization strategies at each stage. We conducted an ablation study to compare different training strategies, as shown in Figure 4. Our results indicate that the three-stage strategy achieves convergence by the 15th epoch, while E2E training exhibits oscillations in the loss function.

The Comparison of Temporal Limitation Digital watermarking for VSC must be capable of real-time protection, as screen capture commands can be scripted to achieve millisecond level interception. Existing single-modal watermarks can not embed in dynamic screen content in real time. We validated the responsiveness of SOTAs to VSC changes. As shown in Tab.4, our watermark, fused directly to the screen, achieves a reaction time of 0 milliseconds, outperforming SOTAs. In particular, here the time is not the execution time of program, but the reaction time to reload the watermark information to the new VSC when it changes.

SOTAs Stega Stamp MBRS PIMo G DWSF Screen Mark

Time(ms) 24.91 18.86 19.83 14.91 0

Table 4: Average reaction time of SOTAs in VSC scenarios

Conclusion In this paper, we propose Screen Mark, a robust deep learning-based robust watermarking scheme for arbitrary VSC protection, using a three-stage progressive watermarking training strategy. The message diffuser and message reverser facilitate the transformation between regular watermarking information and irregular watermarking patterns. The alpha-fusion rendering module integrates these patterns into VSCs of any resolution, while the screen decoder extracts the watermark information from distorted watermarked screenshots. We built a dataset with 100,000 screenshots from various devices and resolutions. Extensive experiments demonstrate the effectiveness of Screen Mark in robustness, imperceptibility, and practical applicability.

Acknowledgments This work was supported by the National Natural Science Foundation of China under Grants U20B2051, U22B2047, 62450067, 62072114.

References Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248 255. Ieee. Dhariwal, P.; and Nichol, A. 2021. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 8780 8794. Du, J.; and Fan, X. 2018. Adaptive Watermark Based on Screen. In Proceedings of the 2nd International Conference on Cryptography, Security and Privacy, 1 4. Fang, H.; Chen, K.; Qiu, Y.; Liu, J.; Xu, K.; Fang, C.; Zhang, W.; and Chang, E.-C. 2023a. De No L: A Few-Shot-Sample Based Decoupling Noise Layer for Cross-channel Watermarking Robustness. In Proceedings of the 31st ACM International Conference on Multimedia, 7345 7353. Fang, H.; Jia, Z.; Ma, Z.; Chang, E.-C.; and Zhang, W. 2022. Pimog: An effective screen-shooting noise-layer simulation for deep-learning-based watermarking network. In Proceedings of the 30th ACM International Conference on Multimedia, 2267 2275. Fang, H.; Qiu, Y.; Chen, K.; Zhang, J.; Zhang, W.; and Chang, E.-C. 2023b. Flow-based robust watermarking with invertible noise layer for black-box distortions. In Proceedings of the AAAI conference on artificial intelligence, 5054 5061. Fang, H.; Zhang, W.; Zhou, H.; Cui, H.; and Yu, N. 2018. Screen-shooting resilient watermarking. IEEE Transactions on Information Forensics and Security, 14(6): 1403 1418. Fernandez, P.; Sablayrolles, A.; Furon, T.; J egou, H.; and Douze, M. 2022. Watermarking images in self-supervised latent spaces. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 3054 3058. IEEE. Guan, Z.; Jing, J.; Deng, X.; Xu, M.; Jiang, L.; Zhang, Z.; and Li, Y. 2022. Deep MIH: Deep invertible network for multiple image hiding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1): 372 390. Guo, H.; Zhang, Q.; Luo, J.; Guo, F.; Zhang, W.; Su, X.; and Li, M. 2023. Practical Deep Dispersed Watermarking with Synchronization and Fusion. In Proceedings of the 31st ACM International Conference on Multimedia, 7922 7932. Jia, J.; Gao, Z.; Chen, K.; Hu, M.; Min, X.; Zhai, G.; and Yang, X. 2020. RIHOOP: Robust invisible hyperlinks in offline and online photographs. IEEE Transactions on Cybernetics, 52(7): 7094 7106. Jia, J.; Gao, Z.; Zhu, D.; Min, X.; Zhai, G.; and Yang, X. 2022. Learning invisible markers for hidden codes in offlineto-online photography. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2273 2282.

Jia, Z.; Fang, H.; and Zhang, W. 2021. Mbrs: Enhancing robustness of dnn-based watermarking by mini-batch of real and simulated jpeg compression. In Proceedings of the 29th ACM international conference on multimedia, 41 49. Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980. Li, C.; Zhuang, B.; Wang, G.; Liang, X.; Chang, X.; and Yang, Y. 2022. Automated progressive learning for efficient training of vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12486 12496. Li, Y.; Liao, X.; and Wu, X. 2024. Screen-Shooting Resistant Watermarking with Grayscale Deviation Simulation. IEEE Transactions on Multimedia. Liu, G.; Si, Y.; Qian, Z.; Zhang, X.; Li, S.; and Peng, W. 2023. WRAP: Watermarking Approach Robust Against Film-coating upon Printed Photographs. In Proceedings of the 31st ACM International Conference on Multimedia, 7274 7282. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32. Piec, M.; and Rauber, A. 2014. Real-time screen watermarking using overlaying layer. In 2014 Ninth International Conference on Availability, Reliability and Security, 561 570. IEEE. Tancik, M.; Mildenhall, B.; and Ng, R. 2020. Stegastamp: Invisible hyperlinks in physical photographs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2117 2126. Tang, Y.; Wang, C.; Xiang, S.; and Cheung, Y.-m. 2024. A Robust Reversible Watermarking Scheme Using Attack Simulation-Based Adaptive Normalization and Embedding. IEEE Transactions on Information Forensics and Security. Wengrowski, E.; and Dana, K. 2019. Light field messaging with deep photographic steganography. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1515 1524. Xiao, X.; Zhang, Y.; Hua, Z.; Xia, Z.; and Weng, J. 2024. Client-Side Embedding of Screen-Shooting Resilient Image Watermarking. IEEE Transactions on Information Forensics and Security. Xu, Y.; Mou, C.; Hu, Y.; Xie, J.; and Zhang, J. 2022. Robust invertible image steganography. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7875 7884. Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; and Wang, O. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, 586 595. Zhu, J.; Kaplan, R.; Johnson, J.; and Fei-Fei, L. 2018. Hidden: Hiding data with deep networks. In Proceedings of the European conference on computer vision (ECCV), 657 672.