# soft_superpixel_neighborhood_attention__e12de883.pdf Soft Superpixel Neighborhood Attention Kent Gauen Purdue University gauenk@purdue.edu Stanley Chan Purdue University stanchan@purdue.edu Images contain objects with deformable boundaries, such as the contours of a human face, yet attention operators act on square windows. This mixes features from perceptually unrelated regions, which can degrade the quality of a denoiser. One can exclude pixels using an estimate of perceptual groupings, such as superpixels, but the naive use of superpixels can be theoretically and empirically worse than standard attention. Using superpixel probabilities rather than superpixel assignments, this paper proposes soft superpixel neighborhood attention (SNA) which interpolates between the existing neighborhood attention and the naive superpixel neighborhood attention. This paper presents theoretical results showing SNA is the optimal denoiser under a latent superpixel model. SNA outperforms alternative local attention modules on image denoising, and we compare the superpixels learned from denoising with those learned with superpixel supervision.1 1 Introduction The attention mechanism is attributed to meaningful benchmark improvements in deep neural networks [1]. While the initial attention mechanism acted globally, recent methods operate on local neighborhoods, which requires less computation and reduces the chance of learning spurious correlations [2]. Neighborhood attention (NA) transforms each pixel using its surrounding square neighborhood of pixels. In reality, this square neighborhood is merely an implementation convenience. Natural images contain objects with non-rigid boundaries, such as the contours of a human face or the outlines of text. To account for these deformations, this paper proposes a new local attention module that re-weights the attention map according to low-level perceptual groups, or superpixels [3], named soft superpixel neighborhood attention (SNA). Theory from vision science posits that humans view scenes according to perceptual groups (e.g. Gestalt principles) [4]. Superpixels are one such formulation of perceptual groupings, and they create a deformable, boundary-preserving segmentation of an image [3]. Since deep learning models can learn spurious correlations, operations that utilize perceptual groups should principally help a network learn desirable correlations. Despite the deformable neighborhood shapes, applying attention with a superpixel estimate is straightforward. However, this naive approach is theoretically worse than NA for some superpixel shapes and sensitive to errors in the superpixel estimate. To address these issues, we propose thinking beyond using a single superpixel estimate. While recent works use the maximum likelihood (ML) superpixel estimate [6, 7], many superpixel formations are equally likely. Figure 1 depicts three samples from a (slightly modified) recent BASS superpixel model [5]. Many of the boundaries seem arbitrary and change among the samples, while some remain fixed. No single superpixel segmentation is the absolute best . This is not a deficiency of the particular method to estimate superpixels, but rather a drawback of using a point estimate. 1Code for this project is available at https://github.com/gauenk/spix_paper 2For educational value, hyperparameters are modified to create more boundaries. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Figure 1: The Ambiguity of Superpixels. This figure compares three superpixel estimates from a recent method named BASS. [5]2While all three samples achieve a similar segmentation quality, some regions are different, and some are almost identical. Since no single segmentation is the best , this suggests that superpixel assignments are not as important as superpixel probabilities. This paper presents a theoretically grounded approach to account for the ambiguity of superpixel assignments within an attention module. We consider this ambiguity in our derivation by re-weighting the empirical data distribution with superpixel probabilities. While standard superpixel models assume a fixed prior probability, this paper supposes each pixel has a separate prior superpixel probability. Under this new model, we find the optimal denoiser is the proposed SNA module. Contributions. In summary, this paper presents a rigorous way to incorporate superpixels into an attention module. By modeling image pixels with distinct superpixel probabilities, we find the soft superpixel neighborhood attention (SNA) module is the optimal denoiser. We empirically show the SNA module outperforms NA within a single-layer denoising network by 1 2 d B PSNR. We compare the superpixels learned for denoising with superpixels learned using supervised training [8]. 2 Related Works Non-Local Methods. Classical image restoration methods transform a pixel using its neighborhood based on pairwise similarities [9 11]. Some early deep learning models incorporate non-local modules with deep learning models for image restoration [12 17]. Once attention was introduced to the computer vision community, later methods drew analogies to between attention and non-local methods [18 20]. Attention. While attention modules followed the same form as non-local methods, they were originally presented as a disconnected idea [1, 21]. Attention s primary issue was originally computation, and subsequent research efforts proposed efficient alternatives [22 25, 2]. These alternatives often compute attention across a smaller region, and use a deep network s depth to disperse information globally. Neighborhood attention (NA) is implemented with an efficient, custom CUDA kernel to search a square window and is among the fastest methods [2]. The square window is theoretically more desirable than asymmetrical alternatives, such as SWIN [22]. Superpixels. Superpixels embody the concept of perceptual groupings, which suggests that humans first group smaller objects together and then aggregate this information to understand the whole image [3]. There are many methods to model these superpixels [26, 27]. SLIC is a state-of-the-art, classical superpixel method and is explained in Section 3.1 [26]. Superpixels have been applied extensively outside of deep learning literature [28 32]. Superpixels have not been employed as widely since the advent of deep learning, but recent methods have started incorporating them [6, 7]. A recent method, referred to as the superpixel sampling network (SSN), presents a supervised learning method to sample using segmentation labels [8]. The loss function of SSN is related to the idea of superpixel pooling, which we use to evaluate superpixel probabilities in Section 5.3 [33, 34]. Non-Parametric Density Estimation. The derivation of the optimal denoiser using latent superpixels closely resembles the problem of mixture-based non-parametric density estimation [35]. This paper does not address important statistical issues, such as identifiability. Figure 2: Each Pixel is Connected to Nine Superpixels. This figure illustrates the anatomy of the SLIC superpixels. The left-most figure illustrates how superpixels are conceptually distributed across a grid on the input image with stride Ssp. The right figure illustrates a single pixel is connected to (at most) nine centroids. 3 Preliminaries 3.1 SLIC Superpixels In this paper, an image (x) has height (H), width (W), and features (F) with shape HW F. Conceptually, superpixels are evenly spaced across the image according to the stride denoted Ssp. The superpixel stride (Ssp) determines the number of superpixels, S = HW/S2 sp (see Fig 2). SLIC superpixels consist of superpixel means (ˆµ with shape S F), intra-superpixel precision (ˆη = [ˆηapp ˆηshape] with shape S 2), the probability each pixel belongs to each superpixel (ˆπ with shape HW S), and the superpixel assigned to each pixel (ˆs with shape HW). SLIC estimates these quantities using Lloyd s algorithm (similar to k-means) [26]. Following recent works, each pixel is connected to at most nine superpixels [8]. Presently, we describe how SLIC superpixel probabilities are computed. Let the distances between pixels and superpixel means be written as a matrix D of shape HW S and let D(i,s) = xi µs 2 2 if the two are connected and infinity otherwise. A difference between the indices is written e D(i,s) = [ix iy] [sx sy] 2 2. Superpixel probabilities are computed as, & \ h at {\vpi } {(i ) } = \sigm a (-\hat {\eta }_{i,\text {app}}\m D {(i)} -\hat {\eta }_{i,\text {shape}}\tilde {\m D } {(i)} ) \label {eq:slic_prob_update} (1) where ˆηi,app, ˆηi,shape > 0 enforce consistency across appearance (app) and shape. The estimated superpixel probabilities are the conditional probability pixel i is assigned to superpixel s, p(si = s|xi) = ˆπ(i,s). A superpixel segmentation (ˆs) is formed by assigning each pixel s cluster label to their most similar superpixel; ˆsi = arg maxs ˆπ(i,s). These quantities are estimates (note the d hats) of latent parameters, which is a modeling assumption discussed in Section 4.2. 3.2 Superpixel Pooling The attention module proposed in this paper will learn superpixel probabilities, but standard superpixel evaluation only considers a point estimate. To augment the standard superpixel benchmarks, we follow recent works and compare superpixel probabilities with superpixel pooling [8, 5, 34]. A matrix z has size HW C for C segmentation classes, and superpixel probabilities (ˆπ) have size HW S. Then, the probabilities are re-normalized, π(i,s) = ˆπ(i,s)/ PHW j=1 ˆπ(j,s). Finally, the superpixel pooling operation is written, \label {eq : sp_pooling} \tilde {\vz }_{\text {sp-pooled}} = \hat {\vpi }\tilde {\vpi } \intercal \vz (2) In this paper, the vector z is either an image (C = F is the number of channels) or a segmentation label. A recent method, named the Superpixel Sampling Network (SSN), proposes learning taskspecific superpixels with this superpixel pooling loss [8]. Their loss function trains a network to estimate superpixel probabilities by comparing the original and pooled vectors. In Section 5.3, we compare the quality of superpixels learned within an SNA-denoising network and those learned using the SSN loss on the BSD500 dataset [36]. Figure 3: Superpixel Neighborhood Attention. The yellow region represents the attention window and the red contours are a superpixel boundary. NA considers all pixels, mixing the dissimilar orange and blue pixels. H-SNA considers only pixels within its own superpixel, which is too few pixels for denoising. SNA excludes the dissimilar blue pixels but retains the similar orange pixels. 3.3 Neighborhood Attention Let a noisy input image be denoted x with shape HW F. The queries (q), keys (k), and values (v) project the image x from dimension F to D, written as q = x Wq, k = x Wk, v = x Wv. Neighborhood attention (NA) computes the output of the attention operator using pixels within a square neighborhood around pixel i, denoted N(i) [2]. A standard neighborhood of size 7 7 is among the fastest available attention methods. The attention scale controls the shape of the attention weights, λat 0. The output and attention weights are computed below, \la bel { e q :na} f {(i)}_ {\te x t { NA}}(\vx ) = \ sum _{j \i n \mathca l {N}(i)}w_{i,j}\vv _j, \qquad w_{i,j} = \frac {\exp \left (\lambda _{\text {at}} d(\vq _i,\vk _j)\right )}{\sum _{j' \in \mathcal {N}(i)}\exp \left (\lambda _{\text {at}} d(\vq _i,\vk _{j'})\right )} (3) As an additional baseline, this paper augments the standard NA by learning the attention parameter for each pixel. Specifically, each pixel s attention scale is the output of a deep network, λ(i) at = g(i) ϕ,Deep(x). Further details are described in Section 4.5. 4.1 Superpixel Neighborhood Attention Hard Superpixel Neighborhood Attention (H-SNA). Perceptual groupings, such as superpixels, segment pixels into a boundary-preserving partition of the image. These boundaries account for sharp changes in pixel intensity, so using these segmentations as masks can remove perceptually unrelated information. Supplemental Section A.2 formalizes this intuition. H-SNA is a naive implementation of this concept. For a fixed query pixel (i), a neighboring pixel (j) is masked if its estimated superpixel does not match the query s superpixel, ˆsi = ˆsj. The function is written as, \la bel {eq: hsn a } f {(i )}_{\te xt { H -SNA} } (\vx ;\h ata{\vs } ) =\ s um _{j \in \ m athca l {N }(i)} w_{ i,j}\vv _j, \qquad w_{i,j} = \frac {\mathds {1}[\hat {s}_i=\hat {s}_j]\cdot \exp \left (\lambda _{\text {at}}d(\vq _i,\vk _j)\right )}{\sum _{j' \in \mathcal {N}(i)} \mathds {1}[\hat {s}_{i}=\hat {s}_{j'}]\cdot \exp \left (\lambda _{\text {at}} d(\vq _i,\vk _{j'})\right )} (4) Soft Superpixel Neighborhood Attention (SNA). There are two problems with H-SNA. First, H-SNA is unstable when the number of pixels sharing the same superpixel label within the local neighborhood is small. Examples include snake-like tendrils and superpixels that consist of small, disconnected regions (see Section A.2). The second problem with H-SNA is its high sensitivity to improper superpixel estimation. Erroneous ML estimates of superpixels have a dramatic effect on quality. To mitigate both issues, we propose soft superpixel neighborhood attention (SNA), which uses superpixel probabilities instead of superpixel assignments, \la bel {e q:s n a } f {( i)}_{\t ext { SNA }}(\vx ;\ hat {\ vpi }) =\s um _{j \ in \mat hca l {N}(i)} w_{i, j} \vv _j, \q quad w_{i,j} = \frac { \exp \left (\lambda _{\text {at}} d(\vq _i,\vk _j)\right ) \sum _{s=1} {S}\hat {\vpi } {(i,s)}\hat {\vpi } {(j,s)} }{ \sum _{j' \in \mathcal {N}(i)} \exp \left (\lambda _{\text {at}} d(\vq _i,\vk _{j'})\right ) \sum _{s=1} {S}\hat {\vpi } {(i,s)}\hat {\vpi } {(j',s)}} (5) Figure 4: The Latent Superpixel Model. The latent superpixel model assumes superpixel probabilities are sampled for each image pixel. This figure illustrates the data-generating process. The leftmost image allows the reader to visually compare the similarities among superpixels by representing each pixel by its most likely superpixel means. Informally, this looks like a low-resolution image . The superpixels and image pixels are sampled as usual. Importantly, the sum over all superpixels contains only 9 non-zero terms (see Section 3.1). The shape of the superpixel probabilities allows SNA to interpolate between NA and H-SNA. To be specific, the superpixel probabilities may be flat to match NA, ˆπ(i,s) 1 9, or the superpixel may be sharp to match H-SNA, ˆπ(i,s) 1[s = s ]. This control over the shape of the superpixel probabilities allows SNA to be better than both H-SNA and NA on average. 4.2 The Latent Superpixel Model Properly incorporating superpixels into attention requires rethinking the relationship between superpixels and images. While superpixels are usually presented with a generative model for images, this model is used only to estimate superpixel assignments. In contrast, this paper uses the data-generating process itself to derive the form of our proposed attention module. Our process is slightly different from standard literature. Ordinarily, all pixels share a single prior superpixel probability. In this paper, the superpixel probabilities are sampled for each image pixel. For each image pixel, our generative model samples (1) superpixel probabilities π p(π), (2) superpixel assignments si p(si|π(i)), and (3) pixel values x|s p(x|s). Figure 4 illustrates this proposed latent variable model. Note that SLIC estimated these quantities (recall the hats: ˆs, ˆπ), but assumes a different data generating model. While the distribution of pixels given the superpixel assignments is unknown, the new model is useful for theoretical analysis. We use it to derive the form of our proposed attention module (Sec 4.3), and to compare the theoretical error between SNA, NA, and H-SNA (Sec A.2). 4.3 SNA is the Optimal Denoiser under the Latent Superpixel Model The optimal denoiser is the function that minimizes the mean-squared error between the denoised pixel and the original pixel value. Following similar derivations [37], minimizing this expectation with respect to an unknown denoiser (D) is possible because of our chosen model for the unknown data density. Say we have a sample, x, s, π p(x, s, π). Then we approximate p(xi|si), :we dd} \hat {p}(\ v x |\v s _i) = \su m _{m=1} M p(m|\vs _i)p(\vx _i|m) = \frac {9}{M}\sum _{m=1} M p(\vs _i|m)\delta (\vx _i-\vx _m) (6) since p(si) = 1 9, and p(m) = 1 M . By re-weighting with superpixel probabilities, rather than superpixel assignments, our estimate does not depend on ambiguous superpixel labels. The expected loss of a denoiser is computed over the factorized joint density of the Gaussian-corrupted pixels, noisefree pixels, superpixel assignments, and superpixel probabilities: p(exi|xi)ˆp(xi|si)p(si|π(i))p(π). Claim 1 The optimal denoiser to the following optimization problem is soft superpixel neighborhood attention (SNA) when the qkv-transforms are identity, the attention scale is fixed to λat = 1 2σ2 , and the samples used to estimate the data density come from a neighborhood surrounding pixel i, \la bel {eq :o p timal_d en o} f { (i) } _ { \text { SN A} } (\tilde {\vx };\vpi ) &= D *(\tilde {\vx }_i;\vpi ,\sigma ) = \argmin _{D}\Eb \left [\|D(\tvx _i;\vpi ,\sigma ) - \vx _i\| 2\right ] Figure 6: Estimating Superpixel Probabilities. The superpixel probabilities are learned during training using one of these two methods. The left method uses a shallow UNet followed by a softmax layer with nine channels (gϕ,Deep). The right method estimates hyperparameters to be used within SLIC iterations (gϕ,SLIC). See Supplemental Section A.1 for the proof. In practice, the error of a denoiser depends on the sampled data, and in this paper, the data is limited to a neighborhood surrounding the query point. Section A.2 analyzes this limited neighborhood. 4.4 Normalization for Restricted Connectivity Figure 5: Artifacts in Attention. This attention map visualizes the unnormalized superpixel weights for neighbors, j N(i). The restricted connectivity leads to sharp cut-offs in the superpixel weights. To create this figure, π(k,s) = 1/9 if pixel k is connected to superpixels s and zero otherwise. White pixels indicate the region sums to 1, while grey regions sum to values less than 1. SLIC [26] and SNN [8] connect each pixel to (at most) nine superpixels. Therefore, some pixels within a neighborhood may only connect to a subset of superpixels that connect to the query index. This sharp cut-off in connectivity creates artifacts within the attention map since the number of non-zero superpixel probabilities for some neighbors is fewer than nine. These artifacts are due to our choice of superpixel algorithms [26, 8] rather than a limitation of our attention module. To correct these artifacts for the attention map associated with pixel i, we re-normalize the adjacent superpixel probabilities to one; π(j,s) π(j,s)/ P s C(i) π(j,s ), where C(i) is the set of superpixels connected to pixel index i. Figure 5 illustrates the attention maps with and without the normalization step. 4.5 Estimating Superpixel Probabilities SNA re-weights the attention map using superpixel probabilities, which are estimated from noisy images, π ˆπ = g (x). This paper forgoes considering theoretical concerns (such as identifiability) and uses ad-hoc estimates of these probabilities with two deep networks, depicted in Figure 6. The first method estimates the marginal probabilities directly from a deep network, ˆπ = gϕ,Deep(x). The second method estimates the superpixel parameters controlling the balance between appearance and shape with a deep network, which is then fed into differentiable SLIC iterations [8] (Section 3.1): ˆπ = gϕ,SLIC(x, ˆη) where [ˆηapp ˆηshape] = gϕ,SLIC(x). This model is abbreviated as ˆπ = gϕ,SLIC(x). The number of SLIC iterations is fixed to 5, following a recent paper [7]. These networks are two-layer UNets [38]. 4.6 Learnable Attention Scale Using a network similar to the ones used to estimate superpixel probabilities, all attention modules are augmented to serve as another baseline. The augmentation replaces the fixed attention scale with learnable attention scales for each pixel, λat = gϕ,Attn(x). The network architecture is identical to gϕ,Deep except for the final layer, which outputs one channel rather than nine. This baseline empirically demonstrates the theoretical argument from Supplemental Section A.3; even when modulating the attention scale, NA cannot robustly reject dissimilar pixels. 5 Experiments This section demonstrates the impressive benefit of superpixel neighborhood attention compared to standard neighborhood attention. To verify whether the improvement is due to the proposed method, we compare several variations of both attention methods. Section 5.2 compares different attention modules within a simple network architecture on Gaussian denoising, which empirically verifies the theoretical findings in Sections 4.3 and A.2. Section 5.3 compares the superpixel probabilities learned from the denoising loss function with superpixels learned through supervised training. Supplemental Section B.1 includes ablation experiments. 5.1 Experimental Setup Our experiments use a simple denoising network with larger auxiliary networks to highlight the role of the learned superpixel probabilities and attention scale parameters. The denoising network contains only projection layers and a single attention module. To keep the extracted features simple, the only interaction between neighboring pixels is through the attention layer. The project matrices project the input RGB image from three to six dimensions dimension. The full network is written, \hat {\vy } _{\text {Deno } } = \ text {Simple Ne t work}_{\theta ,\phi }(\vx ) = f_{\text {Attn}}\left (\vx \m W _0,g_{\phi }(\vx \m W _0)\right )\m W _1 + \vx \m W _0\m W _1 (8) The primary network parameters are denoted θ = [W0, W1], and gϕ( ) is the auxiliary deep network from Section 4.5. The primary network consists of only 200 parameters (θ), and the auxiliary networks contain about 4.4k or 8.8k parameters (ϕ). We train each network for 800 epochs using a batch size of 2 on the BSD500 dataset [36] using a learning rate of 2 10 4 with a decay factor of 1/2 at epochs 300 and 600. The network is optimized with Adam [39]. The code is implemented in Python using Pytorch, Numpy, Pandas, and CUDA and run using two NVIDIA Titan RTX GPUs and one RTX 3090 Ti GPU [40 43]. Testing datasets are Set5 [44], BSD100 [36], Urban100 [45], and Manga109 [46]. 5.2 Gaussian Denoising This subsection empirically verifies the utility of SNA on Gaussian denoising. The Charbonnier loss is used for denoiser training, LDeno = p ˆy Deno y 2 + ε2 (ε = 10 3) [47]. By default, the window size is 15 15, the fixed SLIC precision is ten, and the fixed attention scale is one. Table 1 presents the quantitative results. The SNA module achieves far better denoising quality than NA, even when NA can learn its attention scale. As an extreme example, the PSNR at σ = 30 of the left-most SNA column is about 2.8 d B better than standard NA (2nd NA column). If comparison is restricted to models with the same number of parameters, the PSNR at σ = 30 of the the fourth SNA column is about 1.8 d B better than the augmented NA (1st NA column). For a fixed budget of network parameters, SNA yields a higher quality denoiser than NA. Figure 7 shows SNA is qualitatively superior to NA. The first row shows that SNA produces less grainy images than NA. The bottom two rows show that NA mixes perceptually unrelated information. In the second row, NA outputs a red-orange semi-transparent mist surrounding the orange-red rock. In the third row of the NA results, the white zebra stripes are shaded darker than the clean image when the attention scale is learned, and contain white speckles if fixed. See Section B.2 for more examples. Learning the attention scale improves standard NA and SNA+gϕ,Deep at higher noise intensities. However, it degrades model quality when superpixels are estimated with SLIC iterations (SNA+gϕ,SLIC). Perhaps this quality drop is simply due to network architecture and/or the flow of information. Since the network that learns the attention parameter is only given a noisy image, the simple model struggles to coordinate with the denoising network, which uses multiple SLIC iterations. SLIC iterations explicitly compute pairwise distances, which cannot be learned with the shallow model. A limitation of the proposed SNA module is the additional computation compared to NA. SNA amounts to re-weighting the attention map produced by a neighborhood search, so SNA is necessarily more compute-intensive than NA. The bottom four rows of Table 1 show SNA is 15 - 22 times slower than NA. While this increase is significant, we believe the numbers reported in this paper are overly pessimistic. Our implementation of SNA has not been optimized, and future development can dramatically reduce the wall-clock time. One way to reduce wall-clock time is to read more efficiently Table 1: Image Denoising [PSNR /SSIM ]. The denoising quality of each network is averaged across images from Set5 [44], BSD100 [36], Urban100 [45], and Manga109 [46]. The SNA module is quantitatively superior to NA, even when NA s attention scales are learned with a deep network. However, NA is over 20 times faster than SNA and consumes 13 times less memory. A major contribution of NA is efficiency, while the code for this paper s proposed SNA module has not been optimized. Time and memory usage are reported for a single image of size 128 128. Attn. SNA H-SNA NA [2] Learn λat Sp. Model gϕ,Deep gϕ,SLIC gϕ,Deep gϕ,SLIC gϕ,SLIC σ 10 31.96 0.869 32.08 0.871 32.07 0.865 32.19 0.871 30.88 0.810 30.87 0.850 31.10 0.850 20 29.01 0.838 28.72 0.815 28.77 0.819 29.08 0.804 25.56 0.630 27.12 0.774 26.96 0.743 30 27.70 0.805 26.94 0.777 27.25 0.764 27.51 0.763 22.37 0.512 25.69 0.743 24.91 0.687 Deno Params (θ) 195 195 195 195 195 195 195 Aux Params (ϕ) 8.8k 8.8k 4.4k 4.4k 0 4.4k 0 Fwd Time (ms) 30.20 45.05 27.06 40.58 28.86 4.64 2.08 Bwd Time (ms) 38.72 80.93 40.00 51.35 32.54 6.06 4.67 Fwd Mem (GB) 1.90 2.30 1.87 2.28 1.96 0.23 0.21 Bwd Mem (GB) 3.27 3.68 3.25 3.66 3.13 0.27 0.25 Figure 7: Denoised Examples [PSNR ]. This figure compares the quality of denoised images using the Simple Network and noise intensity σ = 20. The attention scale (λat) is either fixed or learned with a deep network. In both cases, the NA module mixes perceptually dissimilar information, while the SNA module excludes dissimilar regions. See Section B.2 for more examples. from global CUDA memory [43]. Anecdotally, the naive implementations of NA are 500% to 1500% slower than their recently optimized alternatives. 5.3 Inspecting the Learned Superpixel Probabilities This subsection investigates how the superpixels learned from denoising compare with superpixels learned from explicit supervised superpixel training [8]. We observe a collaborative relationship between superpixel probabilities for denoising and superpixel pooling. We observe an adversarial relationship between the superpixels for denoising and boundary adherence. Generally, learning superpixels that improve superpixel benchmarks decreases their utility for denoising. Table 2: Supervised Superpixel Training Impacts Denoiser Quality. This table compares the denoising quality of SNA networks trained with both a denoising loss term and an SSN loss term, Lfinal = LDeno + LSSN. The SSN Label None indicates only a denoising loss is used. Pixel labels marginally improve the denoising quality, suggesting a cooperative relationship between these optimization problems. Segmentation labels degrade the denoising, suggesting the best superpixels for boundary adherence are not the best superpixels for image denoising. Time and memory usage are reported for a single 128 128 image. SSN Label PSNR/SSIM Fwd/Bwd Time (ms) Fwd/Bwd Mem (GB) None 32.77/0.879 33 0.5 Pix 32.78/0.889 62 6.3 Seg 30.14/0.798 86 6.3 Table 3: Evaluating Superpixel Quality. Training an SNA attention module on denoising learns superpixel probabilities with comparable quality to explicitly training superpixels. The ASA and BR metrics evaluate the ML superpixel estimate. The PSNR and SSIM metrics evaluate the quality of the superpixel pooled image. SSN Label Image Pixels Segmentation Loss LDeno LSSN LSSN LSSN + LDeno LSSN LSSN LSSN + LDeno σ 10 0 10 10 0 10 10 ASA 0.954 0.952 0.946 0.947 0.961 0.958 0.951 BR 0.771 0.774 0.752 0.751 0.855 0.824 0.754 PSNR 27.08 31.02 29.89 29.41 21.56 22.20 25.31 SSIM 0.811 0.934 0.886 0.883 0.553 0.574 0.737 To compare the different training regimes for learning superpixel probabilities, an SNA network is fully trained with a combination of two loss functions. One loss is the denoising loss (LDeno). The second loss is the SSN loss function, which computes the difference between a target vector and its superpixel-pooled transformation (see Section 3.2 for super-pixel pooling) [8]. Specifically, LSSN = L(ˆzsp-pool, z) where L is the cross-entropy loss for a segmentation label and the meansquared-error for image pixels. The superpixel stride for all methods is fixed to Ssp = 14. Each network estimates superpixels using SLIC iterations, and the attention scale is fixed to one (if applicable). Evaluation is computed on the test set of BSD500 [36]. The Achievable Segmentation Accuracy (ASA) and Boundary Recall (BR) scores evaluate the quality of the maximum likelihood superpixel assignment. The ASA score measures the upper bound on the achievable accuracy of any segmentation method, and the BR measures the boundary alignment between the superpixels and a segmentation. The superpixels are processed with a connected components algorithm before computing ASA and BR scores. To qualitatively compare the learned superpixel probabilities, we follow recent literature and compare the superpixel pooled images [5]. Images are compared against their superpixel pooled images using PSNR and SSIM. Table 2 evaluates the denoising quality, Table 3 quantitatively evaluates the superpixel quality, and Figure 8 qualitatively evaluates the superpixel pooling quality. We observe a harmony between the denoising loss and pixel pooling loss terms. Combining the two terms yields slightly improved denoising, as reported in the first two rows of Table 2. The first, third, and fourth columns of Table 3 show the ML superpixel quality is similar, but the boundary recall is slightly worse when combining the two. The superpixel pooling quality increases by 2.33 d B PSNR when combined compared to only denoising. The second and last columns of Figure 8 qualitatively compare the pooled images. We observe a disharmony between the superpixel probabilities useful for denoising and those trained on segmentation labels. Table 2 reports a 2.6 d B PSNR drop in denoising quality. The first two rows of Table 3 report training with segmentation labels yields the best ML superpixel estimates, matching related works [8]. However, the superpixel pooling quality is poor. This suggests boundary detection is distinct from denoising and pixel pooling. Figure 8: Comparing Superpixel Probabilites via Superpixel Pooling [PSNR ]. This figure uses superpixel pooling to qualitatively compare superpixel probabilities learned with different loss functions. Learning superpixel probabilities with only a denoising loss yields better superpixel pooling than supervised learning with segmentation labels. However, jointly training superpixel probabilities for denoising and image superpixel pooling improves denoising and pooling quality, which suggests a useful relationship between the two tasks. Expression FLOPs Memory FLOPs Memory exp(λat q i kj) O(F HW K2) O(HW K2) O(F HW K2) O(HW K2) exp(λat q i kj) PS s=1 π(i,s)π(j,s) O(9 HW K2) O(2 HW K2) wi,j = exp(λat q i kj) PS s π(i,s)π(j,s) P j N (i) exp(λat q i kj ) PS s π(i,s)π(j ,s) O(HW K2) O(HW K2) O(HW K2) O(HW K2) P j N(i) wi,jvj O(F HW K2) O(F HW) O(F HW K2) O(F HW) Total FLOPs & Peak Memory O([2F + 10] HW K2) O(F HW + 3 HW K2) O([2F + 1] HW K2) O(F HW + 2 HW K2) Table 4: Computational Complexity. This table reports the FLOPs and memory consumption for Equations 3 and 5. Constant terms are retained to clarify the differences between SNA and NA. The second row re-weights the attention map using superpixel probabilities, and it is the extra step distinguishing SNA from NA. This incurs an extra O(9 HW K2) FLOPs and requires O(2 HW K2) memory. However, SNA s additional complexity does not scale with the number of features (F). 5.4 Comparing Computational Complexity Table 4 compares the FLOPs and memory consumption of SNA and NA from Equations 3 and 5. Let the neighborhood window size be written |N(i)| = K2. In summary, the FLOPs estimate for SNA and NA is O([2F +10] HWK2) and O([2F +1] HWK2), respectively. The peak memory consumption for SNA and NA is O(HWF + 2HWK2) and O(NF + HWK2), respectively. Importantly, SNA s additional complexity does not scale with the number of features (F). Since the number of features is a significant factor in the computational cost of an attention operator, it is sensible to believe SNA can be used efficiently within large-scale deep neural networks. To be concrete, if the number of features is 128, then there is less than a 4% percent increase in FLOPs from NA to SNA. 6 Discussion This paper presents the soft superpixel neighborhood attention (SNA) module as an alternative to neighborhood attention (NA). SNA accounts for deformable boundaries of objects. The key modeling assumption is that superpixel probabilities vary per pixel, and we find SNA is the optimal denoiser under this model. For a fixed budget of network parameters, SNA achieves a significantly better denoising quality than NA within a single-layer neural network. While SNA does require more computation than NA, SNA s computation does not scale with features which suggests SNA can be used efficiently within large-scale deep neural networks. 7 Acknowledgments This work is supported, in part, by the National Science Foundation under the awards 2030570, 2134209, and 2133032. The authors thank Drs. Vinayak Rao and David Inouye for their valuable feedback and support. [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [2] Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6185 6194, 2023. [3] Xiaofeng Ren and Jitendra Malik. Learning a classification model for segmentation. In Proceedings ninth IEEE international conference on computer vision, pages 10 17. IEEE, 2003. [4] Max Wertheimer. Laws of organization in perceptual forms. A Source Book of Gestalt Psychology., pages 71 88, 1938. [5] Roy Uziel, Meitar Ronen, and Oren Freifeld. Bayesian adaptive superpixel segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8470 8479, 2019. [6] Fengting Yang, Qian Sun, Hailin Jin, and Zihan Zhou. Superpixel segmentation with fully convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13964 13973, 2020. [7] Aiping Zhang, Wenqi Ren, Yi Liu, and Xiaochun Cao. Lightweight image super-resolution with superpixel token interaction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12728 12737, 2023. [8] Varun Jampani, Deqing Sun, Ming-Yu Liu, Ming-Hsuan Yang, and Jan Kautz. Superpixel sampling networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 352 368, 2018. [9] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. Non-local means denoising. Image Processing On Line, 1:208 212, 2011. [10] Marc Lebrun. An analysis and implementation of the bm3d image denoising method. Image Processing On Line, 2:175 213, 2012. [11] Shuhang Gu, Lei Zhang, Wangmeng Zuo, and Xiangchu Feng. Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2862 2869, 2014. [12] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and Thomas S Huang. Non-local recurrent network for image restoration. Advances in Neural Information Processing Systems, 31, 2018. [13] Tobias Plötz and Stefan Roth. Neural nearest neighbors networks. Advances in Neural Information Processing Systems, 31, 2018. [14] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and Thomas S Huang. Non-local recurrent network for image restoration. Advances in neural information processing systems, 31, 2018. [15] Hunsang Lee, Hyesong Choi, Kwanghoon Sohn, and Dongbo Min. Knn local attention for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2139 2149, 2022. [16] Gregory Vaksman, Michael Elad, and Peyman Milanfar. Lidia: Lightweight learned image denoising with instance adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 524 525, 2020. [17] Diego Valsesia, Giulia Fracastoro, and Enrico Magli. Deep graph-convolutional image denoising. IEEE Transactions on Image Processing, 29:8226 8237, 2020. [18] Yiqun Mei, Yuchen Fan, Yuqian Zhou, Lichao Huang, Thomas S. Huang, and Honghui Shi. Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [19] Chong Mou, Jian Zhang, Xiaopeng Fan, Hangfan Liu, and Ronggang Wang. Cola-net: Collaborative attention network for image restoration. IEEE Transactions on Multimedia, 24: 1366 1377, 2021. [20] Chong Mou, Jian Zhang, and Zhuoyuan Wu. Dynamic attentive graph learning for image restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4328 4337, 2021. [21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. [22] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012 10022, 2021. [23] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1833 1844, 2021. [24] Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multiscale vision transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 357 366, 2021. [25] Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Crosscovariance image transformers. Advances in neural information processing systems, 34:20014 20027, 2021. [26] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels. Technical report, 2010. [27] Olga Veksler, Yuri Boykov, and Paria Mehrani. Superpixels and supervoxels in an energy optimization framework. In Computer Vision ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part V 11, pages 211 224. Springer, 2010. [28] Greg Mori, Xiaofeng Ren, Alexei A Efros, and Jitendra Malik. Recovering human body configurations: Combining segmentation and recognition. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 2, pages II II. IEEE, 2004. [29] Jian Sun, Wenfei Cao, Zongben Xu, and Jean Ponce. Learning a convolutional neural network for non-uniform motion blur removal. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 769 777, 2015. [30] Kun Li, Yanming Zhu, Jingyu Yang, and Jianmin Jiang. Video super-resolution using an adaptive superpixel-guided auto-regressive model. Pattern Recognition, 51:59 71, 2016. [31] Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan Lu. Affinity derivation and graph merge for instance segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 686 703, 2018. [32] Sree Ramya S. P. Malladi, Sundaresh Ram, and Jeffrey J. Rodríguez. Image denoising using superpixel-based pca. IEEE Transactions on Multimedia, 23:2297 2309, 2021. doi: 10.1109/ TMM.2020.3009502. [33] Wei-Chih Tu, Ming-Yu Liu, Varun Jampani, Deqing Sun, Shao-Yi Chien, Ming-Hsuan Yang, and Jan Kautz. Learning superpixels with segmentation-aware affinity loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 568 576, 2018. [34] Moshe Eliasof, Nir Ben Zikri, and Eran Treister. Rethinking unsupervised neural superpixel segmentation. In 2022 IEEE International Conference on Image Processing (ICIP), pages 3500 3504. IEEE, 2022. [35] Constantinos Goutis. Nonparametric estimation of a mixing density via the kernel method. Journal of the American Statistical Association, 92(440):1445 1450, 1997. [36] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 2, pages 416 423. IEEE, 2001. [37] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35: 26565 26577, 2022. [38] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234 241. Springer, 2015. [39] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [40] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. [41] Charles R. Harris, K. Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. Array programming with Num Py. Nature, 585:357 362, 2020. doi: 10.1038/s41586-020-2649-2. [42] Wes Mc Kinney et al. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, volume 445, pages 51 56. Austin, TX, 2010. [43] NVIDIA, Péter Vingelmann, and Frank H.P. Fitzek. Cuda, release: 10.2.89, 2020. URL https://developer.nvidia.com/cuda-toolkit. [44] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel. Lowcomplexity single-image super-resolution based on nonnegative neighbor embedding. 2012. [45] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5197 5206, 2015. [46] Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 76:21811 21838, 2017. [47] Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and Michel Barlaud. Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of 1st international conference on image processing, volume 2, pages 168 172. IEEE, 1994. A Additional Method Details and Proofs A.1 Soft Superpixel Attention is the Optimal Denoiser The latent superpixel model assumes the following data-generating process for each image pixel, &\te xt {Sample superpixel p r o babi lit ies} & \v pi \sim p( \vpi ) \\ &\tex t { Sample the su perpixe l assignmen t gi ven th e p robab iliti es} & \v s _i | \vp i {(i)} \ sim p(\vs _i|\vpi {(i)}) \\ &\text {Sample the image pixel given the superpixel assignment} & \vx | \vs \sim p(\vx |\vs ) (11) We will construct a denoiser that depends on the superpixel probabilities rather than the superpixel assignment. Our derivation is inspired by [37]. Only the form of p(si|π(i)) = Categorical(si; π(i)) is known, and p(π) will not impact our final answer. So we only need to estimate the unknown density, p(x|s). To do this, we approximate p(x|s) from a sample (x, s, π), vx _i|\vs _i) = \ s u {m= 1} M p(m|\v s _i) p(\vx _i|m) = \frac {9}{M}\sum _{m=1} M p(\vs _i|m) \delta (\vx _i - \vx _m) (12) where p(si|m) = π(m,si). By re-weighting the empirical density with superpixel probabilities, rather than superpixel assignments, our estimate does not depend on the ambiguous superpixel labels. Notice if the superpixel probabilities are independent of the sample index, then the usual weighting is recovered, p(m|si) = p(m) = 1 M . Next, the pixels are corrupted with Gaussian noise, &\hat {p} _ { \sigma } ( \til de {\v x }_i| \ v s _i) = \left [ \ hat {p }(\cdot |\v * \mathca l { N}(\ c dot ;0,\ sigm a 2\m I ) ght ](\tilde {\ vx } _i) = \int _{\mathbb {R} F}\hat {p}(\vx _0|\vs _i)\mathcal {N}(\vx _0-\tilde {\vx }_i;0,\sigma 2\m I )d\vx _0 \\ &= \frac {9}{M}\sum _{m=1} M p(\vs _i|m) \int _{\mathbb {R} F} \delta (\vx _0 - \vx _m)\mathcal {N}(\vx _0;\tilde {\vx }_i,\sigma 2\m I )d\vx _0 = \frac {9}{M}\sum _{m=1} M p(\vs _i|m) \mathcal {N}(\vx _m;\tilde {\vx }_i,\sigma 2\m I ) Let the denoiser be a function of the noise intensity and superpixel probabilities, D( xi; σ, π). The expected error of the denoiser is taken with respect to the density ˆpσ( xi|si)p(si|π(i))p(π), &\Eb \l ef t [ \|D(\ tild \si gma ,\vpi \vx _i\| 2\rig ht ] \\ &= \Eb _{ p( \ vpi )}\l {s= 1} {S} p(s|\vpi {(i) }) \int _{\mat hb b { R} F } \ s um _{m=1} M p( \mathcal {N}(\vx _m;\tilde {\vx }_i,\sigma 2\m I ) \|D(\tilde {\vx }_i;\sigma ,\vpi ) - \vx _m\| 2 d\tilde {\vx }_i \right ] \\ &= \Eb _{p(\vpi )}\Biggr [ \frac {9}{M}\int _{\mathbb {R} F} \underbrace {\sum _{m=1} M \sum _{s=1} {S} p(s|\vpi {(i)}) p(s|m) \mathcal {N}(\tilde {\vx }_i;\vx _m,\sigma 2\m I ) \|D(\tilde {\vx }_i;\sigma ,\vpi ) - \vx _m\| 2}_{=:\mathcal {L}(D;\tilde {\vx }_i,\sigma ,\vpi )} d\tilde {\vx }_i \Biggr ] Note p(si = s|π(i)) p(s|π(i)) for clarity. The expectation can be minimized by minimizing each integrand for fixed a fixed noisy pixel ( x) and superpixel probability (π). Since the integrand is a sum of convex functions, the problem is convex and the solution is the point where the gradient is equal to zero. 0 &= \nabla _{D (\ti ld e { \ }_ i;\sig ma ,\v )} \mathcal {L}(D; \tilde { \v x } _i,\sigma ,\vpi ) = 2\sum _{m=1} M \mathcal {N}(\tilde {\vx }_i;\vx _m,\sigma 2) \sum _{s=1} {S}p(s|\vpi {(i)})p(s|m) \left [D *(\tilde {\vx }_i;\sigma ,\vpi ) - \vx _m\right ] In general, the optimal denoiser is written as, \label { eq : ner al _optim } D *( \t ild e {\vx }_i;\sig ma ,\v pi ) = \su m _ {m =1} M\vx _m\frac { \mathcal {N}(\tilde {\vx }_i,\vx _m,\sigma 2)\sum _{s=1} {S}p(s|\vpi {(i)})p(s|m) }{ \sum _{m'=1} M\mathcal {N}(\tilde {\vx }_i;\vx _{m'},\sigma 2)\sum _{s=1} {S}p(s|\vpi {(i)})p(s|m') } By construction, the probability of a particular sample s superpixel assignment is written p(s|m) = π(m,s). When the samples are restricted to a neighborhood surrounding pixel i, one can write the denoiser as the following expression, \label { eq : n eigh_o pt im} D *(\ tild e {\ v x }_i ;\sigma ,\vp i ) = \s um _ { j \in \ma t hcal {N }(i )}\vx _j\frac { \exp \left (-\frac {1}{2\sigma 2}\|\tilde {\vx }_i-\vx _{j}\|\right )\sum _{s=1} {S}\vpi {(i,s)}\vpi {(j,s)} }{ \sum _{j'\in \mathcal {N}(i)}\exp \left (-\frac {1}{2\sigma 2}\|\tilde {\vx }_i-\vx _{j'}\|\right )\sum _{s=1} {S}\vpi {(i,s)}\vpi {(j',s)} } (20) This is soft superpixel neighborhood attention when the samples are restricted to a neighborhood surrounding pixel i, the qkv-transforms are identity, and the attention scale is fixed, λat = 1 2σ2 . Presently, we highlight where our per-pixel superpixel probabilities are used in the proof. In Equation 19, the probability p(s|π) depends on the sampled superpixel probabilities π. When denoising image pixel index i, we have p(s|π) = π(i,s). However, under the standard model with a shape superpixel probability prior, p(s|π) does not depend on the particular pixel index. This reduces Equation 20 to a superpixel re-weighting term that does not depend on superpixel information from pixel i, which would be strange. We did experimentally try out this alternative re-weighting scheme, and its denoising quality matches NA. Finally, we discuss a conceptual rationale for sampling superpixel probabilities for each image pixel. The introduction (Sec 1) explains that superpixel assignment is an ambiguous classification task. There are many qualitatively good superpixel assignments, and often superpixel boundaries can be modified without impacting the overall quality. This suggests superpixel probabilities are more important than superpixel assignments. However, working with p(x|π) is more challenging than working with p(x|s). So in this paper, we still work with p(x|s) but design the denoiser to depend on the probabilities π to circumvent using the ambiguous superpixel assignments. A.2 Expected Denoising Error for a Fixed Neighborhood Section 4.3 states that SNA is the optimal denoiser, but in practice, samples are limited to a neighborhood surrounding a query pixel. To analyze the impact of this restriction, this subsection computes the expected error over the image pixels and noise for a fixed superpixel sample. For expedient analysis, the number of features is one (F = 1, xi xi). The image pixels conditioned on the superpixel information are modeled with a Gaussian distribution, xi|si, µ, η N(µs, η2 s). We will analyze the attention modules by studying a denoiser of the following form, al {D} _i (\tilde {\vx }) = \sum _{j\in \mathcal {N}(i)} w_{j} \tilde {x}_j (21) This denoiser corresponds to using a flat attention scale λat = 0 and using only prior superpixel probabilities as attention weights wj. Analysis of attention weights, in general, is difficult, and this approach removes the challenge. The expected error is computed over the Gaussian-corrupted and noise-free data density conditioned by the superpixel information, p( x|x)p(x|s, µ, η). The expected error is split into the classic bias-variance decomposition. See Section A.4 for the proof. Claim 2 The expected error of the denoiser D over the joint density p( x, x|s, µ, η) is written, \label { eq:e x pe c ted_er ro r } \ E b \left thcal {D}_i( \ tild \ vx }) - x_i\right ) 2 = \underbrace {\sigma 2\sum _{j\in \mathcal {N}(i)}w_{j} 2}_{{\large \text {Variance}}} + \underbrace {\left [\sum _{j\in \mathcal {N}(i)}w_{j}(\mu _{s_j} - \mu _{s_i})\right ] 2}_{{\large {\text {Bias}}}} The NA module s flat weights achieve the minimum variance of 1 W σ2 where W = |N( )| is the window size. However, NA s bias depends entirely on the superpixel s neighbors and can be large if the neighborhood happens to contain pixels from superpixels with different means. For example, the bias will be large for pixels along the sharp orange edge in Figure 3 since their neighborhoods include the dissimilar blue region. Section A.3 shows that even when the attention scale is non-zero, NA cannot robustly reject dissimilar regions. H-SNA has the opposite problem of NA. H-SNA s sharp weights achieve the minimum possible bias of zero. However, H-SNA s variance depends entirely on the size of the superpixel. In Figure 3, the snake-like superpixel reduces the number of included pixels, increasing the variance term of the error. This observation that thin and small superpixels are undesirable matches existing assumptions from superpixel literature [26, 5]. SNA s modulation of the superpixel weights allows it to be better than both NA and H-SNA. In Figure 3, SNA excludes the dissimilar blue region (unlike NA) and includes the similar orange regions (unlike H-SNA). Said another way, SNA excludes only dissimilar regions. A.3 Exclude Dissimilar Regions Instead of Dissimilar Pixels Section A.2 showed SNA yields a smaller theoretical error than NA because SNA can exclude dissimilar regions via superpixel probabilities. The previous analysis fixed the attention scale to zero (λat = 0), and one may wonder if the attention scale might be used to exclude dissimilar regions as well. This subsection shows the attention scale excludes dissimilar pixels rather than dissimilar regions, which is less robust under noise. In this analysis, the qk-transforms are fixed to identity, the feature dimension is fixed to 1, and the distances function is the negative squared difference: d(yi, yj) = (yi yj)2. We first show the attention scale excludes dissimilar pixels rather than regions. Weights of pixels within a similar region should be similar, wi,j wi,i, while weights of pixels within dissimilar regions should be near zero, wi,j 0. The decay of dissimilar regions will only occur for NA when the scale increases (λat > 0), but this growth also decays the weights associated with similar regions. Noting eλatd(yi,yi) = 1, the ratio of two weights for NA is as follows: ( \frac { w _{i, j } }{ w_ { i,i}}\right ) = -\lambda _{\text {at}}(y_i - y_j) 2 = 0 \;\Longleftrightarrow \; y_i \approx y_j (23) Two noisy pixel values must be approximately equal for their attention weights to be approximately equal. In other words, excluding dissimilar regions with the attention weights also excludes nonidentical noisy pixels. This strict condition limits NA s ability to exclude dissimilar regions. Compare this with SNA, which we argue excludes regions more robustly. Say all but a single region should be excluded, so λsp can be large. The inclusion of the single (useful) region depends on the most likely estimated superpixel matching the latent probabilities, s i = ˆsi. Using Section 3.1 s formulation, the argmax terms match when the following condition is met, (\h a t {\m u }_{s_ i *} - y_ i) 2 < (\hat {\mu }_{\hat {s}_i} - y_i) 2,\; s \neq s * (24) In other words, SNA properly rejects dissimilar regions when a noisy pixel is more similar to its true mean than another superpixel mean. Comparing two noisy differences is more robust than comparing two noisy pixels. So we claim SNA rejects dissimilar regions more robustly than NA. A.4 The Estimated Expected Denoising Error This subsection presents a more detailed version of Claim 3 from Section A.2 and its proof. Claim 3 The expected error of the denoiser D over the joint density p( x, x|s, µ, η) is written, \label { eq:e x pe cted _ erro r _ supp} \E b \l e ft (\ m at h cal {D}_ ilde { \vx }) - x_ i gh t ) 2 =\underbrace {\eta _{s_i} 2(1-2w_{i}) + \sum _{j\in \mathcal {N}(i)}w_{j} 2(\sigma 2+\eta _{s_j} 2)}_{\text {Variance}} + \underbrace {\left [\sum _{j\in \mathcal {N}(i)}w_{j}(\mu _{s_j} - \mu _{s_i})\right ] 2}_{{\text {Bias}}} The following proves the above claim. The first step is to expand the square, \Eb [(\ m athca l {D} _ i(\ti l de { \ v x }) - x_i) 2] = \Eb [\mathcal {D} 2_i(\tilde {\vx })] + \Eb [x_i 2] - 2\Eb [\mathcal {D}_i(\tilde {\vx })x_i] (26) Each expectation is evaluated separately, &\ E b [x _i 2] = \et a _{ s _i} 2 + \mu _ {s _i} 2 \\ & \ Eb [\mathc al {D} 2_ i(\t ilde \vx }) ] = \ su m _ {j \ in \ m a t h cal {N}(i) } w 2_{j}\ Eb [ ilde { x} _j 2] + 2 \su m _{k