# hierarchyagnostic_unsupervised_segmentation_parsing_semantic_image_structure__2a8f783a.pdf Hierarchy-Agnostic Unsupervised Segmentation: Parsing Semantic Image Structure Simone Rossetti1,2 Fiora Pirri1,2 1DIAG, Sapienza University of Rome 2Deep Plants {rossetti,pirri}@diag.uniroma1.it {simone,fiora}@deepplants.com Unsupervised semantic segmentation aims to discover groupings within images, capturing objects view-invariance without external supervision. Moreover, this task is inherently ambiguous due to the varying levels of semantic granularity. Existing methods often bypass this ambiguity using dataset-specific priors. In our research, we address this ambiguity head-on and provide a universal tool for pixellevel semantic parsing of images guided by the latent representations encoded in self-supervised models. We introduce a novel algebraic approach that recursively decomposes an image into nested subgraphs, dynamically estimating their count and ensuring clear separation. The innovative approach identifies scene-specific primitives and constructs a hierarchy-agnostic tree of semantic regions from the image pixels. The model captures fine and coarse semantic details, producing a nuanced and unbiased segmentation. We present a new metric for estimating the quality of the semantic segmentation of discovered elements on different levels of the hierarchy. The metric validates the intrinsic nature of the compositional relations among parts, objects, and scenes in a hierarchy-agnostic domain. Our results prove the power of this methodology, uncovering semantic regions without prior definitions and scaling effectively across various datasets. This robust framework for unsupervised image segmentation proves more accurate semantic hierarchical relationships between scene elements than traditional algorithms. The experiments underscore its potential for broad applicability in image analysis tasks, showcasing its ability to deliver a detailed and unbiased segmentation that surpasses existing unsupervised methods. 1 Introduction The advancement of image segmentation has recently taken significant steps forward. On the one hand, the foundation models are trained on increasingly large datasets, such as CLIPseg [57] (CLIP [67]), SAM [48], and SEEM [97], supervised by text and human prompts [92]. On the other hand, there is a rising growth of unsupervised segmentation models. Unsupervised segmentation explores the feature hierarchy by leveraging self-supervised contrastive learning, as in Smoo Seg [51], U2Seg [62], CUTLer [87], Cu VLER [5], STEGO [32], ACSeg[52], Free Solo [85], HSG [44], Trans Fgu [90], Deep Cut [3], and others [77, 58, 96, 86, 33, 17]. Unsupervised models discover and localize image categories with no annotation aid and evaluate the quality of the pseudo-masks they predict on the datasets corpora used for testing, such as COCO-Stuff [10] and Cityscapes [21]. Despite exploring human perception more closely than the foundation models, they still rely on the linguistic and conceptual relations between the images and their annotations. Curated datasets, such as Image Net [50], Pascal VOC [27], or MSCOCO [53], show an extraordinary number of objects with all their components and particulars not annotated either at the image level 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Figure 1: Hierarchy-agnostic unsupervised segmentation. Finer image parts are generated via over-clustering, each region colour-coded randomly. Our algorithm recursively partitions these parts, grouping them into coarser regions across multiple levels of granularity. The resulting tree represents an unsupervised hierarchical semantic segmentation. The arrangement of regions in the tree reflects their semantic distance, which is colour-coded in the heat map shown on the right. or densely. Why annotate this and not that? Annotation prejudice creates a bias towards a subset of the scene. Unsupervised learning, in contrast, has the potential to generate richer representations that are not restrained by annotation decisions. Without juggling annotations, unsupervised features (e.g. [64]) nearly mirror visual perception discovering scene parts and details; indeed, feature cues live in nested context levels and are not necessarily verbalisable [76, 66], as opposed to the Gestalt holistic view [83]. Following previous research [44, 19, 3, 94], our key idea is that the natural hierarchical structure of visual scenes is an essential attribute that we can actively pursue in unsupervised segmentation. We approach unsupervised semantic segmentation as an unsupervised pixel-wise feature learning problem, discretising the hierarchical semantic knowledge yielded by self-supervised learning. Our approach makes no assumption about the number of semantic granularity levels and the number of partitions in each level as in [44]. We generate robust hierarchical segmentation for every image, solely relying on hierarchical clustering in feature space, see Figure 1. This new method achieves unsupervised segmentation by leveraging relationships between concepts hidden in the latent space of self-supervised models, across multiple levels of semantic granularity. We propose a simple algebraic methodology based on a vast literature [20, 6, 61, 81, 75, 91], that unsupervisedly segments the scene parts. The method guarantees the construction of natural, scenedependent classes of primitives [55], which can be easily used in unsupervised segmentation without surrendering to their a priori definitions. Our contributions are: 1. We introduce a deep recursive spectral clustering that maximises an unbiased semantic similarity measure at multiple granularity levels. Exposing our method to any generic group of images, we show that it results in hierarchical unsupervised semantic segmentation. 2. We introduce new metrics for estimating the quality of the semantic segmentation of the elements discovered on the different levels of the hierarchy, called Normalised Multigranular Covering (NMCovering) and Normalised Hierarchical Covering (NHCovering). 3. We integrate the method into various self-supervised learning models to enhance its flexibility for benchmarking purposes, making it suitable as a downstream task through unsupervised segmentation by inferring all scene parts in the images of any given dataset. 2 Related works Self-supervised representation learning for unsupervised segmentation. Self-supervised learning (SSL) is about learning representations of real-world data without human supervision [14, 36, 11, 64]. Contrastive learning (CL) [18, 63] is the most prominent method in SSL, maximising feature similarities between an image and its affine transformations while minimising similarity between randomly sampled images. Most unsupervised segmentation methods use learning representations to feed self-supervised features to graphs for clustering [58, 88, 87, 3]. Alternatively, features are used for building distillation strategies [32, 96, 51], or for designing inductive priors forcing some consistency property [51, 17, 73]. SSL representations implicitly define a distance metric in the latent space, though learning a metric space is not their primary objective. Unsupervised semantic segmentation. Unsupervised semantic segmentation labels pixels that belong to specific elements in the scene, grouping all objects of the same semantic type under a single label, without supervision. The earlier approaches, such as [41, 47, 17, 86, 77, 33] had not yet available powerful SSL features like [11], though used CL principles. Pi Cie [17] used equivariance to transformations; IIC [41] resorted to invariant information clustering, maximising mutual information, to find commonality in objects, and analogously did Info Seg [33]. CLD [86] integrated local clustering into contrastive learning; [77] used saliency to find the image foreground and guide CL of pixel embedding. The availability of high-level unsupervised features triggered new strategies. DINO self distillation [11] inspired both STEGO [32] and Smoo Seg [51]. STEGO trains a segmentation head by distilling the feature correspondences to form compact clusters. Smoo Seg uses the smooth prior over semantically coherent regions as a supervision cue to generate semantic maps. However, many methods suffer from the background problem and elaborate on DINO s attention to obtaining foreground objects, such as Free Solo [85] and [93]. Also, Trans FGU [90] focuses on a top-down object-centric approach to generate pixel pseudo labels according to Grad CAM [72]. Spectral clustering, as introduced to machine learning by [74], is considered in [88, 58, 87, 52, 68]. In particular, Cut LER [87] applies NCut [74] iteratively to a masked similarity matrix to discover foreground elements of the scene. ACSeg [52] uses the affinity matrix to discover concept similarities. Sem Part [68] considers the foreground a single object saliency mask and applies graph regularisation. In [73], patch-level contrasting learning leverages global hidden positives to learn semantic consistency. As grouping is the common denominator, all the mentioned methods suffer from deciding the correct level of granularity. Some methods resort to an object-centric bias, such as [77, 93], CAMs [90], hinting self-attention [96, 79], fixed-size flat image partitioning [58]. Others, such as [96], and [46, 73], resort to a scene-centric prior assumption. Unsupervised parts discovering. Despite hierarchically discovering parts has a long history in computer vision, it has only recently recovered and connected to unsupervised part segmentation. The first input came from [31], analysing the hierarchical nature of deep learning features. One of the first approaches was SCOPS [39] using dense self-supervised contrastive loss to discover foreground parts of single objects. In [94], the authors introduce self-supervised primitive hierarchical grouping. They leverage a boundary strength map (OWT-UCM [4]) on relatively few images to learn from a large data set. The approach formulates an ultrametric map defining a region hierarchy. Similarly, HSG [44] leverages region boundaries to obtain multi-level segmentation. HSG is unsupervisedly trained from scratch, performing pixel grouping with dense contrastive learning across different granularity levels. In [19], K fixed parts are discovered via an average part descriptor and by forcing consistency using the equivariance of affine and photometric transformations. Leopart [96] follows DINO self-distillation to classify pixels, obtaining detailed scene parts, further clustered via community detection. Finally, Deep Cut [3] approaches unsupervised semantic part segmentation using both spectral clustering with NCut and GNN convolution, constructing a patch-wise correlation [7] matrix from DINO features. We present a flexible, unsupervised method for segmenting natural images, automatically creating data structures that organize pixels by their semantic coherence across multiple levels of detail without relying on predefined hierarchies or labels. This method is designed to segment images from coarse regions to finer parts, providing an intuitive representation of visual content; see Figure 1. For instance, consider an urban street view. At a high level, it consists of elements such as the sky, a road, buildings, and vehicles. Among the vehicles, there might be a bus or a car, which can be further decomposed into parts like the body shell, wheels, and other visible components. Our approach segments an image I R3 M N into a hierarchy-agnostic tree T of semantic regions, with each pixel in the image assigned to a leaf node. Our model is a function f : I T, represented by a deep neural network, which maps an image to its semantic regions. Notably, this is achieved in a fully unsupervised manner; we incorporate mechanisms that guide f to produce a meaningful decomposition of the image, even without labelled examples. 3.1 Overview of the Approach Our method discovers similar parts in an image by recursively partitioning a graph constructed from deep feature representations of the image. The key idea is to treat self-supervised models for image processing as codebooks of a lower-dimensional space, with the extracted feature vectors acting as codes that embed semantics of visual concepts. The primary cue for discovering parts is a deep feature extractor ϕ, a neural network pre-trained without supervision on a standard benchmark such as Image Net. We observe that the alignment of codes leads to semantic similarity across multiple levels of granularity. A higher degree of alignment indicates semantic similarity at finer granularity levels, corresponding to indivisible object parts or primitives. Conversely, a lower degree of alignment reflects semantic similarity at coarser granularity levels. By discretizing the density of these alignments, we can discover scene and object parts at various levels of detail. Let vi = [ϕ (I)]i Rd be the code associated with pixel location i in the image. If pixel j belongs to the same finer part as i, their codes should be similar. Conversely, if they belong to different parts, their codes will diverge. We expect this property to be consistent in each image I and should not be affected by the particular object instances. A straightforward approach might be to cluster these codes using algorithms that minimize a distortion metric in the latent space to identify regions of high feature concentration. However, a critical flaw that can arise when using these methods in high-dimensional space is the presence of many local minima in the cost function. This would require multiple restarts of the iterative algorithms to find a suitable solution, which is impractical due to high complexity. To overcome these limitations, we adopt an efficient method to partition the image into similar regions, avoiding the pitfalls of multiple local minima and the need for iterative restarts in high-dimensional space. Graph construction. We represent the image I as a weighted undirected graph G = (V, E, w), where V = {vi}n i=1 and n = M N. The weight assigned to each edge (i, j) E is defined by a scaled and shifted cosine similarity between feature vectors wij = w(vi, vj) [0, 1]. These weights form the adjacency matrix W = [wij] [0, 1]n n, the degree matrix D = diag [di] Rn n, where di = P j wij and the normalized graph Laplacian L = D 1/2 (D W) D 1/2 [59]. We interpret the edge weights as indicators of the semantic granularity between nodes. Specifically, if wij 1, pixels i and j likely belong to the same fine-grained part (primitive). Conversely, if wij 0, these pixels are likely to belong to entirely different parts, indicating minimal semantic similarity even at the coarsest level of granularity. Similarity perturbation. In an ideal scenario, at a specific granularity level, k distinct connected components emerge in G, resulting in a binary adjacency matrix W {0, 1}n n with k non-zero diagonal blocks indicating strong intra-component connectivity and no inter-component connections and normalized Laplacian L . However, in practice, the observed adjacency matrix is not discrete. Instead, W exhibits tightly connected components alongside others with lower connectivity, resulting in a perturbed normalized graph Laplacian L. We regard the primitives of the model as affected by a small symmetric perturbation H Rn n incorporating contextual information from coarser levels of semantic granularity, i.e. L = L + H, making the observed adjacency real-valued. Fortunately, the Davis-Kahan symmetric sin θ theorem [23, 91] helps us manage perturbations; see also Appendix A. If the eigenvalues of L exhibit a spectral gap δ, the corresponding eigenspaces of L and L remain close despite the perturbation H. The theorem quantifies this proximity by relating the angle θ between the eigenspaces of L and L , where sin θ is proportional to the norm of H and inversely proportional to δ. By selecting the largest gap, we isolate the part of the spectrum closest to the original graph, guessing the true semantic structure at the specific granularity level. Normalized smoothness measure. We consider a function g : V R that assigns a value g(vi) to each node vi V . Since |V | = n, we identify the function g with a vector in Rn. Based on the considerations from [74, 8] (see also Appendix C), we define the normalized smoothness measure of g on the graph G using the functional SG : Rn R+ induced by L through the form: SG(g) = g (D W)g ij(g(vi) g(vj))2wij P i g(vi)2di . (1) We observe that minimizing the functional yields normalized smoothest functions that assign similar values to tightly connected nodes and different values to weakly connected ones while accounting for their importance in the graph avoiding trivial solutions for low connectivity nodes. Therefore, if g is both normalized and smooth with respect to G, then g(vi) is similar to g(vj) whenever vi is similar to vj, where similarity is quantified by the weight wij and adjusted by the node degree di. Given these properties, we treat any function g as a continuous partition function on the graph G and evaluate its correctness by a feature density change criterion on the data partition, which measures variations in similarity across different regions of the graph. Recursive partitioning with perturbation stability. We propose a recursive partitioning strategy for discovering semantic parts by progressively dividing the graph, starting from the whole and refining it into tightly connected subgraphs. At each recursion level, we examine the subgraph s granularity, identify perturbations contextual variations affecting node connections and derive the unperturbed adjacency matrix to reveal finer semantic components. By capturing relationships at multiple levels of detail, this approach yields more nuanced segmentation than methods that partition the entire graph s nodes into a fixed number k of sets. We aim to find the smoothest normalized functions that best describe the semantic structure of a subgraph G at a specific granularity.1 We tackle the minimization of Equation (1) as a standard eigenvalue problem [74] using the Rayleigh-Ritz quotient form. This yields the orthonormal eigenvectors yi corresponding to the smallest eigenvalues λi of L, as guaranteed by the Courant-Fischer theorem: yi = argmin y =1,y y 0 for 2 j kmax 1 and seek for the k-th gap giving the tighter sin θ bound, i.e. k = arg maxj δj. We select up to k smoothest normalized functions on G, namely the first k eigenvectors of L, y1, . . . , yk 1 we ignore y0 since it is constant. These k eigenvectors provide orthogonal graph signals based on semantic coherence, with nodes showing similar values in these functions likely belonging to the same semantic part; thus, each signal points to a distinct connected component. As in [61], we recover the true semantic structure of the graph considering the matrix Y = [y1, y2, . . . , yk 1] Rn k 1. First, we perform ℓ2-normalization of each row in Y , Xij = Yij/(P j Y 2 ij)1/2 Rn k 1 the i-th row of X represents the normalized feature vector for the i-th node, which determines the node s membership in a cluster. Then, we take the best membership for each node using an algorithm that attempts to minimize distortion in a lower-dimensional space finding k disjoint partitions of the nodes V1, V2, . . . , Vk such that Sk i=1 Vi = V and Tk i=1 Vi = . We determine L and estimate the perturbation H = L L to compute the sin θ upper bound value. Each recursion step splits the graph into tighter subgraphs. This process continues until one of the early stopping criteria is met: (1) if a partition is too small (less than kmin), (2) if the eigenvalues exceed a maximum smoothness threshold λmax, or (3) if the sin θ upper bound value becomes too large (more than pmax), indicating that the current estimate of L is no longer reliable. These stopping conditions ensure we halt when further partitioning does not yield meaningful semantic components, thus identifying the final set of image parts or primitives. 1We generalize G and L to denote any subgraph and its Laplacian, rather than just the original image graph. Figure 2: Qualitative results of our algorithm on Pascal VOC2012, COCO-Stuff and Cityscapes datasets. The Hierarchy columns colour-code the pixel semantic hierarchy, and the Category columns are random colour-coded, helping visually discriminate hierarchically close pixels. Each recursive partitioning adds structure to the tree T, where nodes specify semantic regions at various levels. The final output is T, with each leaf node capturing a distinct part of the image at an appropriate level of granularity; see Figure 2. The values kmin, λmax and pmax are found experimentally for the tested dataset; tables are shown in Section 4 and Appendix D. In Appendix B, we discuss the algorithm s properties and the generated T, and present the complete pseudocode of our method. 3.2 Pre and Post-Processing Boosting computation. We introduce a preprocessing strategy that condenses the graph, dramatically reducing the algorithm s time and memory requirements by several orders of magnitude while maintaining comparable accuracy. The condensed graph simplifies the original graph into m nodes by contracting strongly connected components into vertices, where m n. We assume that the finest semantic content in natural images is inherently limited and cannot exceed the raw pixel statistics. From the input image I we extract m regions [2, 69, 74] leading to an initial undirected condensed graph Gc = (Vc, Ec, w), where Vc = {Ai}m i=1, such that Sm i=1 Ai = Vc and Tm i=1 Ai = , and the edge weights w(Ai, Aj) represent the degree of association between regions, defined as w(Ai, Aj) = P u Ai,v Aj w(u, v). We then apply our recursive partitioning algorithm to Gc and its corresponding normalized Laplacian matrix Lc. As a result, we obtain a region tree T. Ablation studies in Section 4.3 compare performances across various overclustering methods. Boundary Sharpening. Given a predicted region tree T with q leaves, B1, B2, . . . , Bq, each embedding a disjoint segment of V , such that Sq j=1 Bj = V and Tq j=1 Bj = , we compute the prototypes for the image I. Each prototype is defined as the ℓ2-normalized average of the feature codes in each leaf uj = |Bj| 1 P v Bj v. For each pixel, we define a conditional probability distribution over the prototypes using the softmax function with smoothing parameter τ, namely, pij = exp v i ujτ 1 /P k exp v i ukτ 1 . We arrange the matrix P = [pij] [0, 1](M N) q and sharpen the region boundaries applying a Conditional Random Fields (CRF) [49], which refine the predicted distribution P by incorporating dependencies between pixel observations I. This improves the accuracy at the boundary, ensuring sharper and more precise segmentation of leaves. 4 Experiments We benchmark our algorithm on unsupervised multi-granular segmentation using seven major objectand scene-centric datasets and seven hierarchically structured datasets with varying granularity levels for hierarchy-agnostic segmentation. Our evaluation includes an ablation study to assess the contributions of each algorithm component and comparisons across different self-supervised backbone architectures. We only utilize publicly available datasets, SSL model checkpoints without retraining, and validation set ground-truth annotations. CRF is applied only where specified. Each dataset provides unique characteristics essential for different segmentation challenges. Pascal VOC2012 [27] offers a broad range of object categories, making it suitable for general object segmentation tasks. With its high-level division into things and stuff, COCO-Stuff [10] extends the MSCOCO [53] and tests the algorithm in complex scenes with multiple objects. Potsdam and Vaihingen [30] datasets, focused on aerial scene parsing, are designed for remote sensing and urban planning applications. Cityscapes [21] is critical for autonomous driving research, providing detailed annotations of urban street scenes. Mapillary Vistats [60] adds diversity with street scenes from various global environments, testing the algorithm s robustness to different conditions. KITTISTEP [89] and KITTI-SS [1], similar to Cityscapes, extend the evaluation to dynamic urban scenarios with instance detection and object tracking. For fine-grained part segmentation, Pascal-Part [15], Part Image Net, and Part Image Net-158 [35] offer detailed part annotations, crucial for tasks requiring precise recognition and segmentation of object parts. These datasets ensure a comprehensive and diverse benchmark for evaluating the performance and robustness of our segmentation algorithm across various contexts, see Figure 2. Further details about datasets are in Appendix D.1, and more quantitative and qualitative results are in Appendices D.3 and D.4, respectively. To ensure reproducibility, we standardize our experimental setup. Unless otherwise specified, we use the DINOv2-Vi T-B14-REG [22] backbone with parameters kmin = 1, pmax = 20, and λmax = 0.8. We apply the spectral method from Ng et al. [61] with m = 300 for superpixel clustering. The recursive partitioning depth is limited at 10 levels. Depending on each backbone downsampling factor, input images are resized to extract 60 60 codes, except for urban street scenes, where we obtain 60 120 codes. Further details in Appendix D. 4.1 Evaluation Metrics Figure 3: Comparison of segmentation metrics. NFCovering evaluates single-level foreground overlap, NMCovering extends across multiple granular levels for all categories, and NHCovering integrates hierarchical consistency. Coloured arrows indicate category-specific matches. Granularity-agnostic. Following [44], we aim to evaluate the unbiased overlapping of regions between predicted segmentation and ground truth within each image via the Normalized Foreground Covering (NFCovering) metric. However, it is not well-suited for the multi-granular segmentation domain. The metric applies to a single granularity level at a time, failing to account for multiple granularity levels. Furthermore, it disregards the background as a valid semantic instance, leading to an incomplete estimate of segmentation performance. We propose a novel evaluation metric, the Normalized Multigranular Covering (NMCovering), which addresses these limitations evaluating the overlap of regions between the unrolled segments in the region tree T and all the available ground-truth categorical segments in a semantic map Sgt. This metric ensures that both foreground and background instances are considered, providing a more comprehensive and granularity-independent assessment of a semantic segmentation model s performance. We adopt a greedy heuristic that maximises the overlap of the full hierarchy with the ground truth segmentation and compute the average overlap ratio of the matching: NMCovering(T Sgt) := 1 |Sgt| R Sgt max R T |R R | |R R |. (3) Table 1: Granularity-agnostic. Evaluation of our algorithm on different datasets using a maximum overlap heuristic for category matching. Dataset m Io U p Acc m Acc f Io U NMCovering (T Sgt) object-centric Pascal VOC2012 78.1 82.6 91.2 78.1 75.4 MSCOCO 55.7 93.1 85.0 78.8 49.6 scene-centric COCO-Stuff 58.7 81.1 80.3 67.3 42.1 Cityscapes 48.8 82.8 76.1 68.8 44.8 KITTI-STEP 51.2 79.8 76.5 65.7 48.4 Mapillary Vistas 47.6 78.9 72.1 66.1 42.7 Potsdam 58.9 83.4 83.2 65.0 56.3 Table 2: Hierarchy-agnostic. Evaluation of our algorithm on different datasets using a maximum overlap heuristic for category matching. Dataset m Io U p Acc NMCovering NHCovering (T Tgt) whole-centric COCO-Stuff 59.5 75.1 53.5 42.9 Cityscapes 53.7 78.8 51.1 43.8 KITTI-STEP 58.3 79.6 54.2 46.5 Mapillary Vistas part-centric Pascal-Part 25.8 80.0 39.5 38.8 Part-Imagenet 55.4 79.5 65.8 65.2 Part-Imagenet-158 59.5 82.6 67.8 63.1 The NMCovering metric evaluates the performance of a hierarchical segmentation model considering how many ground truth objects are recognised at any granularity level and how well they are segmented. A high score indicates a high segmentation coherence between human semantic perception and unsupervised machine one. Hierarchy-agnostic. We introduce a second accuracy metric, the Normalized Hierarchical Covering (NHCovering). NHCovering jointly evaluates the segmentation quality and the semantic hierarchical inclusion of the prediction relating to the ground-truth semantic region tree Tgt. Hierarchical inclusion is the matching between the nodes of two distinct hierarchies preserving the lineage from the ancestors; this problem is commonly referred to in the literature as the unordered tree inclusion problem [45]. To calculate this metric, we use a greedy heuristic that computes the average overlap ratio of matching regions, weighting each match by the ratio of matched ancestors. The operator π(R) returns the ancestors set of the tree node R, and β(R, T) returns the nodes set in the predicted tree T that best match the ancestors of node R: NHCovering(T Tgt) := 1 |Tgt| R Tgt max R T |R R | |R R | |β(R, T) π(R )| |π(R)| , (4) where β(R, T) := [ P π(R) arg max P T |P P | |P P |. (5) The NHCovering metric computes the average weighted overlap of regions between the predicted tree T and the ground-truth tree Tgt. The overlap weight measures the proportion of correct ancestorships with respect to the ground truth. Balancing segmentation performance with semantic ancestry consistency provides a granularityand hierarchy-independent performance assessment. This score quantifies the coherence of segmentation and hierarchical organization of visual concepts between human perception [66] and unsupervised machine one. Refer to Figure 3 for a visual insight into the metrics. A more detailed discussion is in Appendix D.2. 4.2 Unsupervised Segmentation Granularity-Agnostic. We adopt the NMCovering metric to benchmark the performance and versatility of our algorithm across different natural image domains. As shown in Tables 1 and 5, our method achieves excellent results in segmenting object-centric images and foreground discovery. Additionally, Table 1 demonstrates our approach s strong performance on scene-centric datasets, such as remote-sensing images and urban street scenes. Table 3 compares our approach to other supervision strategies.2 Our approach achieves segmentation quality comparable to supervised methods and surpasses other supervision strategies by a large gap. Hierarchy-Agnostic. We further benchmark the hierarchical inclusion quality of the algorithm on available datasets having hierarchical structures at a high level, such as MSCOCO, COCO-Stuff and Cityscapes, and at a low level, such as Pascal Part and Part Image Net. We show in Table 2 2We ran four experiments for each dataset with random seed and assumed normally distributed errors. However, in the segmentation literature, m Io U is typically reported by the single mean value in percentage. Table 3: Semantic segmentation. Comparison on Pascal VOC2012 val. Ours match unsupervised masks to best overlapping classes. m Io U Method Backbone VOC12 MSCOCO fully-supervised Deep Lab-CRF [12] Res Net-101 77.7 - Deep Lab-CRF [12] VGG-16 - 43.6 [10] Deep Lab V3-JFT [13] Res Net-101 82.7 - weakly-supervised Vi T-PCM [71] Vi T-B16 69.3 45.0 L2G [42] Res Net-38 72.0 44.2 Weak Tr [95] Dei T-S 74.0 50.3 un-supervised Melas-Kyriazi et al. [58] Vi T-S16 37.2 - Leopart [96] Vi T-S16 41.7 49.2 HSG [44] Res Net-50 41.9 - Zhang et al. [94] Res Net-50 43.5 - Mask Distill [79] Res Net-50 48.9 - Ours w/o CRF Vi T-S8 76.2 .9 52.1 .6 Ours w CRF Vi T-B14 80.3 1.1 56.5 .9 Table 4: Boundary potential methods. All methods match unsupervised tree segments to best overlapping classes. NMCovering Pascal VOC2012 m Io U p Acc (T Sgt) boundary potential SE-OWT-UCM [24] 48.4 83.0 59.0 PMI-OWT-UCM [40] 47.0 86.5 61.3 semantic smoothness Ours w/o CRF 78.1 86.0 75.4 Ours w CRF 80.3 87.3 76.8 NMCovering COCO-Stuff m Io U (T Tgt) NHCovering boundary potential SE-OWT-UCM [24] 30.7 43.0 32.9 PMI-OWT-UCM [40] 27.5 43.2 23.1 semantic smoothness Ours w/o CRF 58.7 53.5 42.1 Ours w CRF 59.9 55.6 43.9 Table 5: Backbone ablation. Granularity-agnostic segmentation evaluation on Pascal VOC2012 val set using a maximum overlap heuristic for category matching in each image. We report category Io U for each experiment with micro and macro averaged scores and the NMCovering. Backbone bkgd airplane bicycle bird boat bottle bus car cat chair cow d. table dog horse bike person p. plant sheep couch train tv m Io U p Acc m Acc f Io U NMCovering Vi T-B8 [25] 63.9 58.5 40.1 60.5 58.0 59.7 74.1 68.6 68.8 49.7 67.5 52.0 65.6 68.6 58.5 60.5 58.1 66.5 62.4 64.2 52.4 60.9 69.8 75.1 63.6 60.8 CLIP-Vi T-B16 [67] 74.4 73.0 52.2 82.0 71.2 66.5 76.5 84.0 87.4 66.4 86.3 59.1 83.2 80.3 75.0 76.1 70.0 85.5 79.2 70.5 63.3 74.4 79.5 84.0 75.1 74.0 MAE-Vi T-B16 [37] 66.2 81.4 54.6 85.8 73.4 71.4 82.4 80.6 83.8 64.8 85.1 66.8 83.8 81.4 74.6 72.6 66.5 87.3 77.0 76.2 68.7 75.4 73.5 85.9 69.1 70.0 MOCOv3-Vi T-B16 [16] 72.2 82.6 57.2 83.0 74.4 69.9 78.7 76.1 81.8 59.0 85.7 66.7 80.3 77.2 72.3 70.6 60.2 86.4 77.6 76.4 61.8 73.8 78.1 84.9 73.0 71.1 DINO-Res Net-50 [11] 67.2 65.7 47.6 70.2 58.8 49.8 66.8 56.6 73.9 46.8 75.6 47.1 70.3 71.6 60.7 55.2 52.6 77.5 59.5 63.7 39.2 60.8 73.3 76.0 65.7 61.9 DINO-Vi T-S8 [11] 69.7 83.1 51.7 85.8 75.2 70.2 84.0 82.0 86.7 67.1 85.8 66.3 85.8 80.0 76.5 73.5 66.3 86.4 81.3 75.9 66.9 76.2 76.8 85.6 72.0 72.5 DINO-Vi T-B8 [11] 70.6 87.0 57.1 91.0 77.1 74.3 83.7 80.0 88.1 67.5 86.2 65.2 85.5 81.2 78.6 75.0 66.2 88.9 83.5 80.0 67.6 77.8 77.4 86.0 73.0 74.0 DINOv2-Vi T-B14-R [22] 76.9 73.4 51.0 82.1 72.4 82.5 85.6 81.1 90.2 71.2 87.1 68.8 87.7 78.3 79.2 82.1 70.8 84.7 82.9 82.9 68.8 78.1 82.6 91.2 78.1 75.4 the closeness in performance of NHCovering with respect to the NMCovering, demonstrating the ability of the algorithm to capture hierarchical relations among the parts. The lower performance on Pascal Part is due to the lower granularity of part annotations compared to Part Image Net, see [35]. Recent unsupervised semantic segmentation approaches, such as [44, 94], often employ mutual information maximisation between regions at multiple granularity levels. These methods typically utilise hierarchical clustering that groups low-level coherent regions via boundary potentials, such as the Ultrametric Contour Map (OWT-UCM) [4], of boundaries derived from low-level features like brightness, colour, and texture gradients, as in Structured Edges (SE) [24] or Pointwise Mutual Information (PMI) [40]. We compare our approach with these methods in Table 4. The results indicate that low-level processes are inappropriate for the hierarchical grouping of high-level (semantic) features. In contrast, our approach excels in this area, suggesting significant room for improvement in the current state of the art. See some hierarchical grouping results in Figure 4. 4.3 Ablation Experiments Backbone. We evaluate in Table 5 the consistency of our approach according to the latent space induced by different SSL Imagenet pre-trained backbones on the granularity-agnostic task on the Figure 4: Unsupervised parts discovering examples on Part Image Net. The left column shows the ground truth part masks. The second to fourth column shows the predicted regions for each tree depth. Heatmap colours encode leaves distance in the subtrees. Table 6: Superpixel and parameters ablation experiments. (a) NMCovering on Part Image Net: superpixel vs. kmin. Superpixel kmin kmin = 1 m = 100 1 5 (sec/iter) colour-space k-mean [54] 32.7 33.4 .73 .14 SLIC [2] 58.1 44.9 .05 .01 quick-shift [80] 60.8 58.0 .21 .12 SSL-latent-space k-mean [54] 62.9 58.1 .34 .19 Spectral [61] 63.1 59.9 .61 .11 None 65.7 64.9 .84 .29 (b) m Io U for m sizes with [61]. m Dataset 50 100 300 None object-centric Pascal VOC2012 76.3 77.5 78.1 78.1 MSCOCO 45.8 52.4 55.7 56.1 scene-centric COCO-Stuff 43.6 52.2 59.5 60.3 Cityscapes 30.3 37.6 48.8 50.4 part-centric Pascal-Part 18.2 23.1 25.8 26.4 Part Image Net 36.9 47.2 48.3 48.8 (c) NHCovering with different perturbation thresholds and smoothness parameters on the COCO-Stuff dataset. λmax λmax = .5 pmax .5 .8 None (sec/iter) 15 36.9 41.1 41.4 .49 .09 20 39.5 42.9 43.1 .65 .16 None 40.9 43.7 44.0 .77 .23 Pascal VOC2012 val set. The performance gap reflects the representation quality assessed by SSL downstream task benchmarks. Such a result assesses the best model and a complementary downstream task benchmark for SSL. We do not adopt superpixel clustering or CRF but utilise raw patch features as pixel codes. Parameters kmin, pmax and λmax. In Tables 6a to 6c we validate the optimal parameters of our algorithm. While kmin choice affects the granularity at lower levels, the pmax and λmax choice affects the granularity at higher levels by controlling the stability of the partition. Overclustering and CRF. We test different over clustering techniques in Table 6a. Results show higher performances for a simultaneous normalised cut on SSL latent space. When applying CRF with τ = 0.1, we obtain an increase in segmentation accuracy as shown in Tables 3 and 4. 5 Discussion Broader impact. Supervised learning typically relies on highly curated datasets that require expensive and time-consuming manual annotations, especially for complex tasks that demand expert knowledge, such as fine-grained semantic segmentation. As the demand for larger datasets grows and costs rise, increasing interest is in advancing image understanding with minimal or no supervision. Our approach addresses this challenge by streamlining the annotation process, reducing costs, and thereby significantly expanding the data available for supervised model training through the dynamic discovery of semantic regions. Limitations. While our method shows promising results, it does have limitations. One major drawback is that both segmentation quality and algorithm execution time scale with the input size. Small object parts are especially difficult to detect, particularly when overclustering is used during preprocessing; the results support this. However, a trade-off between accuracy and inference time can be experimentally determined. Moreover, our approach is based on self-supervised learning, which, like all data-driven methods, is susceptible to inherent biases in the data. These biases can influence the resulting latent space of the model, potentially impacting the performance and generalization of our approach. 6 Conclusions We introduce a novel method for unsupervised hierarchical decomposition of natural scenes into primitive components, without requiring prior knowledge of scene granularity. Leveraging deep feature extraction and graph partitioning, our approach constructs a tree of semantic elements from any scene for any dataset. Our core algorithm applies an innovative algebraic approach to deep spectral clustering, addressing blurring from pixel similarity across object parts. Matrix perturbation theory is employed at each tree level, ensuring stable smooth partitions. This framework not only advances unsupervised semantic segmentation but also benchmarks deep neural network representations by evaluating segmentation quality at multiple granularity levels and hierarchical consistency among them. We validate our method with novel metrics, evaluating overlap with ground-truth masks across diverse semantic segmentation datasets and semantic inclusion within hierarchical ones. [1] Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars Mescheder, Andreas Geiger, and Carsten Rother. Augmented reality meets computer vision: Efficient data generation for urban driving scenes. IJCV, 126:961 972, 2018. [2] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE TPAMI, 34(11):2274 2282, 2012. [3] Amit Aflalo, Shai Bagon, Tamar Kashti, and Yonina Eldar. Deep Cut: Unsupervised segmentation using graph neural networks clustering. In ICCV, pages 32 41, 2023. [4] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE TPAMI, 33(5):898 916, 2010. [5] Shahaf Arica, Or Rubin, Sapir Gershov, and Shlomi Laufer. Cu VLER: Enhanced unsupervised object discoveries through exhaustive self-supervised transformers. In CVPR, pages 23105 23114, 2024. [6] Arik Azran and Zoubin Ghahramani. Spectral methods for automatic multiscale data clustering. In CVPR, volume 1, pages 190 197, 2006. [7] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Machine learning, 56:89 113, 2004. [8] Mikhail Belkin and Partha Niyogi. Semi-supervised learning on manifolds. Machine Learning Journal, 1, 2002. [9] Rajendra Bhatia, Chandler Davis, and Alan Mc Intosh. Perturbation of spectral subspaces and solution of linear operator equations. Linear algebra and its applications, 52:45 67, 1983. [10] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. COCO-Stuff: Thing and stuff classes in context. In CVPR, pages 1209 1218, 2018. [11] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650 9660, 2021. [12] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deep Lab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE TPAMI, 40(4):834 848, 2017. [13] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. ar Xiv preprint ar Xiv:1706.05587, 2017. [14] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, pages 1597 1607. PMLR, 2020. [15] Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014. [16] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV, pages 9640 9649, 2021. [17] Jang Hyun Cho, Utkarsh Mall, Kavita Bala, and Bharath Hariharan. Pi CIE: Unsupervised semantic segmentation using invariance and equivariance in clustering. In CVPR, pages 16794 16804, 2021. [18] Sumit Chopra, Raia Hadsell, and Yann Le Cun. Learning a similarity metric discriminatively, with application to face verification. In CVPR, volume 1, pages 539 546, 2005. [19] Subhabrata Choudhury, Iro Laina, Christian Rupprecht, and Andrea Vedaldi. Unsupervised part discovery from contrastive reconstruction. Neur IPS, 34:28104 28118, 2021. [20] Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997. [21] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. In CVPR, June 2016. [22] Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers, 2023. [23] Chandler Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970. [24] Piotr Dollár and C. Lawrence Zitnick. Structured forests for fast edge detection. In ICCV, 2013. [25] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2021. [26] Justin Eldridge, Mikhail Belkin, and Yusu Wang. Unperturbed: spectral analysis beyond Davis-Kahan. In Algorithmic learning theory, pages 321 358. PMLR, 2018. [27] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (VOC) challenge. IJCV, 88(2):303 338, 2010. [28] Jianqing Fan, Weichen Wang, and Yiqiao Zhong. An ℓinfty eigenvector perturbation bound and its application. Journal of Machine Learning Research, 18(207):1 42, 2018. [29] Miroslav Fiedler. Algebraic connectivity of graphs. Czechoslovak mathematical journal, 23 (2):298 305, 1973. [30] Markus Gerke. Use of the stair vision library within the isprs 2d semantic labeling benchmark (vaihingen), 01 2015. [31] Abel Gonzalez-Garcia, Davide Modolo, and Vittorio Ferrari. Do semantic parts emerge in convolutional neural networks? IJCV, 126:476 494, 2018. [32] Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, and William T Freeman. Unsupervised semantic segmentation by distilling feature correspondences. In ICLR, 2021. [33] Robert Harb and Patrick Knöbelreiter. Info Seg: Unsupervised semantic image segmentation with mutual information maximization. In DAGM-GCPR, pages 18 32. Springer, 2021. [34] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In ICCV, pages 991 998, 2011. doi: 10.1109/ ICCV.2011.6126343. [35] Ju He, Shuo Yang, Shaokang Yang, Adam Kortylewski, Xiaoding Yuan, Jie-Neng Chen, Shuai Liu, Cheng Yang, and Alan Yuille. Part Image Net: A large, high-quality dataset of parts. ar Xiv preprint ar Xiv:2112.00933, 2021. [36] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729 9738, 2020. [37] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. ar Xiv:2111.06377, 2021. [38] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012. [39] Wei-Chih Hung, Varun Jampani, Sifei Liu, Pavlo Molchanov, Ming-Hsuan Yang, and Jan Kautz. SCOPS: Self-supervised co-part segmentation. In CVPR, pages 869 878, 2019. [40] Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H. Adelson. Crisp boundary detection using pointwise mutual information. In ECCV, 2014. [41] Xu Ji, Joao F Henriques, and Andrea Vedaldi. Invariant information clustering for unsupervised image classification and segmentation. In ICCV, pages 9865 9874, 2019. [42] Peng-Tao Jiang, Yuqi Yang, Qibin Hou, and Yunchao Wei. L2G: A simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation. In CVPR, pages 16886 16896, 2022. [43] Tosio Kato. Perturbation theory for linear operators, volume 132. Springer Science & Business Media, 2013. [44] Tsung-Wei Ke, Jyh-Jing Hwang, Yunhui Guo, Xudong Wang, and Stella X Yu. Unsupervised hierarchical semantic segmentation with multiview cosegmentation and clustering transformers. In CVPR, pages 2571 2581, 2022. [45] Pekka Kilpeläinen and Heikki Mannila. Ordered and unordered tree inclusion. SIAM Journal on Computing, 24(2):340 356, 1995. [46] Junho Kim, Byung-Kwan Lee, and Yong Man Ro. Causal unsupervised semantic segmentation. ar Xiv preprint ar Xiv:2310.07379, 2023. [47] Wonjik Kim, Asako Kanezaki, and Masayuki Tanaka. Unsupervised learning of image segmentation based on differentiable feature clustering. IEEE TIP, 29:8055 8068, 2020. [48] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ICCV, pages 4015 4026, 2023. [49] Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected CRFs with gaussian edge potentials. Neur IPS, 24, 2011. [50] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84 90, 2017. [51] Mengcheng Lan, Xinjiang Wang, Yiping Ke, Jiaxing Xu, Litong Feng, and Wayne Zhang. Smooseg: smoothness prior for unsupervised semantic segmentation. Neur IPS, 36, 2024. [52] Kehan Li, Zhennan Wang, Zesen Cheng, Runyi Yu, Yian Zhao, Guoli Song, Chang Liu, Li Yuan, and Jie Chen. Acseg: Adaptive conceptualization for unsupervised semantic segmentation. In CVPR, pages 7162 7172, 2023. [53] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, pages 740 755, 2014. [54] Stuart Lloyd. Least squares quantization in PCM. IEEE TIT, 28(2):129 137, 1982. [55] Nikos K Logothetis and David L Sheinberg. Visual object recognition. Annual review of neuroscience, 19(1):577 621, 1996. [56] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431 3440, 2015. [57] Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. In CVPR, pages 7086 7096, June 2022. [58] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and Andrea Vedaldi. Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In CVPR, pages 8364 8375, 2022. [59] Russell Merris. Laplacian matrices of graphs: a survey. Linear algebra and its applications, 197:143 176, 1994. [60] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and Peter Kontschieder. The Mapillary Vistas dataset for semantic understanding of street scenes. In ICCV, pages 4990 4999, 2017. [61] Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Neur IPS, 14, 2001. [62] Dantong Niu, Xudong Wang, Xinyang Han, Long Lian, Roei Herzig, and Trevor Darrell. Unsupervised universal image segmentation. In CVPR, pages 22744 22754, 2024. [63] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. [64] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision. ar Xiv preprint ar Xiv:2304.07193, 2023. [65] Sean O Rourke, Van Vu, and Ke Wang. Random perturbation of low rank matrices: Improving classical bounds. Linear Algebra and its Applications, 540:26 59, 2018. [66] Stephen E Palmer. Hierarchical structure in perceptual representation. Cognitive psychology, 9(4):441 474, 1977. [67] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748 8763, 2021. [68] Sriram Ravindran and Debraj Basu. SEMPART: Self-supervised multi-resolution partitioning of image semantics. In ICCV, pages 723 733, 2023. [69] Ren and Malik. Learning a classification model for segmentation. In ICCV, pages 10 17. IEEE, 2003. [70] Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878 1915, 2011. [71] Simone Rossetti, Damiano Zappia, Marta Sanzari, Marco Schaerf, and Fiora Pirri. Max pooling with vision transformers reconciles class and shape in weakly supervised semantic segmentation. In ECCV, pages 801 818, 2022. [72] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual explanations from deep networks via gradientbased localization. In ICCV, pages 618 626, 2017. [73] Hyun Seok Seong, Won Jun Moon, Su Been Lee, and Jae-Pil Heo. Leveraging hidden positives for unsupervised semantic segmentation. In CVPR, pages 19540 19549, 2023. [74] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE TPAMI, 22(8): 888 905, 2000. [75] Gilbert W Stewart and Ji-guang Sun. Matrix perturbation theory. Academic Press, 1990. [76] Anne M Treisman and Nancy G Kanwisher. Perceiving visually presented objets: recognition, awareness, and modularity. Current opinion in neurobiology, 8(2):218 226, 1998. [77] Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. Unsupervised semantic segmentation by contrasting object mask proposals. In CVPR, pages 10052 10062, 2021. [78] Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. Revisiting contrastive methods for unsupervised learning of visual representations. In Neur IPS, 2021. [79] Wouter Van Gansbeke, Simon Vandenhende, and Luc Van Gool. Discovering object masks with transformers for unsupervised semantic segmentation. ar Xiv preprint ar Xiv:2206.06363, 2022. [80] Andrea Vedaldi and Stefano Soatto. Quick shift and kernel methods for mode seeking. In ECCV, pages 705 718. Springer, 2008. [81] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17:395 416, 2007. [82] Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. The Annals of Statistics, pages 555 586, 2008. [83] Johan Wagemans, Jacob Feldman, Sergei Gepshtein, Ruth Kimchi, James R Pomerantz, Peter A Van der Helm, and Cees Van Leeuwen. A century of gestalt psychology in visual perception: Ii. conceptual and theoretical foundations. Psychological bulletin, 138(6):1218, 2012. [84] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and Lei Li. Dense contrastive learning for self-supervised visual pre-training. In CVPR, 2021. [85] Xinlong Wang, Zhiding Yu, Shalini De Mello, Jan Kautz, Anima Anandkumar, Chunhua Shen, and Jose M Alvarez. Freesolo: Learning to segment objects without annotations. In CVPR, pages 14176 14186, 2022. [86] Xudong Wang, Ziwei Liu, and Stella X Yu. Unsupervised feature learning by cross-level instance-group discrimination. In CVPR, pages 12586 12595, 2021. [87] Xudong Wang, Rohit Girdhar, Stella X Yu, and Ishan Misra. Cut and learn for unsupervised object detection and instance segmentation. In CVPR, pages 3124 3134, 2023. [88] Yangtao Wang, Xi Shen, Shell Xu Hu, Yuan Yuan, James L Crowley, and Dominique Vaufreydaz. Self-supervised transformers for unsupervised object discovery using normalized cut. In CVPR, pages 14543 14553, 2022. [89] Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender, Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers, Aljosa Osep, Laura Leal Taixe, and Liang-Chieh Chen. STEP: Segmenting and tracking every pixel. In Neur IPS, 2021. [90] Zhaoyuan Yin, Pichao Wang, Fan Wang, Xianzhe Xu, Hanling Zhang, Hao Li, and Rong Jin. Transfgu: a top-down approach to fine-grained unsupervised semantic segmentation. In ECCV, pages 73 89. Springer, 2022. [91] Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the Davis Kahan theorem for statisticians. Biometrika, 102(2):315 323, 2015. [92] Yang Yuan. On the power of foundation models. In ICML, pages 40519 40530. PMLR, 2023. [93] Andrii Zadaianchuk, Matthaeus Kleindessner, Yi Zhu, Francesco Locatello, and Thomas Brox. Unsupervised semantic segmentation with self-supervised object-centric representations. In Learning Representations, 2023. [94] Xiao Zhang and Michael Maire. Self-supervised visual representation learning from hierarchical grouping. Neur IPS, 33:16579 16590, 2020. [95] Lianghui Zhu, Yingyue Li, Jiemin Fang, Yan Liu, Hao Xin, Wenyu Liu, and Xinggang Wang. Weak Tr: Exploring plain vision transformer for weakly-supervised semantic segmentation, 2023. [96] Adrian Ziegler and Yuki M Asano. Self-supervised learning of object parts for semantic segmentation. In CVPR, pages 14502 14511, 2022. [97] Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. Neur IPS, 36, 2024. A Theoretical Analysis and Perturbation Foundations Figure 5: An example of ideal and perturbed adjacency matrices. The left shows an input image with highlighted parts and a colour legend. The central matrix represents the ideal adjacency matrix W , corresponding to the Laplacian L , with non-zero diagonal blocks for k disconnected components at a specific semantic granularity. Below, a disconnected graph illustrates these isolated parts. On the right, the perturbed adjacency matrix W introduces off-diagonal entries due to pixel similarity across regions, resulting in the perturbed Laplacian L. Below, a graph with added connections shows these perturbations, with colours matching the highlighted parts in the input image. From the example in Figure 5, we observe that perturbation arises from pixel similarity values that inadvertently create connections between regions, which ideally should remain separate. This unintended overlap occurs because similarity values between certain pixel pairs do not perfectly align with the true structure of the components. Perturbation theory [23, 9, 75, 43] studies how small changes, such as these unintended connections, affect a matrix s eigenspaces. Recent research has expanded this analysis from Hermitian matrices to include Laplacian matrices, which capture graph structures, as in [82, 26, 70, 28, 65, 91]. In our method, the two matrices are an ideal Laplacian matrix L and the observed Laplacian matrix L, both symmetric (not necessarily of the same rank), perturbed by a symmetric matrix H induced by higher-order pixel similarity, L = L +H. Given the recursive partitioning described in Section 3, we are looking, for each subgraph, for the best matches between the eigenvectors of L and those of L minimising the functional defined in Equation (1). In the ideal normalized Laplacian L , the graph is disconnected, and the number of connected components is reflected in the multiplicity of the zero eigenvalue of L . Each connected component contributes one zero eigenvalue, and the constant eigenvectors corresponding to these zero eigenvalues represent these isolated components [20, 81]. In a perturbed normalized Laplacian L, the graph is connected and then the minimum of Equation (2) is achieved by the eigenvector corresponding to the second smallest eigenvalue of L . This eigenvector, known as the Fiedler vector, is not constant and captures the connectivity structure of the graph, often used for finding the best bipartition. However, in the case of a perturbed graph with weak connections (e.g., low weights or nodes with only a single neighbour), small eigenvalues close to zero may not indicate strongly connected subgraphs. Instead, these small eigenvalues reflect loosely connected regions, where perturbations create weak links between components that ideally should remain separate. This implies that using the smallest eigenvectors directly for graph partitioning is sometimes unreliable, as they may reflect unstable or artificial connections introduced by perturbations. To ensure meaningful segmentation, it is essential to establish a measure that quantifies the stability of the partition under perturbations. Such a measure would indicate when it is safe to apply spectral clustering, ensuring that the identified clusters are robust and well-separated. According to Theorem A.1 (shown below), to achieve this, we seek the set of eigenvectors of the perturbed Laplacian L that are closest to the eigenvectors of the ideal normalized Laplacian L despite the perturbation H: Theorem A.1 ([91]). Let A, A Rn n be symmetric, with eigenvalues µ1 µn and µ 1 µ n respectively. Fix 1 r s n and assume that min(µr 1 µr, µs µs+1) > 0, where µ0 := and µn+1 := . Let u := s r + 1, and let Z = (zr, zr+1, . . . , zs) Rn u and Z = (z r, z r+1, . . . , z s) Rn u have orthonormal columns satisfying Azj = µjzj and A z j = µ jz j, for j = r, r + 1, . . . , s. Then: sin Θ(Z , Z) F 2 min(u1/2 A A op, A A F ) min(µr 1 µr, µs µs+1) (6) Moreover there exists and orthogonal matrix O in Rn n such that: Z O Z F 23/2 min(u1/2 A A op, A A F ) min(µr 1 µr, µs µs+1) (7) With op and F, the spectral and the Frobenius norm, respectively. Corollary A.1 ( [91]). Let A, A Rn n be symmetric, with eigenvalues µ1 µn and µ 1 µ n respectively. Fix j {1, . . . , n} and assume that min(µj 1 µj, µj µj+1) > 0, where µ0 := and µn+1 := . If z, z Rn satisfy Az = µjz and A z = µ jz , then: sin Θ(z , z) 2 A A op min(µj 1 µj, µj µj+1) (8) Moreover, if z z 0, then: z z 23/2 A A op min(µj 1 µj, µj µj+1) (9) In our method, the above symmetric matrix A refers to the perturbed matrix L and the matrix A to the ideal matrix L ; see Section 3. Since we select the k smallest eigenvalues, the interval starts from r = n k +1 and ends at s = n, and we have u = k. The denominator in Equation (6) simplifies to min(µn k µn k+1, µn µn+1) = µn k µn k+1, since by definition µn+1 := , and, given Z = (zn k+1, zn k+2, . . . , zn) Rn k and Z = (z n k+1, z n k+2, . . . , z n) Rn k, depending on the chosen eigenvalue ordering, we have: sin Θ(Z , Z) F 2 min(k1/2 L L op, L L F ) µn k µn k+1 for µ1 µn, (10) or, reversing the indexing eigenvalues in non-decreasing order and counting from zero, we set λi = µn i and λ i = µ n i for i = 0, . . . , n 1 with Y = (y0, y1, . . . , yk 1) Rn k and Y = (y 0, y 1, . . . , y k 1) Rn k the orthonormal columns satisfying Lyj = λjyj and L y j = λ jy j, for j = 0, 1, . . . , k 1, we have: sin Θ(Y , Y ) F 2 min(k1/2 L L op, L L F ) λk λk 1 for λ0 λn 1. (11) Indeed, we can apply the substitution and the indexing, given the eigenpairs (λ, y) condition for all the eigenvalues and eigenvectors populating the chosen interval (r, s). Furthermore, given the eigenpair condition, the minimization of Equation (1) in Section 3 is guaranteed (see Algorithm 1), and we can just ratify the smoothness by resorting to the Courant-Fisher theorem; see [38] Theorem 4.2.6. Finally, we can use the Corollary A.1 for a more refined choice of the eigenvectors. Note that the theorem of Yu et al. [91] is particularly useful because, differently from Davis Kahan [23] theorem, it defines an upper bound between two symmetric matrices concerning the angles between a subset of the eigenvectors of the two matrices or their distance (up to a rotation), in terms of the eigenvalues of one of the two matrices. Indeed, this theorem shows that z (an eigenvector of the perturbed Laplacian L) is close to z (an eigenvector of the Laplacian L ), under two main assumptions. First, we assume that L is close to L often this is straightforward in graph theory, for instance, if L is derived from a theoretical (or "population") graph structure, and L is the Laplacian of a graph constructed from a sample or noisy measurements of this structure. Second, applying Weyl s inequality, we assume that, almost certainly: |µ j 1 µj| (µj 1 µj)/2 and |µ j+1 µj| (µj µj+1)/2, where µj and µ j are the j-th eigenvalues of L and L , respectively. Under this eigengap condition, assuming a sufficient separation between the population eigenvalues of L and their neighbouring values, as shown in Yu et al. [91] results, we can conclude that z z is small, meaning z and z are close in norm. Building on this, the theorem implies that when the eigengap condition holds, the eigenvector z associated with L remains stable under perturbations represented by H = L L . Specifically, the proximity of z and z allows us to interpret z as a meaningful approximation of z , preserving the structure of the ideal graph encoded by L . Consequently, this alignment of eigenvectors facilitates robust graph-based clustering or segmentation, as it enables us to consistently identify clusters or partitions in perturbed graphs that mirror the structure of the unperturbed graph. Furthermore, this result provides a foundation for using the perturbed eigenvectors for hierarchical clustering by ensuring that the segments or clusters derived from L approximate those of L even under small changes or noise in the graph data. B The Algorithm and BFS Implementation Figure 6: The algorithm s two steps outputs. First, we quantize the graph to create an initial overclustering of semantic parts. Next, we recursively group these parts, forming multi-level semantic clusters from coarse to fine granularity. The heatmap colour-codes the distance between tree leaves. In Section 3 of the main paper, we have presented the recursive partitioning of the graph G, which is simple and intuitive. As noted, we present a Breadth-First Search (BFS) pseudocode in Algorithm 1. The method obtains graph partitions when two or more components are detectable and recur on each partition to find new subgraphs until an early stopping condition is met, see Section 3. We report that the proposed algorithm is defined in two steps for completeness the first step is meant to speed up the overall algorithm, and hence, it is optional. In the first step, we quantise the graph, creating an over-clustering of semantic parts; in the second step, we recur on the parts, generating coarse to finer semantic groups at multiple granularity levels; see Figure 6. In the first step, we write the patch-wise features extracted with a deep neural network as the set of nodes V of a weighted undirected graph G = (V, E, w), and we define w as the cosine similarity of points, scaled and shifted in [0, 1]. In this step, we follow the approach of Ng et al. [61] to cluster the graph in m components. The clusters define m disjoint segments A1, A2, . . . , Am (Sm i=1 Ai = V , Tm i=1 Ai = ) of G with high intra-cluster degree of semantic similarity. In the second step, we adopt a top-down recursive divisive clustering approach, building a hierarchical semantic decomposition of the image. In the recursive call at a certain level l and for a certain subgraph c notice that here c is an index of a graph p we define an undirected condensed graph Gl,p c = (V l,p c , El,p c , w), with segments A as nodes in V l,p c (S A V l,p c A V , T A V l,p c A = ) and the edges are weighted by the associativity degree of the components w(Ai, Aj) = P u Ai,v Aj w(u, v). Notice that the condensed graph at the root level is indicated as G0,0 0 = (V 0,0 0 , E0,0 0 , w) and V 0,0 0 = {Ai}m i=1. At each depth, we obtain a semantic grouping Sl = {V l+1,c i }c,i given by all the i-th subgraphs at level l + 1 of each subgraph c at level l. The recursion defines the final hierarchy, namely the output tree T. The semantic grouping algorithm discussed here builds upon conventional spectral clustering methods [74, 61, 6, 81]. Algorithm 1: Image Parsing via Granularity and Hierarchy-agnostic Semantic Regions Tree Data: V : set of points w: points similarity function m: desired number of superpixels w: components similarity function kmax: max number of subgraph components kmin: min number of subgraph points λmax: max normalized smoothness threshold pmax: max perturbation threshold Result: ℓ-depth hierarchy of clusterings {Sl}ℓ l=1 V 0,0 0 = {Ai}m i=1 = spectral_overclustering(V ) {Sl}ℓ l=1 = bfs_partitioning({V 0,0 0 }) Function spectral_overclustering(V ): Compute L for graph G = (V, E, w) Y arg min Y R|V | m,Y Y =I Tr(Y LY ) Normalize rows Xij = Yij/(P j Y 2 ij)1/2 Group X rows in m clusters with K-means return m groups {Ai}m i=1 of V Function bfs_partitioning(Sl): Sl+1 {} foreach segment V l,p c Sl do n V l,p c if n < kmin then continue end Get condensed graph G Gl,p c = (V l,p c , El,p c , w) Scale W with min-max normalization and compute L Y arg min Y Rn kmax ,Y Y =I Tr(Y LY ) λ diag(Y LY ) Reorder Y columns and λ in ascending order of λi k arg maxi{λi λi 1}kmax 1 i=2 , λi 1 < λmax if k then continue end Take first k eigenvectors Y Y[:,1:k] Normalize rows Xi,j = Yi,j/(P j Y 2 i,j)1/2 Group X rows in k clusters with K-means and get L Compute perturbation H = L L if 2 min(k1/2 H op, H F )/(λk λk 1) > k pmax then continue end Found stable k subgraphs estimate Sl+1 c = {V l+1,c i }k i=1 of V l,p c Sl+1 Sl+1 Sl+1 c end return {Sl+1} bfs_partitioning(Sl+1) Discussion above the algorithm. The algorithm s design integrates several properties that significantly shape semantic components hierarchical structure and connectivity. One crucial feature is the dynamic estimation of the number of components, k, a parameter closely aligned with insights from perturbation theory. This estimate directly influences segmentation granularity, determining the expected number of distinct semantic regions and hierarchy levels in the final output. The algorithm produces a hierarchy-agnostic unsupervised semantic region tree T. At each recursion stage, it seeks subgraphs that represent loosely connected regions based on robust algebraic criteria from graph theory, capturing coherent semantic regions at each level of granularity. By isolating each region from the context of previous layers, the algorithm accurately reflects the semantic hierarchy within the image s content. In this framework, the tree s leaves represent regions with high intracluster semantic similarity, positioning these clusters as primitives of broader concepts at higher levels. The algorithm naturally embeds semantic inclusion, where finer semantic regions exist within larger semantic contexts, mirroring how complex concepts encompass finer details. Lastly, computing the adjacency matrix only once optimizes efficiency, reducing redundant operations and improving runtime performance, making the approach both scalable and effective for large datasets. Time Complexity. The breadth-first search has a time complexity of O(|V | + |E|) on a general graph. In our case, with n nodes and n 1 edges (assuming a tree graph), BFS has complexity O(n + (n 1)) = O(2n 1) = O(n). The eigendecomposition of a symmetric matrix in the worst case has a complexity of O(n3). Therefore, the combined time complexity is O(n + n3) = O(n3). However, because we only compute the smallest kmax n eigenvectors, the time complexity of the eigendecomposition reduces to O(kmaxn2). Thus, the overall complexity becomes O(n+kmaxn2) = O(kmaxn2), since kmaxn2 dominates n. The recursive partitioning process starts with an eigendecomposition on the full graph, which has a time complexity of O(kmaxn2) for extracting the smallest kmax eigenvectors. This initial computation is the primary contributor to the algorithm s time complexity. As the partitioning proceeds, each recursive call operates on progressively smaller subgraphs. While each subgraph requires an eigendecomposition, the cost decreases significantly as the graph size is reduced at each recursion level. For balanced partitioning, the cumulative cost of these recursive steps remains asymptotically bounded by the initial computation on the full graph. Thus, the overall complexity is effective O(kmaxn2), ensuring computational feasibility even for large graphs. C Graph Partitioning with Normalised Cut We recall here the Normalised Cut (NCut) introduced by Shi and Malik [74] for measuring the goodness of a graph partition. Let G = (V, E, w) be a weighted undirected graph, having n nodes, V = {vi}n i=1, representing points vi Rd. The weight on each edge of the graph wij = w(vi, vj) is a function of the similarity between nodes vi and vj and defines an element of the adjacency matrix W = [wij] [0, 1]n n. The symmetrically normalised Laplacian of the graph [59] is defined as, L = D 1/2(D W)D 1/2, with D = diag [di] Rn n the diagonal degree matrix and di = P The normalised cut objective aims to partition the set V into two disjoint sets A V and B V (A B = V , A B = ) while minimising the degree of similarity between the two sets and maximising the one within each set, and it is defined as: NCut(A, B) := cut(A, B) assoc(A, V )+ cut(A, B) assoc(B, V ) (12) where cut(A, B) = P u A,v B w(u, v) measures the degree of similarity between A and B and is equal to the total weight of edges that the partitioning has removed, assoc(A, V ) = P u A,t V w(u, t) measures the degree of similarity between A and V , and assoc(B, V ) is equivalently defined. The NCut is an unbiased measure of the normalised total similarity of the two sets of points. Indeed, normalisation avoids unnatural bias when partitioning out small sets of points. Minimising exactly Equation (12) is NP-complete, according to the proof due to Papadimitriou [74]. However, [74] shows a tractable real-valued solution to the relaxed problem in Equation (12) can be obtained by solving the generalized eigenvalue system (D W)x = λDx, for x Rn and λ R. The eigenvectors xi span an orthogonal basis for functions on G [81]: xi = arg min x Rn, x =1,x x