# deep_subspace_clustering_with_data_augmentation__21c978ad.pdf Deep Subspace Clustering with Data Augmentation Mahdi Abavisani Rutgers University New Brunswick, NJ mahdi.abavisani@rutgers.edu Alireza Naghizadeh Rutgers University New Brunswick, NJ ar.naghizadeh@rutgers.edu. Dimitris N. Metaxas Rutgers University New Brunswick, NJ dnm@cs.rutgers.edu Vishal M. Patel Johns Hopkins University Baltimore, MD vpatel36@jhu.edu The idea behind data augmentation techniques is based on the fact that slight changes in the percept do not change the brain cognition. In classification, neural networks use this fact by applying transformations to the inputs to learn to predict the same label. However, in deep subspace clustering (DSC), the ground-truth labels are not available, and as a result, one cannot easily use data augmentation techniques. We propose a technique to exploit the benefits of data augmentation in DSC algorithms. We learn representations that have consistent subspaces for slightly transformed inputs. In particular, we introduce a temporal ensembling component to the objective function of DSC algorithms to enable the DSC networks to maintain consistent subspaces for random transformations in the input data. In addition, we provide a simple yet effective unsupervised procedure to find efficient data augmentation policies. An augmentation policy is defined as an image processing transformation with a certain magnitude and probability of being applied to each image in each epoch. We search through the policies in a search space of the most common augmentation policies to find the best policy such that the DSC network yields the highest mean Silhouette coefficient in its clustering results on a target dataset. Our method achieves state-of-the-art performance on four standard subspace clustering datasets. The source code is available at: https://github.com/mahdiabavisani/DSCwith DA.git. 1 Introduction Recent advances in technology have provided massive amounts of complex high dimensional data for computer vision and machine learning applications. High dimensionality has adverse effects, including confusion of algorithms with irrelevant dimensions and curse of dimensionality as well as increased computation time and memory. This motivates us to explore techniques for representing high-dimensional data in lower dimensions. In many practical applications such as face images under various illumination conditions [1] and hand-written digits [2], high-dimensional data can be represented by union of low-dimensional subspaces. The subspace clustering problem aims at finding these subspaces. In particular, the objective of subspace clustering is to find the number of subspaces, their basis and dimensions, and assign data to these subspaces [3]. Conventional subspace clustering algorithms assume that data lie in linear subspaces [4, 5, 6, 7, 8]. In practice, however, many datasets are better modeled by non-linear manifolds. To deal with this issue, many works have incorporated projections and kernel tricks to express non-linearity [9, 10, 7, 11, 12, 13]. Recently, deep subspace clustering (DSC) methods [14, 15, 16, 17, 18] have been 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. proposed which essentially learn unsupervised nonlinear mappings by projecting data into a latent space in which data lie in linear subspaces. Deep subspace clustering networks have shown promising performances on various datasets. Deep learning techniques are prone to overfitting. Data augmentation is often presented as a type of regularization to mitigate this issue [19, 20]. While data augmentation for deep learning-based methods have proven to be beneficial, the current framework of DSC networks is unable to take the full advantage of data augmentation. In this work, we modify the DSC framework and propose a model that can incorporate data augmentation into DSC. An important difference between data augmentation in subspace clustering and data augmentation in supervised tasks is the fact that as opposed to supervised tasks, we do not have ground-truth labels for the existing samples in the subspace clustering algorithms. Corresponding to the fact that objects remain the same even if we slightly transform them, in supervised deep learning models, transformations of an existing sample are trained to be predicted with a consistent label similar to the ground-truth label of the original sample. How can one convey such property in an unsupervised subspace clustering task, where the data does not have the ground-truth labels? A DSC model should favor functions that give consistent outputs for similar data points with a slight difference in their percept. To achieve this, we optimize a consistency loss that is based on temporal ensembling. We input plausible transformations of existing samples into the model and require the autoencoders of the model to map the transformations to consistent subspaces similar to the subspace of the original data. Efficient augmentation policies improve the performance of the deep networks. However, not all the image transformations construct efficient augmentation policies. Efficient augmentation policies can be different from a dataset to another [21, 22, 23]. In supervised applications, the validation set is often used to manually search among transformations such as rotation, horizontal flip, or translation by a few pixels to find efficient augmentations. Manual augmentation needs prior knowledge and expertise, and it can only search among a handful of pre-defined trials. Some methods automate this search for classification networks [21, 24, 25]. However, these methods are only designed for the classification task and cannot be applied to the task of subspace clustering. This is because we do not have a validation or training set in subspace clustering. We overcome this issue by providing a simple yet effective method for finding efficient augmentation policies using a greedy search and use mean Silhouette scores to evaluate the effect of different augmentation policies on the performance of our proposed model. 2 Related Work Clustering Methods with Augmentation. A recent method proposes a technique for deep embedded clustering algorithms with augmentations [26, 27]. In the pre-training stage they use augmentations in training autoencoders, and in the fine-tuning stage they encourage the augmented data to have the same centroid as their corresponding data. To the best of our knowledge, we are the first to propose an augmentation framework for deep subspace clustering algorithms. Self-supervision with Consistency Loss. The idea of learning consistent features for different transformations of unlabeled data has been used in a number of works largely in the semi-supervised and self-supervised learning literature [28, 29, 30, 31, 32, 33]. Self-expressiveness Models in Subspace Clustering. Let X = [x1, , x N] RD N be a collection of N signals {xi RD}N i=1 drawn from a union of n linear subspaces S1 S2 Sn. Given X, the task of subspace clustering is to find sub-matrices Xℓ RD Nℓthat lie in Sℓwith N1 + N2 + + Nn = N. Due to their simplicity, theoretical correctness, and empirical success, subspace clustering methods that are based on self-expressiveness property are very popular [34]. Self-expressiveness property can be stated as X = XC s.t diag(C) = 0, (1) where C RN N is the coefficient matrix. There may exist many coefficient matrices that satisfy the condition in (1). Among those, subspace preserving solutions are especially of interest to selfexpressiveness based subspace clustering methods. Subspace preserving property states that if an Figure 1: An overview of the proposed deep subspace clustering networks with data augmentation. The existing data points xi and xj are transformed into xt i and xt j in each iteration by an augmentation policy. However, the autoencoder learns to keep their latent space features within consistent subspaces. element in C is non-zero, the two data points in X that correspond to this coefficient are in the same subspace. Self-expressiveness based methods combine these two properties and solve a problem of the form: min C LS.E.(C, X) + λ1LS.P.(C), (2) where λ1 is a regularization constant, LS.E. and LS.P. impose the self-expressiveness and subspacepreserving properties, respectively. Most of the linear methods use LS.E.(C, X) = X XC 2 F . However, for LS.P.(C), different methods use various regularizations, including ℓ1-norm, ℓ2-norm and nuclear norm [4, 34, 35]. In recent years, deep neural network-based extensions were introduced to self-expressiveness based models [14, 15, 16, 17]. For these methods, xis do not need to be drawn from a union of linear subspaces. Instead, they use autoencoder networks to map the data points to a latent space where data points lie into a union of linear subspaces and exploit the self-expressiveness and subspace-preserving properties in the latent space. Let Z Rd N be the latent space features developed by the encoder in the autoencoders. Deep subspace clustering networks solve a problem of the form: min Θ LS.E.(C, Z) + λ1LS.P.(C) + λ2LRec.(X, ˆX), (3) where λ1 and λ2 are regularization constants, Θ is the union of trainable parameters, ˆX is the reconstruction of X and the output of the decoder, and LRec.(X, ˆX) = X ˆX 2 F is the reconstruction loss in training the autoencoder. Once a proper C is found from (2) or (3), spectral clustering methods [36] are applied to the affinity matrix W = |C| + |C|T to obtain the segmentation of the data X. 3 Deep Subspace Clustering Networks with Data Augmentation The human brain considers an object to remain the same, even if the percept changes slightly. Correspondingly, when data augmentation is used in supervised deep learning models, transformations of existing samples are trained to predict consistent labels similar to the ground-truth label of original samples. Conveying the same insight, we argue that a DSC model should favor functions that give consistent outputs for similar data points. We approach this property by keeping the estimated subspace membership of data points consistent when an augmentation policy is applied to them. During the training process, we smooth the predictions for the subspace memberships via temporal ensembling of estimated affinity matrices from previous iterations. Let Xt = [xt 1, , xt N] RD N be the transformed version of N existing data points X = [x1, , x N] RD N at the iteration t. Xt is the observation at time t when an augmentation policy is applied to the existing data points X. Our model can be applied to a variety of DSC networks. In this section, we consider a general form that consists of an encoder that takes Xt as an input and generates latent space features Zt. The latent space features are reconstructed by a self-expressive layer with parameters Ct. That is, Zt Ct is fed to the decoder to develop ˆXt, which is a reconstruction of Xt. Figure 1 shows an overview of this model. Note that such a model includes a fully-connected layer that connects all the samples in the mini-batch (the self-expressive layer). Thus, the number of data points and their orders cannot be changed during the training. We keep a placeholder with N fields that correspond to the existing samples and feed Xt to this placeholder at every training step t. The permutation of samples in Xt remains the same. As mentioned, we aim for an autoencoder that preserves the subspace membership of slightly transformed inputs. Let Ct be the coefficient matrix that is constructed at the t-th iteration of a subspace clustering algorithm. In addition, let ˆQ be an existing estimation of subspace membership matrix, whose rows are one-hot vectors denoting the subspace memberships assigned to different samples. The multiplication of ˆQT and |Ct| gives a matrix whose (i, j)th element shows the contribution of the samples assigned to the i-th subspace in reconstructing the j-th sample. For a perfect subspace-preserving coefficient matrix, ˆQT |Ct| has only one non-zero element in each row. For each sample j, the maximum value in the j-th row of ˆQT |Ct| can point to a new estimate for its subspace membership. Therefore, a prediction of subspace membership matrix at the iteration t can be calculated as follows Qt = Softmax(ˆQT |Ct|), (4) where Softmax( ) corresponds to the softmax function on the rows of its input. We refer to Qt as temporal subspace membership matrix. The temporal subspace membership matrix Qt estimates the subspace memberships for the current observation Xt. Note that because of the randomly augmented inputs, the coefficient matrix Ct can undergo sudden changes in different time frames. While it is fine to have different coefficient matrices for slight transformations of data, we are interested in maintaining persistent subspace membership matrices Qt. Thus, we propose a subspace membership consistency loss. We keep an exponential moving average (EMA) of Cts, the coefficient matrices, to provide a smooth temporal ensemble for the coefficient matrix. Thus, in addition to the temporal subspace membership matrix in (4), in each training iteration, we can calculate another membership matrix corresponding to the temporal ensemble of coefficient matrices in prior iterations. We refer to this membership matrix as Qt Ens.. Let Ct 1 EMA be the EMA of coefficient matrices until the iteration t 1, and Ct be the calculated update for the coefficient matrix at the iteration t. The EMA of the coefficient matrix at the iteration t can be updated as follows Ct EMA = αCt 1 EMA + (1 α)Ct, (5) where 0 < α < 1 is the smoothing factor. Using Ct EMA we can calculate Qt Ens. as Qt Ens. = Softmax(ˆQT |Ct EMA|), (6) where ˆQ is the same prior membership matrix as in (4). Note that Qt Ens. provides more consistent subspace membership predictions as compared to Qt. To encourage the autoencoders to favor functions that preserve the subspace memberships even for differently transformed observations Xt, we propose the subspace membership consistency loss as follows: LCons.(Qt Ens., Qt) = CE(Qt Ens., Qt), (7) where CE( ) denotes the cross-entropy function. LCons. penalizes the temporal changes to the subspace memberships if they are inconsistent with the temporal ensemble of subspace memberships Qt Ens.. Full Objective. We train the networks iteratively with two steps of subspace clustering and subspace membership consistency in each iteration. In the subspace clustering step, the loss function of the subspace clustering algorithm of choice (3) is optimized, and in the subspace membership consistency step, (7) is optimized. That is at each iteration t, we train the netwrok with the following algorithm. Step 1: minΘ(LS.E.(Ct, Zt) + λ1LS.P.(Ct) + λ2LRec.(Xt, ˆXt)), Step 2: minΘ(LCons.(Qt Ens., Qt)), (8) where Θ is the union of trainable parameters in the networks. 4 Finding Efficient Augmentations In the previous section, we denoted Xt as a stochastic transition of X which is the result of applying an augmentation policy. The choice of augmentation policy plays an important role in the performance (a) (b) (c) Figure 2: Sample images from different used datasets. (a) Extended Yale-B dataset [39]. (b) COIL dataset [40, 41] . (c) ORL dataset [42]. of the network. We formulate the problem of finding the best augmentation policy as a discrete search problem. Our method consists of three components: A score, a search algorithm and a search space with ns possible configurations. The search algorithm samples a data augmentation policy Si, which has information about what image processing operation to use, the probability of using the operation in each iteration, and the magnitude of the operation. The policy Si will be used to train a child deep subspace clustering network with a fixed architecture. The trained child network will return a score that specifies the effect of applying the policy Si to the input data on the performance of deep subspace clustering task. Finally, all the tested policies {Si}ns 1 will be sorted based on the returned scores. In the following, we describe the score , the search algorithm and the search space in detail. Score. In our framework, the score is a metric that evaluates the performance of the DSC on a certain given input. Note that the ground-truth labels are unknown at this stage. Therefore, we need to use a validation technique that does not use the ground-truth labels. Any internal validation of clustering methods, including mean Silhouette coefficient [37] or the Davies-Bouldin index (DBI) [38] can serve as the score metric in our search. We use mean Silhouette coefficient in this paper. Search Space. In our search space, a sample policy Si consists of ℓsequential sub-policies with each sub-policy using an image operation. Additionally, each operation is also associated with two hyper-parameters: 1) the probability of applying the operation, and 2) the magnitude of the operation. We discretize the range of probability and magnitude values into np and nm discrete values, respectively (with uniform spacing). This way, we can use a discrete search algorithm to find them. For no operations, this constructs a search space with the size of ns = (no np nm)ℓ. Search Algorithm. The size of search space ns, can grow exponentially. A brute-force search might be impractical. To make the searching process feasible, we use a greedy search. First, we begin searching in the reduced search space where each sample policy has only one sub-policy (ℓ= 1). In the reduced search space, we find the best probability and magnitude for each image operation. Note that np and nm can also be decreased as much as necessary to keep the search tractable. Once we find the best augmentation operations for the first sub-policy, we search for the second sub-policy (ℓ= 2). For each found sub-policy in the first stage, we search for the best combination of image operations and their probabilities and magnitudes. This process continues until we reach ℓ= ℓmax, the maximum number of sub-policies. At this point, we sort all the potentially good policies that are found until this point, and select the best b augmentation policies among them. 5 Experimental Results We evaluate our method against state-of-the-art subspace clustering algorithms on three standard datasets. We first use the algorithm described in section 4 to find the best augmentation policies for each dataset. Then, we use the found augmentation policies in the ablation study as well as in comparisons with state-of-the-art subspace clustering algorithms. We use the following datasets in our experiments: Extended Yale-B dataset [39] contains 2432 facial images of 38 individuals from 9 poses and under 64 illuminations settings. ORL dataset [42] includes 400 facial images from 40 individuals. This corresponds to only 10 samples per subject. COIL-100 [40] and COIL-20 [41] datasets are respectively consisted from images of 100 and 20 objects placed on a motorized turnable. Per each object, 72 images are taken at pose intervals of 5 degrees that covers a 360 degrees range. Following most of the prior studies, in our experiments, we use grayscale images of these datasets. Figure 2 (a), (b), and (c) show sample images from the Extended Yale-B, ORL and COIL datasets, respectively. Note that in the subspace clustering tasks, the datasets are not split into training and testing sets. Instead, all the existing samples are used in both the learning stage and the performance evaluation stage. Experimental Setups. While our method can be applied to many DSC algorithms, unless otherwise stated, due to its promising performance, we adopt the MLRDSC networks [17] and apply our method to its networks. We call the result MLRDSC with Data Augmentation (MLRDSC-DA). The objective function of MLRDSC can be also written in the format of (3). The self-expressiveness and subspace-preserving loss terms in MLRDSC are LS.E.(C, Z) = l=1 Zl Zl(G + Dl) 2 F and LS.P.(C) = QT |G| 1 + λ3 l=1 Dl 2 F , (9) where L is the number of layers in the autoencoder, Zl is the features at the l-th layer, and C = G + 1 L PL l=1 Dl. The coefficient matrix in this model is calculated by the consistency matrix G and distinctive matrices {Dl}L l=1. The distinctive matrices enforce subspace-preserving across different layers, and G captures the shared information between the layers. In the training of MLRDSC-DA, we first pre-train the networks by performing the MLRDSC algorithm. Then, we continue training MLRDSC-DA for a few additional iterations until convergence with (8) as Step 1: minΘ PL l=1 Zt l Zt l(Gt + Dt l) 2 F + λ1 Qt T |Gt| 1 +λ3 PL l=1 Dt l 2 F + λ2 Xt ˆXt 2 F , Step 2: minΘ CE(Qt Ens., Qt), (10) where we shape the temporal coefficient matrix as Ct = Gt + 1 L PL l=1 Dt l, and Qt Ens. and Qt are calculated from (6) and (4), respectively. We use the same training settings as described in [17]. This includes the same architecture for networks and values for the hyper-parameters λ1, λ2, λ3 in different experiments as well as the initial values of a zero matrix for the membership matrix ˆQ, and matrices with all the elements equal to 0.0001 for the coefficient matrices G0 and D0 l s at the iteration t = 0. We update ˆQ every 50 iterations by substituting the subspace membership estimations with the result of subspace clustering performed on the current Ct. We set the EMA decay to α = 0.999 in all the experiments (selected by cross-validation and mean silhouette coefficient as the evaluation metric). We implemented our method with Py Torch. We use the adaptive momentum-based gradient descent method (ADAM) [43] with a learning rate of 10 3 to minimize the loss functions. Similar to other DSC methods, we input the whole dataset as a batch. In all the conducted experiments, we report 5-fold averages. 5.1 Best Augmentation Policies Found on the Datasets We perform the search algorithm in Section 4 on different datasets to find the best augmentation policies for each dataset. To reduce the computations, we search in the search space of augmentation policies with the maximum number of sub-policies ℓmax = 2 (i.e. up to two sub-policies can be combined to construct a policy), and set the probability to p = 0.1 and the magnitude to m = 0.3 r where r = (max min) is the magnitude range that image operations accept. The image operation search space is the following set: {Flip LR, Shear X, Flip UD, Sear Y, Posterize, Rotate, Invert, Brightness, Equalize, Solarize, Contrast, Translate Y, Translate X, Auto Contrust, Sharpness, Cutout} that is also used in [21]. This results in a search space of ns = 162. We selected the Translate X Auto Contrast Translate Y Figure 3: Different image transformations on a sample from the Extended Yale B dataset. Table 1: Augmentation policies that yield the highest mean Silhouette coefficient in the subspace clustering results on different datasets. Dataset Augmentation Policy 1 Augmentation Policy 2 Extended Yale B (Op = Shear Y , m=0.3r, p=0.1) (Op = Translate Y , m=0.3r, p=0.1) + (Op = Contrast , m=0.3r, p=0.1) COIL-20 & COIL-100 (Op = Posterize , m=0.3r, p=0.1) (Op = Flip LR , p=0.1) + (Op = Sharpness , m=0.3r, p=0.1) + (Op = Contrast , m=0.3r, p=0.1) ORL (Op = Shear X , m=0.3r, p=0.1) (Op = Flip LR , p=0.1) + (Op = Sharpness , m=0.3r, p=0.1) + (Op = Shear X , m=0.3r, p=0.1) Table 2: Ablation study of our method in terms of clustering error (%) on Extended Yale B. Top performers are bolded. Backbone Augmentations Augmentations Augmentations Augmentations Consistency Loss Consistency Loss Consistency Loss Consistency Loss DSC 2.67 3.10 2.56 1.92 MLRDSC 1.36 2.84 0.95 0.82 values for magnitude and probability of augmentation polices by searching in the full search space of augmentation policies for the first two subjects in the Extended Yale B dataset. Figure 3 shows the different augmentation policies applied to a sample drawn from the Extended Yale B dataset. The details of these image operations are described in Table 1 in the supplementary materials. For each candidate augmentation policy, we train our MLRDSC-DA model, perform subspace clustering, and return the mean Silhouette coefficient [37] as the clustering performance. We use the mean Silhouette coefficients to sort the augmentation policies(including policies with ℓ< ℓmax sub-policies) and select the top two performing augmentation policies in each dataset. That is b = 2. Table 1 shows the found augmentation policies that yield to the highest Silhouette coefficients in the subspace clustering results on different datasets. In our experiments, COIL-20 and COIL-100 resulted in similar policies. Unless otherwise stated, in all the experiments, we apply these augmentation policies to the inputs of our MLRDSC-DA algorithm. 5.2 Ablation Study and Analysis of The Model To understand the effects of some of our model choices, we explore some ablations of our model on the Extended Yale B dataset. In particular, we test our model on two different deep subspace clustering methods, DSC [15] and MLRDSC [17], and in four settings where 1) the consistency loss exists or 2) is ablated; 3) the optimal augmentations policies are applied to the inputs or 4) the data is fed without any augmentations. If we remove both augmentations and the consistency loss, our networks, based on their backbones, turn to either DSC or MLRDSC networks. In the versions that data augmentation is applicable, the augmentations in Table 1 are used. Further analysis on the evaluation of the found augmentation policies is provided in section 5.4. We report the performances in Table 2. As can be seen, the top performer is our full model with augmentations and the consistency loss applied to the MLRDSC method. MLRDSC-based methods, in general, outperform DSC-based methods. Consistency loss slightly improves the performance even without data augmentation. This is the result of temporal ensembling. Table 3: Clustering error (%) of different methods on Extended Yale B, ORL, COIL20, and COIL100 datasets. Top performers are bolded. dataset LRR LRSC SSC AE+SSC KSSC SSC-OMP EDSC AE+EDSC DSC DASC S2Conv SCN MLRDSC MLRDSC-DA [5] [44] [45] [15] [9] [46] [47] [47] [15] [16] [14] [17] Ours E.Yale B 34.87 29.89 27.51 25.33 27.75 24.71 11.64 12.66 2.67 1.44 1.52 1.36 0.82 ORL 33.50 32.50 29.50 26.75 34.25 37.05 27.25 26.25 14.00 11.75 10.50 11.25 10.25 COIL20 30.21 31.25 14.83 22.08 24.65 29.86 14.86 14.79 5.42 3.61 2.14 2.08 1.79 COIL100 53.18 50.67 44.90 43.93 47.18 67.29 38.13 38.88 30.96 26.67 23.28 20.67 Table 4: Clustering error (%) on Extended Yale B with different augmentation policies applied to the inputs of MLRDSC-DA. Augmentation Policies: Random LR Flips Cut-out Common aug. policies Auto Aug for Image Net Auto Aug for SVHN Policies found from Algorithm 1 (ours) Extended Yale B 1.32 2.88 2.96 5.96 11.31 0.82 As can be seen in the second column of this table, applying the found augmentations to the input of DSC and MLRDSC networks without further modification (i.e., not adding the consistency loss) not only does not improve the results, but it slightly degrades the performance. These results clearly show both the importance of the consistency loss and the benefit of using data augmentations when it is combined with the consistency loss. 5.3 Comparison with State-of-The-Art Subspace Clustering Methods In this section, we evaluate our method against the state of the art subspace clustering methods. We apply the found augmentation policies in Table 1 to the data on Extended Yale B, ORL, COIL-20 and COIL-100 datasets and feed them to our MLRDSC-DA method. The rows in Table 4 report the clustering error rates of different subspace clustering algorithms. As the table reveals, deep subspace clustering methods, including DSC, ADSC, S2Conv SCN, and ML-RDSC, in general, outperform the conventional subspace clustering approaches. This observation suggests that deep networks can better model the non-linear relationships between the samples. However, among them, our model outperforms all the benchmarks. Note that our model and MLRDSC share similar architectures and have the same number of parameters. The only difference is that our method takes advantage of training on the augmented set of data. This observation clearly shows the benefits of incorporating data augmentation in the task of deep subspace clustering. 5.4 Comparison with Common Augmentation Policies and Transferred Augmentation Policies Existing automated learning algorithms for finding proper augmentations or even manual searches do not apply to the subspace clustering task. The current algorithms are mostly designed for supervised tasks and require the ground-truth targets to compare the performances, whereas, in the subspace clustering task, the ground-truth labels are not available. However, one may apply the supervised augmentation searches to a source dataset with available labels and use the found augmentation policies on a target dataset for the task of subspace clustering. To compare such an approach with the described method in Section 4, we adopt the augmentation policies that Auto Aug [21] finds on the classification task for SVHN [48] and Image Net [49] datasets, and directly apply the found policies to the input of our MLRDSC-DA. We furthermore compare the performances to the results of applying the following augmentation policies to the input: random left-right flips (Flip-LR), Cut-out [50, 51] and common augmentations picked by practitioners (Common aug. policies). For Common aug. policies , we use the combination of most common augmentations, including zero paddings, cropping, random-flips, and cutout. Note that all the experiments in this section share the same architecture and training procedure as MLRDSC-DA. They are only different in the augmentation policies that are applied to their input. As can be seen in Table 4, the augmentation policies that are found with [21] on SVHN and Image Net, perform poorly. This is because they are deemed good policies for the classification task on those datasets and may not work as efficiently on the subspace clustering task. The reason that Random Flips provides a relatively good performance is that the objects in the dataset are symmetric. The augmentations that are found with our suggested approach provide the best results. 6 Conclusion We introduced a framework to incorporate data augmentation techniques in Deep Subspace Clustering algorithms. The underlying assumption in subspace clustering tasks is that data points with the same label lie into the same subspace. Based on this assumption, we argued that slight transformations of a data point should not alter the subspace into which the data point lies. To address this property, we proposed the subspace consistency loss to keep the data points within consistent subspaces when slight random transformations are applied to the input data. Employing the mean Silhouette coefficient metric, we furthermore, provided a simple yet effective unsupervised algorithm to find the best augmentation policies for each target dataset. Our experiments showed that applying good data augmentations improves the performance oft subspace clustering algorithms. 7 Broader Impact Since our method improves subspace clustering, it advances learning from unannotated data. Improving the learning process and providing more accurate similarity matrices for unannotated data can positively impact accountability, transparency and explainability of AI methods. However, if not controlled, providing the opportunity to learn from big unannotated datasets could increase the concerns about violating the privacy of individuals. 8 Acknowledgement This research was funded in part by NSF grants IIS-1703883, CNS-1747778, CCF-1733843, IIS1763523, IIS-1849238-825536 and MURI-Z8424104-440149, and in part by the Northrop Grumman Mission Systems Research in Applications for Learning Machines (REALM) initiative and the NSF grant 1618677. [1] R. Basri and D. W. Jacobs, Lambertian reflectance and linear subspaces, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 25, no. 2, pp. 218 233, 2003. [2] T. Hastie and P. Y. Simard, Metrics and models for handwritten character recognition, Statistical Science, vol. 13, no. 1, pp. 54 65, 1998. [3] R. Vidal, Subspace clustering, IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 52 68, 2011. [4] E. Elhamifar and R. Vidal, Sparse subspace clustering: Algorithm, theory, and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 35, no. 11, pp. 2765 2781, 2013. [5] G. Liu, Z. Lin, and Y. Yu, Robust subspace segmentation by low-rank representation, in International Conference on Machine Learning, 2010. [6] P. Favaro, R. Vidal, and A. Ravichandran, A closed form solution to robust subspace estimation and clustering, in IEEE Conference on Computer Vision and Pattern Recognition, 2011. [7] V. M. Patel, H. V. Nguyen, and R. Vidal, Latent space sparse and low-rank subspace clustering, IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 4, pp. 691 701, 2015. [8] M. Abavisani and V. M. Patel, Multimodal sparse and low-rank subspace clustering, Information Fusion, vol. 39, pp. 168 177, 2018. [9] V. M. Patel and R. Vidal, Kernel sparse subspace clustering, in IEEE International Conference on Image Processing, 2014. [10] H. Qi and S. Hughes, Using the kernel trick in compressive sensing: Accurate signal recovery from fewer measurements, in IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3940 3943, 2011. [11] V. M. Patel, H. V. Nguyen, and R. Vidal, Latent space sparse subspace clustering, in International Conference on Computer Vision, 2013. [12] M. Abavisani and V. M. Patel, Domain adaptive subspace clustering, in Britain Machine Vision Conference, 2017. [13] M. Abavisani and V. M. Patel, Adversarial domain adaptive subspace clustering, in International Conference on Identity, Security and Behavior Analysis, 2018. [14] J. Zhang, C.-G. Li, C. You, X. Qi, H. Zhang, J. Guo, and Z. Lin, Self-supervised convolutional subspace clustering network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5473 5482, 2019. [15] P. Ji, T. Zhang, H. Lia, M. Salzmann, and I. Reid, Deep subspace clustering networks, in Advances in Neural Information Processing Systems, 2017. [16] P. Zhou, Y. Hou, and J. Feng, Deep adversarial subspace clustering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1596 1604, 2018. [17] M. Kheirandishfard, F. Zohrizadeh, and F. Kamangar, Multi-level representation learning for deep subspace clustering, in The IEEE Winter Conference on Applications of Computer Vision, pp. 2039 2048, 2020. [18] M. Abavisani and V. M. Patel, Deep multimodal subspace clustering networks, IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 6, pp. 1601 1614, 2018. [19] P. Y. Simard, D. Steinkraus, J. C. Platt, et al., Best practices for convolutional neural networks applied to visual document analysis., in Icdar, vol. 3, 2003. [20] D. C. Cire san, U. Meier, L. M. Gambardella, and J. Schmidhuber, Deep, big, simple neural nets for handwritten digit recognition, Neural computation, vol. 22, no. 12, pp. 3207 3220, 2010. [21] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, Autoaugment: Learning augmentation policies from data, ar Xiv preprint ar Xiv:1805.09501, 2018. [22] A. Naghizadeh, M. Abavisani, and D. N. Metaxas, Greedy autoaugment, Pattern Recognition Letters, vol. 138, pp. 624 630, 2020. [23] A. Naghizadeh, D. N. Metaxas, and D. Liu, Greedy auto-augmentation for n-shot learning using deep neural networks, Neural Networks, 2020. [24] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, Randaugment: Practical automated data augmentation with a reduced search space, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702 703, 2020. [25] D. Ho, E. Liang, X. Chen, I. Stoica, and P. Abbeel, Population based augmentation: Efficient learning of augmentation policy schedules, in International Conference on Machine Learning, pp. 2731 2741, PMLR, 2019. [26] X. Guo, E. Zhu, X. Liu, and J. Yin, Deep embedded clustering with data augmentation, in Asian Conference on Machine Learning, pp. 550 565, 2018. [27] X. Guo, X. Liu, E. Zhu, X. Zhu, M. Li, X. Xu, and J. Yin, Adaptive self-paced deep clustering with data augmentation, IEEE Transactions on Knowledge and Data Engineering, 2019. [28] P. Bachman, O. Alsharif, and D. Precup, Learning with pseudo-ensembles, in Advances in neural information processing systems, pp. 3365 3373, 2014. [29] M. Sajjadi, M. Javanmardi, and T. Tasdizen, Regularization with stochastic transformations and perturbations for deep semi-supervised learning, in Advances in neural information processing systems, pp. 1163 1171, 2016. [30] S. Laine and T. Aila, Temporal ensembling for semi-supervised learning, ar Xiv preprint ar Xiv:1610.02242, 2016. [31] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, Hand keypoint detection in single images using multiview bootstrapping, in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1145 1153, 2017. [32] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari, and K. He, Data distillation: Towards omni-supervised learning, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4119 4128, 2018. [33] Y. Tian, D. Krishnan, and P. Isola, Contrastive multiview coding, ar Xiv preprint ar Xiv:1906.05849, 2019. [34] Y. Chen, C.-G. Li, and C. You, Stochastic sparse subspace clustering, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4155 4164, 2020. [35] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 35, no. 1, pp. 171 184, 2013. [36] A. Y. Ng, M. I. Jordan, and Y. Weiss, On spectral clustering: Analysis and an algorithm, in Neural Information Processing Systems (NIPS), vol. 2, pp. 849 856, 2002. [37] P. J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of computational and applied mathematics, vol. 20, pp. 53 65, 1987. [38] D. L. Davies and D. W. Bouldin, A cluster separation measure, IEEE transactions on pattern analysis and machine intelligence, no. 2, pp. 224 227, 1979. [39] K. C. Lee, J. Ho, and D. J. Kriegman, Acquiring linear subspaces for face recognition under variable lighting, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, pp. 684 698, May 2005. [40] S. A. Nene, S. K. Nayar, H. Murase, et al., Columbia object image library (coil-20), 1996. [41] S. A. Nene, S. K. Nayar, and H. Murase, Columbia object image library (coil-100), 1996. [42] F. S. Samaria and A. C. Harter, Parameterisation of a stochastic model for human face identification, in Proceedings of 1994 IEEE Workshop on Applications of Computer Vision, pp. 138 142, IEEE, 1994. [43] D. Kingma and J. Ba, Adam: A method for stochastic optimization, ar Xiv preprint ar Xiv:1412.6980, 2014. [44] R. Vidal and P. Favaro, Low rank subspace clustering (LRSC), Pattern Recognition Letters, 2013. [45] E. Elhamifar and R. Vidal, Sparse subspace clustering, in IEEE Conference on Computer Vision and Pattern Recognition, pp. 2790 2797, 2009. [46] C. You, D. Robinson, and R. Vidal, Scalable sparse subspace clustering by orthogonal matching pursuit, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3918 3927, 2016. [47] P. Ji, M. Salzmann, and H. Li, Efficient dense subspace clustering, in IEEE Winter Conference on Applications of Computer Vision, pp. 461 468, IEEE, 2014. [48] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, Reading digits in natural images with unsupervised feature learning, in NIPS workshop on deep learning and unsupervised feature learning, vol. 2011, p. 5, 2011. [49] Y. Le and X. Yang, Tiny imagenet visual recognition challenge, 2015. [50] T. De Vries and G. W. Taylor, Improved regularization of convolutional neural networks with cutout, ar Xiv preprint ar Xiv:1708.04552, 2017. [51] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, Random erasing data augmentation, ar Xiv preprint ar Xiv:1708.04896, 2017.