# densely_supervised_grasp_detector_dsgd__12d3cab5.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Densely Supervised Grasp Detector (DSGD)

Umar Asif IBM Research Australia umarasif@au1.ibm.com

Jianbin Tang IBM Research Australia jbtang@au1.ibm.com

Stefan Harrer IBM Research Australia sharrer@au1.ibm.com

This paper presents Densely Supervised Grasp Detector (DSGD), a deep learning framework which combines CNN structures with layer-wise feature fusion and produces grasps and their conﬁdence scores at different levels of the image hierarchy (i.e., global-, region-, and pixel-levels). Speciﬁcally, at the global-level, DSGD uses the entire image information to predict a grasp. At the region-level, DSGD uses a region proposal network to identify salient regions in the image and uses a grasp prediction network to generate segmentations and their corresponding grasp poses of the salient regions. At the pixel-level, DSGD uses a fully convolutional network and predicts a grasp and its conﬁdence at every pixel. During inference, DSGD selects the most conﬁdent grasp as the output. This selection from hierarchically generated grasp candidates overcomes limitations of the individual models. DSGD outperforms state-of-the-art methods on the Cornell grasp dataset in terms of grasp accuracy. Evaluation on a multi-object dataset and real-world robotic grasping experiments show that DSGD produces highly stable grasps on a set of unseen objects in new environments. It achieves 97% grasp detection accuracy and 90% robotic grasping success rate with real-time inference speed.

Introduction

Grasp detection is a crucial task in robotic grasping because errors in this stage affect grasp planning and execution. A major challenge in grasp detection is generalization to unseen objects in the real-world. Recent advancements in deep learning have produced Convolutional Neural Network (CNN) based grasp detection methods which achieve higher grasp detection accuracy compared to hand-crafted features. Methods such as (Lenz, Lee, and Saxena 2015; Redmon and Angelova 2015; Asif, Bennamoun, and Sohel 2017b; Asif, Tang, and Harrer 2018a) focused on learning grasps in a global-context (i.e., the model predicts one grasp considering the whole input image), through regression-based approaches (which directly regress the grasp parameters deﬁned by the location, width, height, and orientation of a 2D rectangle in image space). Other methods such as (Pinto and Gupta 2016) focused on learning grasps at patch-level by extracting patches (of different sizes) from the image and

Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

predicting a grasp for each patch. Recently, methods such as (Morrison, Corke, and Leitner 2018; Zeng et al. 2017) used auto-encoders to learn grasp parameters at each pixel in the image. They showed that one-to-one mapping (of image data to ground truth grasps) at the pixel-level can effectively be learnt using small CNN structures to achieve fast inference speed. These studies show that grasp detection performance is strongly inﬂuenced by three main factors: i) The choice of the CNN structure used for feature learning, ii) the objective function used to learn grasp representations, and iii) the image hierarchical context at which grasps are learnt (e.g., global or local). In this work, we explore the advantages of combining multiple global and local grasp detectors and a mechanism to select the best grasp out of the ensemble. We also explore the beneﬁts of learning grasp parameters using a combination of regression and classiﬁcation objective functions. Finally, we explore different CNN structures as base networks to identify the best performing architecture in terms of grasp detection accuracy. The main contributions of this paper are summarized below:

1) We present Densely Supervised Grasp Detector (DSGD), an ensemble of multiple CNN structures which generate grasps and their conﬁdence scores at different levels of image hierarchy (i.e., global-level, region-level, and pixel-level).

2) We propose a region-based grasp network, which learns to segment salient parts (e.g., handles or extrusions) from the input image, and uses the information about these parts to learn class-speciﬁc grasps (i.e., each grasp is associated with a probability with respect to a graspable class and a non-graspable class).

3) We perform an ablation study of our DSGD by varying its critical parameters and present a grasp detector that achieves real-time speed and high grasp accuracy.

4) We demonstrate the robustness of DSGD for producing stable grasps for unseen objects in real-world environments using a multi-object dataset and robotic grasping experiments. See our experiment videos at: https://youtu.be/Bn d Q8vc Nzs and https://youtu.be/ t A2qgtb TT98

Related Work In the context of deep learning based grasp detection, methods such as (Saxena, Driemeyer, and Ng 2008; Jiang, Moseson, and Saxena 2011; Lenz, Lee, and Saxena 2015) trained sliding window based grasp detectors. However, their high inference times limit their application for realtime systems. Other methods such as (Mahler et al. 2016; 2017; Johns, Leutenegger, and Davison 2016) reduced inference time by processing a discrete set of grasp candidates, but these methods ignore some potential grasps. Alternatively, methods such as (Redmon and Angelova 2015; Kumra and Kanan 2017; Guo et al. 2017) proposed end-toend CNN-based approaches which regress a single grasp for an input image. However, these methods tend to produce average grasps which are invalid for certain symmetric objects (Redmon and Angelova 2015). Recently, multi-grasp detectors based on auto-encoders (Morrison, Corke, and Leitner 2018; Zeng et al. 2017; 2018; Myers et al. 2015; Varley et al. 2015) and Faster-RCNN (Ren et al. 2015) based grasp detector (Chu, Xu, and Vela 2018) demonstrated higher grasp accuracy compared to the global methods. Another stream of work focused on learning mapping between images of objects and robot motion parameters using reinforcement learning, where the robot iteratively reﬁnes grasp poses through real-world experiments (Pinto and Gupta 2016; Levine et al. 2016). In this paper, we present a grasp detector which has several key differences from the current grasp detection methods. First, our detector generates multiple global and local grasp candidates and selects the grasp with the highest quality. This allows our detector to effectively recover from the errors of the individual global or local models. Second, we introduce a region-based grasp network which uses segmentation information of salient parts of objects (e.g., handles, extrusions) to learn grasp poses, and produces more accurate grasp predictions compared to global (Kumra and Kanan 2017) or local detectors (Pinto and Gupta 2016). Finally, we use layer-wise dense feature fusion (Huang et al. 2017) within the CNN structures. This maximizes variation in the information ﬂow across the networks and produces better image-to-grasp mappings compared to the models of (Redmon and Angelova 2015; Morrison, Corke, and Leitner 2018).

Problem Formulation Given an image of an object as input, the goal is to generate grasps at different image hierarchical levels (i.e., global-, regionand pixel-levels), and select the most conﬁdent grasp as the output. We deﬁne the global grasp by a 2D oriented rectangle on the target object in the image space. It is given by: Gg = [xg, yg, wg, hg, θg, ρg], (1)

where xg and yg represent the centroid of the rectangle. The terms wg and hg represent the width and the height of the rectangle. The term θg represents the angle of the rectangle with respect to x-axis. The term ρg is grasp conﬁdence and represents the quality of a grasp. Our region-level grasp is deﬁned by a class-speciﬁc representation, where the parameters of the rectangle are associated with n classes (a gras-

pable class: n = 1, and a non-graspable class: n = 0). It is given by:

Gr = [xn r , yn r , wn r , hn r , θn r , ρn r ], n [0, 1]. (2)

Our pixel-level grasp is deﬁned as:

Gp = [Mxy, Mw, Mh, Mθ] Rs W H, (3)

where Mxy, Mw, Mh, and Mθ represent Rs W H dimensional heatmaps1, which encode the position, width, height, and orientation of grasps at every pixel of the image, respectively. The terms W and H represent the width and the height of the input image respectively. We learn the grasp representations (Gg, Gr, and Gp) using joint regression-classiﬁcation based objective functions. Speciﬁcally, we learn the position, width, and the height parameters using a Mean Squared Loss, and learn the orientation parameter using a Cross Entropy Loss with respect to Nθ = 20 classes (angular-bins).

The Proposed DSGD (Fig. 1) Our DSGD is composed of four main modules as shown in Fig. 1: a base network for feature extraction, a Global Grasp Network (GGN) for producing a grasp at the image-level, a Region Grasp Network (RGN) for producing grasps using salient parts of the image, and a Pixel Grasp Network (PGN) for generating grasps at each image pixel. In the following, we describe in detail the various modules of DSGD.

Base Network The purpose of the base network is to act as a feature extractor. We extract features from the intermediate layers of a CNN such as Dense Nets (Huang et al. 2017), and use the features to learn grasp representations at different hierarchical levels. The basic building block of Dense Nets (Huang et al. 2017) is a Dense block: bottleneck convolutions interconnected through dense connections. Speciﬁcally, a dense block consists of Nl number of layers termed Dense Layers which share information from all the preceding layers connected to the current layer through skip connections (Huang et al. 2017). Fig. 3 shows the structure of a dense block with Nl = 6 dense layers. Each dense layer consists of 1 1 and 3 3 convolutions followed by Batch Normalization (Ioffe and Szegedy 2015) and a Rectiﬁed Linear Unit (Re LU). The output of the lth dense layer (Xl) in a dense block can be written as: Xl = [X0, ..., Xl 1], (4) where [ ] represents concatenation of the features produced by the layers 0, ..., l 1.

Global Grasp Network (GGN) Our GGN structure is composed of two sub-networks as shown in Fig. 1-A: a Global Grasp Prediction Network (GGPN) for generating grasp pose ([xg, yg, wg, hg, θg]) and a Grasp Evaluation Network (GEN) for predicting grasp conﬁdence (ρg). The GGPN structure is composed of a dense block, an averaging operation, a

1s = 1 for Mxy, Mw, and Mh, and s = Nθ for Mθ.

Region Grasp Network (RGN)

Pixel Grasp Network (PGN)

Global Grasp Network (GGN)

Dense block 5

1 1 conv 3 3 conv Nl

global avg. pool

[1 Nθ] fcθ θ

Position map Mxy

[1 224 244]

[k Nθ n] fcθ

Base network

Dense block 4

1 1 conv 3 3 conv Nl

Salient Region Network (SRN)

Classification

ROI pooling

Input image [3 224 224]

Global Grasp Prediction Network (GGPN) Grasp Evaluation Network (GEN)

Region Grasp Prediction Network (RGPN)

Top four RGN grasps

Top PGN grasp

Width map Mw

[1 224 244]

Height map Mh

[1 224 244]

Grasp image

[k 4 n] fc R

Dense block 6

1 1 conv 3 3 conv Nl

Base network

Dense block 7

1 1 conv 3 3 conv Nl

global avg. pool

[k 14 14 n] SR2

[k 256 14 14] SR1

Angle map Mθ [Nθ 224 244]

Figure 1: Overview of our DSGD architecture. Given an image as input, DSGD uses a base network to extract features which are fed into a Global Grasp Network (A), a Pixel Grasp Network (B), and a Region Grasp network (C), to produce grasp candidates. The global model produces a single grasp per image and uses an independent Grasp Evaluation Network (D) to produce grasp conﬁdence. The pixel-level model uses a fully convolutional network and produces grasps at every pixel. The region-level model uses a Salient Region Network (E) to generate region proposals which are fed into a Region Grasp Prediction Network (F) to produce salient region segmentations and their corresponding grasp poses. During inference, DSGD switches between the GGN, the PGN, and the RGN models based on their conﬁdence scores.

Input image

Grasp image

Figure 2: Left: A grasp image is generated by replacing the blue channel of the input image with a binary rectangle image produced from a grasp pose. Right: Our Grasp Evaluation Network is trained using grasp images labelled in terms of valid (1) and invalid (0) grasp rectangles.

4 dimensional fully connected layer for predicting the parameters [xg, yg, wg, hg], and a Nθ dimensional fully connected layer for predicting θg. The GEN structure is similar to GGPN except that GEN has a single 2 dimensional fully connected layer for predicting ρg. The input to GEN is a grasp image which is produced by replacing the Blue

channel of the input image with a binary rectangle image generated from the output of GGPN as shown in Fig. 2. Let Rgi = [xgi, ygi, wgi, hgi], θgi and ρgi denote the predicted values of a global grasp for the ith image. We deﬁne the loss of the GGPN and the GEN models over K images as:

(1 λ1)Lreg(Rgi, R gi) + λ1Lcls(θgi, θ gi) , (5)

i K Lcls(ρgi, ρ gi), (6)

where R gi, θ gi, and ρ gi represent the ground-truths. The term Lreg is a regression loss deﬁned as: Lreg(R, R ) = ||R R ||/||R ||2. (7) The term Lcls is a classiﬁcation loss deﬁned as:

Lcls(x, c) =

c=1 Yx,c log(px,c), (8)

where Y is a binary indicator if class label c is the correct classiﬁcation for observation x, and p is the predicted probability of observation x of class c.

1 1 Conv 1 1 Conv

Candidate creator

ROI pooling

Dense block 4

fcθ fc R fcρ

k 2 k 8 k 2 Nθ

Base network

Input Dense Block

Dense Layer 1

Dense Layer 2

Dense Layer 3

Dense Layer 4

Dense Layer 5

Dense Layer 6

Region Grasp Prediction Network

Grasp Evaluation

Dense block 6

Base network

Dense block 5

fc R fcθ 1 4 1 Nθ

Base network

Global Grasp

Prediction Network (GGPN)

Salient Region Network

Grasp image

Base network

Up-sample 1

Up-sample 2

Up-sample 3

Up-sample 4

Up-sample 5

128 112 112

512 14 14 Dense 3

Pixel Grasp Network (PGN)

Dense block 7 1664 7 7

7 7 Conv1 64 112 112

3 3 avg. pool

Dense block 1

Transition 1

Dense block 2

Transition 2

Dense block 3

Transition 3

Dense Layer

2 2 avg. pool

SR2 k 14 14 2

SR1 k 256 14 14

Figure 3: Detailed architecture of our DSGD with a Dense Net (Huang et al. 2017) as its base network.

Region Grasp Network (RGN)

The RGN structure is composed of two sub-networks as shown in Fig. 1-C: a Salient Region Network (SRN) for producing salient region proposals, and a region grasp prediction network (RGPN) for producing salient region segmentations and their corresponding grasp poses.

Salient Region Network (SRN): Here, we use the features extracted from the base network to generate proposals deﬁned by the location (xsr, ysr), width (wsr), height (hsr), and conﬁdence (ρsr) of non-oriented rectangles which encompass salient parts of the image (e.g., handles, extrusions). For this, we ﬁrst generate a ﬁxed number of rectangles using the Region of Interest (ROI) method of (He et al. 2017). Next, we use the features from the base network and optimize a Mean Squared Loss on the rectangle coordinates and a Cross Entropy Loss on the rectangle conﬁdence scores. Let Ti = [xsr, ysr, wsr, hsr] denote the parameters of the ith predicted rectangle, and ρsri denote its probability whether it belongs to a graspable region or a non-graspable region. The loss of SRN over I proposals is given by:

(1 λ2)Lreg(Ti, T i ) + λ2Lcls(ρsri, ρ sri) ,

where ρ sri = 0 for a non-graspable region and ρ sri = 1 for a graspable region. The term T i represents the ground truth candidate corresponding to ρ sri.

Region Grasp Prediction Network (RGPN): Here, we produce salient region segmentations and their corresponding grasp poses using the proposals predicted by SRN (k = 50 in our implementation). For this, we crop features from the output feature maps of the base network using the Region of Interest (ROI) pooling method of (He et al. 2017). The cropped features are then fed to Dense block 4 which produces feature maps of k 1664 7 7 dimensions as shown in Fig. 3. These feature maps are then squeezed to k 1664 dimensions through a global average pooling, and fed to a segmentation branch (with two up-sampling layers SR1 Rk 256 14 14 and SR2 Rk 14 14 n), which produces a segmentation mask for each salient region as shown in Fig. 1-F. The squeezed features are also fed to three fully connected layers fc R Rk 4 n, fcθ Rk Nθ n, and fcρ Rk 2 which produce class-speciﬁc grasps for the segmented regions. Let Rri = [xri, yri, wri, hri], θri, and ρri denote the predicted values of a region-level grasp for the ith salient region, and Si R14 14 n denotes the corresponding predicted segmentation. The loss of the RGPN

model is deﬁned over I salient regions as:

i I (Lreg(Rri, R ri) + λ3Lcls(θri, θ ri)+

λ3Lcls(ρri, ρ ri) + Lseg(Si, S i )), (10)

where R , θ , ρ , and S represent the ground truths. The term ρ ri = 0 for a non-graspable region and ρ ri = 1 for a graspable region. The term Lseg represents a pixelwise binary cross-entropy loss used to learn segmentations of salient regions. It is given by:

j Si (yj log(ˆyj)+(1 yj) log(1 ˆyj)), (11)

where, yj represents the ground truth value and ˆyj denotes the predicted value for a pixel j Si. Learning segmentation based grasp poses enables the model to produce grasp conﬁdence maps where the conﬁdence scores follow Gaussian distributions with the highest conﬁdence at the center of the segmented region. This produces region-centred grasps and therefore better localization results. The total loss of our RGN model is given by:

Lrgn = Lsrn + Lrgpn. (12)

The terms λ1, λ2, and λ3 in Eq. 5, Eq. 9, and Eq. 10 control the relative inﬂuence of classiﬁcation over regression on the combined objective functions2.

Pixel Grasp network (PGN) Here, we feed the features extracted from the base network into Dense block 7 followed by a group of upsampling layers which increase the spatial resolution of the features and produce feature maps of the size of the input image. These feature maps encode the parameters of the grasp pose at every pixel of the image. Let Mxyi, Mwi, Mhi, and Mθi denote the predicted feature maps of the ith image, respectively. We deﬁne the loss of the PGN model over K images as:

i K (Lreg(Mxyi, M xyi) + Lreg(Mwi, M wi)+

Lreg(Mhi, M hi) + Lcls(Mθi, M θi)), (13) where M xyi, M wi, M hi, and M θi represent the groundtruths.

Training and Implementation For the global model, we trained the GGPN and the GEN sub-networks independently. For the region-based and the pixel-based models, we trained the networks in an end-toend manner. Speciﬁcally, we initialized the weights of the base network with the weights pre-trained on Image Net. For the Dense blocks (4-7), the fully connected layers of GGPN, GEN, SRN, and RGPN, and the fully convolutional layers of PGN, we initialized the weights from zero-mean Gaussian distributions (standard deviation set to 0.01, biases set to 0), and trained the networks using the loss functions in

2For experiments, we set the parameters λ1, λ2, and λ3 to 0.4.

Eq. 5, Eq. 6, Eq. 9, Eq. 12, and Eq. 13, respectively for 150 epochs. The starting learning rate was set to 0.01 and divided by 10 at 50% and 75% of the total number of epochs. The parameter decay was set to 0.0005 on the weights and biases. Our implementation is based on the framework of Torch library (Paszke et al. 2017). Training was performed using ADAM optimizer and data parallelism on four Nvidia Tesla K80 GPU devices. For grasp selection during inference, DSGD selects the most conﬁdent region-level grasp if its conﬁdence score is greater than a conﬁdence threshold (δrgn), otherwise DSGD switches to the PGN branch and selects the most conﬁdent pixel-level grasp. If the most conﬁdent pixel-level grasp has a conﬁdence score less than δpgn, DSGD switches to the GGN branch and selects the global grasp as the output. Experimentally, we found that δrgn = 0.95 and δpgn = 0.90 produced the best grasp detection results.

Experiments We evaluated DSGD for grasp detection on the popular Cornell grasp dataset (Lenz, Lee, and Saxena 2015), which contains 885 RGB-D images of 240 objects. The groundtruth is available in the form of grasp-rectangles. We also evaluated DSGD for multi-object grasp detection in new environments. For this, we used the multi-object dataset of (Asif, Tang, and Harrer 2018b) which consists of 6896 RGB-D images of indoor scenes containing multiple objects placed in different locations and orientations. The dataset was generated using an extended version of the scene labeling framework of (Asif, Bennamoun, and Sohel 2017a) and (Asif, Bennamoun, and Sohel 2016). For evaluation, we used the object-wise splitting criteria (Lenz, Lee, and Saxena 2015) for both the Cornell grasp dataset and our multiobject dataset. The object-wise splitting splits the object instances randomly into train and test subsets (i.e., the training set and the test set do not share any images from the same object). This strategy evaluates how well the model generalizes to unseen objects. For comparison purposes, we followed the procedure of (Redmon and Angelova 2015) and substituted the blue channel with the depth image, where the depth values are normalized between 0 and 255. We also performed data augmentation through random rotations. For grasp evaluation, we used the rectangle-metric proposed in (Jiang, Moseson, and Saxena 2011). A grasp is considered to be correct if: i) the difference between the predicted grasp angle and the ground-truth is less than 30 , and ii) the Jaccard index of the predicted grasp and the ground-truth is higher than 25%. The Jaccard index for a predicted rectangle R and a ground-truth rectangle R is deﬁned as:

J(R , R) = |R R|

|R R|. (14)

Single-Object Grasp Detection Table 1 shows that our DSGD achieved the best grasp detection accuracy on the Cornell grasp dataset compared to the other methods. We attribute this improvement to two main reasons: First, the proposed hierarchical grasp generation enables DSGD to produce grasps and their conﬁ-

Figure 4: Grasp detection results of our DSGD (B) on some challenging objects of the Cornell grasp dataset. Ground truths are shown in A.

Table 1: Grasp evaluation on the Cornell grasp dataset in terms of average grasp detection accuracy.

Method Accuracy (%)

(Jiang et. al. 2011) Fast search 58.3 (Lenz et. al. 2015) Deep learning 75.6 (Redmon et. al. 2015) Multi Grasp 87.1 (Kumra et. al. 2017) Res Nets 88.9 (Guo et. al. 2017) Hybrid-Net 89.1 (Asif et. al. 2018) Grasp Net 90.2 (Chu et. al. 2018) Multi-grasp 96.1 (this work) DSGD 97.7

Table 2: Comparison of the individual networks of the proposed DSGD in terms of grasp accuracy (%) on the Cornell grasp dataset.

Base Global model Local models DSGD network GGN PGN RGN

Res Net50 86.8 94.1 96.3 96.7 Dense Net 88.9 95.4 97.0 97.7

dence scores from both global and local contexts. This enables DSGD to effectively recover from the errors of the global (Kumra and Kanan 2017) or local methods (Guo et al. 2017). Second, the use of dense feature fusion enables the networks to learn more discriminative features compared to the models used in (Kumra and Kanan 2017; Guo et al. 2017). Fig. 4 shows grasps produced by our DSGD on some images of the Cornell grasp dataset.

Signiﬁcance of Combining Global and Local Models: Table 2 shows a quantitative comparison of the individual models of our DSGD in terms of grasp accuracy on the Cornell grasp dataset, for different CNN structures as the base network. The base networks we tested include: Res Nets (He et al. 2016) and Dense Nets (Huang et al. 2017). Table 2 shows that on average the local models (PGN and RGN) produced higher grasp accuracy compared to the global model (GGN). The global and the local models have their own pros and cons. The global model uses the entire image information and learns an average of the ground-truth grasps. Although, the global-grasps are accurate for most of the objects, the grasps tend to lie in the middle of circular sym-

metric objects resulting in localization errors as highlighted in red in Fig. 5-B. The PGN model on the other hand operates at the pixel-level and produces correct grasp localizations for these challenging objects as shown in Fig. 5-C. However, pixel-based model is susceptible to outliers in the position prediction maps which result in localization errors as highlighted in red in Fig. 5-D. Our RGN model works at a semi-global level while maintaining large receptive ﬁelds. It predicts grasps using segmentation information of salient parts of the image which highly likely encode graspable parts of an object as shown in Fig. 5-F. Consequently, RGN is less susceptible to pixel-level outliers (see Fig 6-B and Fig 6-C) and does not suffer global averaging errors as shown in Fig. 5-G. Furthermore, the grasp conﬁdences produced by the RGN model follow Gaussian distributions where the conﬁdence scores are the highest at the center of the segmented salient regions (see Fig. 6-E). This results in grasps which are more stable compared to the PGN model, where the predicted grasp conﬁdences follow uniform distributions along the surfaces of the objects resulting in several unstable grasps as shown in Fig. 6-D. Our DSGD takes advantage of both the global context and local predictions and produces highly accurate grasps as shown in Fig. 5-H.

Ablative Study of the Proposed DSGD: The growth rate parameter W refers to the number of output feature maps of each dense layer and therefore controls the depth of the network. Table 3 shows that a large growth rate and wider dense blocks (i.e., more number of layers in the dense blocks) increase the average accuracy from 96.9% to 97.7% at the expense of low runtime speed due to the overhead from additional channels. Table 3 also shows that a lite version of our detector (DSGD-lite) can run at 6fps making it suitable for real-time applications.

Multi-Object Grasp Detection Table 4 shows our grasp evaluation on the multi-object dataset. The results show that on average, our DSGD improves grasp detection accuracy by 9% and 2.4% compared to the pixel-level and region-level models, respectively. Fig. 7 shows qualitative results on the images of our multi-object dataset. The results show that our DSGD successfully generates correct grasps for multiple objects in real-world scenes containing background clutter. The generalization capability of our model is attributed to the proposed hierarchical image-to-grasp mappings, where the proposed region-level

A: Ground Truth B: GGN

C: PGN localization

predictions

E: RGN localization

F: RGN segmented regions

predictions

Figure 5: Qualitative comparison of grasps produced by the proposed GGN, PGN, RGN, and DSGD models. The results show that our DSGD effectively recovers from the errors of the individual models. Incorrect predictions are highlighted in red.

C: RGN predictions

B: PGN predictions A: Scene

D: PGN predictions (zoomed) E: RGN predictions (zoomed)

Figure 6: Comparison of PGN and RGN predictions. The intensity of the prediction maps represent grasp conﬁdences. The RGN predictions (C) are less prone to pixel-level outliers (e.g., due to noise) compared to the PGN predictions (B). Furthermore, the RGN predictions follow Gaussian distributions where the conﬁdence values are the highest at the center of the segmented salient regions (E) producing more stable grasps compared to the PGN predictions where conﬁdence values follow uniform distributions along the surfaces of the objects (D).

Table 3: Ablation study of our DSGD (with Dense Net as the base network) on the Cornell grasp dataset in terms of the growth rate (W) and the number of dense layers (Nl) of the GGN, PGN, and RGN sub-networks.

Model RGN PGN GGN Accuracy Speed W Nl1 Nl2 Nl3 Nl5 W Nl1 Nl2 Nl3 Nl7 W Nl1 Nl2 Nl3 Nl4 (%) (fps)

DSGD-lite 32 6 12 24 16 32 6 12 24 16 32 6 12 24 16 96.9 6 DSGD-A 32 6 12 32 32 32 6 12 32 32 32 6 12 24 16 97.1 5 DSGD-B 48 6 12 36 24 32 6 12 32 32 32 6 12 24 16 97.3 4 DSGD-C 48 6 12 36 24 32 6 12 48 32 32 6 12 24 16 97.5 4 DSGD-D 48 6 12 36 24 32 6 12 24 16 32 6 12 24 16 97.7 4 DSGD-E 48 6 12 36 24 48 6 12 36 24 32 6 12 24 16 97.4 3

network and the proposed pixel-level network learn to associate grasp poses to salient regions and salient pixels in the image data, respectively. These salient regions and pixels encode object graspable parts (e.g., boundaries, corners, handles, extrusions) which are generic (i.e., have similar appearance and structural characteristics) across a large variety of objects generally found in indoor environments. Consequently, the proposed hierarchical mappings learned by our models successfully generalize to new object instances during testing. This justiﬁes the practicality of our DSGD for real-world robotic grasping.

Robotic Grasping Our robotic grasping setup consists of a Kinect for image acquisition and a 7 degrees of freedom robotic arm which is

Table 4: Grasp evaluation on our multi-object dataset.

Base network Grasp accuracy Robotic grasp PGN RGN DSGD success

Res Net 86.5% 93.4% 95.8% 89% Dense Net 87.4% 94.7% 97.2% 90%

tasked to grasp, lift, and take-away the objects placed within the robot workspace. For each image, DSGD generates multiple grasp candidates as shown in Fig. 8. For grasp execution, we select a random candidate which is located within the robot workspace and has conﬁdence greater than 90%. A robotic grasp is considered successful if the robot grasps the target object (veriﬁed through force sensing in the grip-

Figure 7: Grasp evaluation on our multi-object dataset. The localization outputs of our pixel-level (PGN) and region-level (RGN) grasp models are shown in (A) and (B), respectively. The intensity of the prediction maps A and B represent grasp conﬁdences. The segmented regions produced by our RGN model are shown in (C). Note that we only show grasps with conﬁdence scores higher than 90% in (D). Our grasp detection results on Kinect video streams are available at: https://youtu. be/t A2qgtb TT98

Robot view Pixel-level localization outputs Kinect view Region-level localization outputs

Figure 8: Experimental setting for real-world robotic grasping. See video of our experiments at: https://youtu.be/Bn d Q8vc Nzs

per), holds it in air for 3 seconds and takes it away from the robot workspace. The objects are placed in random positions and orientations to remove bias related to the object pose. Table 4 shows the success rates computed over 200 grasping trials. The results show that we achieved grasp success

rates of 90% with Dense Net as the base network. Some failure cases include objects with non-planar grasping surfaces (e.g., brush). However, this can be improved by multi-ﬁnger grasps. We leave this for future work as our robotic arm only supports parallel grasps.

Conclusion and Future Work We presented Densely Supervised Grasp Detector (DSGD), which generates grasps and their conﬁdence scores at different image hierarchical levels (i.e., global-, region-, and pixel-levels). Experiments show that our proposed hierarchical grasp generation produces superior grasp accuracy compared to the state-of-the-art on the Cornell grasp dataset. Our evaluations on videos from Kinect and robotic grasping experiments show the capability of our DSGD for producing stable grasps for unseen objects in new environments. In future, we plan to reduce the computational burden of our DSGD through parameter-pruning for low-powered GPU devices.

References Asif, U.; Bennamoun, M.; and Sohel, F. 2016. Simultaneous dense scene reconstruction and object labeling. In ICRA, 2255 2262. IEEE. Asif, U.; Bennamoun, M.; and Sohel, F. 2017a. A multimodal, discriminative and spatially invariant cnn for rgb-d object labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence. Asif, U.; Bennamoun, M.; and Sohel, F. A. 2017b. Rgbd object recognition and grasp detection using hierarchical cascaded forests. IEEE Transactions on Robotics. Asif, U.; Tang, J.; and Harrer, S. 2018a. Ensemblenet: Improving grasp detection using an ensemble of convolutional neural networks. In British Machine Vision Conference (BMVC). Asif, U.; Tang, J.; and Harrer, S. 2018b. Graspnet: An efﬁcient convolutional neural network for real-time grasp detection for low-powered devices. In Proceedings of the Twenty Seventh International Joint Conference on Artiﬁcial Intelligence, IJCAI-18, 4875 4882. Chu, F.-J.; Xu, R.; and Vela, P. A. 2018. Real-world multiobject, multigrasp detection. IEEE Robotics and Automation Letters 3(4):3355 3362. Guo, D.; Sun, F.; Liu, H.; Kong, T.; Fang, B.; and Xi, N. 2017. A hybrid deep architecture for robotic grasp detection. In ICRA, 1609 1614. IEEE. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770 778. He, K.; Gkioxari, G.; Doll ar, P.; and Girshick, R. 2017. Mask r-cnn. In ICCV, 2980 2988. IEEE. Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L. 2017. Densely connected convolutional networks. In CVPR, volume 1, 3. Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 448 456. Jiang, Y.; Moseson, S.; and Saxena, A. 2011. Efﬁcient grasping from rgbd images: Learning using a new rectangle representation. In ICRA, 3304 3311. IEEE. Johns, E.; Leutenegger, S.; and Davison, A. J. 2016. Deep learning a grasp function for grasping under gripper pose uncertainty. In IROS, 4461 4468. IEEE.

Kumra, S., and Kanan, C. 2017. Robotic grasp detection using deep convolutional neural networks. In IROS. IEEE. Lenz, I.; Lee, H.; and Saxena, A. 2015. Deep learning for detecting robotic grasps. IJRR 34(4-5):705 724. Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; and Quillen, D. 2016. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. IJRR 0278364917710318. Mahler, J.; Pokorny, F. T.; Hou, B.; Roderick, M.; Laskey, M.; Aubry, M.; Kohlhoff, K.; Kr oger, T.; Kuffner, J.; and Goldberg, K. 2016. Dex-net 1.0: A cloud-based network of 3d objects for robust grasp planning using a multi-armed bandit model with correlated rewards. In ICRA, 1957 1964. IEEE. Mahler, J.; Liang, J.; Niyaz, S.; Laskey, M.; Doan, R.; Liu, X.; Ojea, J. A.; and Goldberg, K. 2017. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. Robotics: Science and Systems (RSS). Morrison, D.; Corke, P.; and Leitner, J. 2018. Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach. ar Xiv preprint ar Xiv:1804.05172. Myers, A.; Teo, C. L.; Ferm uller, C.; and Aloimonos, Y. 2015. Affordance detection of tool parts from geometric features. In ICRA, 1374 1381. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; De Vito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in pytorch. Pinto, L., and Gupta, A. 2016. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 3406 3413. IEEE. Redmon, J., and Angelova, A. 2015. Real-time grasp detection using convolutional neural networks. In ICRA, 1316 1322. IEEE. Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster rcnn: Towards real-time object detection with region proposal networks. In NIPS, 91 99. Saxena, A.; Driemeyer, J.; and Ng, A. Y. 2008. Robotic grasping of novel objects using vision. IJRR 27(2):157 173. Varley, J.; Weisz, J.; Weiss, J.; and Allen, P. 2015. Generating multi-ﬁngered robotic grasps via deep learning. In IROS, 4415 4420. IEEE. Zeng, A.; Song, S.; Yu, K.-T.; Donlon, E.; Hogan, F. R.; Bauza, M.; Ma, D.; Taylor, O.; Liu, M.; Romo, E.; et al. 2017. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. ar Xiv preprint ar Xiv:1710.01330. Zeng, A.; Song, S.; Yu, K.-T.; Donlon, E.; Hogan, F. R.; Bauza, M.; Ma, D.; Taylor, O.; Liu, M.; Romo, E.; Fazeli, N.; Alet, F.; Daﬂe, N. C.; Holladay, R.; Morona, I.; Nair, P. Q.; Green, D.; Taylor, I.; Liu, W.; Funkhouser, T.; and Rodriguez, A. 2018. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In ICRA.