# docparser_hierarchical_document_structure_parsing_from_renderings__03d59427.pdf

Doc Parser: Hierarchical Document Structure Parsing from Renderings

Johannes Rausch,1 Octavio Martinez,1 Fabian Bissig,1

Ce Zhang,1 Stefan Feuerriegel2

1 Department of Computer Science, ETH Zurich 2 Department of Management, Technology, and Economics, ETH Zurich johannes.rausch@inf.ethz.ch, octaviom@student.ethz.ch, fbissig@student.ethz.ch ce.zhang@inf.ethz.ch, sfeuerriegel@ethz.ch

Translating renderings (e. g. PDFs, scans) into hierarchical document structures is extensively demanded in the daily routines of many real-world applications. However, a holistic, principled approach to inferring the complete hierarchical structure of documents is missing. As a remedy, we developed Doc Parser : an end-to-end system for parsing the complete document structure including all text elements, nested ﬁgures, tables, and table cell structures. Our second contribution is to provide a dataset for evaluating hierarchical document structure parsing. Our third contribution is to propose a scalable learning framework for settings where domain-speciﬁc data are scarce, which we address by a novel approach to weak supervision that signiﬁcantly improves the document structure parsing performance. Our experiments conﬁrm the effectiveness of our proposed weak supervision: Compared to the baseline without weak supervision, it improves the mean average precision for detecting document entities by 39.1 % and improves the F1 score of classifying hierarchical relations by 35.8 %.

1 Introduction The structural and layout information in a document can be a rich source of information that facilitates Natural Language Processing (NLP) tasks (e. g. information extraction). Over the years, the NLP community has developed a range of techniques to detect, understand, and take advantage of document structures (Hurst and Nasukawa 2000; Chen, Tsai, and Tsai 2000; Tengli, Yang, and Ma 2004; Luong, Nguyen, and Kan 2012; Govindaraju, Zhang, and R e 2013; Katti et al. 2018; Sch afer et al. 2011; Sch afer and Weitz 2012; Garncarek et al. 2020). However, structural information in documents is becoming increasingly challenging to obtain many ﬁle formats that are prevalent today are being rendered without structural information. Prominent examples are PDF documents: this ﬁle format beneﬁts from portability and immutability, yet it is ﬂat in the sense that it stores all content as isolated entities (e. g., combinations of characters and positions) and, thus, hierarchical information is lacking. As such, the structure behind ﬁgures and especially tables is discarded and thus no longer available to computerized analyses in NLP.

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

In contrast, ﬁle formats such as XML or JSON naturally encode hierarchical document structures among textual entities. Hence, techniques are required in order to convert renderings into structured, textual document representations to enable joint inference between text, layout, and other document structures. Earlier attempts for structure parsing on documents focused on a subset of simpler tasks such as segmentation of text regions (Antonacopoulos et al. 2009), locating tables (Zanibbi, Blostein, and Cordy 2004; Embley et al. 2006), or parsing them (Schreiber et al. 2018), but not parsing complete document structures. However, document structures are required as a representation of many downstream tasks in NLP. For instance, recent efforts in the NLP community (Katti et al. 2018; Apostolova and Tomuro 2014; Liu et al. 2019) have shown that utilizing 2D document information, e. g. character and word positions, can be an effective way to improve upon standard NLP tasks such as information extraction. A holistic, principled approach for inferring the complete hierarchical structure from documents is missing. On the one hand, such a task is nontrivial due to the complexity of documents, particularly their deeply-nested structures. For instance, nested tables are fairly easy to recognize for human readers, yet detecting them is known to impose computational hurdles (cf. Schreiber et al. 2018). On the other hand, efﬁcient learning is prevented as large-scale training sets are lacking (cf. Arif and Shafait 2018; Schreiber et al. 2018). Notably, prior datasets are limited to table structures (Gobel et al. 2013; Rice, Jenkins, and Nartker 1995) and not the complete document structures. Needless to say, complex structures also make the labeling process signiﬁcantly more costly (Wang, Phillips, and Haralick 2004). Therefore, an effective implementation that makes only a scarce use of labeled data is demanded. This work focuses on parsing the hierarchical document structure from renderings. We develop an end-to-end system for inferring the complete document structure (see Figure 1). This includes all entities (e. g., text, bibliography regions, ﬁgures, equations, headings, tables, and table cells), as well as the hierarchical relations among them. We speciﬁcally adapt to settings in practice that suffer from data scarcity. For this purpose, we propose a novel learning framework for scalable weak supervision. It is intentionally tailored to the

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Figure 1: Doc Parser takes rendered document images (left) as input, performs segmentation into bounding boxes (center), and then outputs the hierarchical structure of the full document (right). Shown is an illustrative sketch; examples are provided in the supplements.

speciﬁc needs of parsing document renderings; that is, we create weakly-supervised labels by utilizing the reverse rendering process of LATEX. The reverse rendering returns the bounding boxes of all entities in documents together with their category (e. g., whether the entity is a table or a ﬁgure, etc.). Yet the outcomes are noisy (i. e., imprecise bounding boxes, missing entities, incorrect labels) and without deep structure information (e. g. information such as table row numbers is missing). Nevertheless, as we shall see later, the generated data greatly facilitates learning by being treated as weak labels.

Contributions:1 We extend prior literature on document parsing in the following directions:

1. We contribute Doc Parser . This presents the ﬁrst end-toend system for parsing renderings into hierarchical document structures. Prior literature has merely focused on simpler tasks such as table detection or table parsing but not on the parsing of complete documents. As a remedy, we present a system for inferring document structures in a holistic, principled manner. 2. We contribute the ﬁrst dataset (called ar Xivdocs ) for evaluating document parsing. It extends existing datasets for parsing in two directions: (i) it includes all entities that can appear in documents (i. e. not just tables) and (ii) it includes the hierarchical relations among them. The dataset is based on 127,472 scientiﬁc articles from the ar Xiv repository. 3. We propose a novel weakly-supervised learning framework to foster efﬁcient learning in practice where annotated documents are scarce. It is based on an automated and thus scalable labeling process, where annotations are retrieved by reverse rendering the source code of documents. Speciﬁcally, in our work, we utilize TEX source ﬁles from ar Xiv together with synctex for this objective. This then yields weakly-supervised labels by reverse rendering of the TEX code. 4. We conduct extensive evaluation of our proposed techniques, outperforming the state-of-the-art on the related task of table parsing.

1Source codes and the ar Xivdocs dataset are available from https://github.com/DS3Lab/Doc Parser.

Figure 2: System overview.

2 Doc Parser System 2.1 Problem Description

Given a set of document renderings D1, . . . , Dn, the objective is to generate hierarchical structures T1, . . . , Tn. A hierarchical structure Ti, i = 1, . . . , n, consists of both entities and relations as follows:2

Entities Ej, j = 1, . . . , m, refer to the various elements within a document, such as a ﬁgure, table, row, cell, etc. Each entity is described by three attributes: (1) its semantic category cj C = {C1, . . . , Cl} (i. e., which deﬁnes the underlying type) and (2) the coordinates given by rectangular bounding box Bj in the document rendering. There further is (3) a conﬁdence score Pj. This is not part of the ground truth labels; however, it comes from the predictions inside the Doc Parser system. Relations Rj, j = 1, . . . , k of type Ψ are given by triples (Esubj, Eobj, Ψ) consisting of a subject Esubj, an object Eobj, and a relation type Ψ {parent of , followed by, null}. The latter, null, is reserved for entities with metainformation that do not have designated order (i. e., header, footer, keywords, date, page number). All other entities must have Ψ = null. The combination of entities and relations is sufﬁcient to reconstruct the hierarchical structure Ti for a document. However, generating such a hierarchical structure from a document rendering Di is subject to inherent challenges: the similar appearance of entities impedes detection and, further, the hierarchy can be nested arbitrarily, with substantial variation across different documents.

2.2 System Components

Doc Parser performs document structure parsing via 5 components (see overview in Figure 2): (1) image conversion, (2) entity detection, (3) relation classiﬁcation, (4) structurebased reﬁnement, and (5) scalable weak supervision. To store document structures, we developed a customized, JSON-based ﬁle format. Component 1: Image Conversion

2For consistency, we use the term entity throughout the article when referring to all elements in the document structure (e. g. a ﬁgure, table, or text) that need to be detected. While the term object is common in computer vision, we chose the term entity to highlight its semantic nature for NLP.

Document renderings are converted into images with a predeﬁned resolution ρ. Furthermore, all images are resized to a ﬁxed rectangular size φ (if necessary, with zero padding). The document images are further pre-processed: the RGB channels of all document images are normalized analogous to the MS COCO dataset (i. e., by subtracting the mean RGB channel values from the inputs). The reason is that all neural models are later initialized with pre-trained weights from the MS COCO dataset (Lin et al. 2014). Component 2: Entity Detection To detect all document entities within a document image, we build upon a neural model for image segmentation, namely Mask R-CNN (He et al. 2017). Speciﬁcally, it takes the images from the previous component as input and then returns a ﬂat list of entities E1, . . . , Em as output. For each entity Mask R-CNN determines (i) its rectangular bounding box, (ii) conﬁdence score, (iii) a binary segmentation mask that distinguishes between the detected entity and background pixels within the bounding box, and (iv) a category label for the entity. Our implementation makes use of 23 categories C: CONTENT BLOCK, TABLE, TABLE ROW, TABLE COLUMN, TABLE CELL, TABULAR, FIGURE, HEADING, ABSTRACT, EQUATION, ITEMIZE, ITEM, BIBLIOGRAPHY BLOCK, TABLE CAPTION, FIGURE GRAPHIC, FIGURE CAPTION, HEADER, FOOTER, PAGE NUMBER, DATE, KEYWORDS, AUTHOR, AFFILIATION.3 Component 3: Relation Classiﬁcation A set of heuristics is applied to translate the ﬂat list of entities into hierarchical relations R1, . . . , Rk. Here, we distinguish the heuristics according to whether they generate (1) the nesting among entities or (2) the ordering for entities of the same nesting level. The former case corresponds to Ψ = parent of , while the latter determines all relations with Ψ = followed by. In this component, we ignore all entities with meta-information, e. g. footers, as these have no designated hierarchy (cf. document grammar in the supplements). Relations with Nesting (parent of ): Four heuristics h1, . . . , h4 determine parent-child relation as follows: (h1: Overlaps) A list of candidate parent-child relations is compiled based on the overlap of bounding boxes. That is, Doc Parser loops over all bounding boxes and, for each bounding box Bsubj, it determines all other bounding boxes that are contained within Bsubj. Formally, this is given by all tuples of bounding boxes (Bsubj, Bobj) with subj m, obj m, and subj = obj where h1(Bsubj, Bobj) is satisﬁed: Tuples for which the bounding box of Bobj is fully or partially enclosed by the bounding box of Bsubj are added to the candidate list. Furthermore, we add tuples to the candidate list that satisfy area(Bsubj Bobj)

area(Bobj) θ1

and area(Bsubj)

area(Bobj) > θ2, i. e. they must have a certain overlap fraction θ1 and size ratio θ2. In Doc Parser, thresholds of θ1 = 0.45 and θ2 = 1.2 are used. (h2: Grammar Check) This heuristic validates the candidate list against a predeﬁned document grammar (see doc-

3For consistency, this formatting is utilized for all entities.

ument grammar in the supplements). Concretely, all illegal candidates, e.g., a TABULAR nested inside a FIGURE, are removed. (h3: Direct children) The candidate list is further pruned so that it contains only direct children of the parent and not sub-children. For this purpose, all sub-children are removed. As an example, this should remove (E1 subj, E3 obj) from a candidate list {(E1 subj, E2 obj), (E1 subj, E3 obj), (E2 subj, E3 obj)}, since it represents a sub-child and not a direct child of Esubj. (h4: Unique Parents) The candidate list is altered so that each entity has only a single parent. Formally, if an entity Eobj has multiple candidate parents, we ﬁrst compare the Intersection over Union (Io U) of the bounding boxes of all candidate parents with Eobj: Io U = area(Bsubj ˆ Bobj) area(Bsubj ˆ Bobj). We then keep the parent with the maximal Io U, while all others are removed. If two parents have the same Io U, we select the element with the highest conﬁdence score Pj as parent. If that value is also equal, we choose the entity with the largest bounding box. Relations with Ordering (followed by): The entities are ordered according to the general reading ﬂow (i. e., from left to right). Here care is needed so that multi-column pages are processed correctly. For this, two heuristics o1 and o2 are used. By default, all entities are processed by both heuristics. Children of ﬂoating entities are only processed by heuristic o2, however. (o1: Page Layout Entities) First, all entities are grouped according to their coordinates on the document page, namely, into groups belonging to the (a) left side Gl, (b) center Gc, or (c) right side Gr. Formally, this is achieved by computing the overlap for each entity Ej, j = 1, . . . , m with the left (and right) side of a document page, i. e., τovlp = overlap/width(B). If the overlap with either the left (or the right) side is above a threshold (i. e., τovlp > 0.7), the entity Ej is assigned to the left (or right) side. Otherwise, if such assignment is not possible with high conﬁdence, the entity Ej is assigned to center group Gc. In essence, the center group is an indicator whether the document is in singleor multi-column. If no entities have been assigned to the center group (i. e., Gc = ), then the entities are ordered ﬁrst according to Gl followed by Gr. Within each group, the entities are ordered top-to-bottom and then left-to-right by applying heuristic o2. In sum, this approach should ﬁnd an appropriate ordering for multi-column pages. If entities have been assigned to the center group (i. e., Gc = ), then grouping is further decomposed into additional subgroups: the entities E Gc from the center group are used to split Gl, Gc, and Gr into vertical subgroups Gι l, Gι c, and Gι r, respectively. Afterward, we loop over all vertical subgroups ι. For each, we order the entities according to the group (ﬁrst Gι l, followed by Gι c and then Gι r). Within each subgroup, we perform the ordering via heuristic o2. This approach should correctly arrange entities in two cases: (1) in single-column pages and (2) when multi-column pages are split into different chunks by fullwidth ﬁgures or tables. For each subgroup, we perform the ordering via heuristic o2.

(o2: Reading Flow) The entities Ej, j = 1, . . . , m, are ordered top-to-bottom and, within lines, left-to-right, so that it matches the usual reading ﬂow in documents. Formally, let the top-left corner of a document image refer to the coordinate (0, 0). Furthermore, let us consider the top-left location of all bounding boxes Bj. The top-left location is then used to sort the entities ﬁrst by their y-coordinate of Bj and, if equal, by their x-coordinate (both ascending). Component 4: Structure-Based Reﬁnement We utilize the classiﬁed relations to iteratively reﬁne entities and relations in four steps when parsing full document pages: (1) For each entity Eparent with l child entities E1 child, . . . , El child, we update its bounding box such that Bparent = union(Bparent, B1 child, . . . , Bl child). (2) For parent entities Eparent with exactly one child entity of the same category, we remove the child entity and update Bparent such that it is the union of parent and child bounding boxes. We also consider entity pairs of categories that do not conform to the document grammar. This allows us to dismiss duplicate entities of any category. (3) If an entity Echild is sibling to other entities in a way that conﬂicts the document grammar, we generate a new entity that encloses Echild to achieve conformity with the document grammar. Concretely, nested FIGURE structures are deﬁned such that one FIGURE should at most contain one FIGURE GRAPHIC entity child. If multiple FIGURE GRAPHIC are classiﬁed as children, we wrap each of them individually into new FIGURE entities. (4) If no parent is found for an entity Echild that should only occur as a child entity, we identify a suitable parent entity by analyzing its neighboring siblings as follows: we consider all entities that jointly appear in an ordering relation with Echild as a candidates Ecand. We dismiss candidates of category C that would not conform to the hierarchies deﬁned in the document grammar. Finally, we dismiss any candidate for which Bcand Bchild = . If exactly one candidate remains, we update its bounding box Bcand = union(Bcand, Bchild). The updates to the set of entities can lead to further changes to the classiﬁed relations. For this reason, whenever changes are made to entities in one of the four reﬁnement steps, we update the relations via Component 3 and move back to reﬁnement step (1). The reﬁnement is completed once no change is applied in any of the steps or a limit of r loop iterations has been reached.4 Component 5: Scalable Weak Supervision The system is further extended by scalable weak supervision. This aims at improving the performance of entity detection and, as a consequence, of end-to-end parsing. Our weak supervision builds upon an additional dataset that consists of source codes (rather than document renderings). The source codes allow us to create a mapping between entities in the source code and their renderings. This process has three particular characteristics: ﬁrst, the mapping is noisy and thus creates only weak labels. Despite that, the weak labels can aid efﬁcient learning. Second, annotations are obtained only for some entities and relations. Third, if automated, this process circumvents human annotations

4Details on our parameter choice and pseudocode are included in the supplements.

and is thus highly scalable. Let the unlabeled entities found in the source code be given by S1, . . . , Sk. For them, we generate weak labels W1, . . . , Wk consisting of a semantic category and coordinates of the bounding box. However, both the semantic category and the bounding box can be subject to noise. Furthermore, weak labels are generated merely for a subset C C of the semantic categories. In Doc Parser, the weak supervision is based on TEX source ﬁles that are used to generate document renderings in the form of PDF ﬁles. The mapping between both formats is then obtained via synctex (Laurens 2008). synctex is a synchronization tool that performs a reverse rendering, so that PDF locations are mapped to TEX code. For given coordinates in the document rendering, synctex returns a list of rectangular bounding boxes and the corresponding source code. Notably, the inference bounding boxes represent noisy labels, since the resulting entity annotations could be wrongly labeled, shifted, or entirely missing. We proceed as follows. We iterate through the source code and retrieve bounding boxes for all TEX commands. We then map the source code to our entities E. For instance, the bounding box for TEX code \includegraphics inside a \begin{figure} ,..., \end{figure} environment is mapped onto a FIGURE GRAPHIC entity that is nested inside a FIGURE entity. Bounding boxes for all entities that act as inner children are created dynamically by computing the union bounding of all child bounding boxes. We perform following processing steps to generate noisy labels for weak supervision:

1. Bounding boxes that are retrieved for simple text tokens inside the source code are mapped to CONTENT LINE entities. 2. If we encounter environments or commands (e. g., \begin{itemize} or \item), we create corresponding candidate entities. All entities retrieved for tokens inside the scope of these environments are created as nested child entities. This approach is used to create the following entity types, namely FIGURE, FIGURE GRAPHIC, FIGURE CAPTION, TABLE, TABULAR, TABLE CAPTION, ITEMIZE, ITEM, ABSTRACT, and BIBLIOGRAPHY. Any other entities are mapped onto the CONTENT LINE category. 3. We utilize a special characteristic of synctex to identify EQUATION, EQUATION FORMULA and EQUATION LABEL entities: bounding boxes returned by synctex are highly uniform and typically consist of per-line bounding boxes of consistent width and x-coordinates. Equations and labels are an exception to this rule and typically only consist of vertically aligned bounding boxes of smaller width. 4. The sectioning structure of documents is considered: any type of section command is mapped to a SECTION entity. The argument of the sectioning command, e. g. \subsection{titlearg} is mapped via synctex to a HEADER entity. Entities generated from code in the scope of a section are created as children to the section entity that corresponds to the current section scope.

5. Within sections, we sort entities based on a top-tobottom, left-to-right reading order. Using these sorted lists of sibling entities, we form CONTENT BLOCK entities from subsequent groups of CONTENT LINE entities within page columns. If such block occurs within a BIBLIOGRAPHY environment, we instead map it to a BIBLIOGRAPHY BLOCK entity. 6. In TABLE environments, we consider all child entities (except captions) that do not span across a whole table width as CELL and the remainder as TABLE ROW. As we shall see later, this is effective at retrieving complex table structures. 7. We use the detected table cells to generate rows and columns as follows: We compute the centroids of all cells. To identify rows, we consider the sorted ycoordinates of the centroids and group them such that the pixel-wise distance between two consecutive ycoordinates in a group is smaller or equal to 5. If any identiﬁed group contains two or more centroid ycoordinates, we create a TABLE ROW entity from the union of the corresponding TABLE CELL entities. Analogously, using the x-coordinates of the cell centroids, we identify TABLE COLUMN entities. 8. Additional cleaning steps are performed for tables and ﬁgures: Child entities with width or height of 2 or fewer pixels are discarded. Caption bounding boxes that enclose other non-caption child entities are also discarded. 9. We make sure that entities contain at most one leaf node by moving excess leaves into newly generated CONTENT LINE entities. 10. We remove duplicate bounding boxes and entities without any leaf nodes in their respective sub-tree. Candidates are ﬁltered such that only a group of entities and their respective sub-tree are preserved: ITEMIZE, FIGURE, TABLE, EQUATION, HEADING, CONTENT BLOCK, BIBLIOGRAPHY, ABSTRACT.

During training, entities with obvious errors are dismissed, i. e. leaf nodes or entities with bounding boxes that extend beyond page limits or with area of 0.

3 Datasets with Document Structures We contribute the dataset ar Xivdocs that is tailored to the task of hierarchical structure parsing. It comes in two variants: ar Xivdocs-target and ar Xivdocs-weak. (1) ar Xivdocs-target contains documents that have been manually checked and annotated. (2) ar Xivdocs-weak contains a large-scale set of documents that have no manual annotations but that can be used for weak supervision.

3.1 ar Xivdocs-target ar Xivdocs-target provides a set of documents with manual annotations of the complete document structure. These documents were randomly selected from ar Xiv as an open repository of scientiﬁc articles, but in a way such that each has at most 30 pages and contains at least one TABLE within the source code. Altogether, it counts 362 documents. ar Xivdocs-target comes with predeﬁned splits for training, validation, and eval that consist of 160, 79, 123 documents,

respectively. The dataset comprises of 30 different entity categories.5 We ensure a fairly uniform distribution of entity categories across different splits by sampling one random page rendering for each of the 362 documents that contain an ABSTRACT, FIGURE, or TABLE. On average, each document contains 86.32 entities. The number of leaf nodes in the document graph as well as the frequency and average depth of the different entities are reported in the supplements. Evidently, the most common category in the dataset is CONTENT LINE (34.33 %). This is because they typically represent leaf nodes in the graph and are children of larger entities such as ABSTRACT, CAPTION, or CONTENT BLOCK. Annotators were instructed to follow the document grammar during labeling. Annotation of disallowed hierarchies is, however, possible to provide them the freedom to deal with the range of different document representations. Document annotations are automatically initialized by our scalable weak supervision mechanism to speed up the annotation process. The labelers were instructed to annotate entities only up to the coarseness that is used by Doc Parser, e. g. labeling content blocks, rather than individual lines.

3.2 ar Xivdocs-weak ar Xivdocs-weak contains 127,472 documents with an average length of 12.84 pages that were retrieved from ar Xiv. We selected only documents that have a length of at most 30 pages and contain at least one TABLE within their source code. For reproducibility, we make our weak labels available.6

4 Computational Setup 4.1 Mask R-CNN Mask R-CNN extends the architecture of a convolution neural network with skip connections (He et al. 2016) so that it is highly effective for image segmentation and entity detection.7 Formally, it comprises of multiple stages with decreasing spatial resolution. The output of these stages is then fed into a so-called feature pyramid network (FPN) (Lin et al. 2017). The FPN then interconnects these inputs in multiple stages of increasing spatial resolution to produce multi-scale feature maps. Speciﬁcally, we use a Res Net-110 architecture (He et al. 2016) to extract features in 5 stages at different resolutions. The outputs of stages 2 to 5, denoted as C2, . . . , C5, are passed to the FPN. The FPN outputs a total of 5 feature maps P2, . . . , P6 at different resolutions. We refer the reader to (Lin et al. 2017) for a detailed description of the ﬁve feature maps. The multi-scale feature maps are then input to different prediction networks: ﬁrst, a region proposal network (RPN) generates a list of candidate

5Some entity categories are extremely rare and, hence, only a subset is later used as part of our experiments. 6For this purpose, the dataset was labeled via our proposed weak supervision mechanism and thus contains both entities Ej and hierarchical relations Rj. For reasons of space of the physical ﬁles, bounding boxes are only stored for entities in leaf nodes. For all other entities, the bounding boxes can be calculated by taking the union bounding box of their children. 7A model illustration is included in the supplements.

bounding boxes that should contain an entity. Second, a Region of Interest (Ro I) alignment layer ﬁlters out the multiscale feature maps that correspond to the candidate regions. We note that all 5 feature maps are used by the RPN, but P6 is not included in the inputs to the Ro I alignment layer. Third, for each region proposal, a mask sub-network predicts the segmentation masks, based on the Ro I aligned features. These segmentation masks are not used in subsequent steps of Doc Parser at prediction time; however, they are utilized in our loss function during the training process. Fourth, these bounding boxes are subsequently reﬁned in a detection subnetwork, thereby yielding the ﬁnal bounding boxes B. It also provides the label for classifying the entity category. All of the above sub-networks were carefully adapted to the speciﬁc characteristics of our task: (1) We modiﬁed the region proposal network so that it uses a maximum base aspect ratio of 1:8 per entity. The reason for this modiﬁcation is that document entities (as opposed to classical image segmentation) contain entities that have highly rectangular shapes. This is the case for most entities, e. g., single CONTENT LINE or TABLE ROW entities. (2) The output size of the classiﬁer sub-network is modiﬁed so that it can produce predictions for entities across all semantic categories C. (3) During training of the mask sub-network, we treat all pixels in ground truth bounding boxes as foreground. We do this to incorporate our understanding of the exact shape of many entities that span very wide rectangular regions. (4) We use a mask sub-network loss with a weighting factor of 0.5. This is to prioritize that features relevant for the correct prediction of bounding boxes and entity categories are learned. The Mask R-CNN stage of Doc Parser comprises 63,891,032 parameters and is built upon the implementation of Mask RCNN provided by Abdulla (2017), yet which we carefully adapted as described above. Training Procedure: All neural models are initialized with pre-trained weights based on the MS COCO dataset (Lin et al. 2014). We then train each model across three phases for a total of 80,000 iterations. This is split into three phases of 20,000, 40,000, and 20,000 iterations, respectively. During the ﬁrst phase, we freeze all layers of the CNN that is used as the initial block in Mask R-CNN. In the second phase, stages four and ﬁve of the CNN are unfrozen. In the last phase, all network layers are trainable. Early stopping is applied based on the performance on the validation set for unreﬁned predictions. The performance is measured every 2000 iterations via the so-called intersection over union with a threshold of 0.8. We train all models in a multi-GPU setting, using 8 GPUs with a v RAM of 12 GB. Each GPU was fed with one image per training iteration. Accordingly, the batch size per training iteration is set to 8. Furthermore, we use stochastic gradient descent with a learning rate of 0.001 and learning momentum of 0.9. Parameter Settings: During training, we sampled randomly 100 entities from the ground truth per document image (i. e., up to 100 entities as some document images might have fewer). In Mask R-CNN, the maximum number of entity predictions per image is set to 200. During prediction, we only keep entities with a conﬁdence score Pj of 0.7 or

higher. Weak Supervision: Training with weak supervision is as follows: all models are initialized with the weights of our pre-trained Doc Parser WS instead of default weights. We perform the training with learnable parameters analogous to phase 1 above but for 2000 steps with early stopping. In our experiments, we use only a subset of 80 % of the annotated documents from ar Xivdocs-weak, while the other 20 % remain unused. The intention is that we want to allow for additional annotations in the future while ensuring comparability to our results. We further ensure a fairly uniform distribution of entities by utilizing only document pages that contain at least an ABSTRACT, a FIGURE, or TABLE, while all others are discarded. This amounts to 593,583 pages.

4.2 System Variants

We compare the following variants of Doc Parser: Doc Parser Baseline is trained solely on the noise-free labels provided for the training dataset (here: ar Xivdocs-target); Doc Parser WS beneﬁts from weak supervision (WS). It is trained based on a second dataset (here: ar Xivdocs-weak) with noisy labels for weak supervision. This is to test whether training systems on noisy labels can lead to higher performance, compared to training on small but noise-free training datasets; Doc Parser WS+FT is initialized with the weights from Doc Parser WS, but then ﬁne-tuned (FT) on the target dataset.

4.3 Performance Metrics

We separately evaluate the performance of our system for (i) detection of entities Ej and (ii) classiﬁcation of hierarchical relations Rj. The former aims at a high detection rate (i. e. recognizing true positives out of all positives). Hence, we use the average precision as evaluation metric. The latter is based on the F1 score as it represents a typical classiﬁcation task (i. e. recognizing one of the relations from Ψ). Entity Detection: entity detection is commonly measured by the mean average precision (m AP) of a model (0: worst, 100: best). The inferred entities Ej = (cj, Bj, Pj) are compared against the ground truth label consisting of the true category ˆcj with a bounding box ˆBj. Here we follow common practice in computer vision (Everingham et al. 2010) and measure the overlap between bounding boxes from the same category. Speciﬁcally, we calculate the so-called intersection over union (Io U): Io U = area(Bj ˆ Bj) area(Bj ˆ Bj). If the Io U is higher than a user-deﬁned threshold, a predicted entity is considered a true positive. If multiple entities are matched with the same ground truth entity, we only consider the entity with the highest Io U as a true positive. Unmatched predictions and ground truth entities are considered false positives and false negatives, respectively. This is then used to calculate the average precision (AP) per semantic category Ck C. The overall performance across all categories is given by the mean average precision. We compare Io U thresholds of 0.5 and 0.65.8

8Additional results for Io U=0.8 are in the supplements.

Prediction of Hierarchical Relations: Here we measure the classiﬁcation performance for predicting the correct relations. A relation R = (Esubj, Eobj, Ψ) is counted as correct only if the complete tuple is identical. However, the performance depends on the correct entity detection as input. Hence, we later vary the Io U thresholds for entity detection analogous to above and then report the corresponding F1 score for correctly predicting hierarchical relations. The F1 score is the harmonic average of precision and recall for predicting these triples (0: worst, 1: best). Note that our performance measure is relatively strict. We show that, even if some F1 scores are in a lower range, we can recover the overall document structure successfully. In particular, we outperform state-of-the-art OCR results, as illustrated in the qualitative samples in our supplements.

4.4 Robustness Check: Table Structure Parsing

We additionally train our model for structure parsing so that it identiﬁes table structures to demonstrate the robustness of our system and weak supervision. We conﬁrm the effectiveness of our weak supervision as follows: we draw upon the ICDAR 2013 dataset (Gobel et al. 2013) for table structure parsing and compare it with the state-of-the-art. The ICDAR 2013 dataset consists of a variety of real-world documents and is not limited to scientiﬁc articles. We proceed analogously to full document structure parsing and train the three system variants for the task of table structure recognition. Doc Parser Baseline is trained solely on the samples provided in the ICDAR 2013 training dataset; Doc Parser WS is trained on table structures generated from ar Xivdocs-weak. Doc Parser WS+FT is generated by subsequent ﬁne-tuning on the ICDAR training split.9

Both training and ﬁne-tuning of all variants follow the 3 phase training scheme for a total of 80,000 iterations.10

The key focus of our experiments is to conﬁrm the effectiveness of Doc Parser for parsing the complete document structures. However, we emphasize again that both suitable baselines and datasets for this task are hitherto lacking. Hence, we proceed two-fold. On the one hand, we evaluate the performance based on ar Xivdocs as the ﬁrst dataset for document structure parsing. On the other hand, we draw upon the table structure ICDAR 2013 dataset: it is limited to table structures and not complete holistic parsing of document structures. However, it allows to test the effectiveness of our weak supervision against state-of-the-art.

9Details about the setting and additional experiments are provided in the supplements. 10Due to the different domain of the target dataset, we experimented with other weak supervision strategies, e. g. randomly sampling images from ar Xivdocs-weak and ICDAR 2013 during the same training procedure. However, the performance of models trained by sequential ﬁne-tuning could not be surpassed.

Io U=0.5 Io U=0.65

AP Baseline WS WS+FT Baseline WS WS+FT

mean AP 49.9 34.6 69.4 38.5 32.4 56.5

ABSTRACT 95.2 90.5 95.2 90.5 81.0 95.2 AFFILIATION 51.6 0.0 46.0 5.9 0.0 16.2 AUTHOR 18.0 0.0 23.6 20.4 0.0 16.7 BIB. BLOCK 42.4 79.1 94.7 43.2 93.9 80.3 CONT. BLOCK 89.3 69.8 88.4 83.2 67.0 84.4 DATE 0.0 0.0 24.1 0.0 0.0 9.3 EQUATION 65.8 54.5 82.1 40.6 52.1 72.8 FIG. CAPTION 47.8 30.5 69.2 44.0 17.7 59.5 FIG. GRAPHIC 22.3 5.2 60.2 15.9 4.4 54.5 FIGURE 47.8 35.3 63.5 44.0 33.9 59.4 FOOTER 55.7 0.0 69.3 48.9 0.0 59.7 HEADER 79.7 0.0 88.3 64.8 0.0 56.6 HEADING 53.7 52.1 66.4 33.1 46.0 45.4 ITEM 0.0 33.6 50.5 0.0 35.3 33.5 ITEMIZE 0.0 25.0 58.3 0.0 25.0 50.0 KEYWORDS 36.4 0.0 59.0 36.4 0.0 43.0 PAGE NR. 74.7 0.0 77.3 28.5 0.0 42.0 TAB. CAPTION 55.2 69.1 76.6 40.2 61.6 63.4 TABLE 84.5 96.3 94.3 62.7 87.9 89.6 TABULAR 78.4 50.8 100.0 68.4 42.4 99.5

Table 1: Average precision (AP) of entity detection.

5.1 Document Structure Parsing We compare the performance of document structure parsing based on our ar Xivdocs-target dataset across both performance metrics.

Entity Detection The overall performance for entity detection is detailed in Table 1 (ﬁrst row). We discuss the performance for Io U = 0.5 in the following. Doc Parser Baseline achieves an m AP of 49.9. This is higher than Doc Parser WS with an m AP of 34.6. We attribute this to the fact that several entity categories from ar Xivdocs-target are not part of ar Xivdocs-weak. Notably, the ﬁne-tuned system Doc Parser WS+FT results in signiﬁcant performance improvements: it obtains a m AP of 69.4, which, in comparison to the baseline Doc Parser, is an improvement by 39.1 %. Doc Parser WS+FT consistently outperforms the baseline system, even for categories that are not annotated during weak supervision (e. g. AUTHOR, FOOTER, HEADER, PAGE NUMBER). We attribute this to the better model initialization due to the prior weakly supervised pre-training. There is a small number of entity categories for which the Baseline achieves higher AP values. We attribute this to our experimental protocol which yields the best model via early stopping, based on m AP and not on individual entity AP values. For a few entities a decrease can be observed after ﬁne-tuning (e. g. TABLE at Io U=0.5). We attribute this to the high quality of weak annotations for this category and, consequently, a slight decrease of generalization due to ﬁnetuning. Some AP values (for both Doc Parser Baseline and Doc Parser WS) amount to 0.0, e. g. for DATE. This is caused by the absence of some categories in ar Xivdocs-weak in the case of Doc Parser WS. For Doc Parser Baseline, we attribute this to the limited amount of samples in ar Xivdocs-target for

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160

Number of Fine-tuning Images

Doc Parser WS+FT Doc Parser WS Baseline Doc Parser

Figure 3: Performance of entity detection (m AP for Io U = 0.5) during ﬁne-tuning.

Io U=0.5 Io U=0.65

Baseline WS WS+FT Baseline WS WS+FT

All 0.416 0.343 0.504 0.322 0.318 0.445 followed by 0.413 0.387 0.506 0.314 0.366 0.447 parent of 0.421 0.235 0.500 0.339 0.198 0.443

Reﬁned: All 0.453 0.382 0.615 0.363 0.354 0.558 followed by 0.455 0.410 0.581 0.351 0.394 0.524 parent of 0.451 0.317 0.679 0.389 0.263 0.620

Table 2: Performance in predicting hierarchical relations (as measured by F1).

the affected categories, coupled with an inferior model initialization, compared to Doc Parser WS+FT. Doc Parser WS+FT outperforms the Doc Parser Baseline system across all measured Io U thresholds by a considerable margin. Using Io U thresholds above 0.5 leads to a performance decrease. Even though higher Io Us should generally correspond to better matches with the ground truth, they can penalize ambiguous cases and thus a correct detection. In sum, this conﬁrms the effectiveness of our weak supervision in bolstering the overall performance. Table 1 breaks down the performance by entity category. For Doc Parser WS+FT, we observe an especially good performance for detecting tabulars and ﬁgures. This is owed to the strong initialization of our system due to the high quality and large number of samples in our scalable weak supervision.11

Figure 3 shows the ﬁne-tuning. Only 20 ﬁne-tuning samples are sufﬁcient for Doc Parser WS+FT to surpass the baseline system Doc Parser (which is trained on 160 samples from the target dataset). It thus helps in reducing the labeling effort by a factor of around 8. Furthermore, we observe a steady increase in the performance of the ﬁne-tuned networks with more samples. Notably, the highest performance increase is already achieved by the ﬁrst 10 document images for ﬁne-tuning.

11For a few entities, the best performance is achieved a combination of the WS system together with a high Io U (e. g., BIBLIOGRAPHY BLOCK). A likely reason for this is the composition of ar Xivdocs-target. As BIBLIOGRAPHY entities were not speciﬁcally used as a criterion for the per-page sampling, fewer documents in the target dataset contained relevant entities, leading to decreased performance of the baseline and WS+FT systems.

System Schreiber et al. (2018) Baseline WS WS+FT

F1* 0.9144 0.8443 0.8117 0.9292 F1 0.8209 0.8056 0.9292

Table 3: ICDAR 2013 result on table structure parsing. Notes: Evaluation of image-based systems on ICDAR 50 % , which uses a random subset containing 50 % of the competition set for testing. Schreiber et al. (2018) use a different, non-public 50 % random subset. Furthermore, Schreiber et al. (2018) choose the best system based on the test set as indicated by F1*. In contrast, F1 refers to the performance when the selection is based on the validation set.

Prediction of Hierarchical Relations Table 2 compares the classiﬁcation of relations with and without postprocessing. The best performance (across all Ψ) is achieved by Doc Parser WS+FT with an Io U of 0.5: it registers an F1 score of 0.615. Here, the use of weak supervision with ﬁne-tuning yields consistent improvements. This is also due to the signiﬁcant improvements of the prior entity detection for this system variant. In particular, for an Io U of 0.5, it outperforms the F1 score of the baseline system (F1 of 0.453) by 0.162. This amounts to a relative improvement of 35.8 %. Evidently, a smaller Io U threshold of 0.5 is beneﬁcial. Higher Io U thresholds reduce the overall parsing performance as structure parsing builds on the prior detection of document entities. The performance on hierarchical relations (F1 score of 0.615) is largely explained by our choice of a strict evaluation (i. e. the complete tuple including both entities must be correct). Overall, this performance is already highly effective in recovering the overall document structure. This is later conﬁrmed as part of a qualitative assessment.

5.2 Robustness Check: Table Structure Parsing Results: Table 3 compares the state-of-the-art for table structure parsing with our weak supervision strategy. Altogether, our weak supervision outperforms the state-of-theart (Schreiber et al. 2018) by a considerable margin. Discussion: Our system shows signiﬁcant improvement over the image-based state of the art. We also compare our approach to the state-of-the-art heuristic-based system that operates on raw PDF ﬁles, instead of images, as input (Nurminen 2013). Even though our system does not utilize the additional information provided by raw PDF ﬁles, Doc Parser achieves an F1 score of 0.9292, compared to 0.9221 for the PDF-based system. We refrain from directly comparing the aforementioned F1 score with that from earlier experiments as the underlying target domains differ.

6 Related Work OCR: Extracting text from document images has been extensively studied as part of optical character recognition (OCR) within the NLP community (e. g., Sch afer et al. 2011; Sch afer and Weitz 2012). To this end, the work by Katti et al. (2018) argued that OCR should be seen as a preprocessing step for downstream NLP tasks. As such, the au-

thors extract text-based information but not the hierarchical document structure as in our research. Table Detection: Document renderings are commonly used for the task of table detection (rather than table structure parsing). Here, the objective is to predict the bounding boxes of tables, i. e., whether a pixel refers to a table or not (e. g., Yildiz, Kaiser, and Miksch 2005; Wang, Phillips, and Haralick 2004). Prior research on table detection has utilized data augmentation (Gilani et al. 2017), weak supervision (Li et al. 2019), and transfer learning (e. g., Siddiqui et al. 2018) to address the lack of large-scale domain-speciﬁc datasets. Similar to our research, efﬁcient learning presents an issue for table detection. However, parsing of full pages requires effective identiﬁcation of a much larger number of entities of multiple categories and high variety in shape per input. Table Structure Parsing: There are works that recognize table structures from text or other syntactic tokens (Kieninger and Dengel 1998; Pivk et al. 2007) rather than directly from document renderings. As such, these works are tailored to tokens as input, and it is thus unclear how such an approach could theoretically be adapted to document renderings since our task inherently relies upon images as input. Because of the different input and thus the different datasets for benchmarking, the performance of the aforementioned works is not comparable to our approach. The works by Schreiber et al. (2018); Qasim, Mahmood, and Shafait (2019) draw upon deep neural networks to identify table structures for rendered inputs. However, they aim at a different purpose: parsing table structures, but not complete document hierarchies. As such, the authors do not attempt to identify text elements, nested ﬁgures, etc. Weak Supervision for Document Layout: (Zhong, Tang, and Yepes 2019) use weak supervision for detection of page layout entities. The WS mechanism relies on matching external XML annotations with text extractions by a heuristicbased third-party tool. In contrast, our weak supervision directly builds on the LATEXcompilation and can be readily extended to any new dataset of LATEXsource ﬁles. Furthermore, the dataset features only 5 coarse categories and the system does not feature a relation classiﬁcation component, thus being insufﬁcient to acquire full document structures.12 Weak Supervision in NLP: Annotations in NLP are oftentimes costly and, as a result, there has been a recent surge in weak supervision. Weak supervision has now been applied to various tasks, such as text classiﬁcation (e. g., Hingmire and Chakraborti 2014; Lin, He, and and Everson Richard 2011), information extraction (e. g., Hoffmann et al. 2011), and semantic parsing (e. g., Goldman et al. 2018). The methodological levers for obtaining weak labels are versatile and include, e. g., manual rules (e. g., Rabinovich et al. 2018), estimated models (e. g., Hoffmann et al. 2011), or reinforcement learning (Pr ollochs, Feuerriegel, and Neumann 2019); however, not for document structure parsing.

7 Discussion and Conclusion Efﬁciency: Our system requires only 340 ms/document during entity detection (averaged over our validation set of

12Additional comparison is included in the supplements.

79 documents for Doc Parser WS+FT) on a single Titan Xp GPU with 12 GB VRAM and a batch size of 1. The relation detection in stage 2 only adds a minimal overhead of an average of 5.67 ms/document (10.81 ms/document with reﬁnement) on a single CPU @ 2.1 GHz.

Qualitative Assessment: We performed a qualitative analysis on a subset of documents. We observe that, even for F1 scores below 0.5, the ﬁnal document structure is often still very accurate. In fact, state-of-the-art OCR systems as natural baselines are outperformed signiﬁcantly. This can be explained by our experiment design: we used very strict evaluation metrics. Hence, even small mismatches or ambiguities between the ground truth and predicted entities result in fairly large F1 penalties, despite high overall similarity. Details are in the supplements (including qualitative examples).

Detection Model Choice: Deep CNN models, including recent work (Tan and Le 2019; Duan et al. 2019), are heavily reliant on large training datasets. As such, we expect the impact of our technical contribution, as shown in our comparison of baseline and WS+FT models, to be the same across different modern CNN backbones. Our choice of Mask RCNN as a tool for instance segmentation was also done in consideration of possible future extensions of Doc Parser to non-rectiﬁed documents. Here, the additional instance masks could guide the OCR or rectiﬁcation process.

Future Work: In future work, we plan to explore approaches that can jointly learn entity and relation detection. Furthermore, we aim to further improve our system by enriching 2D inputs with textual features, e. g. highdimensional word embeddings. The robustness of WS pretraining w.r.t. smaller subsets of ar Xivdocs-weak is another area of future investigation.

Conclusion: Despite the extensive interest of the NLP community in leveraging document structures (e. g., Apostolova and Tomuro 2014; Sch afer et al. 2011; Sch afer and Weitz 2012; Schreiber et al. 2018; Katti et al. 2018), the task of parsing complete document structures from renderings has been overlooked. To the best of our knowledge, we present the ﬁrst system for this task. In particular, Doc Parser provides an effective alternative to state-of-the-art OCR which is still widespread in practice. In addition, Doc Parser allows to provide additional semantic input to downstream NLP tasks (e. g. information extraction).

Acknowledgments

Ce Zhang and the DS3Lab gratefully acknowledge the support from the Swiss National Science Foundation (Project Number 200021 184628), Innosuisse/SNF BRIDGE Discovery (Project Number 40B2-0 187132), European Union Horizon 2020 Research and Innovation Programme (DAPHNE, 957407), Botnar Research Centre for Child Health, Swiss Data Science Center, Alibaba, Cisco, e Bay, Google Focused Research Awards, Oracle Labs, Swisscom, Zurich Insurance, Chinese Scholarship Council, and the Department of Computer Science at ETH Zurich.

Abdulla, W. 2017. Mask R-CNN for Object Detection and Instance Segmentation on Keras and Tensor Flow.

Antonacopoulos, A.; Bridson, D.; Papadopoulos, C.; and Pletschacher, S. 2009. A Realistic Dataset for Performance Evaluation of Document Layout Analysis. In International Conference on Document Analysis and Recognition (ICDAR). ISBN 9780769537252. ISSN 15205363.

Apostolova, E.; and Tomuro, N. 2014. Combining visual and textual features for information extraction from online ﬂyers. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1924 1929.

Arif, S.; and Shafait, F. 2018. Table Detection in Document Images using Foreground and Background Features. In 2018 Digital Image Computing: Techniques and Applications (DICTA). ISBN 978-1-5386-6602-9.

Chen, H.-H.; Tsai, S.-C.; and Tsai, J.-H. 2000. Mining Tables from Large Scale HTML Texts. In Proceedings of the 18th Conference on Computational Linguistics - Volume 1, COLING 00, 166 172. USA: Association for Computational Linguistics. ISBN 155860717X.

Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; and Tian, Q. 2019. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, 6569 6578.

Embley, D. W.; Hurst, M.; Lopresti, D.; and Nagy, G. 2006. Table-processing Paradigms: A Research Survey.

Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2010. The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88(2): 303 338.

Garncarek, Ł.; Powalski, R.; Stanisławek, T.; Topolski, B.; Halama, P.; and Grali nski, F. 2020. LAMBERT: Layout Aware language Modeling using BERT for information extraction. ar Xiv preprint ar Xiv:2002.08087 .

Gilani, A.; Qasim, S. R.; Malik, I.; and Shafait, F. 2017. Table Detection using Deep Learning. In 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

Gobel, M.; Hassan, T.; Oro, E.; and Orsi, G. 2013. ICDAR 2013 Table Competition. In International Conference on Document Analysis and Recognition (ICDAR). ISBN 9780-7695-4999-6. ISSN 15205363.

Goldman, O.; Latcinnik, V.; Nave, E.; Globerson, A.; and Berant, J. 2018. Weakly Supervised Semantic Parsing with Abstract Examples. In Annual Meeting of the Association for Computational Linguistics (ACL).

Govindaraju, V.; Zhang, C.; and R e, C. 2013. Understanding Tables in Context Using Standard NLP Toolkits. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 658 664. Soﬁa, Bulgaria: Association for Computational Linguistics.

He, K.; Gkioxari, G.; Doll ar, P.; and Girshick, R. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV). ISBN 978-1-5386-0457-1. ISSN 0006-291X.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Hingmire, S.; and Chakraborti, S. 2014. Sprinkling Topics For Weakly Supervised Text Classiﬁcation. In Annual Meeting of the ACL.

Hoffmann, R.; Zhang, C.; Ling, X.; Zettlemoyer, L.; and Weld, D. S. 2011. Knowledge-based Weak Supervision for Information Extraction of Overlapping Relations. In Annual Meeting of the ACL.

Hurst, M.; and Nasukawa, T. 2000. Layout and Language: Integrating Spatial and Linguistic Knowledge for Layout Understanding Tasks. In COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics.

Katti, A. R.; Reisswig, C.; Guder, C.; Brarda, S.; Bickel, S.; H ohne, J.; and Faddoul, J. B. 2018. Chargrid: Towards Understanding 2D Documents. In Conference on Empirical Methods in Natural Language Processing (EMNLP).

Kieninger, T.; and Dengel, A. 1998. The T-Recs Table Recognition and Analysis System. In International Workshop on Document Analysis Systems (DAS).

Laurens, J. 2008. Direct and reverse synchronization with Sync TEX. TUGBoat 29: 365 371.

Li, M.; Cui, L.; Huang, S.; Wei, F.; Zhou, M.; and Li, Z. 2019. Table Bank: Table Benchmark for Imagebased Table Detection and Recognition. ar Xiv preprint ar Xiv:1903.01949 .

Lin, C.; He, Y.; and and Everson Richard. 2011. Sentence Subjectivity Detection With Weakly-Supervised Learning. In International Joint Conference on Natural Language Processing (IJCNLP).

Lin, T. Y.; Doll ar, P.; Girshick, R.; He, K.; Hariharan, B.; and Belongie, S. 2017. Feature Pyramid Networks for Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). ISBN 9781538604571.

Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll ar, P.; and Zitnick, C. L. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV). ISBN 978-3-319-106014. ISSN 16113349.

Liu, X.; Gao, F.; Zhang, Q.; and Zhao, H. 2019. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), 32 39.

Luong, M.-T.; Nguyen, T. D.; and Kan, M.-Y. 2012. Logical Structure Recovery in Scholarly Articles with Rich Document Features. In Multimedia Storage and Retrieval Innovations for Digital Library Systems, 270 292. IGI Global.

Nurminen, A. 2013. Algorithmic Extraction of Data in Tables in PDF Documents. Master s thesis, Tampere University of Technology. Pivk, A.; Cimiano, P.; Sure, Y.; Gams, M.; Rajkoviˇc, V.; and Studer, R. 2007. Transforming Arbitrary Tables into Logical Form with TARTAR. Data and Knowledge Engineering 567 595. ISSN 0169023X. Pr ollochs, N.; Feuerriegel, S.; and Neumann, D. 2019. Learning Interpretable Negation Rules via Weak Supervision at Document Level: A Reinforcement Learning Approach. In NAACL-HLT. Qasim, S. R.; Mahmood, H.; and Shafait, F. 2019. Rethinking table recognition using graph neural networks. In 2019 International Conference on Document Analysis and Recognition (ICDAR), 142 147. IEEE. Rabinovich, E.; Sznajder, B.; Spector, A.; Shnayderman, I.; Aharonov, R.; Konopnicki, D.; and Slonim, N. 2018. Learning Concept Abstractness using Weak Supervision. In EMNLP. Rice, S. V.; Jenkins, F. R.; and Nartker, T. A. 1995. The Fourth Annual Test of OCR Accuracy. Technical report, Technical Report 95. Sch afer, U.; Kiefer, B.; Spurk, C.; Steffen, J.; and Wang, R. 2011. The ACL Anthology Searchbench. In 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations (ACL-HLT). Association for Computational Linguistics. Sch afer, U.; and Weitz, B. 2012. Combining OCR Outputs for Logical Document Structure Markup: Technical Background to the ACL 2012 Contributed Task. In ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, ACL 12. Schreiber, S.; Agne, S.; Wolf, I.; Dengel, A.; and Ahmed, S. 2018. Deep De SRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images. In International Conference on Document Analysis and Recognition (ICDAR). ISBN 9781538635865. ISSN 15205363. Siddiqui, S. A.; Malik, M. I.; Agne, S.; Dengel, A.; and Ahmed, S. 2018. De CNT: Deep Deformable CNN for Table Detection. IEEE Access 74151 74161. ISSN 21693536. Tan, M.; and Le, Q. V. 2019. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. ar Xiv preprint ar Xiv:1905.11946 . Tengli, A.; Yang, Y.; and Ma, N. L. 2004. Learning Table Extraction from Examples. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, 987 993. Geneva, Switzerland: COLING. Wang, Y.; Phillips, I. T.; and Haralick, R. M. 2004. Table Structure Understanding and its Performance Evaluation. Pattern Recognition 1479 1497. ISSN 00313203. Yildiz, B.; Kaiser, K.; and Miksch, S. 2005. pdf2table: A Method to Extract Table Information from PDF Files. 2nd Indian International Conference on Artiﬁcial Intelligence (IICAI) .

Zanibbi, R.; Blostein, D.; and Cordy, J. 2004. A Survey of Table Recognition. Document Analysis and Recognition 1 33. ISSN 1433-2833. Zhong, X.; Tang, J.; and Yepes, A. J. 2019. Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR), 1015 1022. IEEE.