# learning_to_interpret_satellite_images_using_wikipedia__34eef626.pdf

Learning to Interpret Satellite Images using Wikipedia

Burak Uzkent1 , Evan Sheehan1 , Chenlin Meng1 , Zhongyi Tang2 Marshall Burke2 , David Lobell2 and Stefano Ermon1

1Department of Computer Science, Stanford University 2Department of Earth System Science, Stanford University buzkent@cs.stanford.edu, {esheehan, chenlin, ztang, mburke, dlobell}@stanford.edu, ermon@cs.stanford.edu

Despite recent progress in computer vision, ﬁnegrained interpretation of satellite images remains challenging because of a lack of labeled training data. To overcome this limitation, we construct a novel dataset called Wiki Sat Net by pairing georeferenced Wikipedia articles with satellite imagery of their corresponding locations. We then propose two strategies to learn representations of satellite images by predicting properties of the corresponding articles from the images. Leveraging this new multi-modal dataset, we can drastically reduce the quantity of human-annotated labels and time required for downstream tasks. On the recently released f Mo W dataset, our pre-training strategies can boost the performance of a model pre-trained on Image Net by up to 4.5% in F1 score.

1 Introduction

Deep learning has been the driving force behind many recent improvements in computer vision tasks, including image classiﬁcation, image segmentation, object detection and tracking, etc. [Russakovsky et al., 2015; Lin et al., 2014; Han et al., 2018]. These deep models, however, require training on high quality, large-scale datasets, and building these datasets is typically very costly. Satellite images are particularly difﬁcult and expensive to label because of humans unfamiliarity with aerial perspectives [Christie et al., 2018]. One effective way to reduce the amount of training data needed is to perform pre-training on an existing, previously annotated dataset, such as Image Net [Deng et al., 2009], and transfer the learned weights to the domain of interest [Raina et al., 2007; Dai et al., 2009]. However, the success of this approach diminishes if the underlying distributions and/or compositions of the pre-training and target datasets are not sufﬁciently similar. Such a problem is exceptionally pronounced in the satellite imagery space, as the entire frame of reference and perspective of an aerial image is altered compared to a natural image. This has the unfortunate effect of rendering natural image datasets, such as Image Net, less useful as pretraining mechanisms for downstream computer vision tasks in the satellite domain [Pan et al., 2010; Kaiser et al., 2017; Jean et al., 2018; Oshri et al., 2018].

Because direct annotation is expensive, researchers have considered many creative ways to provide supervision without explicit labels. These include unsupervised [Kingma et al., 2014], label-free [Ren et al., 2018; Stewart and Ermon, 2017], and weakly supervised learning methods [Ratner et al., 2017]. A particularly effective strategy is to leverage co-occurrence statistics in a dataset, e.g., predict the next frame in a video, a missing word in a sentence [Mikolov et al., 2013], or predict relationships between entities such as images and text co-occurring together. For example, leveraging images and their hashtags on Instagram, Mahajan et al. build a large scale image recognition dataset consisting of more than 3 billion images across 17,000 weak labels obtained from textual hashtags and their Word Net [Miller, 1995] synsets. After pre-training on this extremely large dataset, they report almost 5% improvement over the same model trained from scratch on Image Net. Because satellite images are geolocated, i.e., they correspond to speciﬁc locations (and times), they can be paired with other geolocated datasets (e.g., Open Street Map [Kaiser et al., 2017]), exploiting spatial co-occurrence statistics as a source of supervision. Following this strategy, we construct a novel multi-modal dataset by pairing geo-referenced Wikipedia articles with their corresponding satellite images. By treating an article as an information-rich label, we obtain highly detailed physical and qualitative context for each image. For example, the ﬁrst sentence of the John. F. Kennedy International Airport article contains excerpts such as JFK is the primary international airport serving New York City . Wikipedia articles additionally contain demographic, environmental, and social information in structured form [Sheehan et al., 2019]. To the best of our knowledge, this is the ﬁrst time that Wikipedia has been used in conjunction with satellite images, and with 888,696 article-image entries, our approach yields the largest satellite image dataset to date. In this paper, we demonstrate the effectiveness of pairing Wikipedia articles to satellite images for pre-training CNNs for satellite image recognition. We propose two pre-training methods to learn deep representations. First, similar to [Mahajan et al., 2018], we weakly label satellite images with curated summarization tags extracted from the article via an automated process. We then train a deep convolutional network to predict these weak labels directly from the images, learning useful representations in the process. In the second approach,

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Three Gorges Dam

Hagia Sophia

Port of Boston Chrysler Building Huvudstabron

Taj Mahal Maracanã Stadium JFK Airport

Niagara Falls

Figure 1: Left: Scatter plot of the distribution of geo-tagged Wikipedia articles together with some images (right) matched to the articles shown as green dots on the left plot. The title of the Wikipedia articles are written under each image.

we propose a novel joint architecture where we ﬁrst obtain a textual embedding of each article using document summarization techniques from NLP [Le and Mikolov, 2014] and then train a deep convolutional network to produce an embedding for each image that is similar to the textual one. The ﬁrst approach is a crude way of getting a single weak label for each article whereas the second learns representations without weak labels. The pre-trained networks are then evaluated on a downstream hand-labeled dataset, as in [Jean et al., 2018], where we obtain 4.5% higher accuracy compared to networks pre-trained on Image Net, the standard approach for computer vision tasks.

2 Pairing Rich Crowdsourced Annotations from Wikipedia to Satellite Images

Wikipedia is a large-scale, crowdsourced database spanning 302 languages with over 47 million articles [Wikipedia, 2018]. Of these 47 million articles, about 11% are contained in the English version. Out of these approximately 5 million articles, we found that roughly 1 million, or nearly 20%, are geolocated, meaning there is a latitude and longitude ci = {clat i , clon i } associated with the article s text yi. Our key idea is to use the article s coordinates to acquire a satellite image of its location from space (see Fig. 1). There is often a strong correlation between the article s text, yi, and the visual content of the corresponding image, xi. Indeed, we can think of the article as an extremely detailed caption for the satellite image, providing an often comprehensive textual representation of the satellite image, or an information-rich label. This label often contains structured data in the form of tables, called infoboxes, as well as raw text, allowing for the extraction of information about the physical state and features of the entity (e.g., elevation, age, climate, population).

2.1 Collecting Wikipedia Articles

this subsection can be cut or moved to appendix. Try: The ﬁrst step towards our goal was to acquire an English Wikipedia data dump of articles in the future, we plan to explore supplementing the dataset with non-English Wikipedia articles as well. The initial step for accomplishing our deﬁned goal involved the acquisition of an English

Wikipedia data dump of articles1, though, in the future, we plan to explore non-English Wikipedias to supplement the dataset. A Wikipedia article dump is stored as one large XML ﬁle containing all standard articles as well as numerous technical article stubs (e.g., internal communication pages, page redirects, etc.). In order to analyze each relevant article individually, we ﬁrst parsed the XML ﬁle into its constituent articles, netting roughly 5 million standard, titled articles. To isolate those that In general, which should be set off by a comma; often you can use that instead. were geolocated, we then iterated through these 5 million articles and used regular expressions to ﬁnd strings matching one of the archetypal coordinate patterns, such as:

(1) coord|lat|lon|display = title

(2) for d {N, S, E, W} :

coord|deg|min|sec|d|deg|min|sec|d|display = title

This resulted in the acquisition of 1.02 million articles possessing coordinates.

2.2 Acquiring Matching Satellite Imagery For a given article s coordinate ci, there are many sensors that can provide imagery, with different tradeoffs in terms of spatial and temporal resolution, wavelengths, and costs. In this paper we acquire high resolution images from Digital Globe satellites. The images have a ground sampling distance (GSD) of 0.3-0.5m. These are among the highest resolution images available commercially, and were also used in the recently released functional map of the world (f Mo W) dataset [Christie et al., 2018]. Note that one could also use the same strategy to build a similar multi-modal dataset using lower-resolution (10 meter), publicly available Landsat and Sentinel-2 images. For a given coordinate ci, there are usually multiple images available, captured at different times. We acquired the latest image available. Another important design choice is the size of the acquired images. In this study, we use 1000 1000 pixels images covering approximately an area of 900m2. In aerial images, objects occupy drastically different numbers of pixels, as shown in Fig. 1. Based on preliminary manual examination, we found that 1000 1000 pixels images can typically cover most of the relevant objects.

1https://dumps.wikimedia.org/enwiki/

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Finally, we prioritized collecting RGB images and only acquired grayscale images if an RGB image was not available. We did not perform any ﬁltering to remove cloudy images, as our goal is to learn robust representations on a noisy dataset. Our resulting Wiki Sat Net multi-modal dataset is a set of tuples D = {(c1, x1, y1), (c2, x2, y2), , (c N, x N, y N)} where each tuple (ci, xi, yi) represents a location (ci), corresponding Digital Globe image (xi) and Wikipedia article text (yi). Wiki Sat Net contains N = 888, 696 article-image pairs. To the best of our knowledge, this is the largest dataset to date consisting of satellite images and about 2 times larger than the recently released large scale f Mo W dataset. Note that our procedure is highly scalable and fully automated. It could be used to generate even larger datasets by considering other Wikipedia languages and other sensors in addition to Digital Globe. In the next section, we propose two novel methods to pre-train a convolutional neural network (CNN) to extract information about images xi using information from yi.

3 Learning Visual Representations Using Wikipedia Textual Information

Exemplifying the diverse application possibilities highlighted in the previous sections, we construct a general Wikipedia article-satellite image framework for pre-training CNNs. We then explore whether we can learn to interpret satellite images using knowledge extracted from Wikipedia articles via two approaches: weakly-supervised [Ratner et al., 2017] labelling and a novel textual embedding method that attempts to match textual and visual embeddings.

3.1 Weakly Supervised Learning

We ﬁrst propose learning visual features using a dataprogramming pipeline [Ratner et al., 2017] to label our dataset. We begin by extracting a weak label ˆw(yi) for each article yi in our dataset. In our context, a weak label is a noisy, machine-generated classiﬁcation of an article from a set of pre-deﬁned labels. Because of space constraints, we only provide a high-level description of the approach, and will add more details by purchasing extra pages in the ﬁnal version. As a ﬁrst step, we manually compile a list of 97 potential categories that an article could fall under (e.g., city, lake, event, etc.) and use regular expressions to search for the terms throughout speciﬁc areas of the article s text where article meta-data is contained. We then rank the categories which are matched to the article in a manuallyconstructed hierarchical fashion from speciﬁc to general (e.g., building town county, etc.) and choose the one which comes ﬁrst to label the article. Because many of these category labels are very detailed, we then merge certain similar categories together to create more general labels. We also discard articles that are assigned labels which cannot be determined from a satellite image (e.g., person, event, etc.). Weak labels represented by less than 100 samples are also removed, reducing the ﬁnal set of labels to 55. Given the ﬁnal set of weak labels and corresponding images, we train a classiﬁer to predict ˆw(yi) from xi. The classiﬁer is composed of a convolutional neural network fv :

North Queensland Cowboys Highland Aviation Iserbrook

(event) (school) (incident) Figure 2: Some of the extracted weak labels representing ﬂipped label noise. Corresponding Wikipedia article titles are written above the images. Though the words stadium, airport, and water are mentioned 19, 6, and 23 times in the articles, our weak label extraction pipeline generates wrong labels. Using image to text matching helps alleviate this ﬂipped label noise.

(city) (town) (town) (county) Figure 3: Visually similar examples where the extracted weak labels cause adversarial label noise. Here the CNN is penalized for errors even when the predicted label is visually similar to assigned weak label. In contrast, our document summarization model projects the embeddings of the articles of these images to a similar space to avoid penalizing the CNN when predicting a similar label.

X 7 RM that embeds images into an M dimensional feature space, followed by fully connected and softmax layers as shown in Fig. 4a. In this study, we parameterize fv using the Dense Net121 [Huang et al., 2017] architecture which was previously shown to perform well across a range of tasks. The classiﬁer is trained using the cross entropy loss function. The features learned by the embedding fv on this large-scale pre-training task can then be transferred to downstream tasks, e.g., object detection or land cover classiﬁcation. Extracting weak labels is a noisy process that leads to a signiﬁcant number of ﬂipped labels as shown in Fig. 2. Additionally, the process leads to adversarial label noise because of visually similar labels such as city, country, populated place, building, town etc., as shown in Fig. 3. One can apply a simple merging step to place such visually similar labels into a general category, e.g., populated place. However, it leads to a class imbalance problem where almost 40% of the dataset is dominated by populated places. Exploring the trade-off between adversarial label noise and class imbalance problems is a very time-consuming process due to the nature of working with a large-scale dataset. For this reason, in the next section, we propose a novel method to learn deep representations using multi-modal data without manual pre-processing.

3.2 Image to Text Matching Learning In this section, we propose a novel method to learn deep convolutional features without using hand-crafted labeling functions. This not only substantially reduces human effort, but also tackles the adversarial label noise by softening the loss function for the images that can fall into multiple visually similar categories. Our method relies on

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Global Pooling

1024 Loss Function

Weak Label Extraction

55 Finetune Weights on Human Labeled Dataset

Global Pooling

1024 300 Loss Function

Bridge Mandela

(a) (b) Figure 4: The workﬂow of the proposed weakly supervised learning method (a): (1) Extract labels from articles using our labeling pipeline. (2) Match articles with images of their coordinates. (3) Pre-train on a large-scale dataset using 55 weak labels. (4) Transfer learned weights to a down-stream task. In (b) we show the workﬂow of the image to text matching learning. Our method enforces the CNN to learn features similar to raw textual features learned by Doc2Vec.

the idea of image to text matching [Lei Ba et al., 2015; Wang et al., 2018]. In this direction, we propose a novel network shown in Fig. 4b with two branches: a visual and a textual one. We design a loss function that encourages the CNN (visual branch) to produce image embeddings that are close to a suitable vector representation of the corresponding article s text (textual branch). The proposed architecture uses satellite images, X, and Wikipedia articles, Y, as input. In the textual branch, we learn a function ft : Y 7 RK, to project an article, yi, to a textual embedding space zt i RK using a document summarization model from natural language processing (NLP):

zt i = ft(yi). (1)

In the visual branch, we use a function fv : X 7 RM parameterized using a convolutional neural network to extract features from an image as

zv i = fv(xi) (2)

where i represents the index of the image paired to article yi. We parameterize fv using the Dense Net121 architecture [Huang et al., 2017] as in the weak supervision method. Next, we use a function fm : Zv 7 RK to map zv i to the same dimension as the textual feature vector zt i. The function fm is parameterized using a fully connected layer with Re LU activations. The ﬁnal feature vectors, zv i and zt i RK, are then compared with a loss function that enforces similarity.

Pre-training the Doc2Vec Model Our image to text matching method uses textual descriptors Zt to learn deep visual representations. In our study, we use the Doc2Vec network [Le and Mikolov, 2014] which can summarize variable length articles in a uniﬁed framework. Doc2Vec is a document summarization method that can take a variable length piece of text, yi, and map yi Y to a paragraph vector zt i = ft(yi) RK in a ﬁxed-length vector space, where K is speciﬁed by the user. Documents that possess similar meanings are mapped to nearby points in the embedding space, allowing a comparison between any two documents. In contrast to ﬁxed length vector representations using Bag-of-Words, Doc2Vec can capture the orderings and semantics of the words, which is highly beneﬁcial for our unsupervised learning task. For example, learning a textual embedding space where we can closely map article categories

such as country, city, town etc. is desired considering that their corresponding visual data contain similar structures (see Fig. 5). Another advantage of the Doc2Vec model is that it is an unsupervised learning model. This allows us to learn Wikipedia-speciﬁc descriptors by training it on the full geolocated Wikipedia article corpus.

Cosine Similarity Loss Function After learning feature vectors, zv i and zt i RK, from the two branch network, we apply a loss function to measure the similarity of the two vectors. We propose using the cosine similarity metric, which measures the angle, θi, between two vectors as

D(xi, yi) = cos(θi) = fv(xi)T ft(yi) fv(xi) 2 ft(yi) 2 . (3)

Wikipedia has varying lengths of articles, which makes the cosine similarity ideal since it measures the similarity between the direction rather than the magnitude of two vectors.

Training on Wiki Sat Net In our pre-training experiments, we use similar hyperparameters in both weak supervision and image to text matching to train the Dense Net121 for optimizing the weights for fv. We initialize weights randomly, however, we observed faster convergence when initializing with pre-trained weights. After experimentation, we set the learning rate and batch size to 0.0001 and 64, respectively, and the Adam optimizer is used to train the model [Kingma and Ba, 2014]. Finally, we

City - Middletown, Connecticut City - Milton, Georgia Lake - Timothy Lake Lake - Tinquilco Lake Town - Mingona Township, Kansas Town - Moon Township, Pennsylvania Road - Morehampton Road, Dublin Road - Motorway M10 Pakistan River - Motru River River - Mousam River Island - Aupaluktok Island Island - Avatanak Island

Figure 5: Visualization of PCA components of the randomly chosen articles learned by Doc2Vec. Notice that visually similar objects such as city, town are closely mapped while different objects are projected far away. The article titles are shown on the right.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Figure 6: The cosine similarities between the CNN embeddings and the Doc2Vec embedding are computed and overlaid on the images. The CNN learns to embed AT&T Stadium s image closer to the its corresponding article. resize the 1000 1000 pixels images to 224 224 pixels images to compare with publicly available datasets. In the initial steps of image to text training, we observe an angle of approximately 90o (D(xi, yi) 0) between zt i and zv i . This is consistent with the fact that random vectors in high dimensional spaces are likely to be orthogonal to each other. After several epochs, the angle decreases to about 45o (D(xi, yi) 0.5) and stops decreasing further. We believe that this is partially due to articles that do not contain any visual cue, e.g culture and person, and also cloudy images, which amount to roughly 5% of the dataset. We did not observe over-ﬁtting in our experiments. While we are not able to achieve zero loss, we qualitatively ﬁnd that our approaches learn meaningful representations. To verify this, after pre-training the CNN on Wiki Sat Net using the image to text matching, we visualize the cosine similarities between zt i and zv i as shown in Fig. 6. In the same ﬁgure, we keep zt ﬁxed and use embeddings from images at different locations. The CNN learns to project embedding zv i closer to its corresponding article embedding zt i. Our implementation to perform image to text matching and weak supervision can be found in our repository2. Additionally, we plan on releasing a fraction of the high resolution images used in Wiki Sat Net. This will encourage further research into jointly utilizing Wikipedia and satellite images.

4 Transfer Learning Experiments

After pre-training CNNs on Wiki Sat Net using the proposed methods, we test them on three target tasks: (1) single image classiﬁcation on the f Mo W dataset, (2) temporal view classiﬁcation using multiple images over an area on the f Mo W dataset, and (3) land cover classiﬁcation. In these tasks, we compare our pre-training strategies to the following baselines: (1) pre-training on Image Net [Russakovsky et al., 2015], (2) pre-training on CIFAR10, and (3) training from scratch. Our goal is to evaluate whether we learn satellitespeciﬁc representations that outperform the ones obtained using out-of-domain benchmarks with human labels.

Fine-tuning There are two classical approaches in ﬁne-tuning a deep network on the target task: (1) training all layers, and (2) freezing all the layers other than the ﬁnal classiﬁcation layer. In

2https://github.com/ermongroup/Pretraining Wiki Sat Net

our experiments, we present results from both strategies. The learning rates for the weakly supervised and image to text matching model are set to 1e-4 and 1e-5 after experimentation. On the other hand, for the Image Net model the learning rate is set to 1e-4, while it is set to 1e-3 for the CIFAR10 and trained from scratch models. These were the best performing hyper-parameters in our experiments. Finally, resized 224 224 pixel RGB images are used as input to the model as in the pre-training task. We follow the same approach for the models pre-trained on CIFAR10 and Image Net.

4.1 Experimenting on the f Mo W Dataset To quantify the quality of the representations learned in the pre-training step, we ﬁrst use a recently released large-scale satellite image recognition dataset named f Mo W [Christie et al., 2018]. The f Mo W dataset consists of both multispectral and RGB images and contains 83,412 unique training bounding boxes from large satellite images representing 62 different objects. The validation and test sets contain 14,241 and 16,948 bounding boxes and are left unchanged in our experiments. It also comes with temporal views from the same scenes, making classiﬁcation of some classes such as construction site and ﬂooded road easier. Christie et al. proposes a multi-modal architecture that uses a Dense Net161 pre-trained on Image Net and an LSTM to learn from images and their corresponding meta-data. Their Dense Net161 model has a number of parameters similar to the Dense Net121 model we use in our experiments. Since our pre-training framework learns from visual data, it can be easily applied to any CNN model to boost performance and reduce the number of labeled samples needed for a target task.

Reasoning on Single Images In the ﬁrst task, we perform experiments on the f Mo W dataset for the task of classifying individual images using features extracted by the visual branch fv( ). We experiment with 2000, 10000, 50000, 100000, 200000, and 350000 training images. As shown in Fig. 7, our pre-training strategies outperform the other pre-training strategies by large margins in top-1 and top5 classiﬁcation accuracy when using small amounts of labeled data. For example, when using 2000 labeled images, both our training strategies outperform Image Net and CIFAR10 by 10% and 30%, respectively. As expected, this number goes down to about 5% and 20% when increasing the number of labeled images to 50000. Interestingly, at this point, the model trained from scratch starts to outperform the model pre-trained on CIFAR10. When using the full training set, our proposed pre-training strategies outperform Image Net by about 2% and outperform the model trained from scratch by about 10%. These results demonstrate that our proposed approach produces features that are highly beneﬁcial in downstream tasks involving satellite images, even when large numbers of human labeled samples are available. When ﬁnetuning only the ﬁnal layer, the proposed pre-training methods outperform Image Net pre-training by about 13%( Table 1).

Reasoning on Temporal Views In this section, we evaluate our representations on the task of temporal view classiﬁcation across 62 classes from the f Mo W dataset. This way, we can understand if our pre-training

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Figure 7: The top-1 and 5 classiﬁcation accuracies of the proposed pre-training and baseline strategies on f Mo W s test set when ﬁnetuning all layers on f Mo W s training set. Monte-Carlo experiments were conducted when sampling a subset of the full training set.

Model CIFAR10 Image Net Wiki Sat Net Weak Labels Wiki Sat Net Image2Text Top-1 Acc. (Fixed fv) 13.98 (%) 37.73 (%) 50.73 (%) 51.02 (%)

Top-1 Acc. (Fine-tuned fv) 55.79 (%) 68.61 (%) 70.62 (%) 70.72 (%)

Table 1: Top-1 accuracies on the f Mo W test set for pre-trained models. All the models are ﬁne-tuned on the full f Mo W training set. Fixed fv represents the ﬁne-tuning method where the pre-trained weights are ﬁxed whereas the second method ﬁne-tunes all layers.

methods also boost performance on tasks that use temporal data as input. Christie et al. trains the network on single labeled images and at test time averages the softmax predictions of the network on different images from the same area to assign the label with the maximum average score. We follow their training and test methods and at test time average predictions from T images over the same area, again using features extracted from fv( ) as input. Different from the previous section, we now report results in F1-scores to compare our models to the ones proposed by [Christie et al., 2018].

Model CIFAR10 Image Net Wiki Sat Net Weak Labels Wiki Sat Net Image2Text F1 Score (Single View) 55.34 64.71 (%) 66.17 (%) 67.12 (%)

F1 Score (Temporal Views) 60.45 68.73 (%) 71.31 (%) 73.02 (%)

Table 2: F1 scores of different pre-training methods on f Mo W s test set when ﬁne-tuning all the layers on f Mo W s training set.

We ﬁrst compare our pre-training methods to Image Net and CIFAR10 pre-training in Table 2. The proposed pretraining methods outperform the Image Net pre-trained model by up to 4.5% in F1 Score when performing reasoning on temporal views. Among the proposed methods, the image to text matching approach outperforms the weak supervision with handcrafted labels method by about 1.7% in F1 Score. On the other hand, Christie et al. proposes ﬁve different models for the f Mo W classiﬁcation task. Three of them use metadata and images jointly, whereas the remaining two only employ an Image Net pre-trained Dense Net on images. Their vi-

sual data-only models are named CNN-I-1 and CNN-I, where the former is a single view model and the latter performs temporal reasoning. We can improve these models with our pretraining strategy by about 4.5% in F1 score while performing similarly to their top performing model, LSTM-IM, which uses meta-data and visual data jointly to perform temporal reasoning. Although this is outside the scope of this paper, our models can replace the Dense Net model, pre-trained on Image Net, used in LSTM-IM to improve its results as well.

4.2 Experiments on Land Cover Classiﬁcation Additionally, we perform classiﬁcation across 66 land cover classes using remote sensing images with 0.6m GSD obtained by the USDA s National Agriculture Imagery Program (NAIP). We focus on the images from the California s Central Valley near the city of Fresno for the year 2016. The corresponding land cover map, named the Cropland Data Layer (CDL), is collected by the USDA for the continental United States [NAIP, 2016]. The CDL is provided at 30m GSD, and we upsample them to match 0.6m GSD to use as ground truth. The ﬁnal dataset consists of 100000 training and 50000 validation and test images. We only ﬁne-tune the classiﬁcation layer while keeping fv ﬁxed.

Model CIFAR10 Image Net Wiki Sat Net Weak Labels Wiki Sat Net Image2Text Top 1 Acc. 42.01 40.11 (%) 46.16 (%) 47.65 (%) Top 5 Acc. 74.73 80.15 (%) 88.66 (%) 88.77 (%)

Table 3: Performance of different pre-training methods on the land cover classiﬁcation task.

As shown in Table 3, our pre-training strategies lead to substantially higher performance than the Image Net and CIFAR10 features. This demonstrates the robustness and wide range of applications our pre-training strategies possess.

5 Conclusion

In this study, we proposed a novel combination of satellite images and crowdsourced annotations from geo-referenced Wikipedia articles. Our approach yields a large scale, multimodal dataset combining rich visual and textual information for millions of locations all over the world including additional languages beyond English will likely improve coverage even more. Leveraging paired multi-modal data, we proposed two different pre-training methods: (1) learning with weak labels, and (2) learning without weak labels using image to text matching. Both pre-training strategies lead to improved results on the recently released f Mo W dataset consisting of large numbers of labeled samples. Our image to text matching model outperformed one pre-trained on Image Net by 4.5% when using around 350000 labeled samples; this increase is substantially higher when there are fewer labels.

Acknowledgements

This research was supported in part by Stanford s Data for Development Initiative, DARPA World Modelers Program, and NSF grants #1651565, #1522054, and #1733686.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

[Christie et al., 2018] Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, 2018. [Dai et al., 2009] Wenyuan Dai, Ou Jin, Gui-Rong Xue, Qiang Yang, and Yong Yu. Eigentransfer: a uniﬁed framework for transfer learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 193 200. ACM, 2009. [Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248 255. IEEE, 2009. [Han et al., 2018] Junwei Han, Dingwen Zhang, Gong Cheng, Nian Liu, and Dong Xu. Advanced deep-learning techniques for salient and category-speciﬁc object detection: a survey. IEEE Signal Processing Magazine, 35(1):84 100, 2018. [Huang et al., 2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261 2269. IEEE, 2017. [Jean et al., 2018] Neal Jean, Sherrie Wang, George Azzari, David Lobell, and Stefano Ermon. Tile2vec: Unsupervised representation learning for remote sensing data. ar Xiv preprint ar Xiv:1805.02855, 2018. [Kaiser et al., 2017] Pascal Kaiser, Jan Dirk Wegner, Aur elien Lucchi, Martin Jaggi, Thomas Hofmann, and Konrad Schindler. Learning aerial image segmentation from online maps. IEEE Transactions on Geoscience and Remote Sensing, 55(11):6054 6068, 2017. [Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [Kingma et al., 2014] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581 3589, 2014. [Le and Mikolov, 2014] Quoc Le and Thomas Mikolov. Distributed representations of sentences and documents. ar Xiv preprint ar Xiv:1405.4053, 2014. [Lei Ba et al., 2015] Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proceedings of the IEEE International Conference on Computer Vision, pages 4247 4255, 2015. [Lin et al., 2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740 755. Springer, 2014.

[Mahajan et al., 2018] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. ar Xiv preprint ar Xiv:1805.00932, 2018. [Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efﬁcient estimation of word representations in vector space. Co RR, abs/1301.3781, 2013. [Miller, 1995] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39 41, 1995. [NAIP, 2016] USDA-NASS Cropland Data NAIP. Published crop-speciﬁc data layer. USDA-NASS, Washington, DC, 2016. [Oshri et al., 2018] Barak Oshri, Annie Hu, Peter Adelson, Xiao Chen, Pascaline Dupas, Jeremy Weinstein, Marshall Burke, David Lobell, and Stefano Ermon. Infrastructure quality assessment in africa using satellite imagery and deep learning. In Proc. 24th SIGKDD Conference, 2018. [Pan et al., 2010] Sinno Jialin Pan, Qiang Yang, et al. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345 1359, 2010. [Raina et al., 2007] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pages 759 766. ACM, 2007. [Ratner et al., 2017] Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R e. Snorkel: Rapid training data creation with weak supervision. ar Xiv preprint ar Xiv:1711.10160, 2017. [Ren et al., 2018] Hongyu Ren, Russell Stewart, Jiaming Song, Volodymyr Kuleshov, and Stefano Ermon. Adversarial constraint learning for structured prediction. Co RR, abs/1805.10561, 2018. [Russakovsky et al., 2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015. [Sheehan et al., 2019] Evan Sheehan, Chenlin Meng, Matthew Tan, Burak Uzkent, Neal Jean, David Lobell, Marshall Burke, and Stefano Ermon. Predicting economic development using geolocated wikipedia articles. In Proc. 25th SIGKDD Conference, 2019. [Stewart and Ermon, 2017] Russell Stewart and Stefano Ermon. Label-free supervision of neural networks with physics and domain knowledge. In AAAI, 2017. [Wang et al., 2018] Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. [Wikipedia, 2018] Wikipedia. Wikipedia, the free encyclopedia, 2018.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)