{"title": "Spatial-Aware Feature Aggregation for Image based Cross-View Geo-Localization", "book": "Advances in Neural Information Processing Systems", "page_first": 10090, "page_last": 10100, "abstract": "In this paper, we develop a new deep network to explicitly address these inherent differences between ground and aerial views.  We observe there exist some approximate domain correspondences between ground and aerial images. Specifically, pixels lying on the same azimuth direction in an aerial image approximately correspond to a vertical image column in the ground view image. Thus, we propose a two-step approach to exploit this prior knowledge. The first step is to apply a regular polar transform to warp an aerial image such that its domain is closer to that of a ground-view panorama.  Note that polar transform as a pure geometric transformation is agnostic to scene content, hence cannot bring the two domains into full alignment. Then, we add a subsequent spatial-attention mechanism which further brings corresponding deep features closer in the embedding space.  To improve the robustness of feature representation, we introduce a feature aggregation strategy via learning multiple spatial embeddings. By the above two-step approach, we achieve more discriminative deep representations, facilitating cross-view Geo-localization more accurate. Our experiments on standard benchmark datasets show significant performance boosting, achieving more than doubled recall rate compared with the previous state of the art.", "full_text": "Spatial-Aware Feature Aggregation for\n\nCross-View Image based Geo-Localization\n\nYujiao Shi, Liu Liu, Xin Yu, Hongdong Li\n\nAustralian National University, Canberra, Australia.\n\nAustralian Centre for Robotic Vision, Australia.\n\n{firstname.lastname}@anu.edu.au\n\nAbstract\n\nRecent works show that it is possible to train a deep network to determine the\ngeographic location of a ground-level image (e.g., a Google street-view panorama)\nby matching it against a satellite map covering the wide geographic area of interest.\nConventional deep networks, which often cast the problem as a metric embedding\ntask, however, suffer from poor performance in terms of low recall rates. One\nof the key reasons is the vast differences between the two view modalities, i.e.,\nground view versus aerial/satellite view. They not only exhibit very different visual\nappearances, but also have distinctive geometric con\ufb01gurations. Existing deep\nmethods overlook those appearance and geometric differences, and instead use\na brute force training procedure, leading to inferior performance. In this paper,\nwe develop a new deep network to explicitly address these inherent differences\nbetween ground and aerial views. We observe that pixels lying on the same azimuth\ndirection in an aerial image approximately correspond to a vertical image column\nin the ground view image. Thus, we propose a two-step approach to exploit this\nprior. The \ufb01rst step is to apply a regular polar transform to warp an aerial image\nsuch that its domain is closer to that of a ground-view panorama. Note that polar\ntransform as a pure geometric transformation is agnostic to scene content, hence\ncannot bring the two domains into full alignment. Then, we add a subsequent\nspatial-attention mechanism which brings corresponding deep features closer in\nthe embedding space. To improve the robustness of feature representation, we in-\ntroduce a feature aggregation strategy via learning multiple spatial embeddings. By\nthe above two-step approach, we achieve more discriminative deep representations,\nfacilitating cross-view Geo-localization more accurate. Our experiments on stan-\ndard benchmark datasets show signi\ufb01cant performance boosting, achieving more\nthan doubled recall rate compared with the previous state of the art. Remarkably,\nthe recall rate@top-1 improves from 22.5% in [5] (or 40.7% in [11]) to 89.8% on\nCVUSA benchmark, and from 20.1% [5] to 81.0% on the new CVACT dataset.\n\n1\n\nIntroduction\n\nImage based Geo-localization is referred to the task of determining the location of an image (known\nas a query image) by comparing it with a large set of Geo-tagged database images. It has important\ncomputer vision applications such as for robot navigation, autonomous driving, as well as way-\ufb01nding\nin AR/VR applications.\nIn this paper, we study ground-to-aerial cross-view image based Geo-localization problem. To be\nspeci\ufb01c, the query image is a normal ground-level image (e.g., a street view image taken by a tourist)\nwhereas the database images are collections of aerial/satellite images covering the same (though\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Aerial\n\n(b) Ground\n\n(c) Ground Attention map\n\n(d) Polar-transformed Aerial Image\n\n(e) Polar-transformed Aerial Attention Map\n\nFigure 1: Illustration of geometric correspondences between ground and aerial images, and visualiza-\ntion of our generated spatial embedding maps.\n\nwider) geographic region. Cross-view image based localization is a very challenging task because the\nviewpoints (as well as imaging modality) between ground and aerial images are drastically different;\ntheir image visual appearances can also be far apart. As a result, \ufb01nding feature correspondence\nbetween two views (even for a matching pair) can be very challenging. Recently, machine learning\ntechniques (especially deep learning) have been applied to this task, showing promising results\n[5, 11, 19, 24].\nExisting deep neural networks developed for this task often treat the cross-view localization problem\nas a standard image retrieval task, and are trained to \ufb01nd better image feature embeddings that bring\nmatching image pairs (one from ground view, and one from aerial view) closer while pushing those\nunmatching pairs far apart. In other words, they cast the problem as a deep metric learning task, and\nthus learn feature representations purely based on image content (appearance or semantics) without\ntaking into account spatial correspondences between ground and aerial views. To be precise, as seen\nin Figure 1(a) and Figure 1(b), one can easily observe that the locations of objects in an aerial image\nexhibit a strong spatial relationship with the ones in its corresponding ground image. Furthermore,\nthe relative positions among objects also provide critical clues for the cross-view image matching.\nBy exploring such geometric con\ufb01gurations of the scenes, one can signi\ufb01cantly reduce the ambiguity\nof the cross-view image matching problem, and this is the key idea of our paper, which will be\ndescribed next.\nUnlike conventional approaches, our method focuses on establishing spatial correspondences between\nthese two domains explicitly and then learning feature correspondences from these two coarsely\naligned domains. Although deep neural networks are able to learn any functional transformation in\ntheory, explicitly aligning two domains based on geometric correspondences will reduce the burden\nof the learning process for domain alignment, thus facilitating the network convergence. In our\nmethod, we apply polar coordinate transform to aerial images, making it approximately aligned with\na ground-view panorama, as shown in Figure 1(d). After polar transform, we train a Siamese-type\nnetwork architecture to establish deep feature representation. Since polar transform does not take\nthe scene content into account and the true correspondences between the two different domains are\nmore complex than a simple polar transform, some objects may exhibit distortions. To remedy that,\nwe develop a spatial attention based feature embedding module to extract position-aware features.\nPrecisely, our spatial feature embedding module imposes different attention on different locations and\nthen re-weights features to yield a global descriptor for an input image. In this manner, our method\nnot only retains image content information but also encodes the layout information of object features.\nTo achieve robustness of feature representation, we employ a feature aggregation strategy by learning\nmultiple spatial feature embeddings and then aggregating the embedded features. We further employ\na triplet loss to establish the feature correspondences between these cross-view images. Our extensive\nexperimental results demonstrate that our method achieves superior Geo-localization performance\nto the state-of-the-art. Remarkably, the recall rate@top-1 improves from 22.5% in [5] (or 40.7% in\n[11]) to 89.8% on CVUSA benchmark, and from 20.1% [5] (or 46.9% in[11]) to 81.0% on the new\nCVACT dataset.\nContributions of this paper can be summarized as follows:\n\n2\n\n\f\u2022 We propose a new pipeline to address the cross-view Geo-localization problem. We \ufb01rst\nexploit the geometric correspondences between ground and aerial image domains to align\nthese two domains explicitly by a polar transform, allowing the networks to focus on learning\ndetailed scene-dependent feature correspondences.\n\u2022 We present a spatial-aware attention module to re-weight features in accordance with feature\nlocations. Since our method embeds relative positions between object features into image\ndescriptors, our descriptors are more discriminative.\n\u2022 We conduct extensive experiments which con\ufb01rm that our proposed method signi\ufb01cantly\noutperforms the state-of-the-art on two standard cross-view benchmark datasets. Our method\nachieves nearly 4-fold improvement in terms of top-1 recall, compared with the CVM-Net\nproposed in 2018 [5].\n\n2 Related Work\n\nDue to the drastic appearance and viewpoint changes, it is very dif\ufb01cult to match local features [12,\n2, 18, 22] between ground and aerial images directly. Several methods [3, 10, 13] warp ground\nimages into bird-view images and then match the warped images to the aerial ones. Jegou et al.[6]\naggregate the residuals of local features to cluster centroids as image representations, known as\nVLAD descriptors. The work [17] aggregates a set of local features into a histogram, known as Bag\nof words, to attain a global descriptor. The aggregated descriptors are proved to be partially viewpoint\nand occlusion invariant, and thus facilitating image matching. However, hand-crafted features are\nstill the performance bottleneck of traditional cross-view Geo-localization methods.\nDeep neural networks have demonstrated their powerful image representation ability [14]. The\nseminal work [20] \ufb01ne-tune AlexNet [8] on Imagenet [14] and Places [25] to extract features for\nthe cross-view matching task. This work also indicates that the better discriminativeness of deep\nfeatures compared to hand-crafted features. The work [21] \ufb01ne-tunes CNNs by minimizing the feature\ndistances between aerial and ground-view images and obtains better localization performance. [19]\nemploys a triplet CNN architecture to learn feature embedding and achieves signi\ufb01cant improvements.\n[5] embeds a NetVLAD layer on top of a VGG backbone network to represent the two-view images\nmore discriminatively. Liu & Li [11] observe that orientations play a critical role in learning\ndiscriminative features. Thus, this method incorporates per-pixel orientation information into a CNN\nto learn orientation-selective features for the cross-view localization task. Shi et al.[15] propose\na feature transport module to bridge the spatial and feature response domain differences between\nground and aerial images. However, it might be dif\ufb01cult for networks to explore both geometric and\nfeature correspondences simultaneously via a metric learning objective. Therefore, we propose to\ndecouple the procedure of constructing geometric and feature correspondences, and let networks\nlearn simple tasks.\n\n3 Methodology\n\nIn this section, we \ufb01rst introduce the polar transform applied to aerial images for aligning these\ntwo cross-view domains, and then we present our spatial-aware position embedding module for\ndescriptor extraction of both ground and aerial images. We employ a Siamese-like two-branch\nnetwork architecture and our entire pipeline is illustrated in Figure 2.\n\n3.1 Polar Transform\n\nAs we observed, pixels lying on the same azimuth direction in an aerial image approximately\ncorrespond to a vertical image column in the ground view image. Instead of enforcing neural\nnetworks to learn this mapping implicitly, we explicitly transform the aerial images and then roughly\neliminate the geometric correspondence gap between these two domains. In doing so, we ease the\ntask of learning multiple correspondences (i.e., geometry and feature representations) and only need\nto learn a simple feature mapping task, thus signi\ufb01cantly facilitating network convergence.\nWe apply polar transform to aerial images in order to build more apparent spatial correspondences\nbetween aerial and ground images. Speci\ufb01cally, we take the center of each aerial image as the\npolar origin and the north direction (as it is often available for a satellite image) as angle 0\u25e6 in\n\n3\n\n\fFigure 2: Illustration of the pipeline of our proposed method.\n\nthe polar transform. Note that there is no ad hoc pre-centering process for the aerial images, and\nwe do not assume that the location of a query ground-level image corresponds to the center of an\naerial image during testing. In fact, small offsets on the polar origin do not affect the appearance of\npolar-transformed aerial images severely, and the small appearance changes will be reduced by our\nSPE modules (as illustrated in detail in Section 3.2). On the contrary, when a large offset occurs, the\naerial image should be regarded as a negative sample and the polar-transformed aerial image will\nbe signi\ufb01cantly different from the ground-truth one. In this manner, the polar transform effectively\nincreases the discriminativeness of our model.\nTo facilitate training of our two-branch network, we constrain the size of the transformed aerial\nimages to be the same as the ground ones Wg \u00d7 Hg. Note that, the size of the original aerial images\nis Aa \u00d7 Aa. Therefore, the polar transform between the original aerial image points (xs\ni ) and the\ntarget transformed aerial image points (xt\n\ni , ys\n\ni, yt\n\ni ) is de\ufb01ned as:\n\nxs\ni =\n\nys\ni =\n\nAa\n2\nAa\n2\n\n+\n\nAa\n2\n\u2212 Aa\n2\n\nyt\ni\nHg\nyt\ni\nHg\n\nsin(\n\ncos(\n\n2\u03c0\nWg\n2\u03c0\nWg\n\nxt\ni)\n\nxt\ni)\n\n(1)\n\nAfter polar transform, the objects in the transformed aerial images lie in similar positions to their\ncounterparts in the ground images, as seen in Figure 1(d). However, the appearance distortions\nare still obvious in the transformed images because polar transform does not take the depth of the\nscene content into account. Reducing these distortion artifacts for image descriptor extraction is also\ndesirable.\n\n3.2 Spatial-aware Feature Aggregation (SAFA)\n\nAs illustrated in Figure 2, we \ufb01rst employ a backbone network, i.e., the \ufb01rst sixteen layers of VGG19\n[16], to extract features from ground and polar-transformed aerial images. Considering the features\nfrom aerial images undergo distortions, we intend to impose an attention mechanism to select salient\nfeatures while suppressing the features caused by the distortions. Moreover, since spatial layout\nprovides important clues for image matching, we aim to embed spatial con\ufb01guration into our feature\nrepresentation as well. Thus, we develop a spatial-aware feature aggregation (SAFA) module to\nalleviate the distortions in transformed aerial images while embedding the object features into a\ndiscriminative global image descriptor for image matching. Our SAFA is built upon the outputs\nof a Siamese network and learns to encode ground and aerial features individually. The detailed\narchitecture of SAFA is shown in Figure 3.\nSpatial-aware Position Embedding Module (SPE):\nOur SPE is designed to encode the relative positions among object features extracted by the CNN\nbackbone network, as well as the important features. In particular, given input feature maps from one\nbranch, our SPE automatically determines an embedding position map from them. Note that, we do\nnot enforce any additional supervision for SPE and it is learned in a self-attention fashion by a metric\nlearning objective. Moreover, although polar transform can signi\ufb01cantly reduce the domain gap in\nterms of geometric con\ufb01guration, object distortions still exist and cannot be removed by an explicit\n\n4\n\n\fFigure 3: Spatial-aware position embedding module.\n\nfunction. Thus, we employ SPE to select the features from transformed aerial images while reducing\nthe impact of the distortion artifacts in the feature extraction.\nFigure 3 illustrates the work\ufb02ow of our SPE module. Our SPE \ufb01rst employs a max-pooling operator\nalong feature channels to choose the most distinct object feature, and then adopts a Spatial-aware\nImportance Generator to generate a position embedding map. In the Spatial-aware Importance\nGenerator, two fully connected layers are used to select features among the prominent ones as well as\nencode the spatial combinations and feature responses. In this manner, our method can mitigate the\nimpacts of the features from distortions caused by polar transform while represent input images by\nusing salient features. Furthermore, since we choose features based on a certain layout, the encoded\nfeatures not only represent the emergence of certain objects but also re\ufb02ect the positions of the\nobjects. Hence, we encode the spatial layout information into feature representation, thus improving\nthe discriminativeness of our descriptors.\nGiven the position embedding map P \u2208 RH\u00d7W , the feature descriptor K = {kc}, c = 1, 2, ..., C, is\ncalculated as:\n\nkc = (cid:104)f c, P(cid:105)F ,\n\n(2)\nwhere f c \u2208 RH\u00d7W represents the input feature map of the SPE module in the c-th channel, (cid:104)., .(cid:105)\ndenotes the Frobenius inner product of the two inputs, and kc is the embedded feature activation for\nthe c-th channel.\nAs seen in Figure 1, only a certain region achieves high responses in the visualized feature maps. This\nindicates that our SPE not only localizes the salient features but also encodes the layout information\nof those features. Note that the SPE module is adopted in both the ground and aerial branches, and\nour objective forces them to encode correspondent features between these two branches.\nMultiple Position-embedded Feature Aggregation: Motivated by the feature aggregation strategy\n[9], we aim to improve the robustness of our feature representation by aggregating our embedded\nfeatures. Towards this goal, we employ multiple SPE modules with the same architecture but different\nweights to generate multiple embedding maps, and then encode input features in accordance with the\ndifferent generated masks. For instance, some maps focus on the layout of roads while some focus on\ntrees. Therefore, we can explore different spatial layout information in the input images. As illustrated\nin Figure 2, we concatenate the embedded features together as our \ufb01nal image descriptor. Note that,\nwe do not impose any constraint on generating diverse embedding maps but learn embeddings through\nour metric learning objective. During training, in order to minimize the loss function, our descriptors\nshould be more discriminative. Therefore, the loss function inherently forces our embedding maps to\nencode different spatial con\ufb01gurations to increase the discriminativeness of our embedded features.\n\n3.3 Training Objective\n\nWe apply a metric learning objective to learn feature representations for both the ground and aerial\nimage branches. The triplet loss is widely used to train deep networks for image localization and\nmatching tasks [5, 11, 19]. The goal of the triplet loss is to make matching pairs closer while pushing\nunmatching pairs far apart. Similar to [5], we employ a weighted soft-margin triplet loss as our\nobjective:\n\nL = log(1 + e\u03b3(dpos\u2212dneg)),\n\n(3)\n\nwhere dpos and dneg are the (cid:96)2 distance of matching and unmatching image pairs. \u03b3 is a parameter to\nadjust the gradient of the loss, thus controlling the convergence speed.\n\n5\n\n\f(a) CVUSA\n\n(b) CVACT\n\nFigure 4: Ground-to-aerial image pairs sampled from CVUSA [24] and CVACT [11]. Each sub\ufb01gure\nillustrates a ground image (Left) and an aerial image (right).\n\n4 Experiments\n\nTraining and Testing Datasets:\nOur experiments are conducted on two standard benchmark\ndatasets: CVUSA [24] and CVACT [11], where ground images are panoramas. CVUSA and CVACT\nare both cross-view datasets, and each dataset contains 35, 532 ground-and-aerial image pairs for\ntraining. CVUSA provides 8, 884 image pairs for testing and CVACT provides the same number\nof pairs for validation (denoted as CVACT_val). Besides, CVACT also provides 92, 802 cross-\nview image pairs with accurate Geo-tags to evaluate Geo-localization performance (denoted as\nCVACT_test). CVACT_test is a real geo-localization/retrieval test set where all aerial images within\n5 meters to a query ground image are regarded as ground truth correspondences for this query image.\nIn other words, for a query ground image, there may exists several corresponding aerial images in the\ndatabase. Note that in these two datasets the ground and aerial images are captured at different time.\nFigure 4 presents sampled image pairs from these two datasets.\n\nImplementation Details: We use the VGG16 model with pretrained weights on Imagenet [4] as\nour backbone to extract features from cross-view images, and the output of the last convolutional\nlayer of VGG16 is fed into the proposed SAFA block1. The parameters in our proposed SPE module\nare randomly initialized. Similar to [5, 11], we set \u03b3 to 10 for the triplet loss. Our network is trained\nwith Adam optimizer [7], and the learning rate is set to 10\u22125. Exhaustive mini-batch strategy [19] is\nutilized to create triplet images within a batch, and the batch size Bs is set to 32. In a mini-batch,\nthere is 1 matching/positive aerial image and Bs \u2212 1 unmatching/negative aerial images for each\nground-view image. Thus, we construct Bs(Bs \u2212 1) triplets in total. Similarly, for each aerial\nimage, there is 1 matching ground-view image and Bs \u2212 1 unmatching ground-view images, and\nthus Bs(Bs \u2212 1) triplets are also constructed. Hence, we have 2Bs(Bs \u2212 1) triplets in total within\neach batch.\n\nEvaluation Metric: Similar to [19, 5, 11], we use the recall accuracy at top K as our evaluation\nmetric to exam the performance of our model and compare with the state-of-the-art methods. Speci\ufb01-\ncally, given a ground-level query image, it is regarded as \u201csuccessfully localized\" if its ground-truth\naerial image is within the nearest top K retrieved images. The percentage of query images which\nhave been correctly localized is reported as r@K.\n\n4.1 Comparison with State-of-the-Art Methods\n\nWe compare our method with two recent state-of-the-art cross-view localization methods: CVM-NET\n[5] and Liu & Li\u2019s method [11]. For fair comparisons, we use the released models or codes provided\nby the authors. In our method, we apply polar transform to the aerial images and our SAFA outputs 8\nspatial-aware embedding maps and then aggregate these embedded features, denoted as Polar_SAFA\n(M = 8). Note that, the dimension of our descriptors is as the same as that used in CVM-NET. We\nreport recalls at top-1, top-5, top-10, up to top 1%, and the results are listed in Table 1.\nAs indicated by Table 1, our method signi\ufb01cantly outperforms all the state-of-the-art methods. In\nparticular, we almost double the recall at top-1 compared to Liu et al.\u2019s method. The complete\nrecall@K performance is shown in Figure 5.\n\n1The code of this paper is available at https://github.com/shiyujiao/SAFA.\n\n6\n\n\fTable 1: Comparison with state-of-the-art methods on CVUSA [24] and CVACT_val dataset [11].\n\nCVM-NET [5]\nLiu & Li [11]\n\nOur polar-SAFA(M=8)\n\nr@1\n22.53\n40.79\n89.84\n\nCVUSA\n\nr@5\n50.01\n66.82\n96.93\n\nr@10\n63.19\n76.36\n98.14\n\nr@1% r@1\n20.15\n93.52\n96.08\n46.96\n81.03\n99.64\n\nCVACT_val\nr@10\nr@5\n56.87\n45.00\n68.28\n75.48\n94.84\n92.80\n\nr@1%\n87.57\n92.01\n98.17\n\n(a) CVUSA\n\nFigure 5: Recall rates on cross-view Geo-localization datasets. This \ufb01gure demonstrates that our\nmethod (i.e., Polar_SAFA(M = 8)) signi\ufb01cantly outperforms the state-of-the-art methods.\n\n(b) CVACT_val\n\n(c) CVACT_test\n\n4.2 Accurate Geo-localization\n\nWe conduct experiments on the large-scale CVACT_test dataset [11] to illustrate the effectiveness\nof our method for accurate city-scale Geo-localization applications. We also compare with the\nstate-of-the-art methods, CVM-NET [5] andLiu & Li\u2019s method [11]. The recall performance at top-K\nis shown in Figure 5(c). Our method signi\ufb01cantly outperforms the second-best method [11], with a\nrelative improvement of 35.6% at top-1.\n\n4.3 Visualization of Learned Spatial Correspondences\n\nTo visualize our generated embedding maps, we employ the method of [23] to back-propagate the\nembedding maps to the input ground image as well as the polar-transformed aerial image. As visible\nin Figure 6, our SPE is able to encode similar spatial layout as well as feature correspondences\nbetween ground and polar-transformed aerial images. Furthermore, different SPE modules can\ngenerate different spatial embedding maps. In this way, we can encode multiple spatial layouts into\nour feature representations.\n\n4.4 Ablation Study\n\nIn this part, we demonstrate the effectiveness of our proposed polar transform and Spatial-aware\nPosition Embedding (SPE) modules. For the baseline network, we remove the polar transform from\nour network and replace the SPE module with a global max-pooling operator, which has been widely\nadopted in image retrieval tasks[5, 11, 1]. In this case, spatial correspondences between ground and\naerial branches are not used and the baseline network is only trained by our triplet loss.\n\nEffects of Polar Transform: To demonstrate the effectiveness of polar transform for the cross-\nview Geo-localization problem, we train our baseline network in two different settings: one takes\noriginal cross-view ground and aerial images, marked as VGG_gp, and the other takes ground and\npolar-transformed aerial images, marked as Polar_VGG_gp. As indicated in Table 2, applying polar\ntransform to aerial images improves the performance greatly on both datasets.\nMoreover, we also investigate the applicability of polar transform to other cross-view Geo-localization\nmodels.Liu & Li [11] needs an additional pixel-wise orientation map for input images and the\norientation maps are not available for polar transformed images. Thus, we only conduct experiments\non CVM-NET [5]. As illustrated in Table 2, using the polar-transformed aerial images as input, we\neven improve the performance of CVM-NET by 27.47% on CVUSA and 14.77% on CVACT at r@1.\n\n7\n\n26111%TOP_K0.20.40.60.81RecallCVM-NETLiu & LiPolar_SAFA(M=8)15101%TOP_K00.20.40.60.81RecallCVM-NETLiu & LiPolar_SAFA(M=8)50100TOP_K00.20.40.60.81RecallCVM-NETLiu & LiPolar_SAFA(M=8)\fGround\n\nPolar-transformed Aerial\n\nGround\n\nPolar-transformed Aerial\n\nFigure 6: Visualization of eight-groups generated spatial embedding maps for ground and polar-\ntransformed images. The corresponding ground and polar-transformed aerial images are shown in\nFigure 1(b) and Figure 1(d). (Best viewed on screen with zoom-in)\n\nTable 2: Effectiveness demonstration of polar transform.\n\nCVUSA\n\nVGG_gp\n\nPolar_VGG_gp\nCVM-NET [5]\nPolar_CVM-NET\n\nr@1\n39.72\n65.74\n22.53\n50.00\n\nr@5\n66.91\n84.76\n50.01\n77.22\n\nr@10\n77.49\n89.91\n63.19\n85.13\n\nr@1% r@1\n32.22\n96.38\n56.65\n98.30\n20.15\n93.52\n97.82\n34.92\n\nCVACT_val\nr@5\nr@10\n69.41\n59.08\n84.98\n79.20\n56.87\n45.00\n61.74\n71.05\n\nr@1%\n91.85\n95.76\n87.57\n91.78\n\nEffects of Spatial-aware Position Embedding: We demonstrate the effectiveness of our proposed\nSpatial-aware Position Embedding (SPE) module using original cross-view images as inputs. We\n\ufb01rstly replace the global max-pooling in VGG_gp with a single SPE module. Since our SPE module\nexplicitly establishes spatial correspondences for cross-view images, it outperforms VGG_gp as\nindicated in Table 3. Especially, our single SPE model achieves 58.79% on CVUSA and 42.96% on\nCVACT_val for r@1, and obtains 48% and 33% relative improvements compared with VGG_gp,\nrespectively.\n\nEffects of Multiple Spatial-aware Position Embeddings:\nTo demonstrate the effectiveness of\naggregating feature embeddings by using multiple SPE modules, we use different numbers of SPE\nmodules, i.e., 1, 2, 4, and 8, and report the recall rates in Table 3. The results indicates that as M\nincreases, we can obtain better recall performance. Note that, signi\ufb01cant improvements ( 10%) for\nr@1 are obtained when M increases from 1 to 2 and from 2 to 4. However, when M increases from\n4 to 8, we only attain slight improvements (<4%). Therefore, we do not increase M to an even larger\nnumber. As indicated by Table 3, our method, combining polar transform and multiple SPE modules,\nachieves the best performance on both datasets. By employing polar transform, we improve the\nperformance over 7%, thus demonstrating the effectiveness of polar transform as well.\n\n5 Conclusion\n\nWe have proposed a new deep network to solve the cross-view image based Geo-localization problem.\nOur network addresses the dif\ufb01culty caused by signi\ufb01cant domain differences between ground-level\nand aerial-view images by a two-step procedure. The \ufb01rst step approximately brings the two image\ndomains into a rough geometric alignment, and a subsequent spatial-attention mechanism further\nalleviates content-dependent domain discrepancy. Our key idea is to exploit available problem-\ndependent geometric priors of the task. In contrast to existing methods, we exploit the geometric\nconstraint to coarsely align one domain to the other \ufb01rst. By doing so, we can force our network to\nfocus on learning discriminative features without requiring to minimize the domain gap. Moreover,\n\n8\n\n\fTable 3: Effectiveness demonstration of the proposed SPE modules.\n\nVGG_gp\n\nSAFA (M = 1)\nSAFA (M = 2)\nSAFA (M = 4)\nSAFA (M = 8)\n\nPolar_SAFA (M = 8)\n\nr@1\n39.72\n58.79\n69.33\n79.93\n81.15\n89.84\n\nCVUSA\n\nr@5\n66.91\n84.19\n89.01\n93.29\n94.23\n96.93\n\nr@10\n77.49\n90.84\n93.52\n96.15\n96.85\n98.14\n\nr@1% r@1\n32.22\n96.38\n99.08\n42.96\n58.98\n99.31\n74.61\n99.54\n78.28\n99.49\n99.64\n81.03\n\nCVACT_val\nr@10\nr@5\n69.41\n59.08\n71.51\n80.56\n88.46\n82.86\n93.03\n90.02\n93.79\n91.60\n92.80\n94.84\n\nr@1%\n91.85\n95.48\n97.13\n98.01\n98.15\n98.17\n\n(a) With and without polar transform\n\nFigure 7: Comparison of recalls on CVUSA [5] and CVACT_val [11] datasets.\n\n(b) Different number of SPE modules\n\nwe propose a spatial-aware feature aggregation module to not only embed features but also the feature\nlayout information, achieving more discriminative image descriptors. Since the cross-view feature\nlearning process has been decoupled, the domain gap does not affect feature learning. Our method\nis able to learn more discriminative image descriptors and thus outperforms the state-of-the-art.\nAlthough our current experiments are conducted on query ground images which are panoramas with\nknown orientation, this restriction can be relaxed under the same network architecture and this is left\nas our future extension.\n\nAcknowledgments\n\nThis research is supported in part by China Scholarship Council (201708320417), the Australia\nResearch Council ARC Centre of Excellence for Robotics Vision (CE140100016), ARC-Discovery\n(DP 190102261) and ARC-LIEF (190100080), and in part by a research gift from Baidu RAL\n(ApolloScapes-Robotics and Autonomous Driving Lab). The authors gratefully acknowledge the\nGPU gift donated by NVIDIA Corporation. We thank all anonymous reviewers for their constructive\ncomments.\n\n9\n\n15101%TOP_K0.40.60.81RecallCVUSAVGG_gpPolar_VGG_gp15101%TOP_K0.40.60.81RecallCVACT_valVGG_gpPolar_VGG_gp15101%TOP_K0.40.60.81RecallCVUSAVGG_gpSAFA(M=1)SAFA(M=2)SAFA(M=4)SAFA(M=8)15101%TOP_K0.40.60.81RecallCVACT_valVGG_gpSAFA(M=1)SAFA(M=2)SAFA(M=4)SAFA(M=8)\fReferences\n[1] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn\narchitecture for weakly supervised place recognition. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, pages 5297\u20135307, 2016.\n\n[2] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features.\n\nEuropean conference on computer vision, pages 404\u2013417. Springer, 2006.\n\nIn\n\n[3] Francesco Castaldo, Amir Zamir, Roland Angst, Francesco Palmieri, and Silvio Savarese.\nIn Proceedings of the IEEE International Conference on\n\nSemantic cross-view matching.\nComputer Vision Workshops, pages 9\u201317, 2015.\n\n[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-\nscale hierarchical image database. In 2009 IEEE conference on computer vision and pattern\nrecognition, pages 248\u2013255. Ieee, 2009.\n\n[5] Sixing Hu, Mengdan Feng, Rang M. H. Nguyen, and Gim Hee Lee. Cvm-net: Cross-view\nmatching network for image-based ground-to-aerial geo-localization. In The IEEE Conference\non Computer Vision and Pattern Recognition (CVPR), June 2018.\n\n[6] Herv\u00e9 J\u00e9gou, Matthijs Douze, Cordelia Schmid, and Patrick P\u00e9rez. Aggregating local descriptors\ninto a compact image representation. In Computer Vision and Pattern Recognition (CVPR),\n2010 IEEE Conference on, pages 3304\u20133311. IEEE, 2010.\n\n[7] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[8] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[9] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid\nmatching for recognizing natural scene categories. In 2006 IEEE Computer Society Conference\non Computer Vision and Pattern Recognition (CVPR\u201906), volume 2, pages 2169\u20132178. IEEE,\n2006.\n\n[10] Tsung-Yi Lin, Serge Belongie, and James Hays. Cross-view image geolocalization. In Pro-\nceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 891\u2013898,\n2013.\n\n[11] Liu Liu and Hongdong Li. Lending orientation to neural networks for cross-view geo-\nlocalization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nJune 2019.\n\n[12] David G Lowe. Distinctive image features from scale-invariant keypoints. 60(2):91\u2013110, 2004.\n\n[13] Arsalan Mousavian and Jana Kosecka. Semantic image based geolocation given a map. arXiv\n\npreprint arXiv:1609.00278, 2016.\n\n[14] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual\nrecognition challenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[15] Yujiao Shi, Xin Yu, Liu Liu, Tong Zhang, and Hongdong Li. Optimal feature transport for\n\ncross-view image geo-localization. arXiv preprint arXiv:1907.05021, 2019.\n\n[16] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. CoRR, abs/1409.1556, 2014.\n\n[17] Josef Sivic and Andrew Zisserman. Video google: Ef\ufb01cient visual search of videos. In Toward\n\ncategory-level object recognition, pages 127\u2013144. Springer, 2006.\n\n10\n\n\f[18] Yurun Tian, Xin Yu, Bin Fan, Fuchao Wu, Huub Heijnen, and Vassileios Balntas. Sosnet:\nSecond order similarity regularization for local descriptor learning. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 11016\u201311025, 2019.\n\n[19] Nam N Vo and James Hays. Localizing and orienting street views using overhead imagery. In\n\nEuropean Conference on Computer Vision, pages 494\u2013509. Springer, 2016.\n\n[20] Scott Workman and Nathan Jacobs. On the location dependence of convolutional neural network\nfeatures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\nWorkshops, pages 70\u201378, 2015.\n\n[21] Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with\naerial reference imagery. In Proceedings of the IEEE International Conference on Computer\nVision, pages 3961\u20133969, 2015.\n\n[22] Xin Yu, Yurun Tian, Fatih Porikli, Richard Hartley, Hongdong Li, Huub Heijnen, and Vassileios\nBalntas. Unsupervised extraction of local image descriptors via relative distance ranking loss.\nIn The IEEE International Conference on Computer Vision (ICCV) Workshops, 2019.\n\n[23] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In\n\nEuropean conference on computer vision, pages 818\u2013833. Springer, 2014.\n\n[24] Menghua Zhai, Zachary Bessinger, Scott Workman, and Nathan Jacobs. Predicting ground-\nlevel scene layout from aerial imagery. In IEEE Conference on Computer Vision and Pattern\nRecognition, volume 3, 2017.\n\n[25] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning\ndeep features for scene recognition using places database. In Advances in neural information\nprocessing systems, pages 487\u2013495, 2014.\n\n11\n\n\f", "award": [], "sourceid": 5336, "authors": [{"given_name": "Yujiao", "family_name": "Shi", "institution": "Australian National University"}, {"given_name": "Liu", "family_name": "Liu", "institution": "ANU"}, {"given_name": "Xin", "family_name": "Yu", "institution": "Australian National University"}, {"given_name": "Hongdong", "family_name": "Li", "institution": "Australian National University"}]}