{"title": "ExtremeWeather: A large-scale climate dataset for semi-supervised detection, localization, and understanding of extreme weather events", "book": "Advances in Neural Information Processing Systems", "page_first": 3402, "page_last": 3413, "abstract": "Then detection and identification of extreme weather events in large-scale climate simulations is an important problem for risk management, informing governmental policy decisions and advancing our basic understanding of the climate system. Recent work has shown that fully supervised convolutional neural networks (CNNs) can yield acceptable accuracy for classifying well-known types of extreme weather events when large amounts of labeled data are available. However, many different types of spatially localized climate patterns are of interest including hurricanes, extra-tropical cyclones, weather fronts, and blocking events among others. Existing labeled data for these patterns can be incomplete in various ways, such as covering only certain years or geographic areas and having false negatives. This type of climate data therefore poses a number of interesting machine learning challenges. We present a multichannel spatiotemporal CNN architecture for semi-supervised bounding box prediction and exploratory data analysis. We demonstrate that our approach is able to leverage temporal information and unlabeled data to improve the localization of extreme weather events. Further, we explore the representations learned by our model in order to better understand this important data. We present a dataset, ExtremeWeather, to encourage machine learning research in this area and to help facilitate further work in understanding and mitigating the effects of climate change. The dataset is available at extremeweatherdataset.github.io and the code is available at https://github.com/eracah/hur-detect.", "full_text": "ExtremeWeather: A large-scale climate dataset for\n\nsemi-supervised detection, localization, and\nunderstanding of extreme weather events\n\nEvan Racah1,2, Christopher Beckham1,3, Tegan Maharaj1,3,\n\nSamira Ebrahimi Kahou4, Prabhat2, Christopher Pal1,3\n\n1 MILA, Universit\u00e9 de Montr\u00e9al, evan.racah@umontreal.ca.\n\n2 Lawrence Berkeley National Lab, Berkeley, CA, prabhat@lbl.gov.\n\n3 \u00c9cole Polytechnique de Montr\u00e9al, firstname.lastname@polymtl.ca.\n\n4 Microsoft Maluuba, samira.ebrahimi@microsoft.com.\n\nAbstract\n\nThen detection and identi\ufb01cation of extreme weather events in large-scale climate\nsimulations is an important problem for risk management, informing governmental\npolicy decisions and advancing our basic understanding of the climate system.\nRecent work has shown that fully supervised convolutional neural networks (CNNs)\ncan yield acceptable accuracy for classifying well-known types of extreme weather\nevents when large amounts of labeled data are available. However, many different\ntypes of spatially localized climate patterns are of interest including hurricanes,\nextra-tropical cyclones, weather fronts, and blocking events among others. Existing\nlabeled data for these patterns can be incomplete in various ways, such as covering\nonly certain years or geographic areas and having false negatives. This type of\nclimate data therefore poses a number of interesting machine learning challenges.\nWe present a multichannel spatiotemporal CNN architecture for semi-supervised\nbounding box prediction and exploratory data analysis. We demonstrate that our\napproach is able to leverage temporal information and unlabeled data to improve\nthe localization of extreme weather events. Further, we explore the representations\nlearned by our model in order to better understand this important data. We present\na dataset, ExtremeWeather, to encourage machine learning research in this area and\nto help facilitate further work in understanding and mitigating the effects of climate\nchange. The dataset is available at extremeweatherdataset.github.io and\nthe code is available at https://github.com/eracah/hur-detect.\n\n1\n\nIntroduction\n\nClimate change is one of the most important challenges facing humanity in the 21st century, and\nclimate simulations are one of the only viable mechanisms for understanding the future impact of\nvarious carbon emission scenarios and intervention strategies. Large climate simulations produce\nmassive datasets: a simulation of 27 years from a 25 square km, 3 hour resolution model produces on\nthe order of 10TB of multi-variate data. This scale of data makes post-processing and quantitative\nassessment challenging, and as a result, climate analysts and policy makers typically take global\nand annual averages of temperature or sea-level rise. While these coarse measurements are useful\nfor public and media consumption, they ignore spatially (and temporally) resolved extreme weather\nevents such as extra-tropical cyclones and tropical cyclones (hurricanes). Because the general public\nand policy makers are concerned about the local impacts of climate change, it is critical that we be\nable to examine how localized weather patterns (such as tropical cyclones), which can have dramatic\nimpacts on populations and economies, will change in frequency and intensity under global warming.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fDeep neural networks, especially deep convolutional neural networks, have enjoyed breakthrough suc-\ncess in recent recent years, achieving state-of-the-art results on many benchmark datasets (Krizhevsky\net al., 2012; He et al., 2015; Szegedy et al., 2015) and also compelling results on many practical\ntasks such as disease diagnosis (Hosseini-Asl et al., 2016), facial recognition (Parkhi et al., 2015),\nautonomous driving (Chen et al., 2015), and many others. Furthermore, deep neural networks\nhave also been very effective in the context of unsupervised and semi-supervised learning; some\nrecent examples include variational autoencoders (Kingma & Welling, 2013), adversarial networks\n(Goodfellow et al., 2014; Makhzani et al., 2015; Salimans et al., 2016; Springenberg, 2015), ladder\nnetworks (Rasmus et al., 2015) and stacked what-where autoencoders (Zhao et al., 2015).\nThere is a recent trend towards video datasets aimed at better understanding spatiotemporal relations\nand multimodal inputs (Kay et al., 2017; Gu et al., 2017; Goyal et al., 2017). The task of \ufb01nding\nextreme weather events in climate data is similar to the task of detecting objects and activities in\nvideo - a popular application for deep learning techniques. An important difference is that in the case\nof climate data, the \u2019video\u2019 has 16 or more \u2019channels\u2019 of information (such as water vapour, pressure\nand temperature), while conventional video only has 3 (RGB). In addition, climate simulations do not\nshare the same statistics as natural images. As a result, unlike many popular techniques for video, we\nhypothesize that we cannot build off successes from the computer vision community such as using\npretrained weights from CNNs (Simonyan & Zisserman, 2014; Krizhevsky et al., 2012) pretrained on\nImageNet (Russakovsky et al., 2015).\nClimate data thus poses a number of interesting machine learning problems: multi-class classi\ufb01cation\nwith unbalanced classes; partial annotation; anomaly detection; distributional shift and bias correction;\nspatial, temporal, and spatiotemporal relationships at widely varying scales; relationships between\nvariables that are not fully understood; issues of data and computational ef\ufb01ciency; opportunities\nfor semi-supervised and generative models; and more. Here, we address multi-class detection and\nlocalization of four extreme weather phenomena: tropical cyclones, extra-tropical cyclones, tropical\ndepressions, and atmospheric rivers. We implement a 3D (height, width, time) convolutional encoder-\ndecoder, with a novel single-pass bounding-box regression loss applied at the bottleneck. To our\nknowledge, this is the \ufb01rst use of a deep autoencoding architecture for bounding-box regression. This\narchitectural choice allows us to do semi-supervised learning in a very natural way (simply training\nthe autoencoder with reconstruction for unlabelled data), while providing relatively interpretable\nfeatures at the bottleneck. This is appealing for use in the climate community, as current engineered\nheuristics do not perform as well as human experts for identifying extreme weather events.\nOur main contributions are (1) a baseline bounding-box loss formulation; (2) our architecture, a \ufb01rst\nstep away from engineered heuristics for extreme weather events, towards semi-supervised learned\nfeatures; (3) the ExtremeWeather dataset, which we make available in three benchmarking splits: one\nsmall, for model exploration, one medium, and one comprising the full 27 years of climate simulation\noutput.\n\n2 Related work\n\n2.1 Deep learning for climate and weather data\nClimate scientists do use basic machine learning techniques, for example PCA analysis for dimen-\nsionality reduction (Monahan et al., 2009), and k-means analysis for clusterings Steinhaeuser et al.\n(2011). However, the climate science community primarily relies on expert engineered systems and\nad-hoc rules for characterizing climate and weather patterns. Of particular relevance is the TECA\n(Toolkit for Extreme Climate Analysis) Prabhat et al. (2012, 2015), an application of large scale\npattern detection on climate data using heuristic methods. A more detailed explanation of how TECA\nworks is described in section 3. Using the output of TECA analysis (centers of storms and bounding\nboxes around these centers) as ground truth, (Liu et al., 2016) demonstrated for the \ufb01rst time that\nconvolutional architectures could be successfully applied to predict the class label for two extreme\nweather event types. Their work considered the binary classi\ufb01cation task on centered, cropped patches\nfrom 2D (single-timestep) multi-channel images. Like (Liu et al., 2016) we use TECA\u2019s output\n(centers and bounding boxes) as ground truth, but we build on the work of Liu et al. (2016) by: 1)\nusing uncropped images, 2) considering the temporal axis of the data 3) doing multi-class bounding\nbox detection and 4) taking a semi-supervised approach with a hybrid predictive and reconstructive\nmodel.\n\n2\n\n\fSome recent work has applied deep learning methods to weather forecasting. Xingjian et al. (2015)\nhave explored a convolutional LSTM architecture (described in 2.2 for predicting future precipitation\non a local scale (i.e. the size of a city) using radar echo data. In contrast, we focus on extreme\nevent detection on planetary-scale data. Our aim is to capture patterns which are very local in time\n(e.g. a hurricane may be present in half a dozen sequential frames), compared to the scale of our\nunderlying climate data, consisting of global simulations over many years. As such, 3D CNNs\nseemed to make more sense for our detection application, compared to LSTMs whose strength is in\ncapturing long-term dependencies.\n\n2.2 Related methods and models\nFollowing the dramatic success of CNNs in static 2D images, a wide variety of CNN architectures\nhave been explored for video, ex. (Karpathy et al., 2014; Yao et al., 2015; Tran et al., 2014). The\ndetails of how CNNs are extended to capture the temporal dimension are important. Karpathy et al.\n(2014) explore different strategies for fusing information from 2D CNN subcomponents; in contrast,\nYao et al. (2015) create 3D volumes of statistics from low level image features.\nConvolutional networks have also been combined with RNNs (recurrent neural networks) for mod-\neling video and other sequence data, and we brie\ufb02y review some relevant video models here. The\nmost common and straightforward approach to modeling sequential images is to feed single-frame\nrepresentations from a CNN at each timestep to an RNN. This approach has been examined for\na number of different types of video (Donahue et al., 2015; Ebrahimi Kahou et al., 2015), while\n(Srivastava et al., 2015) have explored an LSTM architecture for the unsupervised learning of video\nrepresentations using a pretrained CNN representation as input. These architectures separate learning\nof spatial and temporal features, something which is not desirable for climate patterns. Another\npopular model, also used on 1D data, is a convolutional RNN, wherein the hidden-to-hidden transition\nlayer is 1D convolutional (i.e. the state is convolved over time). (Ballas et al., 2016) combine these\nideas, applying a convolutional RNN to frames processed by a (2D) CNN.\nThe 3D CNNs we use here are based on 3-dimensional convolutional \ufb01lters, taking the height, width,\nand time axes into account for each feature map, as opposed to aggregated 2D CNNs. This approach\nwas studied in detail in (Tran et al., 2014). 3D convolutional neural networks have been used for\nvarious tasks ranging from human activity recognition (Ji et al., 2013), to large-scale YouTube video\nclassi\ufb01cation (Karpathy et al., 2014), and video description (Yao et al., 2015). Hosseini-Asl et al.\n(2016) use a 3D convolutional autoencoder for diagnosing Alzheimer\u2019s disease through MRI - in\ntheir case, the 3 dimensions are height, width, and depth. (Whitney et al., 2016) use 3D (height,\nwidth, depth) \ufb01lters to predict consecutive frames of a video game for continuation learning. Recent\nwork has also examined ways to use CNNs to generate animated textures and sounds (Xie et al.,\n2016). This work is similar to our approach in using 3D convolutional encoder, but where their\napproach is stochastic and used for generation, ours is deterministic, used for multi-class detection\nand localization, and also comprises a 3D convolutional decoder for unsupervised learning.\nStepping back, our approach is related conceptually to (Misra et al., 2015), who use semi-supervised\nlearning for bounding-box detection, but their approach uses iterative heuristics with a support vector\nmachine (SVM) classifer, an approach which would not allow learning of spatiotemporal features.\nOur setup is also similar to recent work from (Zhang et al., 2016) (and others) in using a hybrid\nprediction and autoencoder loss. This strategy has not, to our knowledge, been applied either to\nmultidimensional data or bounding-box prediction, as we do here. Our bounding-box prediction loss\nis inspired by (Redmon et al., 2015), an approach extended in (Ren et al., 2015), as well as the single\nshot multiBox detector formulation used in (Liu et al., 2015) and the seminal bounding-box work in\nOverFeat (Sermanet et al., 2013). Details of this loss are described in Section 4.\n\n3 The ExtremeWeather dataset\n3.1 The Data\n\nThe climate science community uses three \ufb02avors of global datasets: observational products (satellite,\ngridded weather station); reanalysis products (obtained by assimilating disparate observational\nproducts into a climate model) and simulation products. In this study, we analyze output from the\nthird category because we are interested in climate change projection studies. We would like to better\nunderstand how Earth\u2019s climate will change by the year 2100; and it is only possible to conduct\n\n3\n\n\fsuch an analysis on simulation output. Although this dataset contains the past, the performance of\ndeep learning methods on this dataset can still inform the effectiveness of these approaches on future\nsimulations. We consider the CAM5 (Community Atmospheric Model v5) simulation, which is a\nstandardized three-dimensional, physical model of the atmosphere used by the climate community to\nsimulate the global climate (Conley et al., 2012). When it is con\ufb01gured at 25-km spatial resolution\n(Wehner et al., 2015), each snapshot of the global atmospheric state in the CAM5 model output is a\n768x1152 image, having 16 \u2019channels\u2019, each corresponding to a different simulated variable (like\nsurface temperature, surface pressure, precipitation, zonal wind, meridional wind, humidity, cloud\nfraction, water vapor, etc.). The global climate is simulated at a temporal resolution of 3 hours, giving\n8 snapshots (images) per day. The data we provide is from a simulation of 27 years from 1979 to\n2005. In total, this gives 78,840 16-channel 768x1152 images.\n\n3.2 The Labels\n\nGround-truth labels are created for four extreme weather events: Tropical Depressions (TD) Tropical\nCyclones (TC), Extra-Tropical Cyclones (ETC) and Atmospheric Rivers (AR) using TECA (Prabhat\net al., 2012). TECA generally works by suggesting candidate coordinates for storm centers by\nonly selecting points that follow a certain combination of criteria, which usually involves requiring\nvarious variables\u2019 (such as pressure, temperature and wind speed) values are between between certain\nthresholds. These candidates are then re\ufb01ned by breaking ties and matching the \"same\" storms across\ntime (Prabhat et al., 2012). These storm centers are then used as the center coordinates for bounding\nboxes. The size of the boxes is determined using prior domain knowledge as to how big these storms\nusually are, as described in (Liu et al., 2016). Every other image (i.e. 4 per day) is labeled due to\ncertain design decisions made during the production run of the TECA code. This gives us 39,420\nlabeled images.\n\n3.2.1 Issues with the Labels\n\nTECA, the ground truth labeling framework, implements heuristics to assign \u2019ground truth\u2019 labels\nfor the four types of extreme weather events. However, it is entirely possible there are errors in the\nlabeling: for instance, there is little agreement in the climate community on a standard heuristic for\ncapturing Extra-Tropical Cyclones (Neu et al., 2013); Atmospheric Rivers have been extensively\nstudied in the northern hemisphere (Lavers et al., 2012; Dettinger et al., 2011), but not in the southern\nhemisphere; and spatial extents of such events not universally agreed upon. In addition, this labeling\nonly includes AR\u2019s in the US and not in Europe. As such, there is potential for many false negatives,\nresulting in partially annotated images. Lastly, it is worth mentioning that because the ground truth\ngeneration is a simple automated method, a deep, supervised method can only do as well as emulating\nthis class of simple functions. This, in addition to lower representation for some classes (AR and\nTD), is part of our motivation in exploring semi-supervised methods to better understand the features\nunderlying extreme weather events rather than trying to \"beat\" existing techniques.\n\n3.3 Suggested Train/Test Splits\n\nWe provide suggested train/test splits for the varying sizes of datasets on which we run experiments.\nTable 1 shows the years used for train and test for each dataset size. We show \"small\" (2 years\ntrain, 1 year of test), \"medium\" (8 years train, 2 years test) and \"large\" (22 years train, 5 years\ntest) datasets. For reference, table 2 shows the breakdown of the dataset splits for each class for\n\"small\" in order to illustrate the class-imbalance present in the dataset. Our model was trained on\n\"small\", where we split the train set 50:50 for train and validation. Links for downloading train\nand test data, as well as further information the different dataset sizes and splits can be found here:\nextremeweatherdataset.github.io.\n\nTable 1: Three benchmarking levels for the ExtremeWeather dataset\n\nTrain\n1979, 1981\n\nLevel\nSmall\nMedium 1979-1983,1989-1991\nLarge\n\n1979-1983, 1994-2005, 1989-1993\n\nTest\n1984\n1984-1985\n1984-1988\n\n4\n\n\fTable 2: Number of examples in ExtremeWeather benchmark splits, with class breakdown statistics\nfor Tropical Cyclones (TC), Extra-Tropical Cyclones (ETC), Tropical Depressions (TD), and United\nStates Atmospheric Rivers (US-AR)\n\nBenchmark\nSmall\n\nSplit TC (%)\nTrain\nTest\n\n3190 (42.32)\n2882 (39.04)\n\nETC (%)\n3510 (46.57)\n3430 (46.47)\n\nTD (%)\n433 (5.74)\n697 (9.44)\n\nUS-AR (%) Total\n404 (5.36)\n7537\n7381\n372 (5.04)\n\n4 The model\nWe use a 3D convolutional encoder-decoder architecture, meaning that the \ufb01lters of the convolutional\nencoder and decoder are 3 dimensional (height, width, time). The architecture is shown in Figure 1;\nthe encoder uses convolution at each layer while the decoder is the equivalent structure in reverse,\nusing tied weights and deconvolutional layers, with leaky ReLUs (Andrew L. Maas & Ng., 2013)\n(0.1) after each layer. As we take a semi-supervised approach, the code (bottleneck) layer of the\nautoencoder is used as the input to the loss layers, which make predictions for (1) bounding box\nlocation and size, (2) class associated with the bounding box, and (3) the con\ufb01dence (sometimes\ncalled \u2019objectness\u2019) of the bounding box. Further details (\ufb01lter size, stride length, padding, output\nsizes, etc.) can be found in the supplementary materials.\n\nFigure 1: Diagram of the 3D semi-supervised architecture. Parentheses denote subset of total\ndimension shown (for ease of visualization, only two feature maps per layer are shown for the\nencoder-decoder. All feature maps are shown for bounding-box regression layers).\n\nThe total loss for the network, L, is a weighted combination of supervised bounding-box regression\nloss, Lsup, and unsupervised reconstruction error, Lrec :\n\nwhere Lrec is the mean squared squared difference between input X and reconstruction X\u2217:\n\nL = Lsup + \u03bbLrec,\n\nLrec =\n\n||X \u2212 X\u2217||2\n2,\n\n1\nM\n\n(1)\n\n(2)\n\nwhere M is the total number of pixels in an image.\nIn order to regress bounding boxes, we split the original 768x1152 image into a 12x18 grid of\n64x64 anchor boxes. We then predict a box at each grid point by transforming the representation to\n12x18=216 scores (one per anchor box). Each score encodes three pieces of information: (1) how\nmuch the predicted box differs in size and location from the anchor box, (2) the con\ufb01dence that an\nobject of interest is in the predicted box (\u201cobjectness\u201d), and (3) the class probability distribution for\nthat object. Each component of the score is computed by several 3x3 convolutions applied to the 640\n12x18 feature maps of the last encoder layer. Because each set of pixels in each feature map at a\ngiven x, y coordinate can be thought of as a learned representation of the climate data in a 64x64\npatch of the input image, we can think of the 3x3 convolutions as having a local receptive \ufb01eld size of\n192x192, so they use a representation of a 192x192 neighborhood from the input image as context to\ndetermine the box and object centered in the given 64x64 patch. Our approach is similar to (Liu et al.,\n2015) and (Sermanet et al., 2013), which use convolutions from small local receptive \ufb01eld \ufb01lters to\n\n5\n\n\fregress boxes. This choice is motivated by the fact that extreme weather events occur in relatively\nsmall spatiotemporal volumes, with the \u2018background\u2019 context being highly consistent across event\ntypes and between events and non-events. This is in contrast to Redmon et al. (2015), which uses\na fully connected layer to consider the whole image as context, appropriate for the task of object\nidenti\ufb01cation in natural images, where there is often a strong relationship between background and\nobject.\nThe bounding box regression loss, Lsup, is determined as follows:\n\nLsup =\n\n(Lbox + Lconf + Lcls),\n\n1\nN\n\nwhere N is the number of time steps in the minibatch, and Lbox is de\ufb01ned as:\ni R(vi \u2212 v\u2217\ni ),\n\ni R(ui \u2212 u\u2217\n\nLbox = \u03b1\n\ni ) + \u03b2\n\nobj\n\nobj\n\n1\n\n1\n\n(cid:88)\n\n(cid:88)\n\n(3)\n\n(4)\n\ni\n\ni\n\nwhere i \u2208 [0, 216) is the index of the anchor box for the ith grid point, and where 1\ni = 1 if an\nobject is present at the ith grid point, 0 if not; R(z) is the smooth L1 loss as used in (Ren et al., 2015),\nui = (tx, ty)i and u\u2217\nh)i and t is the parametrization\nde\ufb01ned in (Ren et al., 2015) such that:\n\ny)i, vi = (tw, th)i and v\u2217\n\ni = (t\u2217\n\ni = (t\u2217\n\nw, t\u2217\n\nx, t\u2217\n\nobj\n\ntx = (x \u2212 xa)/wa, ty = (y \u2212 ya)/ha, tw = log(w/wa), th = log(h/ha)\nx = (x\u2217 \u2212 xa)/wa, t\u2217\nt\u2217\n\ny = (y\u2217 \u2212 ya)/ha, t\u2217\n\nw = log(w\u2217/wa), t\u2217\n\nh = log(h\u2217/ha),\n\nwhere (xa, ya, wa, ha) is the center coordinates and height and width of the closest anchor box,\n(x, y, w, h) are the predicted coordinates and (x\u2217, y\u2217, w\u2217, h\u2217) are the ground truth coordinates.\nLconf is the weighted cross-entropy of the log-probability of an object being present in a grid cell:\n\n(cid:88)\n\n[\u2212 log(p(obj)i)] + \u03b3 \u2217(cid:88)\n\ni\n\ni\n\nLconf =\n\n1\n\nobj\ni\n\n1\n\nnoobj\ni\n\n[\u2212 log(p(obji))]\n\n(5)\n\nFinally Lcls is the cross-entropy between the one-hot encoded class distribution and the softmax\npredicted class distribution, evaluated only for predicted boxes at the grid points containing a ground\ntruth box:\n\n\u2212p\u2217(c) log(p(c))\n\n(6)\n\n(cid:88)\n\ni\n\n(cid:88)\n\nc\u2208classes\n\nLcls =\n\n1\n\nobj\ni\n\nThe formulation of Lsup is similar in spirit to YOLO (Redmon et al., 2015), with a few important\ndifferences. Firstly, the object con\ufb01dence and class probability terms in YOLO are squared-differences\nbetween ground truth and prediction, while we use cross-entropy, as used in the region proposal\nnetwork from Faster R-CNN (Ren et al., 2015) and the network from (Liu et al., 2015), for the\nobject probability term and the class probability term respectively. Secondly, we use a different\nparametrization for the coordinates and the size of the bounding box. In YOLO, the parametrizations\nfor x and y are equivalent to Faster R-CNN\u2019s tx and ty, for an anchor box the same size as the patch\nit represents (64x64). However w and h in YOLO are equivalent to Faster-RCNN\u2019s th and tw for a\n64x64 anchor box only if (a) the anchor box had a height and width equal to the size of the whole\nimage and (b) there were no log transform in the faster-RCNN\u2019s parametrization. We \ufb01nd both\nthese differences to be important in practice. Without the log term, and using ReLU nonlinearities\ninitialized (as is standard) centered around 0, most outputs (more than half) will give initial boxes\nthat are in 0 height and width. This makes learning very slow, as the network must learn to resize\nessentially empty boxes. Adding the log term alone in effect makes the \"default\" box (an output\nof 0) equal to the height and width of the entire image - this equally slows down learning, because\nthe network must now learn to drastically shrink boxes. Making ha and wa equal to 64x64 is a\npragmatic \u2019Goldilocks\u2019 value. This makes training much more ef\ufb01cient, as optimization can focus\nmore on picking which box contains an object and not as much on what size the box should be.\nFinally, where YOLO uses squared difference between predicted and ground truth for the coordinate\nparametrizations, we use smooth L1, due its lower sensitivity to outlier predictions (Ren et al., 2015).\n\n6\n\n\f5 Experiments and Discussion\n\n5.1 Framewise Reconstruction\n\nAs a simple experiment, we \ufb01rst train a 2D convolutional autoencoder on the data, treating each\ntimestep as an individual training example (everything else about the model is as described in Section\n4), in order to visually assess reconstructions and ensure reasonable accuracy of detection. Figure\n2 shows the original and reconstructed feature maps for the 16 climate variables of one image in\nthe training set. Reconstruction loss on the validation set was similar to the training set. As the\nreconstruction visualizations suggest, the convolutional autoencoder architecture does a good job of\nencoding spatial information from climate images.\n\nFigure 2: Feature maps for the 16 channels in an \u2019image\u2019 from the training set (left) and their\nreconstructions from the 2D convolutional autoencoder (right).\n5.2 Detection and localization\n\nAll experiments are on ExtremeWeather-small, as described in Section 3, where 1979 is train and\n1981 is validation. The model is trained with Adam (Kingma & Ba, 2014), with a learning rate of\n0.0001 and weight decay coef\ufb01cient of 0.0005. For comparison, and to evaluate how useful the time\naxis is to recognizing extreme weather events, we run experiments with both 2D (width, height)\nand 3D (width, height, time) versions of the architecture described in Section 4. Values for \u03b1, \u03b2, \u03b3\n(hyperparameters described in loss Equations 4 and 5) were selected with experimentation and some\ninspiration from (Redmon et al., 2015) to be 5, 7 and 0.5 respectively. A lower value for \u03b3 pushes up\nthe con\ufb01dence of true positive examples, allowing the model more examples to learn from, is thus a\nway to deal with ground-truth false negatives. Although some of the selection of these parameters is\na bit ad-hoc, we assert that our results still provide a good \ufb01rst-pass baseline approach for this dataset.\nThe code is available at https://github.com/eracah/hur-detect\nDuring training, we input one day\u2019s simulation at a time (8 time steps; 16 variables). The semi-\nsupervised experiments reconstruct all 8 time steps, predicting bounding boxes for the 4 labelled\ntimesteps, while the supervised experiments reconstruct and predict bounding boxes only for the\n4 labelled timesteps. Table 3 shows Mean Average Precision (mAP) for each experiment. Average\nPrecision (AP) is calculated for each class in the manner of ImageNet (Russakovsky et al., 2015),\nintegrating the precision-recall curve, and mAP is averaged over classes. Results are shown for\nvarious settings of \u03bb (see Equation 1) and for two modes of evaluation; at IOU (intersection over union\nof the bounding-box and ground-truth box) thresholds of 0.1 and 0.5. Because the 3D model has\ninherently higher capacity (in terms of number of parameters) than the 2D model, we also experiment\nwith higher capacity 2D models by doubling the number of \ufb01lters in each layer. Figure 3 shows\nbounding box predictions for 2 consecutive (6 hours in between) simulation frames, comparing the\n3D supervised vs 3D semi-supervised model predictions.\nIt is interesting to note that 3D models perform signi\ufb01cantly better than their 2D counterparts for\nETC and TC (hurricane) classes. This implies that the time evolution of these weather events is\nan important criteria for discriminating them. In addition, the semi-supervised model signi\ufb01cantly\nimproves the ETC and TC performance, which suggests unsupervised shaping of the spatio-temporal\nrepresentation is important for these events. Similarly, semi-supervised data improves performance\nof the 3D model (for IOU=0.1), while this effect is not observed for 2D models, suggesting that 3D\nrepresentations bene\ufb01t more from unsupervised data. Note that hyperparameters were tuned in the\nsupervised setting, and a more thorough hyperparameter search for \u03bb and other parameters may yield\nbetter semi-supervised results.\n\n7\n\noriginal reconstruction \fFigure 3 shows qualitatively what the quantitative results in Table 3 con\ufb01rm - semi-supervised\napproaches help with rough localization of weather events, but the model struggles to achieve\naccurate boxes. As mentioned in Section 4, the network has a hard time adjusting the size of the\nboxes. As such, in this \ufb01gure we see mostly boxes of size 64x64. For example, for TDs (usually\nmuch smaller than 64x64) and for ARs, (always much bigger than 64x64), a 64x64 box roughly\ncentered on the event is suf\ufb01cient to count as a true positive at IOU=0.1, but not at the more stringent\nIOU=0.5. This lead to a large dropoff in performance for ARs and TDs, and a sizable dropoff in the\n(variably-sized) TCs. Longer training time could potentially help address these issues.\n\nTable 3: 2D and 3D supervised and semi-supervised results, showing Mean Average Precision (mAP)\nand Average Precision (AP) for each class, at IOU=0.1; IOU=0.5. M is model; P is millions of\nparameters; and \u03bb weights the amount that reconstruction contributes to the overall loss.\n\nM Mode\n2D Sup\n2D Semi\n2D Semi\n2D Sup\n2D Semi\n3D Sup\n3D Semi\n\nP\n\n66.53\n66.53\n66.53\n16.68\n16.68\n50.02\n50.02\n\n\u03bb\n0\n1\n10\n0\n1\n0\n1\n\nETC (46.47%)\n\nAP (%)\n21.92; 14.42\n18.05; 5.00\n15.57; 5.87\n13.90; 5.25\n15.80; 9.62\n22.65; 15.53\n24.74; 14.46\n\nTC (39.04%)\n\nAP (%)\n52.26; 9.23\n52.37; 5.26\n44.22; 2.53\n49.74; 15.33\n39.49; 4.84\n50.01; 9.12\n56.40; 9.00\n\nTD (9.44%)\n\nAP (%)\n\n95.91; 10.76\n97.69; 14.60\n98.99; 28.56\n97.58; 7.56\n99.50; 3.26\n97.31; 3.81\n96.57; 5.80\n\nAR (5.04%)\n\nAP (%)\n\n35.61; 33.51\n36.33; 0.00\n36.61; 0.00\n35.63; 33.84\n21.26; 13.12\n34.05; 17.94\n33.95; 0.00\n\nmAP\n\n51.42; 16.98\n51.11; 6.21\n48.85; 9.24\n49.21; 15.49\n44.01; 7.71\n51.00; 11.60\n52.92; 7.31\n\nFigure 3: Bounding box predictions shown on 2 consecutive (6 hours in between) simulation frames,\nfor the integrated water vapor column channel. Green = ground truth, Red = high con\ufb01dence\npredictions (con\ufb01dence above 0.8). 3D supervised model (Left), and semi-supervised (Right).\n\n5.3 Feature exploration\nIn order to explore learned representations, we use t-SNE (van der Maaten & Hinton, Nov 2008)\nto visualize the autoencoder bottleneck (last encoder layer). Figure 4 shows the projected feature\nmaps for the \ufb01rst 7 days in the training set for both 3D supervised (top) and semi-supervised (bottom)\nexperiments. Comparing the two, it appears that more TCs (hurricanes) are clustered by the semi-\nsupervised model, which would \ufb01t with the result that semi-supervised information is particularly\nvaluable for this class. Viewing the feature maps, we can see that both models have learned spiral\npatterns for TCs and ETCs.\n\n6 Conclusions and Future Work\nWe introduce to the community the ExtremeWeather dataset in hopes of encouraging new research\ninto unique, dif\ufb01cult, and socially and scienti\ufb01cally important datasets. We also present a baseline\nmethod for comparison on this new dataset. The baseline explores semi-supervised methods for\n\n8\n\n\fFigure 4: t-SNE visualisation of the \ufb01rst 7 days in the training set for 3D supervised (top) and\nsemi-supervised (bottom) experiments. Each frame (time step) in the 7 days has 12x18 = 216 vectors\nof length 640 (number of feature maps in the code layer), where each pixel in the 12x18 patch\ncorresponds to a 64x64 patch in the original frame. These vectors are projected by t-SNE to two\ndimensions. For both supervised and semi-supervised, we have zoomed into two dense clusters and\nsampled 64x64 patches to show what that feature map has learned. Grey = unlabelled, Yellow =\ntropical depression (not shown), Green = TC (hurricane), Blue = ETC, Red = AR.\n\nobject detection and bounding box prediction using 3D autoencoding CNNs. These architectures\nand approaches are motivated by \ufb01nding extreme weather patterns; a meaningful and important\nproblem for society. Thus far, the climate science community has used hand-engineered criteria to\ncharacterize patterns. Our results indicate that there is much promise in considering deep learning\nbased approaches. Future work will investigate ways to improve bounding-box accuracy, although\neven rough localizations can be very useful as a data exploration tool, or initial step in a larger\ndecision-making system. Further interpretation and visualization of learned features could lead to\nbetter heuristics, and understanding of the way different variables contribute to extreme weather\nevents. Insights in this paper come from only a fraction of the available data, and we have not explored\nsuch challenging topics as anomaly detection, partial annotation detection and transfer learning (e.g.\nto satellite imagery). Moreover, learning to generate future frames using GAN\u2019s (Goodfellow et al.,\n2014) or other deep generative models, while using performance on a detection model to measure\nthe quality of the generated frames could be another very interesting future direction. We make\nthe ExtremeWeather dataset available in hopes of enabling and encouraging the machine learning\ncommunity to pursue these directions. The retirement of Imagenet this year (Russakovsky et al.,\n2017) marks the end of an era in deep learning and computer vision. We believe the era to come\nshould be de\ufb01ned by data of social importance, pushing the boundaries of what we know how to\nmodel.\n\nAcknowledgments\n\nThis research used resources of the National Energy Research Scienti\ufb01c Computing Center (NERSC),\na DOE Of\ufb01ce of Science User Facility supported by the Of\ufb01ce of Science of the U.S. Department\nof Energy under Contract No. DE-AC02-05CH11231. Code relies on open-source deep learning\nframeworks Theano (Bergstra et al.; Team et al., 2016) and Lasagne (Team, 2016), whose developers\nwe gratefully acknowledge. We thank Samsung and Google for support that helped make this research\n\n9\n\n\fpossible. We would also like to thank Yunjie Liu and Michael Wehner for providing access to the\nclimate datasets; Alex Lamb and Thorsten Kurth for helpful discussions.\n\nReferences\nAwni Y. Hannun Andrew L. Maas and Andrew Y. Ng. Recti\ufb01er nonlinearities improve neural network\nacoustic models. ICML Workshop on Deep Learning for Audio, Speech, and Language Processing,\n2013.\n\nNicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. Delving deeper into convolutional networks\nfor learning video representations. In the Proceedings of ICLR. arXiv preprint arXiv:1511.06432,\n2016.\n\nChenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learning affordance for\ndirect perception in autonomous driving. In Proceedings of the IEEE International Conference on\nComputer Vision, pp. 2722\u20132730, 2015.\n\nAndrew J Conley, Rolando Garcia, Doug Kinnison, Jean-Francois Lamarque, Dan Marsh, Mike\nMills, Anne K Smith, Simone Tilmes, Francis Vitt, Hugh Morrison, et al. Description of the ncar\ncommunity atmosphere model (cam 5.0). 2012.\n\nMichael D. Dettinger, Fred Martin Ralph, Tapash Das, Paul J. Neiman, and Daniel R. Cayan.\nAtmospheric rivers, \ufb02oods and the water resources of california. Water, 3(2):445, 2011. ISSN\n2073-4441. URL http://www.mdpi.com/2073-4441/3/2/445.\n\nJeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venu-\ngopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual\nrecognition and description. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), June 2015.\n\nSamira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher\nPal. Recurrent neural networks for emotion recognition in video. In Proceedings of the 2015 ACM\non International Conference on Multimodal Interaction, pp. 467\u2013474. ACM, 2015.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\nIn Advances in Neural\n\nAaron Courville, and Yoshua Bengio. Generative adversarial nets.\nInformation Processing Systems, pp. 2672\u20132680, 2014.\n\nRaghav Goyal, Samira Kahou, Vincent Michalski, Joanna Materzy\u00b4nska, Susanne Westphal, Heuna\nKim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The \"something\nsomething\" video database for learning and evaluating visual common sense. arXiv preprint\narXiv:1706.04261, 2017.\n\nChunhui Gu, Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru, David A Ross, George\nToderici, Yeqing Li, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, et al. Ava: A video\ndataset of spatio-temporally localized atomic visual actions. arXiv preprint arXiv:1705.08421,\n2017.\n\nKaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers: Surpassing\nhuman-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE International\nConference on Computer Vision, pp. 1026\u20131034, 2015.\n\nEhsan Hosseini-Asl, Georgy Gimel\u2019farb, and Ayman El-Baz. Alzheimer\u2019s disease diagnostics by a\n\ndeeply supervised adaptable 3d convolutional network. 2016.\n\nShuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action\nrecognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221\u2013231,\n2013.\n\nAndrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei.\nLarge-scale video classi\ufb01cation with convolutional neural networks. In Proceedings of the IEEE\nconference on Computer Vision and Pattern Recognition, pp. 1725\u20131732, 2014.\n\n10\n\n\fWill Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan,\nFabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.\narXiv preprint arXiv:1705.06950, 2017.\n\nDiederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nAlex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolu-\ntional neural networks. In Advances in neural information processing systems, pp. 1097\u20131105,\n2012.\n\nDavid A. Lavers, Gabriele Villarini, Richard P. Allan, Eric F. Wood, and Andrew J. Wade. The\ndetection of atmospheric rivers in atmospheric reanalyses and their links to british winter \ufb02oods\nand the large-scale climatic circulation. Journal of Geophysical Research: Atmospheres, 117(D20):\nn/a\u2013n/a, 2012. ISSN 2156-2202. doi: 10.1029/2012JD018027. URL http://dx.doi.org/10.\n1029/2012JD018027. D20106.\n\nWei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, and Scott Reed. Ssd: Single shot\n\nmultibox detector. arXiv preprint arXiv:1512.02325, 2015.\n\nYunjie Liu, Evan Racah, Prabhat, Joaquin Correa, Amir Khosrowshahi, David Lavers, Kenneth\nKunkel, Michael Wehner, and William Collins. Application of deep convolutional neural networks\nfor detecting extreme weather in climate datasets. 2016.\n\nAlireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J. Goodfellow. Adversarial autoencoders.\n\nCoRR, abs/1511.05644, 2015. URL http://arxiv.org/abs/1511.05644.\n\nIshan Misra, Abhinav Shrivastava, and Martial Hebert. Watch and learn: Semi-supervised learning\nof object detectors from videos. CoRR, abs/1505.05769, 2015. URL http://arxiv.org/abs/\n1505.05769.\n\nAdam H Monahan, John C Fyfe, Maarten HP Ambaum, David B Stephenson, and Gerald R North.\nEmpirical orthogonal functions: The medium is the message. Journal of Climate, 22(24):6501\u2013\n6514, 2009.\n\nUrs Neu, Mirseid G. Akperov, Nina Bellenbaum, Rasmus Benestad, Richard Blender, Rodrigo\nCaballero, Angela Cocozza, Helen F. Dacre, Yang Feng, Klaus Fraedrich, Jens Grieger, Sergey\nGulev, John Hanley, Tim Hewson, Masaru Inatsu, Kevin Keay, Sarah F. Kew, Ina Kindem,\nGregor C. Leckebusch, Margarida L. R. Liberato, Piero Lionello, Igor I. Mokhov, Joaquim G.\nPinto, Christoph C. Raible, Marco Reale, Irina Rudeva, Mareike Schuster, Ian Simmonds, Mark\nSinclair, Michael Sprenger, Natalia D. Tilinina, Isabel F. Trigo, Sven Ulbrich, Uwe Ulbrich,\nXiaolan L. Wang, and Heini Wernli. Imilast: A community effort to intercompare extratropical\ncyclone detection and tracking algorithms. Bulletin of the American Meteorological Society, 94(4):\n529\u2013547, 2013. doi: 10.1175/BAMS-D-11-00154.1.\n\nOmkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In British Machine\n\nVision Conference, volume 1, pp. 6, 2015.\n\nPrabhat, Oliver Rubel, Surendra Byna, Kesheng Wu, Fuyu Li, Michael Wehner, and Wes Bethel.\n\nTeca: A parallel toolkit for extreme climate analysis. ICCS, 2012.\n\nPrabhat, Surendra Byna, Venkatram Vishwanath, Eli Dart, Michael Wehner, and William D. Collins.\n\nTeca: Petascale pattern recognition for climate science. CAIP, 2015.\n\nAntti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised\nlearning with ladder networks. In Advances in Neural Information Processing Systems, pp. 3546\u2013\n3554, 2015.\n\nJoseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once:\nUni\ufb01ed, real-time object detection. CoRR, abs/1506.02640, 2015. URL http://arxiv.org/\nabs/1506.02640.\n\n11\n\n\fShaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object\n\ndetection with region proposal networks. 2015.\n\nOlga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition\nchallenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\nOlga Russakovsky, Eunbyung Park, Wei Liu, Jia Deng, Fei-Fei Li, and Alex Berg. Beyond imagenet\nlarge scale visual recognition challenge, 2017. URL http://image-net.org/challenges/\nbeyond_ilsvrc.php.\n\nTim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.\n\nImproved techniques for training gans. 2016.\n\nPierre Sermanet, David Eigen, Xiang Zhang, Micha\u00ebl Mathieu, Rob Fergus, and Yann LeCun.\nOverfeat: Integrated recognition, localization and detection using convolutional networks. arXiv\npreprint arXiv:1312.6229, 2013.\n\nKaren Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. arXiv preprint arXiv:1409.1556, 2014.\n\nJost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative\n\nadversarial networks. arXiv preprint arXiv:1511.06390, 2015.\n\nNitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video\n\nrepresentations using lstms. CoRR, abs/1502.04681, 2, 2015.\n\nKarsten Steinhaeuser, Nitesh Chawla, and Auroop Ganguly. Comparing predictive power in climate\n\ndata: Clustering matters. Advances in Spatial and Temporal Databases, pp. 39\u201355, 2011.\n\nChristian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du-\nmitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1\u20139, 2015.\n\nDu Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa-\n\ntiotemporal features with 3d convolutional networks. 2014.\n\nL.J.P van der Maaten and G.E. Hinton. Visualizing high-dimensional data using t-sne. Journal of\n\nMachine Learning Research, 9: 2579\u20132605, Nov 2008.\n\nMichael Wehner, Prabhat, Kevin A. Reed, D\u00e1ith\u00ed Stone, William D. Collins, and Julio Bacmeister.\nResolution dependence of future tropical cyclone projections of cam5.1 in the u.s. clivar hurricane\nworking group idealized con\ufb01gurations. Journal of Climate, 28(10):3905\u20133925, 2015. doi:\n10.1175/JCLI-D-14-00311.1.\n\nWilliam F. Whitney, Michael Chang, Tejas Kulkarni, and Joshua B. Tenenbaum. Understanding\n\nvisual concepts with continuation learning. 2016.\n\nJianwen Xie, Song-Chun Zhu, and Ying Nian Wu. Synthesizing dynamic textures and sounds by\n\nspatial-temporal generative convnet. 2016.\n\nShi Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo.\nIn\n\nConvolutional lstm network: A machine learning approach for precipitation nowcasting.\nAdvances in Neural Information Processing Systems, pp. 802\u2013810, 2015.\n\nLi Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and\nAaron Courville. Describing videos by exploiting temporal structure. In Proceedings of the IEEE\nInternational Conference on Computer Vision, pp. 4507\u20134515, 2015.\n\nYuting Zhang, Kibok Lee, and Honglak Lee. Augmenting supervised neural networks with un-\nsupervised objectives for large-scale image classi\ufb01cation. arXiv preprint arXiv:1606.06582v1,\n2016.\n\nJunbo Zhao, Michael Mathieu, Ross Goroshin, and Yann Lecun. Stacked what-where auto-encoders.\n\narXiv preprint arXiv:1506.02351, 2015.\n\n12\n\n\f", "award": [], "sourceid": 1942, "authors": [{"given_name": "Evan", "family_name": "Racah", "institution": "University of Montreal"}, {"given_name": "Christopher", "family_name": "Beckham", "institution": "MILA"}, {"given_name": "Tegan", "family_name": "Maharaj", "institution": "MILA, Polytechnic Montreal"}, {"given_name": "Samira", "family_name": "Ebrahimi Kahou", "institution": "Microsoft Research \u2013 Maluuba"}, {"given_name": "Mr.", "family_name": "Prabhat", "institution": "LBL/NERSC"}, {"given_name": "Chris", "family_name": "Pal", "institution": "Montr\u00e9al Institute for Learning Algorithms"}]}