{"title": "Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization", "book": "Advances in Neural Information Processing Systems", "page_first": 7763, "page_last": 7774, "abstract": "There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further fine-tuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9%  in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.", "full_text": "Cooperative Learning of Audio and Video Models\n\nfrom Self-Supervised Synchronization\n\nBruno Korbar\nDartmouth College\n\nbruno.18@dartmouth.edu\n\nDu Tran\n\nFacebook Research\ntrandu@fb.com\n\nLorenzo Torresani\nDartmouth College\nLT@dartmouth.edu\n\nAbstract\n\nThere is a natural correlation between the visual and auditive elements of a video.\nIn this work we leverage this connection to learn general and effective models\nfor both audio and video analysis from self-supervised temporal synchronization.\nWe demonstrate that a calibrated curriculum learning scheme, a careful choice of\nnegative examples, and the use of a contrastive loss are critical ingredients to obtain\npowerful multi-sensory representations from models optimized to discern temporal\nsynchronization of audio-video pairs. Without further \ufb01netuning, the resulting\naudio features achieve performance superior or comparable to the state-of-the-art\non established audio classi\ufb01cation benchmarks (DCASE2014 and ESC-50). At the\nsame time, our visual subnet provides a very effective initialization to improve the\naccuracy of video-based action recognition models: compared to learning from\nscratch, our self-supervised pretraining yields a remarkable gain of +19.9% in\naction recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.\n\n1\n\nIntroduction\n\nImage recognition has undergone dramatic progress since the breakthrough of AlexNet [1] and the\nwidespread availability of progressively large datasets such as Imagenet [2]. Models pretrained on\nImagenet [2] have enabled the development of feature extractors achieving strong performance on a\nvariety of related still-image analysis tasks, including object detection [3, 4], pose estimation [5, 6]\nand semantic segmentation [7, 8]. Deep learning approaches in video understanding have been less\nsuccessful, as evidenced by the fact that deep spatiotemporal models trained on videos [9, 10] still\nbarely outperform the best hand-crafted features [11].\nResearchers have devoted signi\ufb01cant and laudable efforts in creating video benchmarks of much\nlarger size compared to the past [9, 12, 13, 14, 15, 16], both in terms of number of examples as well\nas number of action classes. The growth in scale has enabled more effective end-to-end training\nof deep models and the \ufb01ner-grained de\ufb01nition of classes has made possible the learning of more\ndiscriminative features. This has inspired a new generation of deep video models [17, 18, 19] greatly\nadvancing the \ufb01eld. But such progress has come at a high cost in terms of time-consuming manual\nannotations. In addition, one may argue that future signi\ufb01cant improvements by mere dataset growth\nwill require scaling up existing benchmarks by one or more orders of magnitude, which may not be\npossible in the short term.\nIn this paper, we explore a different avenue by introducing a self-supervision scheme that does not\nrequire any manual labeling of videos and thus can be applied to create arbitrarily-large training sets\nfor video modeling. Our idea is to leverage the natural synergy between the audio and the visual\nchannels of a video by introducing a self-supervised task that entails deciding whether a given audio\nsample and a visual sequence are either \u201cin-sync\u201d or \u201cout-of-sync.\u201d This is formally de\ufb01ned as a\nbinary classi\ufb01cation problem which we name \u201cAudio-Visual Temporal Synchronization\u201d (AVTS). We\npropose to address this task via a two-stream network, where one stream receives audio as input and\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthe other stream operates on video. The two streams are fused in the late layers of the network. This\ndesign induces a form of cooperative learning where the two streams must learn to \u201cwork together\u201d\nin order to improve performance on the synchronization task.\nWe are not the \ufb01rst to propose to leverage correlation between video and audio as a self-supervised\nmechanism for feature learning [20, 21, 22]. However, unlike prior approaches that were trained\nto learn semantic correspondence between the audio and a single frame of the video [21] or that\nused 2D CNNs to model the visual stream [22], we propose to feed video clips to a 3D CNN [19] in\norder to learn spatiotemporal features that can model the correlation of sound and motion in addition\nto appearance within the video. We note that AVTS differs conceptually from the \u201cAudio-Visual\nCorrespondence\u201d (AVC) proposed by Arandjelovic and Zisserman [21, 22]. In AVC, negative training\npairs were formed by drawing the audio and the visual samples from distinct videos. This makes it\npossible to solve AVC purely based on semantic information (e.g., if the image contains a piano but\nthe audio includes sound from a waterfall, the pair is obviously a negative sample). Conversely, in\nour work we train on negative samples that are \u201chard,\u201d i.e., represent out-of-sync audio and visual\nsegments sampled from the same video. This forces the net to learn relevant temporal-sensitive\nfeatures for both the audio and the video stream in order to recognize synchronization as opposed to\nonly semantic correspondence.\nTemporal synchronization is a harder problem to solve than semantic correspondence, since it requires\nto determine whether the audio and the visual samples are not only semantically coherent but also\ntemporally aligned. To ease the learning, we demonstrate that it is bene\ufb01cial to adopt a curriculum\nlearning strategy [23], where harder negatives are introduced after an initial stage of learning on\neasier negatives. We demonstrate that using curriculum learning further improves the quality of our\nfeatures for all the downstream tasks considered in our experiments.\nThe audio and the visual components of a video are processed by two distinct streams in our network.\nAfter learning, it is then possible to use the two individual streams as feature extractors or models\nfor each of the two modalities. In our experiments we study several such applications, including\npretraining for action recognition in video, feature extraction for audio classi\ufb01cation, as well as\nmultisensory (visual and audio) video categorization. Speci\ufb01cally, we demonstrate that, without\nfurther \ufb01netuning, the features computed from the last convolutional layer of the audio stream\nyield performance on par with or better than the state-of-the-art on established audio classi\ufb01cation\nbenchmarks (DCASE2014 and ESC-50). In addition, we show that our visual subnet provides a very\neffective initialization to improve the performance of action recognition networks on medium-size\nvideo classi\ufb01cation datasets, such as HMDB51 [24] and UCF101 [25]. Furthermore, additional\nboosts in video classi\ufb01cation performance can be obtained by \ufb01netuning multisensory (audio-visual)\nmodels from our pretrained two-stream network.\n\n2 Technical Approach\n\nIn this section we provide an overview of our approach for Audio-Visual Temporal Synchronization\n(AVTS). We begin with a formal de\ufb01nition of the problem statement. We then introduce the key-\nfeatures of our model by discussing their individual quantitative contribution towards both AVTS\nperformance and accuracy on our downstream tasks (action recognition and audio classi\ufb01cation).\n\n2.1 Audio-Visual Temporal Synchronization (AVTS)\nWe assume we are given a training dataset D = {(a(1), v(1), y(1)), . . . , (a(N ), v(N ), y(N ))} consisting\nof N labeled audio-video pairs. Here a(n) and v(n) denote the audio sample and the visual clip\n(a sequence of RGB frames) in the n-th example, respectively. The label y(n) \u2208 {0, 1} indicates\nwhether the audio and the visual inputs are \u201cin sync,\u201d i.e., if they were sampled from the same\ntemporal slice of a video. If y(n) = 0, then a(n) and v(n) were taken either from different temporal\nsegments of the same video, or possibly from two different videos, as further discussed below. The\naudio input a(n) and the visual clip v(n) are sampled to span the same temporal duration.\nAt a very high level, the objective of AVTS is to learn a classi\ufb01cation function g(.) that minimizes the\nempirical error, i.e., such that g(a(n), v(n)) = y(n) on as many examples as possible. However, as our\nprimary goal is to use AVTS as a self-supervised proxy for audio-visual feature learning, we de\ufb01ne\ng(.) in terms of a two-stream network where the audio and the video input are separately processed by\n\n2\n\n\fan audio subnetwork fa(.) and a visual subnetwork fv(.), providing a feature representation for each\nmodality. The function g(fa(a(n)), fv(v(n))) is then responsible to fuse the feature information from\nboth modalities to address the synchronization task. An illustration of our two-stream network design\nis provided in Fig. 2. The technical details about the two streams are provided in subsection 2.5.\n\n2.2 Choice of Loss Function\n\nA natural choice is to adopt the cross-entropy loss as learning objective, since this would directly\nmodel AVTS as a binary classi\ufb01cation problem. However, we found it dif\ufb01cult to achieve convergence\nunder this loss when learning from scratch. Inspired by similar \ufb01ndings in Chung et al [26], we\ndiscovered experimentally that more consistent and robust optimization can be obtained by minimizing\nthe contrastive loss, which was originally introduced for training Siamese networks [27] on same-\nmodality input pairs. In our setting, we optimize the audio and video streams to produce small\ndistance on positive pairs and larger distance on negative pairs, as in [26]:\n\nN(cid:88)\n\nn=1\n\nE =\n\n1\nN\n\n(y(n))||fv(v(n))\u2212 fa(a(n))||2\n\n2 + (1\u2212 y(n)) max(\u03b7 \u2212||fv(v(n))\u2212 fa(a(n))||2, 0)2 (1)\n\nwhere \u03b7 is a margin hyper-parameter. Upon convergence, AVTS prediction on new test examples (a, v)\ncan be addressed by simply thresholding the distance function, i.e., by de\ufb01ning g(fa(a), fv(v)) \u2261\n1{||fv(v(n)) \u2212 fa(a(n))||2 < \u03c4} where 1{.} denotes the logical indicator function and \u03c4 is a set\nthreshold. We also tried adding one or more fully connected (FC) layers on top of the learned feature\nextractors and \ufb01ne-tuning the entire network end-to-end with respect to a cross-entropy loss. We\nfound both these approaches to perform similarly on the AVTS task, with a slight edge in favor of the\n\ufb01ne-tuning solution (see details in subsection 3.2). However, on downstream tasks (action recognition\nand audio classi\ufb01cation), we found AVTS \ufb01ne-tuning using the cross-entropy loss to yield no further\nimprovement after the contrastive loss optimization.\n\n2.3 Selection of Negative Examples\n\nWe use an equal proportion of positive and negative examples for training. We generate a positive\nexample by extracting the audio and the visual input from a randomly chosen video clip, so that the\nvideo frames correspond in time with the audio segment. We consider two main types of negative\nexamples. Easy negatives are those where the video frames and the sound come from two different\nvideos. Hard negatives are those where the pair is taken from the same video, but there is at least half\na second time-gap between the audio sample and the visual clip. The purpose of hard negatives is to\ntrain the network to recognize temporal synchronization as opposed to mere semantic correspondence\nbetween the audio and the visual input. An illustration of a positive example, and the two types of\nhard negatives is provided in Fig. 1. We have also tried using super-hard negatives which we de\ufb01ne\nas examples where the audio and the visual sequence overlap for a certain (\ufb01xed) temporal extent.\nNot surprisingly, we found that including either hard or super-hard negatives as additional training\nexamples was detrimental when the negative examples in the test set consisted of only \u201ceasy\u201d\nnegatives (e.g., the AVTS accuracy of our system drops by about 10% when using a negative training\nset consisting of 75% easy negatives and 25% hard negatives compared to using negative examples\nthat are all easy). Less intuitively, at \ufb01rst we found that introducing hard or super-hard negatives in the\ntraining set degraded also the quality of audio and video features with respect to our downstream tasks\nof audio classi\ufb01cation and action recognition in video. As further discussed in the next subsection,\nadopting a curriculum learning strategy was critical to successfully leverage the information contained\nin hard negatives to achieve improved performance in terms of AVTS and downstream tasks.\n\n2.4 Curriculum Learning\n\nWe trained our system from scratch with easy negatives alone, with hard negatives alone, as well as\nwith \ufb01xed proportions of easy and hard negatives. We found that when hard negatives are introduced\nfrom the beginning \u2014 either fully or as a proportion \u2014 the objective is very dif\ufb01cult to optimize and\ntest results on the AVTS task are consequently poor. However, if we introduce the hard negatives\nafter the initial optimization with easy negatives only (in our case between 40th and 50th epoch),\n\ufb01ne-tuning using some harder negatives yields better results in terms of both AVTS accuracy as well\n\n3\n\n\fFigure 1: Illustration of a positive example, a \u201chard\u201d negative and \u201csuper-hard\u201d negative. \u201cEasy\u201d\nnegative are not shown here: they involve taking audio samples and visual clips from different videos.\nEasy negatives can be recognized merely based on semantic information, since two distinct videos\nare likely to contain different scenes and objects. Our approach uses hard negatives (audio and\nvisual samples taken from different slices of the same video) to force the network to recognized\nsynchronization, as opposed to mere semantic correspondence.\n\nas performance on our downstream tasks (audio classi\ufb01cation and action recognition). Empirically,\nwe obtained the best results when \ufb01ne-tuning with a negative set consisting of 25% hard negatives\nand 75% easy negatives. For a preview of results see Table 1, which outlines the difference in\nAVTS accuracy when training using curriculum learning as opposed to single-stage learning. Even\nmore remarkable are the performance improvements enabled by curriculum feature learning on the\ndownstream tasks of audio classi\ufb01cation and action recognition (see Table 4).\n\n2.5 Architecture Design\n\nFigure 2: Our architecture design. The complete model for AVTS training can be viewed in (a). The\nvideo subnetwork (shown in (b)) is a MCx network [19] using 3D convolutions in the early layers,\nand 2D convolutions in the subsequent layers. The audio subnetwork (shown in (c)) is the VGG\nmodel used by Chung and Zisserman [26].\n\nAs illustrated in Fig. 2(a), our network architecture is composed of two main parts: the audio\nsubnetwork and the video subnetwork, each taking its respective input. Our video subnetwork\n(shown in Fig. 2(b)) is based on the mixed-convolution (MCx) family of architectures [19]. A MCx\n\n4\n\n\fTable 1: AVTS accuracy achieved by our system on the Kinetics test set, which includes negatives of\nonly \u201ceasy\u201d type, as in [21]. The table shows that curriculum learning with a mix of easy and hard\nnegatives in a second stage of training leads to a signi\ufb01cant gain in accuracy.\n\nMethod\n\nNegative type\n\nSingle learning stage\n\n75% easy, 25% hard\n\neasy\n\nhard\neasy\n\nEpochs Accuracy (%)\n1 - 90\n1 - 90\n1 - 90\n1 - 50\n\n69.0\n58.9\n52.3\n67.2\n\nCurriculum learning\n\n(i.e., second learning stage applied after a \ufb01rst\nstage of 1-50 epochs with easy negatives only)\n\n75% easy, 25% hard\n\nhard\n\n51 - 90\n51 - 90\n\n78.4\n65.7\n\nnetwork uses a sequence of x 3D (spatiotemporal) convolutions in the early layers, followed by 2D\nconvolutions in the subsequent layers. The intuition is that temporal modeling by 3D convolutions\nis particularly useful in the early layers, while the late layers responsible for the \ufb01nal prediction do\nnot require temporal processing. MCx models were shown to provide a good trade off in terms of\nvideo classi\ufb01cation accuracy, number of learning parameters, and runtime ef\ufb01ciency (see [19] for\nfurther details). We found that within our system MC3 yields the best performance overall and is used\nwhenever not speci\ufb01ed otherwise. Note that our architecture is a simpli\ufb01ed version of the original\nnetwork discussed in [19], as it lacks residual connections and differs in terms of dimensionality in\nthe \ufb01nal FC layer. The input in our video subnetwork are video clips of size (3 \u00d7 t \u00d7 h \u00d7 w), where\n3 refers to the RGB channels of each frame, t is the number of frames, and h, w are the height and\nwidth, respectively. For the audio stream, we use the processing and the architecture described by\nChung and Zisserman [26]: the audio is \ufb01rst converted to the MP3 format and FFT \ufb01lterbank features\nare computed and passed through a VGG-like convolutional architecture. The speci\ufb01cations of the\naudio subnetwork are provided in Figure 2(c). Further implementation details can be found in [26].\n\n3 Experiments\n\n3.1\n\nImplementation Details\n\nInput preprocessing. The starting frame of each clip is chosen at random within a video. The\nlength of each clip is set to t = 25 frames. This results in a clip duration of 1 second on all the\ndatasets considered here except for HMDB51, which uses a frame rate different from 25 fps (clip\nduration on HMDB51 is roughly 1.2 seconds). Standard spatial transformations (multi-scale random\ncrop, random horizontal \ufb02ip, and Z normalization) are applied to all frames of the clip at training\ntime. FFT \ufb01lterbank features are extracted from the audio sample, and Z normalization is applied.\nThe FFT \ufb01lterbank parameters are set as follows: window length to 0.02, window step to 0.01, FFT\nsize to 1024, and number of \ufb01lters to 40. Mean and standard deviation for normalization are extracted\nover a random 20% subset of the training dataset.\n\nTraining details. Hyper-parameter \u03b7 in Eq. 1 is set to 0.99. We train the complete AVTS network\nend-to-end using stochastic gradient descent with initial learning rate determined via grid search.\nTraining is done on a four-GPU machine with a mini-batch of 16 examples per GPU. The learning\nrate is scaled by 0.1 each time the loss value fails to decrease for more than 5 epochs.\n\n3.2 Evaluation on Audio-Visual Temporal Synchronization (AVTS)\n\nWe \ufb01rst evaluate our approach on the AVTS task. We experimented with training our model on several\ndatasets: Kinetics [12], SoundNet [20], and AudioSet [28]. We tried different ways to combine\nthe information from the two streams after training from scratch the network shown in Fig. 2 with\ncontrastive loss. The best results were obtained by concatenating the outputs of the two subnetworks,\nand by adding two fully connected layers (containing 512 and 2 units, respectively). The resulting\nnetwork was then \ufb01netuned end-to-end for the binary classi\ufb01cation AVTS task using cross-entropy\nloss. In order to have results completely comparable to those of Arandjelovic et al. [21], we use their\nsame test set including only negatives of \u201ceasy\u201d type.\n\n5\n\n\fTable 2: Action recognition accuracy (%) on UCF101 [25] and HMDB51 [24] using AVTS as a self-\nsupervised pretraining mechanism. Even though our pretraining does not leverage any manual labels,\nit yields a remarkable gain in accuracy compared to learning from scratch (+19.9% on UCF101 and\n+17.7% on HMDB51, for MC3). As expected, making use of Kinetics action labels for pretraining\nyields further boost. But the accuracy gaps are not too large (only +1.5% on UCF101 with MC3)\nand may potentially be bridged by making use of a larger pretraining dataset, since no manual cost is\ninvolved for our procedure. Additionally, we show that our method generalizes to different families\nof models, such as I3D-RGB [18]. Rows marked with * report numbers as listed in the I3D-RGB\npaper, which may have used a slightly different setup in terms of data preprocessing and evaluation.\n\nVideo Network\nArchitecture\nMC2\nMC2\nMC2\nMC3\nMC3\nMC3\nMC3\nI3D-RGB\nI3D-RGB\nI3D-RGB*\nI3D-RGB*\nI3D-RGB*\n\nPretraining\nDataset\nnone\nKinetics\nKinetics\nnone\nKinetics\nAudioset\nKinetics\nnone\nKinetics\nImagenet\nKinetics\nKinetics + Imagenet\n\nPretraining\nSupervision\nN/A\nself-supervised (AVTS)\nfully supervised (action labels)\nN/A\nself-supervised (AVTS)\nself-supervised (AVTS)\nfully supervised (action labels)\nN/A\nself-supervised (AVTS)\nfully supervised (object labels)\nfully supervised (action labels)\nfully supervised (object+action labels)\n\nUCF101 HMDB51\n\n67.2\n83.6\n87.9\n69.1\n85.8\n89.0\n90.5\n57.1\n83.7\n84.5\n95.1\n95.6\n\n41.2\n54.3\n62.0\n43.9\n56.9\n61.6\n66.8\n40.0\n53.0\n49.8\n74.3\n74.8\n\nTable 1 summarizes our AVTS results when training on the Kinetics training set and testing on the\nKinetics test set. We can clearly see that inclusion of hard negatives in the \ufb01rst stage of training is\ndeleterious. However, curriculum learning with a 75/25% mix of easy/hard negatives in the second\nstage (after training only on easy negatives in the \ufb01rst stage) yields a remarkable gain of 9.4% in\nAVTS accuracy. However, we found detrimental to use super-hard negatives in any stage of training,\nas these tend to make the optimization overly dif\ufb01cult.\nAs noted in Section 2.1, performing AVTS classi\ufb01cation by directly thresholding the contrastive loss\ndistance (i.e., using classi\ufb01er 1{||fv(v) \u2212 fa(a)||2 < \u03c4}) produces comparable performance when\nthe test set includes negatives of only easy type (76.1% on Kinetics). However, we found it to be\nlargely inferior when evaluated on a test set including a mix of 25/75% of hard and easy negatives\n(65.6% as opposed to 70.3% when using the net \ufb01netuned with cross-entropy).\nIn terms of comparison with the L3-Net of Arandjelovic and Zisserman [21], our approach does\nconsistently better: 78% vs 74% on Kinetics and 86% vs 81% on Audioset [28] (learning on the\ntraining split of Audioset and testing on the test split of Audioset). Inclusion of hard negatives during\ntraining allows us to maintain high performance even when hard negatives are included in the testing\nset (70% on Kinetics), whereas we found that the performance of L3-Net drops drastically when\nhard-negatives are included in the test set (57% for L3-Net vs 70% for our AVTS).\nWe stress, however, that performance on the AVTS task is not our ultimate objective. AVTS is only a\nproxy for learning rich audio-visual representations. In the following sections, we present evaluations\nof our AVTS audio and visual subnets on downstream tasks.\n\n3.3 Evaluation of AVTS as a Pretraining Mechanism for Action Recognition\n\nIn this section we assess the ability of the AVTS procedure to serve as an effective pretraining\nmechanism for video-based action recognition models. For this purpose, after AVTS training\nwith contrastive loss on Kinetics, we \ufb01ne-tune our video subnetwork on two medium-size action\nrecognition benchmarks: UCF101 [25] and HMDB51 [24]. We note that since AVTS does not\nrequire any labels, we could have used any video dataset for pretraining. Here we use Kinetics for\nAVTS learning, as it will allow us to compare the results of our self-supervised pretraining (i.e.,\nno manual labels) with those obtained by fully-supervised pretraining obtained by making use of\naction class labels, which are available on Kinetics. We also include results by training the MC-x\n\n6\n\n\fnetwork from scratch on UCF101 [25] and HMDB51. Finally, we report action recognition results\nusing the I3D-RGB [18] network trained in several ways: learned from scratch, pretrained using our\nself-supervised AVTS, or pretrained with category labels (using object labels from ImageNet with\n2D-to-3D \ufb01lter in\ufb02ation [18], using action labels from Kinetics, or using both ImageNet and Kinetics).\nResults are computed as averages over the 3 train/test splits provided with these two benchmarks.\nThe results are provided in Table 2. We note that AVTS pretraining provides a remarkable boost in\naccuracy on both datasets. For the MC3 model, the gain produced by AVTS-pretraining on Kinetics\nis 16.7% on UCF101 and 13.0% on HMDB51, compared to learning from scratch. This renders\nour approach practically effective as a pretraining mechanism for video-based classi\ufb01cation model,\nsince we stress that zero manual annotations were used to produce this gain in accuracy. As expected,\npretraining on the Kinetics dataset using action-class labels leads to even higher accuracy on UCF101\nand HDMB51. But this leverages the enormous cost of manual labeling over the 500K video clips.\nConversely, since our pretraining is self-supervised, it can be applied to even larger datasets at no\nadditional manual cost. Here we investigate the effect of larger self-supervised training sets by\nperforming AVTS pretraining on AudioSet [28], which is nearly 8x bigger than Kinetics. As shown in\nTable 2, in this case the accuracy of MC3 further improves, reaching 89.0% on UCF101. This is only\n1.5% lower than the accuracy obtained by pretraining with full supervision on Kinetics. Conversely,\nAVTS pretraining on a subset of Audioset having the same size as Kinetics leads to an accuracy of\n86.4% on UCF101. This suggests that AVTS pretraining on even larger datasets is likely to lead to\nfurther accuracy improvements and may help bridging the remaining gap with respect to methods\nbased on fully-supervised pretraining.\n\n3.4 Evaluation of AVTS Audio Features\n\nIn this section we evaluate the audio features learned by our AVTS procedure by minimization of\nthe contrastive loss. For this purpose, we take the activations from the conv_5 layer of the audio\nsubnetwork and test their quality as audio representation on two established sound classi\ufb01cation\ndatasets: ESC-50 [29] and DCASE2014 [30]. We extract 10 equally-spaced 2-second-long sub-clips\nfrom each full audio sample of ESC-50. For DCASE2014, we extract 60 sub-clips from each full\nsample since audio samples in this dataset are longer than those of ESC-50. We stress that no\n\ufb01netuning of the audio subnetwork is done for these experiments. We directly train a multiclass\none-vs-all linear SVM on the conv_5 AVTS features to classify the audio events. We compute the\nclassi\ufb01cation score for each audio sample by averaging the sub-clip scores in the sample, and then\npredict the class having higher score.\nTable 3 summarizes the results of our approach as well as many other methods on these two\nbenchmarks. We can observe that audio features learned on AVTS generalize extremely well on both\nof these two sound classi\ufb01cation tasks, yielding performance superior or close to the state-of-the-art.\nFrom the results in the Table it can be noticed that our audio subnet directly trained from scratch on\nthese two benchmarks performs quite poorly. This indicates that the effectiveness of our approach lies\nin the self-supervised learning procedure rather than in the net architecture. On ESC-50 our approach\nnot only outperforms other recent self-supervised methods (L3Net [21, 22], and SoundNet [20]), but\nalso marginally surpasses human performance.\n\n3.5 Multi-Modal Action Recognition\n\nWe also evaluate our approach on the task of multi-modal action recognition, i.e., using both the audio\nand the visual stream in the video to predict actions. As for AVTS classi\ufb01cation, we concatenate the\naudio and the visual features obtained from our two subnetworks and add two fully connected layers\n(containing, respectively, 512 and C units, where C is number of classes in the dataset), and then\n\ufb01ne-tune the entire system for action recognition with cross entropy loss as the training objective.\nWe tried this approach on UCF101 and compared our results to those achieved by the concurrent\nself-supervised approach of Owens and Efros [32]. Our method achieves higher accuracy than [32]\nboth when both methods rely on the video-stream only (85.8% vs 77.6%) as well as when both\nmethods use multisensory information (audio and video) from both streams (our model achieves\n87.0% while the method of Owens and Efros yields 82.1%).\n\n7\n\n\fTable 3: Evaluation of audio features learned with AVTS on two audio classi\ufb01cation benchmarks:\nESC-50 and DCASE2014. \"Our audio subnet\" denotes our audio subnet directly trained on these\nbenchmarks. The superior performance of our AVTS features suggest the effectiveness of our\napproach lies in the self-supervised learning procedure rather than in the net architecture.\n\nMethod\n\nSVM-MFCC [29]\nRandom Forest [29]\nOur audio subnet\nSoundNet [20]\nL3-Net [21]\nOur AVTS features\nOur AVTS features\nOur AVTS features\nHuman performance [21]\nState-of-the-art (RBM)[31]\n\nAuxiliary\ndataset\nnone\nnone\nnone\nSoundNet\nSoundNet\nKinetics\nAudioSet\nSoundNet\nn/a\nnone\n\nAuxiliary\nsupervision\nnone\nnone\nnone\nself\nself\nself\nself\nself\nn/a\nnone\n\n# auxiliary\nexamples\nnone\nnone\nnone\n2M+\n2M+\n230K\n1.8M\n2M+\nn/a\nnone\n\nESC-50\naccuracy (%)\n39.6\n44.3\n61.6\n74.2\n79.3\n76.7\n80.6\n82.3\n81.3\n86.5\n\nDCASE2014\naccuracy (%)\n-\n-\n72\n88\n93\n91\n93\n94\n-\n-\n\n3.6\n\nImpact of curriculum learning on AVTS and downstream tasks\n\nTable 4 presents results across the many tasks considered in this paper. These results highlight the\nstrong bene\ufb01t of training AVTS with curriculum learning for both the AVTS task, as well as other\napplications that use our features (audio classi\ufb01cation) or \ufb01netune our models (action recognition).\nWe also include results achieved by L3-Net (the most similar competitor) across these tasks. For all\ntasks, both AVTS and L3-Net are pretrained on Kinetics, except for the evaluations on ESC-50 and\nDCASE2014 where Flickr-Soundnet [20] is used for pretraining.\nFor a fair comparison, on UCF101 and HMDB51 we \ufb01ne-tune the L3-Net using all frames from the\ntraining set, and evaluate it by using all frames from the test videos. This means that both L3-Net and\nour network are pre-trained, \ufb01ne-tuned, and tested on the same amount of data. The results in this\ntable show that AVTS yields consistently higher accuracy across all tasks.\n\nTable 4: Impact of curriculum learning on AVTS and downstream tasks (audio classi\ufb01cation and action\nrecognition). Both L3-Net [21] and our AVTS model are pretrained, \ufb01ne-tuned (when applicable)\nand tested on the same amount of data. All numbers are accuracy measures (%).\n\nMethod\nOur AVTS - single stage\nOur AVTS - curriculum\nL3-Net\n\nAVTS-Kinetics\n69.8\n78.4\n74.3\n\n4 Related work\n\nESC-50 DCASE HMDB51 UCF101\n70.6\n82.3\n79.3\n\n89.2\n94.1\n93\n\n46.4\n56.9\n40.2\n\n77.1\n85.8\n72.3\n\nUnsupervised learning has been studied for decades in both computer vision and machine learning.\nInspirational work in this area includes deep belief networks [33], stacked autoencoders [34], shift-\ninvariant decoders [35], sparse coding [36], TICA [37], stacked ISAs [38]. Instead of reconstructing\nthe original inputs as typically done in unsupervised learning, self-supervised learning methods try\nto exploit free supervision from images or videos. Wang et al. [39] used tracklets of image patches\nacross video frames as self-supervision. Doersch et al. [41] exploited the spatial context of image\npatches to pre-train a deep ConvNet. Fernando et al. [42] used temporal context for self-supervised\npre-training, while Misra et al. [43] proposed frame-shuf\ufb02ing as a self-supervised task.\nSelf-supervised learning can also be done across different modalities. Pre-trained visual classi\ufb01ers\n(noisy predictions) were used as supervision for pre-training audio models [20] as well as CNN\nmodels with depth images as input [44]. Audio signals were also used to pre-train visual models [45].\nRecently, Arandjelovic and Zisserman proposed the Audio-Visual Correspondence (AVC) \u2013 i.e.,\npredicting if an audio-video pair is in a true correspondence \u2013 as a way to jointly learn both auditive\n\n8\n\n\fand visual representations [21]. The approach was subsequently further re\ufb01ned for cross-modal\nretrieval and sound localization [22]. While these approaches used only a single frame of a video\nand therefore focused on exploiting the semantics of the audio-visual correlation, our method uses a\nvideo clip as an input. This allows our networks to learn spatiotemporal features. Furthermore, while\nin AVC negative training pairs are generated by sampling audio and visual slices from two distinct\nvideos, we purposely include out-of-sync audio-visual pairs as negative examples. This transforms\nthe task from audio-video correspondence to one of temporal synchronization. We demonstrate that\nthis forces our model to take into account the temporal information, as well as the appearance, in\norder to exploit the correlation of sound and motion within the video. Another bene\ufb01t of the temporal\nsynchronization task is that it allows us to control the level of dif\ufb01culty in hard negative examples,\nthus allowing us to develop a curriculum learning strategy which further improves our learned models.\nOur ablation study shows that both technical contributions (temporal synchronization and curriculum\nlearning) lead to superior audio and video models for classi\ufb01cation.\nOur approach is also closely similar to the approach described in the work of Chung and Zisser-\nman [26], where the problem of audio-visual synchronization was empirically studied within the\napplication of correlating mouth motion and speech. Here we broaden the scope of the study to\nencompass arbitrary human activities and experimentally evaluate the task as a self-supervised mecha-\nnism for general audio-visual feature learning. Our method is also related to that of Izadinia et al. [40]\nwho used canonical correlation analysis to identify moving-sounding objects in video and to solve\nthe problem of audio-video synchronization. Our work is concurrent with that of Zhao et al. [46]\nand that of Owens and Efros [32]. The former is focused on the task of spatially localizing sound in\nvideo as well as the problem of conditioning sound generation with image regions. Conversely, we\nuse audio-visual samples for model learning. The work of Owens and Efros [32] is similar in spirit\nto our own but we present stronger experimental results on comparable benchmarks (e.g., UCF101)\nand our technical approach differs substantially in the use of hard-negatives, curriculum learning and\nthe choice of contrastive loss as learning objective. We demonstrated in our ablation study that all\nthese aspects contribute to the strong performance of our system. Finally we note that our two-stream\ndesign enables the application of our model to single modality (i.e., audio-only or video-only) after\nlearning, while the network of Owens and Efros requires both modalities as input.\n\n5 Conclusions\n\nIn this work we have shown that the self-supervised mechanism of audio-visual temporal synchro-\nnization (AVTS) can be used to learn general and effective models for both the audio and the vision\ndomain. Our procedure performs a form of cooperative training where the \u201caudio stream\u201d and the\n\u201cvisual stream\u201d learn to work together towards the objective of synchronization classi\ufb01cation. By\ntraining on videos (as opposed to still-images) and by including out-of-sync audio-video pairs as\n\u201chard\u201d negatives we force our model to address the problem of audio-visual synchronization (i.e.,\nare audio and video temporally aligned?) as opposed to mere semantic correspondence (i.e., are\naudio and video recorded in the same semantic setting?). We demonstrate that this leads to superior\nperformance of audio and visual features for several downstream tasks. We have also shown that\ncurriculum learning signi\ufb01cantly improves the quality of the features on all end tasks.\nWhile in this work we trained AVTS on established, labeled video dataset in order to have a direct\ncomparison with fully-supervised pretraining methods, our approach is self-supervised and does not\nrequire any manual labeling. This opens up the possibility of self-supervised pretraining on video\ncollections that are much larger than any existing labeled video dataset and that may be derived\nfrom many different sources (YouTube, Flickr videos, Facebook posts, TV news, movies, etc.). We\nbelieve that this may yield further improvements in the generality and effectiveness of our models\nfor downstream tasks in the audio and video domain and it may help bridge the remaining gap with\nrespect to fully-supervised pretraining that rely on costly manual annotations.\n\nAcknowledgements\n\nThis work was funded in part by NSF award CNS-120552. We gratefully acknowledge NVIDIA and\nFacebook for the donation of GPUs used for portions of this work. We would like to thank Relja\nArandjelovi\u00b4c for discussions and for sharing informations regarding L3-Net, and members of the\nVisual Learning Group at Dartmouth for their feedback.\n\n9\n\n\fReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[2] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[3] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time\nobject detection with region proposal networks. In Advances in Neural Information Processing\nSystems 28: Annual Conference on Neural Information Processing Systems 2015, December\n7-12, 2015, Montreal, Quebec, Canada, pages 91\u201399, 2015.\n\n[4] Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross B. Girshick. Mask R-CNN. In IEEE\nInternational Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017,\npages 2980\u20132988, 2017.\n\n[5] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose\n\nestimation using part af\ufb01nity \ufb01elds. In CVPR, 2017.\n\n[6] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose\nmachines. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016,\nLas Vegas, NV, USA, June 27-30, 2016, pages 4724\u20134732, 2016.\n\n[7] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic\nsegmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015,\nBoston, MA, USA, June 7-12, 2015, pages 3431\u20133440, 2015.\n\n[8] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. CoRR,\n\nabs/1511.07122, 2015.\n\n[9] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and\nFei-Fei Li. Large-scale video classi\ufb01cation with convolutional neural networks. In 2014 IEEE\nConference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA,\nJune 23-28, 2014, pages 1725\u20131732, 2014.\n\n[10] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning\nspatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference\non Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 4489\u20134497,\n2015.\n\n[11] Heng Wang and Cordelia Schmid. Action recognition with improved trajectories. In IEEE\nInternational Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8,\n2013, pages 3551\u20133558, 2013.\n\n[12] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya-\nnarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew\nZisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.\n\n[13] Chunhui Gu, Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru, David A. Ross,\nGeorge Toderici, Yeqing Li, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, and Jitendra\nMalik. AVA: A video dataset of spatio-temporally localized atomic visual actions. CoRR,\nabs/1705.08421, 2017.\n\n[14] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet:\nA large-scale video benchmark for human activity understanding. In IEEE Conference on\nComputer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015,\npages 961\u2013970, 2015.\n\n[15] Hang Zhao, Zhicheng Yan, Heng Wang, Lorenzo Torresani, and Antonio Torralba. SLAC: A\nsparsely labeled dataset for action classi\ufb01cation and localization. CoRR, abs/1712.09374, 2017.\n\n10\n\n\f[16] Gunnar A. Sigurdsson, G\u00fcl Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav\nGupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In\nComputer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands,\nOctober 11-14, 2016, Proceedings, Part I, pages 510\u2013526, 2016.\n\n[17] Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. Spatiotemporal residual networks\nfor video action recognition. In Advances in neural information processing systems, pages\n3468\u20133476, 2016.\n\n[18] Jo\u00e3o Carreira and Andrew Zisserman. Quo vadis, action recognition? A new model and the\nkinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR\n2017, Honolulu, HI, USA, July 21-26, 2017, pages 4724\u20134733, 2017.\n\n[19] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A\n\ncloser look at spatiotemporal convolutions for action recognition. In CVPR, 2018.\n\n[20] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations\nfrom unlabeled video. In Advances in Neural Information Processing Systems 29: Annual\nConference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona,\nSpain, pages 892\u2013900, 2016.\n\n[21] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In IEEE International\nConference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 609\u2013617,\n2017.\n\n[22] Relja Arandjelovic and Andrew Zisserman. Objects that sound. In Computer Vision - ECCV\n2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part\nI, pages 451\u2013466, 2018.\n\n[23] Yoshua Bengio, J\u00e9r\u00f4me Louradour, Ronan Collobert, and Jason Weston. Curriculum learning.\nIn Proceedings of the 26th Annual International Conference on Machine Learning, ICML \u201909,\npages 41\u201348, New York, NY, USA, 2009. ACM.\n\n[24] Hilde Kuehne, Hueihan Jhuang, Rainer Stiefelhagen, and Thomas Serre. Hmdb51: A large\nvideo database for human motion recognition. In High Performance Computing in Science and\nEngineering \u201812, pages 571\u2013582. Springer, 2013.\n\n[25] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human\n\nactions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.\n\n[26] Joon Son Chung and Andrew Zisserman. Out of time: Automated lip sync in the wild. In\nComputer Vision - ACCV 2016 Workshops - ACCV 2016 International Workshops, Taipei,\nTaiwan, November 20-24, 2016, Revised Selected Papers, Part II, pages 251\u2013263, 2016.\n\n[27] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot\n\nimage recognition. In ICML Deep Learning Workshop, volume 2, 2015.\n\n[28] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Chan-\nning Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled\ndataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.\n\n[29] Karol J. Piczak. ESC: Dataset for Environmental Sound Classi\ufb01cation. In Proceedings of the\n\n23rd Annual ACM Conference on Multimedia, pages 1015\u20131018. ACM Press, 2015.\n\n[30] Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D\nPlumbley. Detection and classi\ufb01cation of acoustic scenes and events. IEEE Transactions on\nMultimedia, 17(10):1733\u20131746, 2015.\n\n[31] Hardik B Sailor, Dharmesh M Agrawal, and Hemant A Patil. Unsupervised \ufb01lterbank learning\nusing convolutional restricted boltzmann machine for environmental sound classi\ufb01cation. Proc.\nInterspeech 2017, pages 3107\u20133111, 2017.\n\n[32] Andrew Owens and Alexei A. Efros. Audio-visual scene analysis with self-supervised multisen-\nsory features. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany,\nSeptember 8-14, 2018, Proceedings, Part VI, pages 639\u2013658, 2018.\n\n11\n\n\f[33] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep\n\nbelief nets. Neural Computation, 18(7):1527\u20131554, 2006.\n\n[34] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training\nof deep networks. In Advances in Neural Information Processing Systems 19, Proceedings of\nthe Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British\nColumbia, Canada, December 4-7, 2006, pages 153\u2013160, 2006.\n\n[35] Marc\u2019Aurelio Ranzato, Fu Jie Huang, Y-Lan Boureau, and Yann LeCun. Unsupervised learning\nof invariant feature hierarchies with applications to object recognition. In 2007 IEEE Computer\nSociety Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June\n2007, Minneapolis, Minnesota, USA, 2007.\n\n[36] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Ef\ufb01cient sparse coding algorithms.\nIn Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual\nConference on Neural Information Processing Systems, Vancouver, British Columbia, Canada,\nDecember 4-7, 2006, pages 801\u2013808, 2006.\n\n[37] Quoc Le, Marc\u2019Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg Corrado, Jeff\nDean, and Andrew Ng. Building high-level features using large scale unsupervised learning. In\nICML, 2011.\n\n[38] Quoc V. Le, Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. Learning hierarchical invariant\nspatio-temporal features for action recognition with independent subspace analysis. In The 24th\nIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs,\nCO, USA, 20-25 June 2011, pages 3361\u20133368, 2011.\n\n[39] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using\n\nvideos. In ICCV, 2015.\n\n[40] H. Izadinia, I. Saleemi, and M. Shah. Multimodal analysis for identi\ufb01cation and segmentation\n\nof moving-sounding objects. IEEE Transactions on Multimedia, 15(2):378\u2013390, Feb 2013.\n\n[41] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning\n\nby context prediction. In ICCV, 2015.\n\n[42] Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video\nrepresentation learning with odd-one-out networks. In 2017 IEEE Conference on Computer\nVision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages\n5729\u20135738, 2017.\n\n[43] Ishan Misra, C. Lawrence Zitnick, and Martial Hebert. Shuf\ufb02e and learn: Unsupervised learning\nusing temporal order veri\ufb01cation. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling,\neditors, Computer Vision \u2013 ECCV 2016, pages 527\u2013544, Cham, 2016. Springer International\nPublishing.\n\n[44] Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross modal distillation for supervision\ntransfer. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016,\nLas Vegas, NV, USA, June 27-30, 2016, pages 2827\u20132836, 2016.\n\n[45] Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba.\nAmbient sound provides supervision for visual learning. In Computer Vision - ECCV 2016 -\n14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings,\nPart I, pages 801\u2013816, 2016.\n\n[46] Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh H. McDermott, and\nAntonio Torralba. The sound of pixels. In Computer Vision - ECCV 2018 - 15th European\nConference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, pages 587\u2013604,\n2018.\n\n12\n\n\f", "award": [], "sourceid": 4842, "authors": [{"given_name": "Bruno", "family_name": "Korbar", "institution": "Dartmouth Collegue"}, {"given_name": "Du", "family_name": "Tran", "institution": "Facebook"}, {"given_name": "Lorenzo", "family_name": "Torresani", "institution": "Dartmouth/Facebook"}]}