{"title": "Graph Agreement Models for Semi-Supervised Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 8713, "page_last": 8723, "abstract": "Graph-based algorithms are among the most successful paradigms for solving semi-supervised learning tasks. Recent work on graph convolutional networks and neural graph learning methods has successfully combined the expressiveness of neural networks with graph structures. We propose a technique that, when applied to these methods, achieves state-of-the-art results on semi-supervised learning datasets. Traditional graph-based algorithms, such as label propagation, were designed with the underlying assumption that the label of a node can be imputed from that of the neighboring nodes. However, real-world graphs are either noisy or have edges that do not correspond to label agreement. To address this, we propose Graph Agreement Models (GAM), which introduces an auxiliary model that predicts the probability of two nodes sharing the same label as a learned function of their features. The agreement model is used when training a node classification model by encouraging agreement only for the pairs of nodes it deems likely to have the same label, thus guiding its parameters to better local optima. The classification and agreement models are trained jointly in a co-training fashion. Moreover, GAM can also be applied to any semi-supervised classification problem, by inducing a graph whenever one is not provided. We demonstrate that our method achieves a relative improvement of up to 72% for various node classification models, and obtains state-of-the-art results on multiple established datasets.", "full_text": "Graph Agreement Models\n\nfor Semi-Supervised Learning\n\nOtilia Stretcu\u2021\u2217, Krishnamurthy Viswanathan\u2020, Dana Movshovitz-Attias\u2020,\n\nEmmanouil Antonios Platanios\u2021, Andrew Tomkins\u2020, Sujith Ravi\u2020\n\n\u2020Google Research, \u2021Carnegie Mellon University\n\nostretcu@cs.cmu.edu,{kvis,danama}@google.com,\n\ne.a.platanios@cs.cmu.edu,{tomkins,sravi}@google.com\n\nAbstract\n\nGraph-based algorithms are among the most successful paradigms for solving semi-\nsupervised learning tasks. Recent work on graph convolutional networks and neural\ngraph learning methods has successfully combined the expressiveness of neural\nnetworks with graph structures. We propose a technique that, when applied to these\nmethods, achieves state-of-the-art results on semi-supervised learning datasets.\nTraditional graph-based algorithms, such as label propagation, were designed with\nthe underlying assumption that the label of a node can be imputed from that of\nthe neighboring nodes. However, real-world graphs are either noisy or have edges\nthat do not correspond to label agreement. To address this, we propose Graph\nAgreement Models (GAM), which introduces an auxiliary model that predicts the\nprobability of two nodes sharing the same label as a learned function of their\nfeatures. The agreement model is used when training a node classi\ufb01cation model\nby encouraging agreement only for the pairs of nodes it deems likely to have the\nsame label, thus guiding its parameters to better local optima. The classi\ufb01cation\nand agreement models are trained jointly in a co-training fashion. Moreover, GAM\ncan also be applied to any semi-supervised classi\ufb01cation problem, by inducing a\ngraph whenever one is not provided. We demonstrate that our method achieves\na relative improvement of up to 72% for various node classi\ufb01cation models, and\nobtains state-of-the-art results on multiple established datasets.\n\n1\n\nIntroduction\n\nIn many practical settings, it is often expensive, if not impossible, to obtain large amounts of labeled\ndata. Unlabeled data, on the other hand, is often readily available. Semi-supervised learning (SSL)\nalgorithms leverage the information contained in both the labeled and unlabeled samples, thus\noften achieving better generalization capabilities than supervised learning algorithms. Graph-based\nsemi-supervised learning [43, 41] has been one of the most successful paradigms for solving SSL\nproblems when a graph connecting the samples is available. In this paradigm, both labeled and\nunlabeled samples are represented as nodes in a graph. The edges of the graph can arise naturally\n(e.g., links connecting Wikipedia pages, or citations between research papers), but oftentimes they\nare constructed automatically using an appropriately chosen similarity metric. This similarity score\nmay also be used as a weight for each constructed edge (e.g., for a document classi\ufb01cation problem,\nZhu et al. [43] set the edge weights to the cosine similarity between the tf-idf vectors of documents).\n\nThere exist several lines of work that leverage graph structure in different ways, from label propagation\nmethods [43, 41] to neural graph learning methods [7, 37] to graph convolution approaches [15, 35],\nwhich we describe in more detail in Sections 2 and 5. Most of these methods rely on the assumption\nthat graph edges correspond in some way to label similarity (or agreement). For instance, label\npropagation assumes that node labels are distributed according to a jointly Gaussian distribution\n\n\u2217This work was done during an internship at Google.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhose precision matrix is de\ufb01ned by the edge weights. However, in practice, graph edges and\ntheir weights come from noisy sources (especially when the graph is constructed from embeddings).\nTherefore, the edges may not clearly correspond to label agreement uniformly across the graph. The\nlikelihood of two nodes sharing a label could perhaps be better modeled explicitly, as a learned\nfunction of their features. To this end, we introduce graph agreement models (GAM), which learn\nto predict the probability that pairs of nodes share the same label. In addition to the main node\nclassi\ufb01cation model, we introduce an auxiliary agreement model that takes as input the feature\nvectors of two graph nodes and predicts the probability that they have the same label. The output\nof the agreement model can be used to regularize the classi\ufb01cation model by encouraging the label\npredictions for two nodes to be similar only when the agreement model says so. Intuitively, a perfect\nagreement model will allow labels to propagate only across \u201ccorrect\u201d edges and will thus make it\npossible to boost classi\ufb01cation performance using noisy graphs.\n\nf\n\nClassification Model\n\nAdds\nconfident\npredictions\nto training\ndataset\n\nTraining either the classi\ufb01cation or the agreement\nmodel in isolation may be hard, if not impossible, for\nmany SSL settings. That is because we often start\nwith a tiny number of labeled nodes, but a large num-\nber of unlabeled nodes. For example, the agreement\nmodel needs to be supervised with edges connecting\nlabeled nodes, and in some cases, due to the scarcity\nof labeled data, there may not be any such edges to\nbegin with. To ameliorate this issue, we also propose\na learning algorithm that allows the classi\ufb01cation and\nagreement models to be jointly trained. While the\nagreement model can be used to regularize the classi-\n\ufb01cation model, the most con\ufb01dent predictions of the\nlatter can be used to augment the training data for the agreement model. Figure 1 illustrates the\ninteraction between the two models. This idea is inspired by the co-training algorithm proposed by\nBlum and Mitchell [6]. We show in our experiments that the proposed approach achieves the best\nknown results on several established graph-based classi\ufb01cation datasets. We also demonstrate that\nour approach works well with graph convolutional networks [15], and the combination outperforms\ngraph attention networks [35] which are expensive during inference.\n\nFigure 1: Proposed learning paradigm.\n\nProvides\nregularization\nfor training the\nclassification\nmodel\n\nAgreement Model\n\ng\n\nWhile our method was originally inspired by graph-based classi\ufb01cation problems, we show that it\ncan also be applied to any semi-supervised classi\ufb01cation problem, by inducing a graph whenever one\nis not provided. We performed experiments on the popular datasets CIFAR-10 [16] and SVHN [26]\nand show that GAM outperforms state-of-the-art SSL approaches. Furthemore, the proposed method\nhas the following desirable properties:\n\n1. General: Can be applied on top of any classi\ufb01cation model to improve its performance.\n2. State-of-the-Art: Outperforms previous methods on several established datasets.\n3. Ef\ufb01cient: Does not incur any additional performance cost at inference.\n4. Robust: Up to 18% accuracy improvement when 5% of the graph edges correspond to agreement.\n\n2 Background\n\nWe introduce notation used in the paper, and describe related work most relevant to our proposed\nmethod. Let G(V, E, W ) be a graph with nodes V , edges E, and edge weights W = {wij}ij\u2208E.\nEach node i \u2208 V is represented by a feature vector xi and label yi. Labels are observed for a small\nsubset of nodes, L \u2282 V , and the goal is to infer them for the remaining unlabeled nodes, U = V \\ L.\n\nGraph-based algorithms, such as label propagation (LP), tackle this problem by assuming that two\nnodes connected by an edge likely have the same label, and a higher edge weight indicates a higher\nlikelihood that this is true. In LP, this is done by encouraging a node\u2019s predicted label probability\ndistribution to be equal to a weighted average of its neighbors\u2019 distributions. While this method is\nsimple and scalable, it is limited as it does not take advantage of the node features. Weston et al. [37]\nand Bui et al. [7] propose combining the LP approach with the power of neural networks by learning\nexpressive node representations. In particular, Bui et al. [7] propose Neural Graph Machines (NGM),\na method for training a neural network that predicts node labels solely based on node features and the\nLP assumption which takes the form of regularization. They minimize the following objective:\nLNGM = X\n\n+ \u03bbU UX\n\n+ \u03bbLUX\n\n+ \u03bbLLX\n\nwijd(hi, hj)\n\nwijd(hi, hj)\n\nwijd(hi, hj)\n\n\u2113(f (xi), yi)\n\ni\u2208L\n\ni,j\u2208L,ij\u2208E\n\ni\u2208L,j\u2208U,ij\u2208E\n\ni\u2208U,j\u2208U,ij\u2208E\n\n|\n\n{z\n\nsupervised\n\n}\n\n|\n\nlabeled-labeled\n\n{z\n\n|\n\nlabeled-unlabeled\n\n{z\n\n}\n\n|\n\n}\n\n2\n\nunlabeled-unlabeled\n\n{z\n\n,\n\n}\n\n\fwhere f (xi) is the predicted label distribution for node i, hi is the last hidden layer representation\nof the network for input xi, \u2113 is a cost function (e.g., cross-entropy), and d is a loss function that\nmeasures dissimilarity between representations (e.g., L2). \u03bbLL, \u03bbLU , and \u03bbU U are positive constants\nrepresenting regularization weights applied to distances between node representations for edges\nconnecting two labeled nodes, a labeled and an unlabeled node, and two unlabeled nodes, respectively.\nIntuitively, this objective function aims to match predictions with labels, for nodes where labels are\navailable, while also making node representations similar for neighboring nodes in the graph.\n\nNGMs are used to train neural networks and learn complex node representations, they are scalable,\nand they incur no added cost at inference time as the classi\ufb01cation model f is unchanged. However,\nthe quality of learned parameters relies on the quality of the underlying graph. Most real-world graphs\ncontain spurious edges that may not directly re\ufb02ect label similarity. In practice, graphs are of one of\ntwo types: (1) Provided: As an example, several benchmark datasets for graph-based SSL consider a\ncitation graph between research articles, and the goal is to classify the article topic. While articles\nwith similar topics often cite each other, there exist citations edges between articles of different topics.\nThus, this graph offers a good but non-deterministic prior for label agreement. (2) Constructed: In\nmany settings, graphs are not available, but can be generated. For example, one can generate graph\nedges by calculating the pairwise distances between research articles, using any distance metric. The\nquality of this graph then depends on how well the distance metric re\ufb02ects label agreement.\n\nIn either case, edge weights may not correspond to likelihood of label agreement, and given a small\nnumber of labeled nodes, it is hard to determine whether that correspondence exists in a given graph.\nThis drastically limits the regularization capacity of label propagation methods: a large regularization\nweight risks disrupting the base model due to noisy edges, while a small regularization weight does\nnot prevent the base model from over\ufb01tting. In the next section, we propose a novel approach that\naims to address this problem, and can be thought of as a generalization of label propagation methods.\n\n3 Proposed Method\n\nWe propose Graph Agreement Models (GAM), a novel approach that aims to resolve the main\nlimitation of label propagation methods while leveraging their strengths. Instead of using the edge\nweights as a \ufb01xed measure of how much the labels of two nodes should agree, GAM learns the\nprobability of agreement. To achieve this, we introduce an agreement model, g, that takes as input\nthe features of two nodes and (optionally) the weight of the edge between them, and predicts the\nprobability that they have the same label. The predicted agreement probabilities are then used when\ntraining the classi\ufb01cation model, f , to prevent over\ufb01tting.\n\nClassi\ufb01cation Model. The only aspect of the classi\ufb01cation model that we modify is the loss func-\ntion. We propose a modi\ufb01ed version of the NGM loss function, where the weight of each edge\u2019s\ncontribution to the loss is decided by the agreement model. In other words, we replace all wij s with\ngij = g(xi, xj, wij). The new loss function becomes:\nLGAM = X\n\ngijd(yi, fj) + \u03bbU U X\n\ngijd(fi, fj), (1)\n\n\u2113(fi, yi) + \u03bbLL X\n\ngijd(yi, fj) + \u03bbLU X\n\ni\u2208L\n\ni,j\u2208L,ij\u2208E\n\ni\u2208L,j\u2208U,ij\u2208E\n\ni,j\u2208U,ij\u2208E\n\nwhere we use the short notation fi = f (xi). Note that there are actually a few more differences\nbetween LNGM and LGAM. Since the agreement model g is designed to estimate agreement between\nlabels, and not between the hidden representations h generated by f , we are in fact penalizing\ndisagreement between the predicted label distributions directly. This is also easier to implement for\narbitrary classi\ufb01cation models, since it removes the need for a decision on what should be the hidden\nrepresentation of the graph nodes. Moreover, our regularization terms make use of the supervised\nnode labels, whenever available (i.e., in the LU term, or one of the two sides of the LL term). This is\nbecause we aim to decrease the entropy of the predictions, which, as we have empirically observed,\nimproves the stability of the learning process.\n\nAgreement Model. The agreement model, g, can be any neural network. The only constraint is that\nit must receive the features of two nodes and predict a single value that represents the probability\nthat the two nodes have the same label. Note that using the edge weight could be helpful, but is not\nnecessary. Since modularity enables more \ufb02exibility, we decided to split the agreement model further\ninto the following components:\n\n1. Encoder: Produces a vector representation for a node. The same encoder network is applied to\n\nboth inputs (each input is a node\u2019s features) of the agreement model.\n\n2. Aggregator: Combines the encoded representations of the two node arguments into a single\nvector, and is invariant to the order of its two arguments (e.g., the \u201csum\u201d operation). The last\n\n3\n\n\fL\n\nU\n\nL\n\nU\n\nL\n\nU\n\n1\n\n2\n\n3\n\nTrain the agreement model g using L\n\nTrain classification model f using\nlabeled nodes in L and predictions\nof g on edges between L-L nodes,\nL-U nodes and U-U nodes\n\nExtend L using the most\nconfident predictions of f on\nunlabeled nodes from U\n\nFigure 3: Overview of the three main steps in each iteration of the proposed co-training algorithm.\n\ncondition represents a meaningful and valid inductive bias for the agreement model, namely that\nthe order in which nodes are presented should not in\ufb02uence their probability of agreement.\n\n3. Predictor: Given the aggregator output, this component predicts the probability that the initial\n\ntwo nodes have the same label.\n\nNode 1\n\nNode 2\n\nFigure 2 shows how these components are used together in the agreement\nmodel. This formulation is highly generic, as each module can be im-\nplemented as an arbitrary neural network. The recent success of BERT\n[10]\u2014a transfer learning architecture that recently achieved state-of-the-art\nperformance for several natural language processing tasks\u2014seems to indi-\ncate that it is important to have a highly expressive encoder, even if the\npredictor is only a linear function. Furthermore, it is clear that the choice\nfor an encoder network heavily depends on the nature of the data (e.g.,\nconvolutional neural networks perform well for images). However, through\nour extensive experiments\u2014which are further described in Section 4\u2014we\nobserved that simple multi-layer perceptrons consistently provide a good\ntrade-off between performance and ef\ufb01ciency. Regarding the aggregator,\n\u201caddition\u201d and \u201csubtraction\u201d are both simple and valid options. However, the\nfunctional form that seemed to work best in practice, and the one we use in\nour experiments is de\ufb01ned as aggregator(ei, ej) = (ei \u2212 ej)2, where ei\nand ej are the output vector embeddings from the encoder for nodes i and\nj, respectively. This function is invariant to the order of the two nodes and it re\ufb02ects our intuition that\nagreement probability can be thought of as distance between two nodes in a latent space. For the\npredictor, we use a linear layer, similar to BERT. Finally, we use the following loss function to\ntrain the agreement model:\n\nFigure 2: Agreement\nmodel components.\n\nAgreement\nprobability\n\nEncoder\n\nLagreement\n\nGAM\n\n= X\n\n\u2113(g(xi, xj, wij), 1yi=yj ),\n\n(2)\n\nwhere \u2113 is a binary classi\ufb01cation loss function (e.g., sigmoid cross-entropy), and 1yi=yj is an indicator\nfunction whose value is 1 when yi = yj , and 0 otherwise. It now remains to describe the overall\nlearning algorithm we propose for jointly training the classi\ufb01cation and agreement models.\n\ni\u2208L,j\u2208L,ij\u2208E\n\n3.1 Learning Algorithm\n\nThe classi\ufb01cation model, f , is trained by minimizing the loss function shown in Equation 1. However,\nthis loss function uses the agreement model g, that also needs to be trained. We can think of g as\nregularizing the training process of f . Perhaps most interestingly though, while the agreement model\ncan play a crucial role in training the classi\ufb01cation model, the classi\ufb01cation model can also help\ntrain the agreement model. A key contribution of our work is exactly this interaction in the training\nprocesses of f and g. More speci\ufb01cally, we propose the following learning algorithm:\n\n1. Train the agreement model g to convergence, using the limited amount of labeled data that is\n\nprovided. We refer to the initial trained model as g0.\n\n2. Train f using g0 in its loss function. We refer to the trained model as f 0.\n3. Let f 0 produce predictions over all of the unlabeled nodes. Although this model was trained using\na limited amount of data, we expect its most con\ufb01dent predictions (i.e., the labels with the highest\nprobability) to most likely be correct. Thus, we take the top M most con\ufb01dent predictions and\nadd them to the set of labeled nodes. We refer to this step as the self-labeling phase.\n\nThe newly added labeled examples can provide new information to the agreement and classi\ufb01cation\nmodels. We thus start again from step 1, and obtain new trained models g1 and f 1, and a new set of\n\n4\n\n\fmost con\ufb01dent M predictions for the remaining unlabeled nodes. We repeat this process for k steps\n(or until all nodes have been labeled), using gk\u22121 to help train f k and f k to help train gk+1.\n\nThis training algorithm resembles the co-training algorithm, originally proposed by Blum and Mitchell\n[6]. The core idea behind it is that, if f and g are good at exploiting different kinds of information,\nthen we can leverage that by having them help train each other. Similar algorithms have been\nsuccessfully used in practice [e.g., 22], and, for some settings there even exist theoretical guarantees\nthat such algorithms will converge to a better classi\ufb01er than the one that would have been obtained\nwithout co-training [3]. For these reasons, we expect this interaction to boost the performance of both\nf and g. An illustration of the algorithm is shown in Figure 3.\n\nNote that g only participates in training. At inference time predictions are made by applying the\ntrained f to the input. Thus GAM does not incur extra computation cost at inference.\n\n3.2\n\nInducing Graphs\n\nMethods that rely on the provided graph have two main limitations. First, they cannot be applied to\ndatasets that do not include a graph. In addition, by inspecting Equation 1 it is easy to notice that\neven with g providing perfect predictions, it will only allow labels to propagate along the graph edges\nconnecting nodes with matching labels. However, if the graph is sparse or the number of labeled\nnodes is small, there may be unlabeled nodes for which there is no \u201cagreement\u201d path connecting\nit to a labeled node from its class. In fact, in the benchmark datasets, Cora [19], Citeseer [5] and\nPubmed [25], propagating labels through \u201cagreement\u201d edges, while starting at the provided labeled\nnodes, only covers 84%, 49%, and 85% of the nodes respectively. The remaining nodes do not appear\nin any of the regularization terms of the classi\ufb01cation model loss function, thus making it prone to\nover\ufb01tting. Our approach alleviates this issue by self-labeling unlabeled nodes. These nodes can then\npropagate their labels during the next co-training iteration.\n\nIn fact, we propose to go a step further and address both limitations. Notice that g can be trained and\napplied on any pair of labeled nodes\u2014not necessarily connected by an edge\u2014and can thus regularize\npredictions made by f for any pair of nodes. This can be achieved by removing all constraints\nij \u2208 E from Equations 1 and 2. In this formulation the provided graph becomes unnecessary. This is\nequivalent to having a fully-connected graph, and using the agreement model to denoise it. We refer\nto this GAM variant that does not use a graph as GAM*.\n\nOur experimental results, presented in the next section, indicate that this new formulation not only\nboosts the performance of GAM on some graph-based datasets, but it also opens up a wide range\nof new applications. That is because GAM* can now be applied to any SSL dataset, whether or\nnot a graph is provided. In Section 4.2, we evaluate GAM* on two datasets with no inherent graph\nstructure, and show that it is able to improve upon state-of-the-art methods for SSL.\n\n4 Experiments\n\nWe performed a set of experiments to test different properties of GAM. First, we tested the generality\nof GAM by applying our approach to Multilayer Perceptrons (MLP), Convolutional Neural Networks\n(CNN), Graph Convolution Networks (GCN) [15], and Graph Attention Networks (GAT) [35]2. Next,\nwe tested the robustness of GAM when faced with noisy graphs, as well as evaluated GAM and\nGAM* with and without a provided graph, comparing them with the state-of-the-art methods.\n\n4.1 Graph-based Classi\ufb01cation\n\nDatasets. We obtained three public datasets from Yang et al. [38]: Cora [19], Citeseer [5], and\nPubmed [25], which have become the de facto standard for evaluating graph node classi\ufb01cation\nalgorithms. We used the same train/validation/test splits as Yang et al. [39], which have been used by\nthe methods we compare to. In these datasets, graph nodes represent research publications and edges\nrepresent citations. Each node is represented as a vector, whose components correspond to words.\nFor Cora and Citeseer the vector elements are binary indicating whether the corresponding term is\npresent in the publication, while for Pubmed they are real-valued tf-idf scores. The goal is to classify\nresearch publications according to their main topic which belongs to a provided set of topics. In each\ncase we are given true labels for a small subset of nodes. Dataset statistics are shown in Table 4 in\nAppendix A.\n\n2MLPs and CNNs are common in many SSL problems and GCN and GAT achieve state-of-the-art perfor-\n\nmance on three datasets commonly used in recent graph-based SSL work.\n\n5\n\n\fSetup. We implemented our models in Ten-\nsorFlow [1]. Parameter updates are using the\nAdam optimizer [14] with default TensorFlow\nparameters, and initial learning rate of 0.001 for\nMLPs and GCN, and 0.005 for GAT (based on\nthe original publication [35]). When training\nthe classi\ufb01cation model, we used a batch size\nof 128 for both the supervised term and for\nthe edges in each of the LL, LU , and U U\nterms. We stopped training when validation\naccuracy did not increase in the last 2000\niterations, and reported the test accuracy at the\niteration with the best validation performance.\nFor the agreement model, we sampled random\nbatches containing pairs of nodes from the\npool of all edges with both nodes labeled for\nGAM, or of all pairs of nodes for GAM*.\nIn both cases, we ensured a ratio of 50%\npositives (labels agree) and 50% negatives\n(labels disagree). In the case of GAM, since\ngraphs typically contain more positive edges\nthan negative, extra negative samples were\nselected at random from the pairs of nodes with\nno edge connecting them. Our experiments\nwere performed using a single Nvidia Titan X\nGPU, and our implementation can be found\nhttps://github.com/tensorflow/\nat\nneural-structured-learning.\n\nTable 1: Test classi\ufb01cation accuracies (%) on\ngraph-based datasets. The \ufb01rst section contains\nresults reported in related work. The next seg-\nments show results for different classi\ufb01ers and their\nextensions using NGM, VAT, GAM, and GAM*.\nSubscripts refer to the number of hidden units.\nShaded methods do not use the graph.\n\nModel\n\nManiReg [4]\nSemiEmb [37]\nLP [43]\nDeepWalk [28]\nICA [19]\nPlanetoid [39]\nChebyshev [8]\nMLP[250, 100]+NGM [7]\nMoNet [24]\nGCN16 [15]\nGAT8 [35]\nGCN16 + O-BVAT [9]\n\nMLP128\nMLP128 + NGM\nMLP128 + VAT\nMLP128 + VATENT\nMLP128 + GAM\nMLP128 + GAM*\n\nDatasets\n\nCora Citeseer\n\nPubmed\n\n59.5\n59.0\n68.0\n67.2\n75.1\n75.7\n81.2\n\n\u2013\n\n81.7\n81.5\n83.0\n83.6\n\n51.7\n77.7\n56.5\n24.1\n80.7\n70.7\n\n60.1\n59.6\n45.3\n43.2\n69.1\n64.7\n69.8\n\n\u2013\n\u2013\n\n70.3\n72.5\n74.0\n\n52.2\n67.8\n56.1\n46.7\n73.0\n70.3\n\n70.7\n71.7\n63.0\n65.3\n73.9\n77.2\n74.4\n75.9\n78.8\n79.0\n79.0\n79.9\n\n69.4\n73.6\n73.1\n70.1\n82.8\n71.9\n\n46.6\n77.6\n55.3\n33.0\n80.1\n64.0\n\n80.9\n81.4\n79.0\n83.4\n86.2\n84.2\n\n49.0\n63.1\n46.5\n29.1\n70.4\n66.9\n\n68.1\n68.9\n69.5\n69.8\n73.5\n71.3\n\nMLP4\u00d732\nMLP4\u00d732 + NGM\nMLP4\u00d732 + VAT\nMLP4\u00d732 + VATENT\nMLP4\u00d732 + GAM\nMLP4\u00d732 + GAM*\n\nGCN128\nGCN128 + NGM\nGCN128 + VAT\nGCN128 + VATENT\nGCN128 + GAM\nGCN128 + GAM*\n\nModels. For both GAM and NGM, we\nused Euclidean distance for d,\nand we\nselected \u03bbLL, \u03bbLU , and \u03bbU U based on\nvalidation set accuracy, where we varied\n\u03bbLU \u2208 {0.1, 1, 10, 100, 1000, 10000}, and set\n\u03bbU U = 1\n2 \u03bbLU and \u03bbLL = 0 (we found through\nexperimentation that the LL component does\nnot have a signi\ufb01cant contribution, probably be-\ncause the predictions for labeled nodes are al-\nready accounted for in the supervised loss term).\nFor the agreement model, we used an MLP with\nthe same number of hidden units as the classi\ufb01-\ncation model. We started with 20 labeled exam-\nples per class and, when extending the labeled\nnode set, we added the M most con\ufb01dent pre-\ndictions of the classi\ufb01er over unlabeled nodes.\nIn our experiments, we set M = 200, but doing\nparameter selection for M as well could poten-\ntially lead to even better results. To avoid adding\nincorrectly-labeled nodes we \ufb01ltered out predic-\ntions where the classi\ufb01cation con\ufb01dence (i.e.,\nthe maximum probability assigned to one of the\nlabels) was lower than 0.4 (since the smallest number of classes considered is 3 for Pubmed, making\nchance classi\ufb01cation probability 0.33).\n\nGCN1024\nGCN1024 + NGM\nGCN1024 + VAT\nGCN1024 + VATENT\nGCN1024 + GAM\nGCN1024 + GAM*\n\nGAT128\nGAT128 + NGM\nGAT128 + GAM\nGAT128 + GAM*\n\n68.7\n70.2\n74.2\n62.5\n79.3\n76.9\n\n76.9\n76.2\n76.8\n75.0\n86.0\n77.0\n\n78.5\n68.9\n76.3\n72.1\n81.6\n81.2\n\n\u2013\n\u2013\n\u2013\n\u2013\n\n81.3\n82.0\n81.8\n64.0\n86.0\n82.4\n\n81.6\n80.3\n84.3\n85.0\n\n70.5\n70.5\n69.3\n50.5\n73.6\n71.9\n\n69.0\n70.8\n70.3\n73.6\n\nResults. Our results are reported in Table 1. Results obtained with GAM are denoted in the form\n\u201c{base model} + GAM\u201d. The subscript following the base model represents the number of hidden\nunits of the classi\ufb01cation model (e.g., MLP128 is a multilayer perceptron with a single layer of 128\nhidden units, and MLP4\u00d732 is a multilayer peceptron with 4 layers of 32 hidden units each). We\nalso report the best known results for these datasets from other publications, as reported in [35].\nFurthermore, in order to allow for a more complete comparison with other general-purpose SSL\nmethods, we also compared with VAT [23]\u2014the current state-of-the-art SSL method, as reported in\n[27] and [23]\u2014and its entropy minimization variant, VATENT. We set the VAT regularization weight\nto 1, as in [23, 27]. The results can be summarized as follows:\n\n6\n\n\f\u2022 GAM always improves the classi\ufb01cation model accuracy of the base model, for all base models,\noften by a signi\ufb01cant margin (e.g., +33.5% for MLP4 \u00d7 32 on Cora, which is a relative increase of\n72%). Note that we measure relative performance as new_accuracy\u2212baseline_accuracy\n\n.\n\nbaseline_accuracy\n\n\u2022 GAM also consistently achieves important gains compared to NGM (e.g., +4.8% for GCN128 on\n\nCora, and +9.8% on Pubmed), supporting the intuition behind our edge denoising approach.\n\n\u2022 VAT also consistently improves upon the baseline classi\ufb01er (although not as much as GAM or\nGAM*), even though it does not use the graph and it treats the unlabeled nodes as independent\nsamples. Interestingly, VATENT fails on these datasets in many cases, although the same method\nperforms very well on other SSL datasets (Section 4.2).\n\n\u2022 It is interesting to note that although GCN and GAT already use the graph as part of their architec-\n\nture, their performance can be further improved by using GAM.\n\n\u2022 To the best of our knowledge, the GAM variants obtain the best results reported on these datasets.\n\u2022 Further, note that GCN with GAM outperforms GAT (which is GCN with attention), suggesting\nthat GAM regularization is a better alternative to attention for handling noisy graphs. Note that\nthe GAT results for Pubmed are missing because we use the implementation of GAT provided by\nVeli\u02c7ckovi\u00b4c et al. [35], and it runs out of GPU memory for 128 hidden units on Pubmed.\n\nRobustness. We developed GAM with the\ngoal of being able to handle graphs with \u201cin-\ncorrect\u201d edges (i.e. those that connect nodes\nwith differing labels). We consider such edges\n\u201cincorrect\" under the label propagation assump-\ntion, despite the fact that they may refer to real-\nworld connections between these nodes (e.g.,\ncitations between research articles on different\ntopics). In Cora, Citeseer, and Pubmed, 19%,\n26%, and 20% of the edges, respectively, are in-\ncorrect. To demonstrate the ability of GAM to\nhandle these incorrect edges and perhaps even\nhigher levels of noise, we performed a robust-\nness analysis by introducing spurious edges to\nthe graph, and testing whether our agreement\nmodel learns to ignore them. We added spuri-\nous edges by randomly sampling pairs of nodes with different true labels until the percentage of\nincorrect edges met a desired target. We tested the performance of GAM on a set of graphs created in\nthis manner. MLPs are good base model candidates for testing this because they can only be affected\nby the graph quality through the GAM regularization terms (unlike GCN or GAT, where the graph is\nimplicitly used in the model). The results are shown in Figure 4 on the Citeseer dataset (the hardest of\nthe three datasets), for graphs containing between 5% and 74% correct edges. A plain MLP with 128\nhidden units obtains 52.2% accuracy independent of the level of noise in the graph. Adding GAM to\nthis MLP increases its accuracy by about 19%. This improvement persists even as the fraction of\ncorrect edges decreases. For example, the accuracy remains 70% even in the case where only 5% of\nthe graph edges are correct. In contrast, the performance of NGM steadily decreases as the fraction\nof incorrect edges increases, to the point where it starts performing worse than the plain MLP (when\nthe percent of correct edges \u2264 60%), and it is thus preferable not to use it.\n\nFigure 4: Robustness to noisy graphs. The x axis\nrepresents the percentage of correct edges remain-\ning after adding wrong edges to the Citeseer dataset.\n\nAblation Study. We performed experiments to show how much each component of GAM contributes\nto its success, as follows:\n\nTable 2: Accuracy (%) of an MLP with 128 hidden\nunits using GAM with a perfect agreement model.\n\n(1) Perfect agreement: We evaluated how well\nGAM would perform if the agreement model\nproduced perfect predictions. This is done by\nletting the agreement model see the true labels,\nand always return 1 when nodes agree, and 0 oth-\nerwise. We ran this experiment for all 3 datasets\nwith an MLP base classi\ufb01er. The results in Ta-\nble 2 show that a perfect agreement model produces a huge boost, up to 38.8% over the baseline. For\nCiteseer, the smaller improvement is not surprising, given that only 49% nodes are connected by\nagreement (see Section 3.2).\n\nMLP128 + GAMp\n\nDatasets\n\nCora Citeseer\n\nPubmed\n\nModel\n\n90.5\n\n76.5\n\n91.6\n\n(2) Sensitivity to agreement model: We evaluated how sensitive GAM is to the choice of agreement\nmodel architecture. To assess this, we ran GAM multiple times, with a \ufb01xed classi\ufb01cation model\narchitecture and we various agreement model sizes. Figure 6 in Appendix C shows the test accuracy\n\n7\n\n74706050403020105Percent correct edges304050607080Accuracy (%)MLP128MLP128 + NGMMLP128 + GAM\fper co-train iteration for each of these models. The results indicate that the behaviour of GAM is\nstable with respect to the agreement model size, which suggests that the agreement model size is a\nhyperparameter that does not require much tuning effort.\n\n(3) Self-labeling: We evaluated the usefulness of the self-labeling component by showing how the\ntest accuracy evolves after each co-training iteration. Figure 5 shows that the accuracy generally has\nan increasing trend with more co-training iterations. In some cases, the \ufb01nal iterations may have a\ndecreasing trend, because in the last few iterations the model self-labels the samples that it is most\nuncertain about, and thus it is more likely to make mistakes. For this reason, we kept track of the\nvalidation accuracy, and at the end we restored the model from the co-train iteration with the best\nvalidation accuracy. Self-labeling is also a critical component for datasets such as Pubmed, where in\nthe \ufb01rst co-train iteration there are no edges with both nodes labeled, so g cannot be trained until we\nself-label more nodes. In such cases, g returns 1 by default until it can be trained, defaulting to NGM\nand relying on the graph (although for noisy graphs, one could return 0 by default).\n\n4.2 Semi-Supervised Learning Without a Graph\n\nOur robustness experiments show that GAM is effective even when the majority of edges in the graph\nconnect nodes with mismatched labels. Therefore, we tested its power further by considering a more\nextreme scenario: no graph is provided, and the agreement model is tasked with learning whether\nan arbitrary pair of nodes shares a label. Note that having no graph, and picking random pairs of\nsamples to use in the regularization terms in Equation 1, is equivalent to having a fully-connected\ngraph from which we sample edges. We tested this scenario on Cora, Citeseer, and Pubmed and the\nresults are marked as GAM* in Table 1. For completeness, we also show results for GCN+GAM*\nand GAT+GAM*, where even though the GAM* regularization term does not use the graph, the\nclassi\ufb01cation models use it by design. Our results show that GAM* also boosts the performance of\nall tested baseline models, with a gain of up to 19% accuracy for MLPs, 3.3% for GCNs, and 4.6%\nfor GATs. It is worth noting that, even though GAM outperforms GAM* due to the extra information\nprovided by the graph, GAM* generally outperforms the competing methods that also do not use a\ngraph, and often even NGM which does.\n\nNon-graph Datasets. Since our approach no longer requires a graph to be provided, we tested\nGAM on the popular CIFAR-10 [16] and SVHN [26] datasets. For evaluation, we use the setup and\ntrain/validation/test splits provided by [27], which aims to provide a realistic framework for evaluating\nSSL methods. Thus, we start with 4000 and 1000 labeled samples for CIFAR-10 and SVHN,\nrespectively, while the remaining training samples are considered unlabeled. More information about\nthese datasets can be found in Appendix B. It is important to note that while Cora, Citeseer and\nPubmed were evaluated under a transductive setting (where the input features and the graph structure\nof the test nodes are seen during training, but not their labels) as is typical in graph-based SSL, in the\nfollowing experiments we evaluate GAM* under an inductive setting (the features of the test nodes\nare completely held out, and there is no graph to provide other information about them).\n\nModels. As the datasets consist of images, we\nuse a Convolutional Neural Network (CNN)\nwith 2 convolution layers followed by max-\npooling, then 2 fully-connected layers (archi-\ntecture details in Appendix D). The agreement\nmodel is a 3 layer MLP with 128, 64, and 32\nhidden units, respectively, and Leaky ReLU ac-\ntivations. After each co-training iteration, we\nself-label 1000 unlabeled samples, subject to a\ncon\ufb01dence > 0.4 (same as in Section 4.1, but\ntuning may improve results further). VAT and\nVATENT settings are the same as in Section 4.1.\n\nTable 3: Classi\ufb01cation accuracies (%) on CIFAR-\n10 with 4000 labels, and SVHN with 1000 labels.\n\nModel\n\nCNN\nCNN + VAT\nCNN + VATENT\nCNN + GAM*\nCNN + VAT + GAM*\nCNN + VATENT + GAM*\n\nDatasets\n\nCIFAR-10\n\nSVHN\n\n62.57\n64.37\n66.73\n69.27\n69.64\n67.29\n\n72.33\n70.26\n81.86\n83.43\n85.47\n84.63\n\nResults. Table 3 shows that GAM* signi\ufb01cantly improves performance over the baseline classi\ufb01er,\neven when no graph is given (up to 13% on SVHN). Moreover, it can improve performance over one\nof the best current SSL methods, VAT, when applied in conjunction with it (e.g., +5.27% when GAM*\nis applied on top of VAT on CIFAR-10, which yields a 7% improvement from a plain CNN). We show\nthe progression of the test accuracy per co-train iteration on CIFAR-10 in Figure 5 (b). Moreover, we\ndid not tune the parameters of the CNN or the learning rate to be favorable to our method. However,\nthe results indicate that GAM* offers a promising direction for general-purpose SSL.\n\n8\n\n\f(a) MLP128 + GAM on Citeseer\n\n(b) CNN + GAM* on CIFAR-10\n\nFigure 5: Test accuracy per co-train iteration for a (a) MLP128 + GAM on the Citeseer dataset starting\nwith 120 labeled samples, and self-labeling 200 samples per co-train iteration, and (b) CNN + GAM*\non the CIFAR-10 dataset starting with 4000 labeled samples, and self-labeling 1000 samples per\nco-train iteration. Iteration 0 shows the baseline model accuracy, without GAM.\n\n5 Related Work\n\nThere has been substantial work on graph-based semi-supervised learning [e.g., 34]. A \ufb01rst class of\nmethods regularize the predicted labels using the Laplacian of the graph without taking advantage\nof the node features. These include label propagation [43, 41], manifold regularization [4], and\nICA [19]. Another line of work [18, 20] focuses on re\ufb01ning the SSL graphs obtained from similarity\nmatrices using only the similarity scores, but ignoring the node features. Recent approaches have\nattempted to marry the core idea behind these methods with the expressive power afforded by neural\nnetworks. Among these, the regularization based approaches of Weston et al. [36], Weston et al. [37],\nas well as Neural Graph Machines by [7] (described in Section 2) are closest to ours. Moreover,\nPlanetoid [39] applies regularization using a term that depends on the skip-gram representation of\nthe graph. Note that the notion of using agreement in predictions made by classi\ufb01ers is a concept\nthat has also been used more broadly in the context of SSL [e.g., 29], and not just for graph-based\nSSL. Another class of techniques learns node embeddings that take into account both the features\nand the graph, which are then consumed by standard supervised learning methods [28, 13, 31, 11].\nMore recently, there has been a large amount of work on Graph Neural Networks that extend neural\nnetworks to graph-structured inputs. See for example [42] for a survey of methods in this category.\nAmong these, the most relevant to our work are graph convolutional networks (GCN) proposed\nby Kipf and Welling [15] and a scalable extension [40]. These approaches de\ufb01ne a notion of graph\nconvolution and uses an approximation of the convolution to provide a scalable method that produced\nstate-of-the-art results. Moreover, [35] and [33] applied attention on the edges of the graph to further\nimprove the performance of GCN.\n\nAside from graph-based approaches, there has also been a great deal of work on SSL methods without\na graph. Most relevant to our work are methods that use regularization to discourage the model\nfrom making vastly dissimilar predictions for similar inputs. These include \u03a0-Model [17, 30], Mean\nTeacher [32] and Virtual Adversarial Training (VAT) [23], SNTG [21] and fast-SWA [2]. Some of\nthe best results are obtained by combining VAT with entropy minimization [12], which adds a loss\nterm that encourages more con\ufb01dent predictions. SNTG infers a similarity graph between samples,\nbut it does so in a signi\ufb01cantly different way than GAM*. Also, in contrast to SNTG, we propose an\nadditional self-training component, and our method is applicable when a graph is provided, whereas\nSNTG, as published, is not designed to use information from a provided graph.\n\nOur proposed method, GAM, can be applied as an extension to all of the above methods, as it only\nrequires the addition of a regularization term to their loss function.\n\n6 Conclusions\n\nWe introduced Graph Agreement Models (GAM), a novel regularization method for graph-based and\ngeneral purpose semi-supervised learning (SSL), that can be applied on top of any classi\ufb01cation model.\nThe key idea behind our approach is the interaction between a node classi\ufb01cation model and a node\nagreement model, which are trained in tandem in a co-training fashion. Our experiments show that\nGAM can improve the accuracy of several types of classi\ufb01ers, including two of the most successful\ngraph-based SSL methods, thus establishing a new state-of-the-art for graph-based classi\ufb01cation.\nMoreover, we demonstrated that GAM can be extended to settings where a graph is not provided, and\nit is able to improve upon the performances of some of the best SSL classi\ufb01cation models.\n\n9\n\n051015Co-train Iteration5055606570Accuracy (%)010203040Co-train Iteration62646668Accuracy (%)\fReferences\n\n[1] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: a system for\nlarge-scale machine learning. Arxiv e-prints, March 2016.\n\n[2] Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. There are many\n\nconsistent explanations of unlabeled data: Why you should average. In ICLR, 2019.\n\n[3] Maria-Florina Balcan, Avrim Blum, and Ke Yang. Co-training and expansion: Towards bridging\ntheory and practice. In Advances in neural information processing systems, pages 89\u201396, 2005.\n\n[4] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric\nframework for learning from labeled and unlabeled examples. Journal of machine learning\nresearch, 7(Nov):2399\u20132434, 2006.\n\n[5] Indrajit Bhattacharya and Lise Getoor. Collective entity resolution in relational data. ACM\n\nTransactions on Knowledge Discovery from Data (TKDD), 1(1):5, 2007.\n\n[6] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In\nProceedings of the eleventh annual conference on Computational learning theory, pages 92\u2013100.\nACM, 1998.\n\n[7] Thang D Bui, Sujith Ravi, and Vivek Ramavajjala. Neural graph learning: Training neural\nnetworks using graphs. In Proceedings of the Eleventh ACM International Conference on Web\nSearch and Data Mining, pages 64\u201371. ACM, 2018.\n\n[8] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks\non graphs with fast localized spectral \ufb01ltering. In Advances in Neural Information Processing\nSystems, pages 3844\u20133852, 2016.\n\n[9] Zhijie Deng, Yinpeng Dong, and Jun Zhu. Batch virtual adversarial training for graph convolu-\ntional networks. CoRR, abs/1902.09192, 2019. URL http://arxiv.org/abs/1902.09192.\n\n[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of\n\ndeep bidirectional transformers for language understanding. In NAACL-HLT, 2019.\n\n[11] E. Faerman, F. Borutta, K. Fountoulakis, and M.W. Mahoney. Lasagne: Locality and structure\n\naware graph node embedding. Arxiv e-prints, October 2017.\n\n[12] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In\n\nNIPS, 2004.\n\n[13] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, ACM, pages 855\u2013864,\n2016.\n\n[14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[15] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv preprint arXiv:1609.02907, 2016.\n\n[16] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[17] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. CoRR,\n\nabs/1610.02242, 2017.\n\n[18] Wei Liu and Shih-Fu Chang. Robust multi-class transductive learning with graphs. 2009\n\nIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 381\u2013388, 2009.\n\n[19] Qing Lu and Lisa Getoor. Link-based classi\ufb01cation. In Proceedings of the 20th International\n\nConference on Machine Learning, ICML-03, pages 496\u2013503, 2003.\n\n[20] Dijun Luo, Heng Huang, Feiping Nie, and Chris H Ding. Forging the graphs: A low rank and\npositive semide\ufb01nite graph learning approach. In Advances in neural information processing\nsystems, pages 2960\u20132968, 2012.\n\n[21] Yucen Luo, Jun Zhu, Mengxi Li, Yong Ren, and Bo Zhang. Smooth neighbors on teacher\ngraphs for semi-supervised learning. 2018 IEEE/CVF Conference on Computer Vision and\nPattern Recognition, pages 8896\u20138905, 2017.\n\n[22] Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bo Yang, Justin Bet-\nteridge, Andrew Carlson, B Dalvi, Matt Gardner, Bryan Kisiel, et al. Never-ending learning.\nCommunications of the ACM, 61(5):103\u2013115, 2018.\n\n10\n\n\f[23] Takeru Miyato, Shin ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training:\na regularization method for supervised and semi-supervised learning. IEEE transactions on\npattern analysis and machine intelligence, 2018.\n\n[24] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodol\u00e0, Jan Svoboda, and\nMichael M. Bronstein. Geometric deep learning on graphs and manifolds using mixture\nmodel cnns. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 5425\u20135434, 2017.\n\n[25] Galileo Namata, Ben London, Lise Getoor, Bert Huang, and UMD EDU. Query-driven active\nsurveying for collective classi\ufb01cation. In 10th International Workshop on Mining and Learning\nwith Graphs, 2012.\n\n[26] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS Workshop on\nDeep Learning and Unsupervised Feature Learning, 2011.\n\n[27] Avital Oliver, Augustus Odena, Colin A. Raffel, Ekin Dogus Cubuk, and Ian J. Goodfellow.\n\nRealistic evaluation of deep semi-supervised learning algorithms. In NeurIPS, 2018.\n\n[28] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social repre-\nsentations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge\ndiscovery and data mining, pages 701\u2013710. ACM, 2014.\n\n[29] Emmanouil Antonios Platanios. Agreement-based learning. arXiv preprint arXiv:1806.01258,\n\n2018.\n\n[30] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic trans-\n\nformations and perturbations for deep semi-supervised learning. In NIPS, 2016.\n\n[31] J. Tang, M. Qu, M. Wang, J. Yan, and Q. Mei. Line: Large-scale information network\nembedding. In Proceedings of the 24th International Conference on World Wide Web, ACM,\npages 1067\u20131077, 2015.\n\n[32] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged\n\nconsistency targets improve semi-supervised deep learning results. In ICLR, 2017.\n\n[33] K. K. Thekumparampil, C. Wang, S. Oh, and L.-J. Li. Attention-based Graph Neural Network\n\nfor Semi-supervised Learning. ArXiv e-prints, March 2018.\n\n[34] Philippe Thomas. Semi-supervised learning by olivier chapelle, bernhard sch\u00f6lkopf, and\n\nalexander zien (review). IEEE Trans. Neural Networks, 20:542, 2009.\n\n[35] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua\nBengio. Graph Attention Networks. International Conference on Learning Representations,\n2018. URL https://openreview.net/forum?id=rJXMpikCZ.\n\n[36] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding.\n\nIn\nProceedings of the 25th international conference on Machine learning, pages 1168\u20131175, 2008.\n\n[37] Jason Weston, Fr\u00e9d\u00e9ric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-\nsupervised embedding. In Neural Networks: Tricks of the Trade, pages 639\u2013655. Springer,\n2012.\n\n[38] Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Planetoid github repository.\n\nhttps://github.com/kimiyoung/planetoid, 2016. Accessed: 2018-02-08.\n\n[39] Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning\n\nwith graph embeddings. In ICML, 2016.\n\n[40] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph convolu-\n\ntional neural networks for web-scale recommender systems. In KDD, 2018.\n\n[41] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Sch\u00f6lkopf.\nLearning with local and global consistency. In Proceedings of the 16th International Conference\non Neural Information Processing Systems, NIPS\u201903, pages 321\u2013328, Cambridge, MA, USA,\n2003. MIT Press. URL http://dl.acm.org/citation.cfm?id=2981345.2981386.\n\n[42] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph\nneural networks: A review of methods and applications. CoRR, abs/1812.08434, 2018. URL\nhttp://arxiv.org/abs/1812.08434.\n\n[43] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian\n\ufb01elds and harmonic functions. In Proceedings of the 20th International conference on Machine\nlearning (ICML-03), pages 912\u2013919, 2003.\n\n11\n\n\f", "award": [], "sourceid": 4699, "authors": [{"given_name": "Otilia", "family_name": "Stretcu", "institution": "Carnegie Mellon University"}, {"given_name": "Krishnamurthy", "family_name": "Viswanathan", "institution": "Google Research"}, {"given_name": "Dana", "family_name": "Movshovitz-Attias", "institution": "Google"}, {"given_name": "Emmanouil", "family_name": "Platanios", "institution": "Carnegie Mellon University"}, {"given_name": "Sujith", "family_name": "Ravi", "institution": "Google Research"}, {"given_name": "Andrew", "family_name": "Tomkins", "institution": "Google"}]}