{"title": "Reconstructing Patterns of Information Diffusion from Incomplete Observations", "book": "Advances in Neural Information Processing Systems", "page_first": 792, "page_last": 800, "abstract": "Motivated by the spread of on-line information in general and   on-line petitions in particular, recent research has raised the following   combinatorial estimation problem.  There is a tree T that we cannot observe directly (representing  the structure along which the information has spread), and certain  nodes randomly decide to make their copy of the information public.  In the case of a petition, the list of names on each public copy   of the petition also reveals a path leading back to the root of the tree.  What can we conclude about the properties of the tree we observe  from these revealed paths,  and can we use the structure of the observed tree  to estimate the size of the full unobserved tree T?    Here we provide the first algorithm for this size estimation task,  together with provable guarantees on its performance.  We also establish structural properties of the observed tree, providing the  first rigorous explanation for some of the unusual structural  phenomena present in the spread of real chain-letter petitions  on the Internet.", "full_text": "Reconstructing Patterns of Information Diffusion\n\nfrom Incomplete Observations \u2217\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nJon Kleinberg\n\nCornell University\nIthaca, NY 14853\n\nFlavio Chierichetti\n\nCornell University\nIthaca, NY 14853\n\nDavid Liben-Nowell\n\nDepartment of Computer Science\n\nCarleton College\n\nNorth\ufb01eld, MN 55057\n\nAbstract\n\nMotivated by the spread of on-line information in general and on-line petitions in\nparticular, recent research has raised the following combinatorial estimation prob-\nlem. There is a tree T that we cannot observe directly (representing the structure\nalong which the information has spread), and certain nodes randomly decide to\nmake their copy of the information public. In the case of a petition, the list of\nnames on each public copy of the petition also reveals a path leading back to the\nroot of the tree. What can we conclude about the properties of the tree we observe\nfrom these revealed paths, and can we use the structure of the observed tree to\nestimate the size of the full unobserved tree T ?\nHere we provide the \ufb01rst algorithm for this size estimation task, together with\nprovable guarantees on its performance. We also establish structural properties of\nthe observed tree, providing the \ufb01rst rigorous explanation for some of the unusual\nstructural phenomena present in the spread of real chain-letter petitions on the\nInternet.\n\n1\n\nIntroduction\n\nThe on-line domain is a rich environment for observing social contagion \u2014 the tendency of new\ninformation, ideas, and behaviors to spread from person to person through a social network [1, 4, 6,\n10, 12, 14, 17, 19]. When a link, invitation, petition, or other on-line item passes between people\nin the network, it is natural to model its spread using a tree structure: each person has the ability to\npass the item to one or more others who haven\u2019t yet received it, producing a set of \u201coffspring\u201d in this\ntree. Recent work has considered such tree structures in the context of on-line conversations [13],\nchain letters [5, 9, 16], on-line product recommendations [11, 15], and other forms of forwarded\ne-mail [18]. These types of trees encode enormous detail about the process by which information\nspreads, but it has been a major methodological challenge to infer properties of their structure from\nthe incomplete pictures of them that on-line data provides. Speci\ufb01cally, how do we reconstruct the\npaths followed by an on-line item, using our incomplete observations, and how do we estimate from\nthese observations the total number of people who encountered the item?\nA fundamental type of social contagion is one in which the item, by its very nature, accumulates\ninformation about the paths it follows as it travels through the social network. A canonical example\n\n\u2217A full version of this paper is available from the authors\u2019 Web pages.\n\n1\n\n\fof such a self-recording item is an on-line petition that spreads virally by e-mail \u2014 in other words,\na chain-letter petition [9, 16]. Each recipient who wants to take part in the petition signs his or\nher name to it and forwards copies of it to friends. In this way, each copy of the petition contains\na growing list of names that corresponds to a path all the way back to the initiator of the petition.\nSuch types of petitions are a central ingredient in broader forms of Internet-based activism, a topic\nof considerable current interest [7, 8].\nIn the remainder of this discussion, we will refer to the\nitem being spread as a \u201cpetition,\u201d although more generally we are considering any item with this\nself-recording structure.\n\nReconstructing the Spread of Social Contagion. Liben-Nowell and Kleinberg studied the fol-\nlowing framework for reconstructing the spread of chain-letter petitions [16]. Empirical analyses\nof large-scale petitions suggest that the spreading pattern can be reasonably modeled as a tree T ;\nalthough there are a small number of deviations, almost all participants sign a copy of the petition\nexactly once (even if they receive it multiple times), and so we can view the person from whom they\nreceived this copy as their parent in T . The originator of the petition is the root of T .\nFor a given petition, the tree T is the structure we wish to understand, because it captures how the\nmessage spreads through the population. But in general we cannot hope to observe T : assuming\nthat the petition spreads through individuals\u2019 e-mail accounts, hosted by multiple providers, there\nis no single organization that has all the information needed to reconstruct T .1 Instead, we must\nobtain information about T indirectly by a revelation mechanism that we can model as follows. For\neach person v who signs the petition, there is a small probability \u03b4 > 0 that v will also publicly post\ntheir copy of it. In this case, we say that the node v is exposed. When v is exposed, we see not only\nthat v belongs to T , but we also see v\u2019s path all the way back to the root r of T , due to the list of\nnames on v\u2019s copy of the petition. Thus, if the set of people who post their copy of the petition is\n{v1, v2, . . . , vs}, then the subtree T (cid:48) of T that we are able to observe consists precisely of the union\nof the r-to-vi paths in T (for i = 1, 2, . . . , s).2\nWe refer to this process as the \u03b4-sampling of a tree T : each node v is exposed independently with\nprobability \u03b4, and then all nodes on any path from the root to an exposed node (including the exposed\nnodes themselves) are revealed. This results in an observed tree, consisting of all revealed nodes,\ngiven by a random variable T\u03b4 drawn from the set of all possible subtrees of T . Understanding the\nrelationship between T\u03b4 and T is a fundamental question, since empirically we are often in a setting\nwhere we can observe T\u03b4 and want to reason about properties of the larger unobserved tree T .\n\nProperties of \u03b4-Sampling: Some Basic Questions. This is the basic issue we address in this pa-\nper: to understand the observation of a tree under \u03b4-sampling. In Liben-Nowell and Kleinberg\u2019s\nwork, they looked at large trees revealed via the public posting of chain-letter petitions on the Inter-\nnet \u2014 the real-life process that is mathematically abstracted by \u03b4-sampling \u2014 and they identi\ufb01ed\nsome unexpected and recurring empirical properties in the observed trees. In particular, the ob-\nserved trees that they reconstructed had a very large single-child fraction \u2014 the fraction of nodes\nwith only one child was above 94%. The resulting trees had a narrow, \u201cstringy\u201d appearance, owing\nto long chains of these single-child nodes; this led naturally to the question of why the patterns\nof chain-letter diffusion were giving rise to such structures. Possible answers were hypothesized in\nsubsequent work. In particular, Golub and Jackson proposed an explanation based on computer sim-\nulation [9]; they studied a model for generating trees T using a Galton\u2013Watson branching process\n[3], and they showed that for branching processes near the critical value for extinction, \u03b4-sampling\nwith small values of \u03b4 produced large single-child fractions in simulations.\nThis line of work has left open a number of questions, of which two principal questions are the fol-\nlowing. First, can we provide a formal connection between \u03b4-sampling and the single-child fraction,\nand can we characterize the types of trees on which this connection holds (whether generated by\nbranching processes or otherwise)? Second, existing work on this topic has so far not provided any\nframework capable of addressing what is perhaps the most basic question about \u03b4-sampling: given\na tree T\u03b4 with its set of exposed nodes indicated \u2014 i.e., a single outcome of the \u03b4-sampling process\n1Some petitions are hosted by a single Web site, rather than relying on social contagion; however, our focus\n\nhere is on those that spread via person-to-person communication.\n\n2In practice, there is a separate algorithmic question inherent in constructing this union in the presence of\nnoise that makes different copies of the lists slightly typographically different from each other [5, 16], but this\nnoise-correction process can be treated as a \u201cblack box\u201d for our purposes here.\n\n2\n\n\f\u2014 can we infer the number of nodes in the original tree T ? (Note that we must do this inference\nwithout knowing the value of \u03b4.) This second question is a central issue in the sense that one gener-\nally asks, given partial observations of diffusion-based activism, for an estimate of the total number\nof people who were involved.\n\nOur Results: Single-Child Fractions and Size Estimation.\nIn this paper, we provide answers to\nboth of these questions. First, we prove that \u03b4-sampling with small \u03b4 produces a large single-child\nfraction in all bounded-degree trees. We do not require any assumption that the unobserved tree T\narises from a branching process; the tree may be arbitrary as long as the degrees are bounded.\nMore precisely, we show that for every natural number k, there is a function fk(x) for which\nlimx\u21920+ fk(x) = 0, such that if T has a maximum of k children at any node, then T\u03b4 has a single-\nchild fraction of at least 1 \u2212 fk(\u03b4) with high probability.3 This result shows how the long stringy\nstructures observed by Liben-Nowell and Kleinberg are a robust property of the process by which\nthese structures were observed, essentially independently of what we assume (beyond a degree\nbound) for the structure of the unobserved tree.\nSecond, we consider the problem of estimating the size of T , which we de\ufb01ne as the number of\nnodes it contains. In the basic formulation of the problem, we ask: given a single draw of T\u03b4, with\nits set of exposed nodes indicated, can we estimate the size of T to within a 1 \u00b1 \u03b5 factor with high\nprobability for any constant \u03b5 > 0? Here we show that this is possible for any bounded-degree tree,\nas well as for trees of unbounded degree that satisfy certain structural conditions.\nFollowing our analysis of the estimation problem, we also consider the closely connected issue of\nconcentration, which is related to estimation but distinct from it. Speci\ufb01cally, we ask: is it the case\nthat the size of T\u03b4 (a numerical random variable derived from T\u03b4 itself) is concentrated near its mean?\nFor suf\ufb01ciently small \u03b4 the answer is no, and we give a bound on the threshold for \u03b4, tight to within\nan exponentially smaller term, at which concentration begins to hold. We note that concentration is\na fundamentally different issue from estimation, in the sense that to be able to perform estimation,\nit is not suf\ufb01cient that the size of T\u03b4 be concentrated as a random variable.4\nUsing our methodology, we provide the \ufb01rst estimate for the reach of the Iraq-War protest chain\nletter studied by Liben-Nowell and Kleinberg: while the tree structure and rate of posting are at\nthe limit of the parameters that can be handled, our framework estimates that their observed tree of\n18,119 signers is a subtree of a larger unobserved tree with approximately 173,000 signers, which in\ntotal generated roughly 3.5 million copies of the e-mailed petition when both signers and non-signers\nare considered.\n\nOur Results: Extensions of the Basic Model. Finally, we prove results for several extensions to\nour model. First, while we have focused on the case in which there is a \ufb01xed underlying tree T which\nis then sampled using randomization, we can also de\ufb01ne a model in which both T and the sampling\nare the result of randomization \u2014 in particular, we consider a case in which T is \ufb01rst generated\nfrom a critical Galton\u2013Watson process [3], and then \u03b4-sampling is applied to the generated tree T .\nFor this model, we show that as long as the offspring distribution of the Galton\u2013Watson process has\n\ufb01nite variance and unit expectation we can estimate the size of the unobserved tree T . Note that this\nallows for unbounded degrees \u2014 i.e., an offspring distribution with unbounded support \u2014 provided\nthat the variance of this distribution is bounded.\nA further extension relaxes the assumption that when a node v makes its copy of the petition public,\nthe path is revealed all the way back to the root. Instead of the full path being visible, one can\nalternately consider a situation in which only the previous (cid:96) names on the petition are preserved, and\nhence the observed tree can only be reconstructed if it is possible to piece these snippets of length (cid:96)\n\n3Note that for simple reasons we need the given conditions on both k and \u03b4. Indeed, if we don\u2019t bound\nthe maximum number of children at any node, then the star graph \u2014 a single node with n \u2212 1 children \u2014\nhas a single-child fraction of 0 with high probability for any non-trivial \u03b4. And if we don\u2019t consider the case\nof \u03b4 \u2192 0, then each node with multiple children has a constant probability of having several of them made\npublic and hence the single-child fraction can\u2019t converge to 1, unless the original tree was composed almost\nexclusively of single-child nodes to begin with.\nto any n-leaf star, for any value of n \u2265 s and with \u03b4 = s/n.\n\n4For example, if T\u03b4 is simply a star with s leaves, this observed tree is consistent with \u03b4-sampling applied\n\n3\n\n\ftogether.5 Here we can show that size estimation is possible provided that (cid:96) is at least \u03b4\u22121 times a\nlogarithmic factor, and this bound is asymptotically tight. Thus, our estimation methods work even\nwhen the data provides much less than a full path back to the root for each node made public. Due\nto space limitations, we defer details of this extension to the full version of the paper.\n\n2 Single-Child Fraction\n\nWe begin by showing that in any bounded-degree tree, the fraction of single-child nodes converges\nto 1 as the sampling rate \u03b4 goes to 0. The plan for this proof is as follows. First of all, let the unob-\nserved tree T have n nodes, each having at most k children. Let us say that v is a branching node if\nit has more than one child. (That is, we partition the nodes of the tree into three disjoint categories:\nleaves, single-child nodes, and branching nodes.) In any bounded-degree tree, the number of leaves\nand the number of branching nodes are within a constant factor of each other; in particular, this will\nbe true in the revealed tree T\u03b4.\nNow, all leaves in T\u03b4 are nodes that are exposed (i.e., made public) by the \u03b4-sampling process, so\nin expectation T\u03b4 has at most \u03b4n leaves (and we can bound the probability that the actual number\nexceeds this by more than a small factor). Thus, there will also be O(\u03b4n) branching nodes, and all\nother nodes in T\u03b4 must be single-child nodes.\nThus, the key to the argument is Theorem 2.1, which asserts that with high probability, T\u03b4 has\n\u2126(\u03b4n logk \u03b4\u22121) nodes in total. Since there are only O(\u03b4n) leaves and branching nodes, the remain-\ning nodes must be single-child nodes \u2014 and since the size of T\u03b4 exceeds O(\u03b4n) by a factor of\n\u2126(logk \u03b4\u22121), the fraction of single-child nodes in T\u03b4 must therefore converge to 1 as \u03b4 goes to 0.\nComplete proofs of all the results in this paper are given in the full version; due to space limitations,\nwe are not able to include them here. Where space permits, we will brie\ufb02y summarize some of the\nproofs in the present version. For Theorem 2.1, the key is to show that in any bounded-degree tree\nT , we can identify \u2126(\u03b4n) many disjoint sub-trees T1, T2, T3, . . ., each of size \u0398(\u03b4\u22121). We then\nargue that in a constant fraction of these trees Ti, a node of Ti at distance at least \u2126(logk \u03b4\u22121) from\nTi\u2019s root will be exposed, which will result in the appearance of \u2126(logk \u03b4\u22121) nodes in Ti.\nTheorem 2.1. Let T be a rooted n-node tree, and suppose that no node in T has more than k \u2265 2\nchildren.6\nLet \u03b4 \u2264 k\u2212\u03b1, for any constant \u03b1 > 2. Let T\u03b4 be the random subtree of T revealed by the \u03b4-sampling\nprocess, and let X\u03b4 be the number of its internal nodes. Then\n\nPr(cid:2)X\u03b4 \u2265 \u2126(\u03b4n logk \u03b4\u22121)(cid:3) \u2265 1 \u2212 e\u2212\u2126(\u03b4n).\n\nWe now follow the plan outlined at the beginning of this section, using this theorem to conclude\nthat the fraction of single-child nodes converges to 1. Theorem 2.1 provided the main step; from\nhere, we simply argue, in Theorem 2.2, that T\u03b4 will have at most O(\u03b4n) branching nodes with high\nprobability.\nTheorem 2.2. Given a tree T on n nodes, a sampling rate \u03b4, and a number M \u2264 n, let p be the\nprobability that the size of the tree T\u03b4 revealed by the \u03b4-sampling process is at most M. Let m be\nthe number of nodes in T\u03b4 and m1 be the number of single-child nodes in T\u03b4. Then,\n\nm1 \u2265\n\n1 \u2212 O\n\nPr\n\n\u00b7 m\n\n\u2265 1 \u2212 e\u2212\u2126(\u03b4n) \u2212 p.\n\n(cid:20)\n\n(cid:18)\n\n(cid:18) \u03b4n\n\n(cid:19)(cid:19)\n\nM\n\n(cid:21)\n\nNow, using Theorems 2.1 and 2.2, we obtain the main result about single-child nodes as the follow-\ning corollary.\nCorollary 2.3. Let T be a rooted n-node tree, and suppose that no node in T has more than k \u2265 2\nchildren.\n\n5This version of the problem also arises naturally if we assume that individuals are not explicitly signing a\n\npetition, but that each forwarded message includes copies of the previous messages to a depth of (cid:96).\n\n6If, in T , each node has at most one child, then T is a path \u2014 in which case, an easy argument shows that\nalmost every node will be revealed, and that necessarily only one of the revealed nodes will not have one child.\nStill, this case is covered by the theorem: just choose k = 2.\n\n4\n\n\fLet \u03b4 \u2264 k\u2212\u03b1, for any constant \u03b1 > 1. Let T\u03b4 be the random subtree of T revealed by the \u03b4-sampling\nprocess. Let m and m1 be, respectively, the number of nodes, and the number of nodes with exactly\none child, in T\u03b4. Then\n\n(cid:20)\n\n(cid:18)\n\n(cid:18)\n\nm1 \u2265\n\n1 \u2212 O\n\nPr\n\n1\n\nlogk \u03b4\u22121\n\n(cid:19)(cid:19)\n\n(cid:21)\n\n\u00b7 m\n\n\u2265 1 \u2212 e\u2212\u2126(\u03b4n).\n\nFor concreteness, observe that if we choose \u03b4 = k\u2212\u2126(1/\u0001) in Corollary 2.3, we obtain that the fraction\nof single-child nodes in the revealed tree will approach 1\u2212O(\u0001) with probability 1\u2212exp (\u2212\u2126 (\u03b4n)).\n\n3 Estimation\n\nAs before, given an unknown tree T , let T\u03b4 be the tree revealed by the \u03b4-sampling process. In this\nsection, we focus on the problem of size estimation: we present an algorithm which can be used as\nan unbiased estimator \u02c6\u03b4 for \u03b4, and then we estimate the size n of the full unobserved tree.\nLet V = V (T\u03b4) be the set of nodes of T\u03b4, let L \u2286 V be the set of its leaves, and let E \u2286 V be the\nset of its nodes that were exposed. (Observe that L \u2286 E.) For the unbiased estimator \u02c6\u03b4, we consider\nthe set of all nodes \u201cabove\u201d the leaves of T\u03b4 \u2014 that is, internal nodes on a path from a leaf of T\u03b4 to\nthe root \u2014 and we use the empirical fraction of exposures in this set as our value for \u02c6\u03b4.\nAfter establishing that \u02c6\u03b4 is an unbiased estimator, we show that the probability of a large deviation\nbetween \u02c6\u03b4 and \u03b4 decreases exponentially in |V \u2212 L|, the number of internal nodes of T\u03b4. Thus,\nto show a high probability bound for our size estimate, we need to establish a lower bound on the\nnumber of internal nodes of T\u03b4, which will be the \ufb01nal step in the analysis.\nWe begin by describing an algorithm to produce the estimator \u02c6\u03b4, and a corresponding estimator \u02c6n\nfor the size of T .\n\n\u2022 If |V | = 0 then return \u02c6\u03b4 = 0; and if |V | = 1 then return \u02c6\u03b4 = 1.\n\u2022 Otherwise return \u02c6\u03b4 =\n\n|E|\u2212|L|\n|V |\u2212|L|. If |E| > |L|, also return \u02c6n =\n\n|V |\u2212|L|\n|E|\u2212|L| \u00b7 |E|.\n\nObserve that the algorithm is well-de\ufb01ned since, if |V | \u2265 2, then V will contain T\u03b4\u2019s root, which\nwill not be contained in L, and therefore |V | \u2212 |L| \u2265 1. For the following analysis of the algorithm\nobserve that, since L \u2286 E and L \u2286 V , we have |E| \u2212 |L| = |E \u2212 L| and |V | \u2212 |L| = |V \u2212 L|.\nWe begin by showing that \u02c6\u03b4 is an unbiased estimator for \u03b4. Following the plan outlined above, we\nconsider the independent exposures of all nodes that lie above the leaves of T\u03b4, resulting in the set\nE\u2212L \u2286 V \u2212L. Because exposure decisions are made independently at each node, Chernoff bounds\nprovide us with a concentration result.\nLemma 3.1. \u02c6\u03b4 is an unbiased estimator for \u03b4. Furthermore, if |V | \u2265 2,\n\n(cid:104)(cid:12)(cid:12)(cid:12)\u02c6\u03b4 \u2212 \u03b4\n\n(cid:12)(cid:12)(cid:12) \u2265 \u0001 \u00b7 \u03b4\n\n(cid:105) \u2264 2e\u2212 1\n\nPr\n\n3 \u00012\u03b4|V \u2212L|.\n\nWe now transfer our bound on |\u02c6\u03b4 \u2212 \u03b4| to a bound on |\u02c6n \u2212 n|. For this, it suf\ufb01ces to combine three\nrelationships among these quantities: (i) \u02c6n = |E|/\u02c6\u03b4 by de\ufb01nition; (ii) |\u02c6\u03b4 \u2212 \u03b4| \u2264 \u0001 \u00b7 \u03b4 with high\nprobability, by Lemma 3.1, and (iii) ||E| \u2212 \u03b4n| \u2264 \u0001\u03b4n with high probability via Chernoff bounds,\nsince the exposure decisions consist of n independent coin \ufb02ips each of probability \u03b4. Putting these\ntogether, we have the following corollary of Lemma 3.1.\nCorollary 3.2. If |V | \u2265 2, then the size n of the unknown tree T satis\ufb01es\n\nPr [|n \u2212 \u02c6n| \u2264 \u0001n] \u2265 1 \u2212 e\u2212\u0398(\u00012\u03b4|V \u2212L|).\n\n3.1 Trees with sublinear maximum degree\n\nOur bounds thus far show that \u02c6n is close to n with a probability that decreases exponentially in the\nnumber of internal nodes |V \u2212 L|. We now investigate cases under which we can replace this upper\nbound on the probability by a more powerful one that decreases exponentially in a function that\ndepends directly on n.\n\n5\n\n\fTo do this, we require a theorem that guarantees that the number of internal nodes is at least an\nexplicit function of n; this function can then be used in place of |V \u2212 L| in the probability bounds.\nOur main result for this purpose is the following; in many respects, the bound it establishes it is less\nre\ufb01ned than the bound from Theorem 2.1, but it is useful for obtaining a guarantee for the estimation\nprocedure. The crux of the proof is to show that if a node v has kv children, and \u03b4kv \u2264 1, then the\nprobability that v is revealed is at least a constant times the expected number of the exposed children\nof v: that is, \u2126(\u03b4kv); if, instead, \u03b4kv > 1, then the probability that v is revealed is \u2126(1). The\nresult then follows from a bound on the number of nodes of degree greater than \u03b4\u22121, linearity of\nexpectation, and Chernoff bounds.\nTheorem 3.3. Let T be a rooted n-node tree, and suppose that no node in T has more than k \u2265 1\nchildren.\nLet T\u03b4 be the random subtree of T revealed by the \u03b4-sampling process. Then, the number X\u03b4 of\ninternal nodes of T\u03b4 satis\ufb01es\n\n(cid:20)\nX\u03b4 \u2265 1 \u2212 e\u22121\n\n2\n\nPr\n\n\u00b7 min(cid:0)k\u22121, \u03b4(cid:1) \u00b7 (n \u2212 1)\n\n(cid:21)\n\n\u2265 1 \u2212 e\u2212\u0398(n min(k\u22121,\u03b4)).\n\nUsing this theorem, we can directly replace the bound from Corollary 3.2 with one that is an explicit\nfunction of n. Speci\ufb01cally, the next result follows directly from Corollary 3.2 and Theorem 3.3.\nCorollary 3.4. Let T be a rooted n-node tree, and suppose that no node in T has more than k \u2265 1\nchildren. Then, the event\n\nhappens with probability at least 1 \u2212 e\u2212\u0398(\u00012\u03b4 min(\u03b4,k\u22121)n).\n\n(1 \u2212 \u0001)n \u2264 \u02c6n \u2264 (1 + \u0001)n\n\n(cid:18)(cid:113) ln \u03b7\u22121\n\n(cid:19)\n\nThe smallest \u03b4 that Corollary 3.4 can tolerate is roughly n\u22121/2: if \u03b4 \u2265 \u2126\nin the unknown tree has more than \u03b4\u22121 children (observe that \u03b4\u22121 (cid:38) \u221a\n1 \u2212 \u03b7, the \u02c6n returned by the estimator is within a multiplicative 1 \u00b1 \u0001 factor of the actual n.\n\n\u00012n\n\nn), then with probability\n\n, and no node\n\n3.2 Trees Arising from Branching Processes\n\nWe observe that Corollary 3.2 can also be used, just as in Section 3.1, for critical branching processes\n\u2014 those whose offspring distributions have unit expectation. (We also require \ufb01nite variance.) The\nmain fact we require about such branching processes is that the height of a uniformly chosen node\nfrom a branching process tree (with offspring distribution having \ufb01nite variance, unit expectation,\n\nand conditioned on being of size n) is at least \u2126(cid:0)n1/2\u2212\u0001(cid:1) with high probability [2].\nwe choose \u03b4 \u2265 \u2126(cid:0)n\u22121/2+2\u0001(cid:1) it holds that |V \u2212 L| \u2265 \u03c9(\u03b4\u22121), and Corollary 3.2 can be applied to\n\nNow, since |V \u2212 L| is at least the length of the path joining a uniform chosen node to the root, if\n\nobtain a concentration result for \u02c6n.\n\n4 Concentration\n\nAs we observed in previous sections, the size of T\u03b4 plays a prominent role in determining both the\nfraction of single-child nodes and the size of the unknown tree T .\nIn this section we prove some concentration results on the quantity |T\u03b4| \u2014 that is, we will bound the\nprobability that |T\u03b4| is far from its mean, over random outcomes of the \u03b4-sampling process applied\nto the underlying tree T .\nTo begin with, the mean E [|T\u03b4|] depends not just on |T| but also on the structure of T . However,\nit has a simple formulation in terms of this structure, as shown by the following claim, which is a\ndirect application of linearity of expectation.\nObservation 4.1. Let T be a rooted tree, and let T\u03b4 be the random subtree of T revealed by the\n\u03b4-sampling process. Then, if |Tv| denotes the size of the subtree of T rooted at v,\n\n(cid:16)\n1 \u2212 (1 \u2212 \u03b4)|Tv|(cid:17)\n\n.\n\n(cid:88)\n\nv\u2208T\n\nE [|T\u03b4|] =\n\n6\n\n\fOur main result on concentration gives a value of \u03b4 above which |T\u03b4| has a high probability of being\nnear its mean. The proof requires an intricate balancing of two kinds of nodes \u2014 those \u201chigh\u201d\nin T , with many descendants, and those \u201clow\u201d in T , with few descendants. If there are many low\nnodes, then since their probabilities of being revealed behave relatively independently, we have\nconcentration; if there only a few low nodes, then we have concentration simply from the fact that\nmost of the high nodes will be revealed in almost all outcomes of the \u03b4-sampling process.\nTheorem 4.2. Let T be a rooted tree on n nodes, with height at most H. Let T\u03b4 be the random\nsubtree of T revealed by the \u03b4-sampling process. Let m be the size of T\u03b4.\nThen for any \u0001, \u03b7 bounded above by some constant, and for any\n\n(cid:32)\n\n(cid:32)\n\n\u03b4 \u2265 \u2126\n\nmin\n\nln2 n/\u03b7(cid:112)n\u00013\u03b7\n\nH ln3 n/\u03b7\n\nn\u00013\u03b7\n\n,\n\n(cid:33)(cid:33)\n\n,\n\nit holds that Pr [|m \u2212 E[m]| \u2264 \u0001E[m]] \u2265 1 \u2212 \u03b7.\n\nNote that the theorem requires a lower bound on the value of \u03b4, and we now show why this bound\n\nis necessary. In particular, we observe how the theorem does not hold if \u03b4 = o(cid:0)n\u22121/2(cid:1). To do this,\nlet T be a tree whose root is connected directly to n \u2212 1 \u2212(cid:6)\u03b4\u22121(cid:7) leaves and also to a path of length\n(cid:6)\u03b4\u22121(cid:7). Then T\u03b4 will not contain any node in the path with probability\n\n(1 \u2212 \u03b4)(cid:100)\u03b4\u22121(cid:101) \u03b4\u21920\u2212\u2212\u2212\u2212\u2192 e\u22121.\n\nIf T\u03b4 does not contain any node in the path, then it will only contain nodes adjacent to the root.\nSince there are \u0398(n) of these nodes, it follows from Chernoff bounds that T\u03b4 will contain at most\nU = O(\u03b4n) many nodes.\nOn the other hand, the probability that T\u03b4 will contain exactly one node in the path is\n\n(cid:6)\u03b4\u22121(cid:7) \u00b7 \u03b4 \u00b7 (1 \u2212 \u03b4)(cid:100)\u03b4\u22121(cid:101)\u22121\n\n\u03b4\u21920\u2212\u2212\u2212\u2212\u2192 e\u22121.\n\nSince, under this conditioning, the single node in the path will be uniformly distributed over the path\nitself, with half the probability it will be in the lower half of the path \u2014 causing the upper half of the\npath to be revealed. Hence with constant probability at least L = \u2126(\u03b4\u22121) nodes will be revealed.\nWe have shown that with constant probability the size of T\u03b4 will be at most U, and with constant\n\nprobability the size of T\u03b4 will be at least L. If \u03b4 = o(cid:0)n\u22121/2(cid:1), we have L/U = \u2126(\u03b4\u22122n\u22121) =\n\u03b4 = o(cid:0)n\u22121/2(cid:1).\n\n\u03c9(n)/n = \u03c9(1), from which it follows that the number of nodes of T\u03b4 is not concentrated when\n\n5 The Iraq-War Petition\n\nUsing the framework developed in the previous sections, we now turn to the anti-war petition studied\nby Liben-Nowell and Kleinberg. The petition, which protested the impending US-led invasion of\nIraq, spread widely via e-mail in 2002\u20132003. The Iraq-War tree observed by Liben-Nowell and\nKleinberg \u2014 after they did some mild preprocessing to clean the data \u2014 was deep and narrow,\nand contained the characteristic \u201cstringy\u201d pattern analyzed in Section 2, with over 94% of nodes\nhaving exactly one child. The observed Iraq-War tree contained |V | = 18, 119 nodes and |E| = 620\nexposed nodes, of which |L| = 557 were exposed leaves.\nUsing this information, we can apply the algorithm from Section 3: we estimate the posting proba-\nbility as \u02c6\u03b4 = (620 \u2212 557)/(18119 \u2212 557) \u2248 0.00359, and we estimate the size of the unobserved\nIraq-War tree to be \u02c6n = |E|/\u02c6\u03b4 \u2248 172,832.38 signatories.\nWe can also apply the results of Section 3 to analyze the error in our estimate \u02c6n. For this purpose,\nwe pose the question concretely as follows: if the observed Iraq-War tree arose via \u03b4-sampling from\nan arbitrary unobserved tree T of size n, what is the probability of the event that the estimate \u02c6n\n2 n, 2n]? (Recall that our estimation algorithm is\nproduced by our algorithm lies in the interval [ 1\ndeterministic; the probability here is taken over the random choices of nodes exposed by the \u03b4-\nsampling process to the arbitrary \ufb01xed tree T .) We use a careful analysis (tight to constants), to\nshow that the estimate \u02c6n is quite tight, as indicated by the following theorem.\n\n7\n\n\fTheorem 5.1. For any tree T of size n, assuming the observed Iraq-War tree was produced via\n\u03b4-sampling of T , the event that \u02c6n lies in the interval [ 1\n\n2 n, 2n] is at least 95%.\n\nIn addition to the number of signers of the petition, it is also of interest to determine the total number\nof e-mail messages generated by the spread of the petition. For this purpose, we \ufb01rst need to estimate\nthe distribution of the number of recipients of an e-mailed copy of the petition.\nTo estimate this distribution, we collected a dataset of 147 copies of e-mail petitions with intact\ne-mail headers. In addition to data from the Iraq-War petition, these 147 copies include two other\nwidely circulated petitions, supporting National Public Radio (NPR) and Mothers Against Drunk\nDriving (MADD). For each of these 147 e-mails, we counted the number of e-mail addresses to\nwhich the message was sent, including both direct and CCed recipients. (E-mails that were sent to\nmailing lists instead of to a list of individuals were not included in the set of 147.) The mean number\nof addressees was 20.37 (with standard deviation 20.60), the median was 14, and the maximum\nwas 141. In addition to using the length of recipient lists to check the conditions needed for our\ntheoretical results, we can also use these numbers to estimate the total reach of the Iraq-War petition.\nA person who signs the petition forwards the petition, on average, to 20.37 other addressees. Thus,\nby linearity of expectation, we can estimate that the \u2248 172,832.38 signers in the unobserved tree\nsent a total of \u2248 3,520,595.58 chain-letter e-mails in the Iraq-War petition.\nFinally, the \u03b4-sampling process is a very simple abstraction of the process by which a widely circu-\nlated message becomes public, and with further inspection of the Iraq-War tree observed by Liben-\nNowell and Kleinberg, we can begin to identify potential limitations of the basic \u03b4-sampling model.\nPrincipally, we have been assuming that each individual signatory of the petition exposes her pe-\ntition copy independently with probability \u03b4. However, the assumption of independence of nodes\u2019\nexposure events \u2014 while useful as an analytical abstraction \u2014 appears to be too simple to capture all\nthe properties we see in the exposure events for the real data. One of the most common mechanisms\nthat exposes a petition e-mail is when that e-mail is sent to a mailing list that archives its messages\non the Web. When one person exposes her petition copy by sending it to a mailing list, then her\nfriends are more likely to expose their petition copies by sending to the same list again, because they\nare more likely to be members of that same list (because of homophily) or because they \u201creply to\nall\u201d (including the list) with their petition copy. We can quantify this independence issue explicitly\nby noting that many of the exposed internal nodes in the observed Iraq-War tree are close to the\nleaves of the tree. In particular, 48 of the 63 exposed internal nodes are within 10 hops of a leaf, out\nof only 5351 total such nodes. Thus the exposure rate for internal nodes within distance 10 of a leaf\nis 48/5351 \u2248 0.00897, while the exposure rate for internal nodes more than distance 10 from any\nleaf is 15/12211 \u2248 0.00123.\n\n6 Conclusion\n\nWhen information spreads through a social network, it often does so along a branching structure\nthat can be reasonably modeled as a tree; but when we observe this spreading process, we frequently\nsee only a portion of the full tree. In this work, we have developed techniques that allow us to\nreason about the full tree along which the information spreads from the portion that is observed; as\na consequence, we are able to propose estimates for the size of a network cascade from a sample of\nit, and to deduce certain structural properties of the tree that it produces.\nWhen we apply these techniques to data such as the Iraq-War petition in Section 5, our conclusions\nmust clearly be interpreted in light of the model\u2019s underlying assumptions. Among these assump-\ntions, the requirement of bounded degree may generally be fairly mild, since it essentially requires\nthe tree of interest simply to be large enough compared to the number of children at any one node.\nArguably more restrictive is the assumption that each node makes an independent decision about\nposting its copy of the information, and with the same \ufb01xed probability \u03b4. It is an interesting di-\nrection for further work to consider how one might perform comparable analyses with a relaxed\nversion of these underlying assumptions, as well as the extent to which estimations of the type we\nhave pursued here are robust in the face of different variations on the assumptions.\n\nAcknowledgements. Supported in part by the MacArthur Foundation, a Google Research Grant,\na Yahoo! Research Alliance Grant, and NSF grants IIS-0910664, CCF-0910940, and IIS-1016099.\n\n8\n\n\fReferences\n[1] E. Adar, L. Zhang, L. A. Adamic, and R. M. Lukose. Implicit structure and the dynamics of\n\nblogspace. In Workshop on the Weblogging Ecosystem, 2004.\n\n[2] D. Aldous. The continuum random tree II: An overview. In M. T. Barlow and N. H. Bingham,\n\neditors, Stochastic Analysis, pages 23\u201370. Cambridge University Press, 1991.\n\n[3] K. B. Athreya and P. E. Ney. Branching Processes. Dover, 2004.\n[4] E. Bakshy, B. Karrer, and L. A. Adamic. Social in\ufb02uence and the diffusion of user-created\n\ncontent. In Proc. 10th ACM Conference on Electronic Commerce, pages 325\u2013334, 2009.\n\n[5] C. H. Bennett, M. Li, and B. Ma. Chain letters and evolutionary histories. Scienti\ufb01c American,\n\n288(6):76\u201379, June 2003.\n\n[6] M. Cha, A. Mislove, and P. K. Gummadi. A measurement-driven analysis of information prop-\nagation in the \ufb02ickr social network. In Proc. 18th International World Wide Web Conference,\npages 721\u2013730, 2009.\n\n[7] J. Earl. The dynamics of protest-related diffusion on the web. Information, Communication,\n\nand Society, 13(26):209\u2013225, 2010.\n\n[8] R. K. Garrett. Protest in an information society: A review of literature on social movements\n\nand new ICTs. Information, Communication, and Society, 9(2):202\u2013224, 2006.\n\n[9] B. Golub and M. O. Jackson. Using selection bias to explain the observed structure of internet\n\ndiffusions. Proc. Natl. Acad. Sci. USA, 107(24):10833\u201310836, 15 June 2010.\n\n[10] D. Gruhl, R. V. Guha, D. Liben-Nowell, and A. Tomkins.\n\nInformation diffusion through\n\nblogspace. In Proc. 13th International World Wide Web Conference, 2004.\n\n[11] J. L. Iribarren and E. Moro. Impact of human activity patterns on the dynamics of information\n\ndiffusion. Physical Review Letters, 103(3), July 2009.\n\n[12] J. Kleinberg. Cascading behavior in networks: Algorithmic and economic issues. In N. Nisan,\nT. Roughgarden, \u00b4E. Tardos, and V. Vazirani, editors, Algorithmic Game Theory, pages 613\u2013\n632. Cambridge University Press, 2007.\n\n[13] R. Kumar, M. Mahdian, and M. McGlohon. Dynamics of conversations. In Proc. 16th ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 553\u2013\n562, 2010.\n\n[14] J. Leskovec, L. Adamic, and B. Huberman. The dynamics of viral marketing. ACM Transac-\n\ntions on the Web, 1(1), May 2007.\n\n[15] J. Leskovec, A. Singh, and J. M. Kleinberg. Patterns of in\ufb02uence in a recommendation network.\nIn Paci\ufb01c-Asia Conference on Knowledge Discovery and Data Mining, pages 380\u2013389, 2006.\n[16] D. Liben-Nowell and J. Kleinberg. Tracing information \ufb02ow on a global scale using Internet\n\nchain-letter data. Proc. Natl. Acad. Sci. USA, 105(12):4633\u20134638, Mar. 2008.\n\n[17] E. Sun, I. Rosenn, C. Marlow, and T. M. Lento. Gesundheit! Modeling contagion through\nFacebook News Feed. In Proc. 3rd International Conference on Weblogs and Social Media,\n2009.\n\n[18] D. Wang, Z. Wen, H. Tong, C.-Y. Lin, C. Song, and A.-L. Barab\u00b4asi. Information spreading in\n\ncontext. In Proc. 20th International World Wide Web Conference, pages 735\u2013744, 2011.\n\n[19] F. Wu, B. A. Huberman, L. A. Adamic, and J. R. Tyler. Information \ufb02ow in social groups.\n\nPhysica A, 337(1-2):327\u2013335, 2004.\n\n9\n\n\f", "award": [], "sourceid": 528, "authors": [{"given_name": "Flavio", "family_name": "Chierichetti", "institution": null}, {"given_name": "David", "family_name": "Liben-nowell", "institution": null}, {"given_name": "Jon", "family_name": "Kleinberg", "institution": null}]}