{"title": "Identity Uncertainty and Citation Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 1425, "page_last": 1432, "abstract": null, "full_text": "Identity Uncertainty and Citation Matching\n\nHanna Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, Ilya Shpitser\n\npasula, marthi, milch, russell, ilyas@cs.berkeley.edu\n\nComputer Science Division, University Of California\n\n387 Soda Hall, Berkeley, CA 94720-1776\n\nAbstract\n\nIdentity uncertainty is a pervasive problem in real-world data analysis. It\narises whenever objects are not labeled with unique identi\ufb01ers or when\nthose identi\ufb01ers may not be perceived perfectly. In such cases, two ob-\nservations may or may not correspond to the same object. In this paper,\nwe consider the problem in the context of citation matching\u2014the prob-\nlem of deciding which citations correspond to the same publication. Our\napproach is based on the use of a relational probability model to de\ufb01ne\na generative model for the domain, including models of author and title\ncorruption and a probabilistic citation grammar. Identity uncertainty is\nhandled by extending standard models to incorporate probabilities over\nthe possible mappings between terms in the language and objects in the\ndomain. Inference is based on Markov chain Monte Carlo, augmented\nwith speci\ufb01c methods for generating ef\ufb01cient proposals when the domain\ncontains many objects. Results on several citation data sets show that\nthe method outperforms current algorithms for citation matching. The\ndeclarative, relational nature of the model also means that our algorithm\ncan determine object characteristics such as author names by combining\nmultiple citations of multiple papers.\n\n1 INTRODUCTION\n\nCitation matching is the problem currently handled by systems such as Citeseer [1]. 1 Such\nsystems process a large number of scienti\ufb01c publications to extract their citation lists. By\ngrouping together all co-referring citations (and, if possible, linking to the actual cited\npaper), the system constructs a database of \u201cpaper\u201d entities linked by the \u201ccites(p 1; p2)\u201d\nrelation. This is an example of the general problem of determining the existence of a set\nof objects, and their properties and relations, given a collection of \u201craw\u201d perceptual data;\nthis problem is faced by intelligence analysts and intelligent agents as well as by citation\nsystems.\nA key aspect of this problem is determining when two observations describe the same\nobject; only then can evidence be combined to develop a more complete description of the\nobject. Objects seldom carry unique identi\ufb01ers around with them, so identity uncertainty\nis ubiquitous. For example, Figure 1 shows two citations that probably refer to the same\npaper, despite many super\ufb01cial differences. Citations appear in many formats and are rife\nwith errors of all kinds. As a result, Citeseer\u2014which is speci\ufb01cally designed to overcome\nsuch problems\u2014currently lists more than 100 distinct AI textbooks published by Russell\n\n1See citeseer.nj.nec.com. Citeseer is now known as ResearchIndex.\n\n\f[Lashkari et al 94] Collaborative Interface Agents, Yezdi Lashkari, Max Metral, and Pattie\nMaes, Proceedings of the Twelfth National Conference on Articial Intelligence, MIT Press,\nCambridge, MA, 1994.\n\nMetral M. Lashkari, Y. and P. Maes. Collaborative interface agents.\nAmerican Association for Arti\ufb01cial Intelligence, Seattle, WA, August 1994.\n\nIn Conference of the\n\nFigure 1: Two citations that probably refer to the same paper.\n\nand Norvig on or around 1995, from roughly 1000 citations. Identity uncertainty has been\nstudied independently in several \ufb01elds. Record linkage [2] is a method for matching up\nthe records in two \ufb01les, as might be required when merging two databases. For each pair\nof records, a comparison vector is computed that encodes the ways in which the records\ndo and do not match up. EM is used to learn a naive-Bayes distribution over this vector\nfor both matched and unmatched record pairs, so that the pairwise match probability can\nthen be calculated using Bayes\u2019 rule. Linkage decisions are typically made in a greedy\nfashion based on closest match and/or a probability threshold, so the overall process is\norder-dependent and may be inconsistent. The model does not provide for a principled\nway to combine matched records. A richer probability model is developed by Cohen et\nal [3], who model the database as a combination of some \u201coriginal\u201d records that are correct\nand some number of erroneous versions. They give an ef\ufb01cient greedy algorithm for \ufb01nding\na single locally optimal assignment of records into groups.\nData association [4] is the problem of assigning new observations to existing trajectories\nwhen multiple objects are being tracked; it also arises in robot mapping when deciding if\nan observed landmark is the same as one previously mapped. While early data associa-\ntion systems used greedy methods similar to record linkage, recent systems have tried to\n\ufb01nd high-probability global solutions [5] or to approximate the true posterior over assign-\nments [6]. The latter method has also been applied to the problem of stereo correspondence,\nin which a computer vision system must determine how to match up features observed in\ntwo or more cameras [7]. Data association systems usually have simple observation mod-\nels (e.g., Gaussian noise) and assume that observations at each time step are all distinct.\nMore general patterns of identity occur in natural language text, where the problem of\nanaphora resolution involves determining whether phrases (especially pronouns) co-refer;\nsome recent work [8] has used an early form of relational probability model, although with\na somewhat counterintuitive semantics.\nCiteseer is the best-known example of work on citation matching [1]. The system groups\ncitations using a form of greedy agglomerative clustering based on a text similarity metric\n(see Section 6). McCallum et al [9] use a similar technique, but also develop clustering\nalgorithms designed to work well with large numbers of small clusters (see Section 5).\nWith the exception of [8], all of the preceding systems have used domain-speci\ufb01c algo-\nrithms and data structures; the probabilistic approaches are based on a \ufb01xed probability\nmodel. In previous work [10], we have suggested a declarative approach to identity uncer-\ntainty using a formal language\u2014an extension of relational probability models [11]. Here,\nwe describe the \ufb01rst substantial application of the approach. Section 2 explains how to\nspecify a generative probability model of the domain. The key technical point (Section 3)\nis that the possible worlds include not only objects and relations but also mappings from\nterms in the language to objects in the domain, and the probability model must include a\nprior over such mappings. Once the extended model has been de\ufb01ned, Section 4 details the\nprobability distributions used. A general-purpose inference method is applied to the model.\nWe have found Markov chain Monte Carlo (MCMC) to be effective for this and other appli-\ncations (see Section 5); here, we include a method for generating effective proposals based\non ideas from [9]. The system also incorporates an EM algorithm for learning the local\nprobability models, such as the model of how author names are abbreviated, reordered, and\nmisspelt in citations. Section 6 evaluates the performance of four datasets originally used\nto test the Citeseer algorithms [1]. As well as providing signi\ufb01cantly better performance,\n\n\four system is able to reason simultaneously about papers, authors, titles, and publication\ntypes, and does a good job of extracting this information from the grouped citations. For\nexample, an author\u2019s name can be identi\ufb01ed more accurately by combining information\nfrom multiple citations of several different papers. The errors made by our system point to\nsome interesting unmodeled aspects of the citation process.\n\n2 RPMs\n\nReasoning about identity requires reasoning about objects, which requires at least some of\nthe expressive power of a \ufb01rst-order logical language. Our approach builds on relational\nprobability models (RPMs) [11], which let us specify probability models over possible\nworlds de\ufb01ned by objects, properties, classes, and relations.\n\n2.1 Basic RPMs\n\nAt its most basic, an RPM, as de\ufb01ned by Koller et al [12], consists of\n\n(cid:15) A set C of classes denoting sets of objects, related by subclass/superclass relations.\n(cid:15) A set I of named instances denoting objects, each an instance of one class.\n(cid:15) A set A of complex attributes denoting functional relations. Each complex at-\n\ntribute A has a domain type Dom[A] 2 C and a range type Range[A] 2 C.\n\n(cid:15) A set B of simple attributes denoting functions. Each simple attribute B has a\n\ndomain type Dom[B] 2 C and a range V al[B].\n\n(cid:15) A set of conditional probability models P (BjP a[B]) for the simple attributes.\nP a[B] is the set of B\u2019s parents, each of which is a nonempty chain of (appropri-\nately typed) attributes (cid:27) = A1: (cid:1) (cid:1) (cid:1) :An:B0, where B0 is a simple attribute. Prob-\nability models may be attached to instances or inherited from classes. The parent\nlinks should be such that no cyclic dependencies are formed.\n\n(cid:15) A set of instance statements, which set the value of a complex attribute to an\n\ninstance of the appropriate class.\n\nWe also use a slight variant of an additional concept from [11]: number uncertainty, which\nallows for multi-valued complex attributes of uncertain cardinality. We de\ufb01ne each such\nattribute A as a relation rather than a function, and we associate with it a simple at-\ntribute #[A] (i.e., the number of values of A) with a domain type Dom[A] and a range\nf0; 1; : : : ; max #[A]g.\n\n2.2 RPMs for citations\n\nFigure 2 outlines an RPM for the example citations of Figure 1. There are four classes,\nthe self-explanatory Author, Paper, and Citation, as well as AuthorAsCited, which repre-\nsents not actual authors, but author names as they appear when cited. Each citation we\nwish to match leads to the creation of a Citation instance; instances of the remaining three\nclasses are then added as needed to \ufb01ll all the complex attributes. E.g., for the \ufb01rst citation\nof Figure 1, we would create a Citation instance C1, set its text attribute to the string \u201cMe-\ntral M. ...August 1994.\u201d, and set its paper attribute to a newly created Paper\ninstance, which we will call P1. We would then introduce max(#[author]) (here only 3,\nfor simplicity) AuthorAsCited instances (D11, D12, and D13) to \ufb01ll the P1.obsAuthors (i.e.,\nobserved authors) attribute, and an equal number of Author instances (A11, A12, and A13)\nto \ufb01ll both the P1.authors[i] and the D1i.author attributes. (The complex attributes would\nbe set using instance statements, which would then also constrain the cited authors to be\nequal to the authors of the actual paper. 2) Assuming (for now) that the value of C1.parse\n2Thus, uncertainty over whether the authors are ordered correctly can be modeled using proba-\n\nbilistic instance statements.\n\n\fA11\n\nA12\n\nA13\n\nA21\n\nA22\n\nA23\n\nP1\n\nP2\n\nAuthor\nsurname\n#(fnames)\nfnames\n\nPaper\n#(authors)\nauthors\ntitle\npublication type\n\nAuthorAsCited\nsurname\n#(fnames)\nfnames\nauthor\n\nCitation\n#(obsAuthors)\nobsAuthors\nobsTitle\nparse\n\ntext\npaper\n\nD11\n\nD12\n\nD13\n\nD21\n\nD22\n\nD23\n\nC1\n\nC2\n\nFigure 2: An RPM for our Citeseer example. The large rectangles represent classes: the\ndark arrows indicate the ranges of their complex attributes, and the light arrows lay out\nall the probabilistic dependencies of their basic attributes. The small rectangles represent\ninstances, linked to their classes with thick grey arrows. We omit the instance statements\nwhich set many of the complex attributes.\n\nis observed, we can set the values of all the basic attributes of the Citation and Autho-\nrAsCited instances. (E.g., given the correct parse, D11.surname would be set to Lashkari,\nand D12.fnames would be set to (Max)). The remaining basic attributes \u2014 those of the\nPaper and Author instances \u2014 represent the \u201ctrue\u201d attributes of those objects, and their\nvalues are unobserved.\nThe standard semantics of RPMs includes the unique names assumption, which precludes\nidentity uncertainty. Under this assumption, any two papers are assumed to be different\nunless we know for a fact that they are the same. In other words, although there are many\nways in which the terms of the language can map to the objects in a possible world, only\none of these identity mappings is legal: the one with the fewest co-referring terms. It is then\npossible to express the RPM as an equivalent Bayesian network: each of the basic attributes\nof each of the objects becomes a node, with the appropriate parents and probability model.\nRPM inference usually involves the construction of such a network. The Bayesian network\nequivalent to our RPM is shown in Figure 3.\n\n3 IDENTITY UNCERTAINTY\n\nIn our application, any two citations may or may not refer to the same paper. Thus, for\ncitations C1 and C2, there is uncertainty as to whether the corresponding papers P1 and P2\nare in fact the same object. If they are the same, they will share one set of basic attributes;\n\nA11.\nfnames\n\nD11.\nfnames\n\nA11.\n\nsurname\n\nD11.\n\nsurname\n\nC1.\n\n#(authors)\n\nC1.\nparse\n\nA12.\n\nsurname\n\nA12.\n\n#(fnames)\n\nA11.\n\n#(fnames)\n\nD11.\n\n#(fnames)\n\nD12.\n\nsurname\n\nA12.\nfnames\n\nD12.\nfnames\n\nD12.\n\n#(fnames)\n\nA13.\n\nsurname\n\nD13.\n\nsurname\n\nC1.\ntext\n\nC1.\n\nobsTitle\n\nP1.\n\npubtype\n\nA13.\n\n#(fnames)\n\nD13.\n\n#(fnames)\n\nA13.\nfnames\n\nD13.\nfnames\n\nP1.\ntitle\n\nA21.\nfnames\n\nD21.\nfnames\n\nA21.\n\nsurname\n\nD21.\n\nsurname\n\nC2.\n\n#(authors)\n\nC2.\nparse\n\nA22.\n\nsurname\n\nA22.\n\n#(fnames)\n\nA21.\n\n#(fnames)\n\nD21.\n\n#(fnames)\n\nD12.\n\nsurname\n\nA22.\nfnames\n\nD22.\nfnames\n\nD22.\n\n#(fnames)\n\nA23.\n\nsurname\n\nD23.\n\nsurname\n\nC2.\ntext\n\nC2.\n\nobsTitle\n\nP2.\n\npubtype\n\nA23.\n\n#(fnames)\n\nD23.\n\n#(fnames)\n\nA23.\nfnames\n\nD23.\nfnames\n\nP2.\ntitle\n\nFigure 3: The Bayesian network equivalent to our RPM, assuming C1 6= C2.\n\n\fif they are distinct, there will be two sets. Thus, the possible worlds of our probability\nmodel may differ in the number of random variables, and there will be no single equiva-\nlent Bayesian network. The approach we have taken to this problem [10] is to extend the\nrepresentation of a possible world so that it includes not only the basic attributes of a set\nof objects, but also the number of objects n and an identity clustering (cid:19), that is, a mapping\nfrom terms in the language (such as P1) to objects in the world. We are interested only\nin whether terms co-refer or not, so (cid:19) can be represented by a set of equivalence classes of\nterms. For example, if P1 and P2 are the only terms, and they co-refer, then (cid:19) is ffP1; P2gg;\nif they do not co-refer, then (cid:19) is ffP1g; fP2gg.\nWe de\ufb01ne a probability model for the space of extended possible worlds by specifying the\nprior P (n) and the conditional distribution P ((cid:19)jn). As in standard RPMs, we assume that\nthe class of every instance is known. Hence, we can simplify these distributions further\nby factoring them by class, so that, e.g., P ((cid:19)) = QC2C P ((cid:19)C ). We then distinguish two\ncases:\n\n(cid:15) For some classes (such as the citations themselves), the unique names assumptions\nremains appropriate. Thus, we de\ufb01ne P ((cid:19)Citation) to assign a probability of 1.0\nto the one assignment where each citation object is unique.\n\n(cid:15) For classes such as Paper and Author, whose elements are subject to identity un-\ncertainty, we specify P (n) using a high-variance log-normal distribution. 3 Then\nwe make appropriate uniformity assumptions to construct P ((cid:19)C). Speci\ufb01cally, we\nassume that each paper is a priori equally likely to be cited, and that each author is\na priori equally likely to write a paper. Here, \u201ca priori\u201d means prior to obtaining\nany information about the object in question, so the uniformity assumption is en-\ntirely reasonable. With these assumptions, the probability of an assignment (cid:19)C;k;m\nthat maps k named instances to m distinct objects, when C contains n objects, is\ngiven by\n\nP ((cid:19)C;k;m) =\n\nn!\n\n(n (cid:0) m)!\n\n1\nnk\n\nWhen n > m, the world contains objects unreferenced by any of the terms. How-\never, these \ufb01ller objects are obviously irrelevant (if they affected the attributes of\nsome named term, they would have been named as functions of that term.) There-\nfore, we never have to create them, or worry about their attribute values.\n\nOur model assumes that the cardinalities and identity clusterings of the classes are indepen-\ndent of each other, as well as of the attribute values. We could remove these assumptions.\nFor one, it would be straightforward to specify a class-wise dependency model for n or (cid:19)\nusing standard Bayesian network semantics, where the network nodes correspond to the\ncardinality attributes of the classes. E.g., it would be reasonable to let the total number of\npapers depend on the total number of authors. Similarly, we could allow (cid:19) to depend on the\nattribute values\u2014e.g., the frequency of citations to a given paper might depend on the fame\nof the authors\u2014provided we did not introduce cyclic dependencies.\n\n4 The Probability Model\n\nWe will now \ufb01ll in the details of the conditional probability models. Our priors over the\n\u201ctrue\u201d attributes are constructed off-line, using the following resources: the 1990 Cen-\nsus data on US names, a large A.I. BibTeX bibliography, and a hand-parsed collection of\n500 citations. We learn several bigram models (actually, linear combinations of a bigram\nmodel and a unigram model): letter-based models of \ufb01rst names, surnames, and title words,\nas well as higher-level models of various parts of the citation string. More speci\ufb01cally, the\nvalues of Author.fnames and Author.surname are modeled as having a a 0:9 chance of being\n\n3Other models are possible; for example, in situations where objects appear and disappear, P ((cid:19))\n\ncan be modeled implicitly by specifying the arrival, transition, and departure rates [6].\n\n\fdrawn from the relevant US census \ufb01le, and a 0:1 chance of being generated using a bigram\nmodel learned from that \ufb01le. The prior over Paper.titles is de\ufb01ned using a two-tier bigram\nmodel constructed using the bibliography, while the distributions over Author.#(fnames),\nPaper.#(authors), and Paper.pubType 4 are derived from our hand-parsed \ufb01le. The con-\nditional distributions of the \u201cobserved\u201d variables given their true values (i.e., the corrup-\ntion models of Citation.obsTitle, AuthorAsCited.surname, and AuthorAsCited.fnames) are\nmodeled as noisy channels where each letter, or word, has a small probability of being\ndeleted, or, alternatively, changed, and there is also a small probability of insertion. Autho-\nrAsCited.fnames may also be abbreviated as an initial. The parameters of the corruption\nmodels are learnt online, using stochastic EM.\nLet us now return to Citation.parse, which cannot be an observed variable, since citation\nparsing, or even citation sub\ufb01eld extraction, is an unsolved problem. It is therefore fortu-\nnate that our approach lets us handle uncertainty over parses so naturally. The state space\nof Citation.parse has two different components. First of all, it keeps track of the citation\nstyle, de\ufb01ned as the ordering of the author and title sub\ufb01elds, as well as the format in which\nthe author names are written. The prior over styles is learned using our hand-segmented\n\ufb01le. Secondly, it keeps track of the segmentation of Citation.text, which is divided into\nan author segment, a title segment, and three \ufb01ller segments (one before, one after, and\none in between.) We assume a uniform distribution over segmentations. Citation.parse\ngreatly constrains Citation.text: the title segment of Citation.text must match the value of\nCitation.obsTitle, while its author segment must match the combined values of the simple\nattributes of Citation.obsAuthors. The distributions over the remaining three segments of\nCitation.text are de\ufb01ned using bigram models, with the model used for the \ufb01nal segment\nchosen depending on the publication type. These models were, once more, learned using\nour pre-segmented \ufb01le.\n\n5 INFERENCE\n\nWith the introduction of identity uncertainty, our model grows from a single Bayesian\nnetwork to a collection of networks, one for each possible value of (cid:19). This collection can be\nrather large, since the number of ways in which a set can be partitioned grows very quickly\nwith the size of the set. 5 Exact inference is, therefore, impractical. We use an approximate\nmethod based on Markov chain Monte Carlo.\n\n5.1 MARKOV CHAIN MONTE CARLO\n\nMCMC [13] is a well-known method for approximating an expectation over some distribu-\ntion (cid:25)(x), commonly used when the state space of x is too large to sum over. The weighted\nsum over the values of x is replaced by a sum over samples from (cid:25)(x), which are generated\nusing a Markov chain constructed to have (cid:25)(x) as a stationary distribution.\nThere are several ways of building up an appropriate Markov chain. In the Metropolis\u2013\nHastings method (M-H), transitions in the chain are constructed in two steps. First, a\ncandidate next state x0 is generated from the current state x, using the (more or less arbi-\ntrary) proposal distribution q(x0jx). The probability that the move to x0 is actually made is\nthe acceptance probability, de\ufb01ned as (cid:11)(x0jx) = min(cid:16)1; (cid:25)(x0)q(xjx0)\nSuch a Markov chain will have the right stationary distribution (cid:25)(x) as long as q is de\ufb01ned\nin such a way that the chain is ergodic. It is even possible to factor q into separate proposals\nfor various subsets of variables. In those situations, the variables that are not changed by the\ntransition cancel in the ratio (cid:25)(x0)=(cid:25)(x), so the required calculation can be quite simple.\n\n(cid:25)(x)q(x0jx) (cid:17).\n\n4Publication types range over farticle, conference paper, book, thesis, and tech reportg\n5This sequence is described by the Bell numbers, whose asymptotic behaviour is more than ex-\n\nponential.\n\n\f5.2 THE CITATION-MATCHING ALGORITHM\n\nThe state space of our MCMC algorithm is the space of all the possible worlds, where\neach possible world contains an identity clustering (cid:19), a set of class cardinalities n, and the\nvalues of all the basic attributes of all the objects. Since the (cid:19) is given in each world, the\ndistribution over the attributes can be represented using a Bayesian network as described\nin Section 3. Therefore, the probability of a state is simply the product pf P (n), P ((cid:19)), and\nthe probability of the hidden attributes of the network.\nOur algorithm uses a factored q function. One of our proposals attempts to change n using\na simple random walk. The other suggests, \ufb01rst, a change to (cid:19), and then, values for all the\nhidden attributes of all the objects (or clusters in (cid:19)) affected by that change. The algorithm\nfor proposing a change in (cid:19)C works as follows:\n\nSelect two clusters a1; a2 2 (cid:19)C\nCreate two empty clusters b1 and b2\nplace each instance i 2 a1 [ a2 u.a.r. into b1 or b2\nPropose (cid:19)0\n\nC = (cid:19)C (cid:0) fa1; a2g [ fb1; b2g\n\n6\n\nGiven a proposed (cid:19)0\nC, suggesting values for the hidden attributes boils down to recovering\ntheir true values from (possibly) corrupt observations, e.g., guessing the true surname of\nthe author currently known both as \u201cSimth\u201d and \u201cSmith\u201d. Since our title and name noise\nmodels are symmetric, our basic strategy is to apply these noise models to one of the\nobserved values. In the case of surnames, we have the additional resource of a dictionary\nof common names, so, some of the time, we instead pick one of the set of dictionary entries\nthat are within a few corruptions of our observed names. (One must, of course, careful\nto account for this hybrid approach in our acceptance probability calculations.) Parses are\nhandled differently: we preprocess each citation, organizing its plausible segmentations\ninto a list ordered in terms of descending probability. At runtime, we simply sample from\nthese discrete distributions. Since we assume that boundaries occur only at punctuation\nmarks, and discard segmentations of probability < 10(cid:0)6, the lists are usually quite short. 7\nThe publication type variables, meanwhile, are not sampled at all. Since their range is so\nsmall, we sum them out.\n\n5.3 SCALING UP\n\nOne of the acknowledged \ufb02aws of the MCMC algorithm is that it often fails to scale. In\nthis application, as the number of papers increases, the simplest approach \u2014 one where\nthe two clusters a1 and a2 are picked u.a.r \u2014 is likely to lead to many rejected proposals,\nas most pairs of clusters will have little in common. The resulting Markov chain will mix\nslowly. Clearly, we would prefer to focus our proposals on those pairs of clusters which are\nactually likely to exchange their instances. We have implemented an approach based on the\nef\ufb01cient clustering algorithm of McCallum et al [9], where a cheap distance metric is used\nto preprocess a large dataset and fragment it into many canopies, or smaller, overlapping\nsets of elements that have a non-zero probability of matching. We do the same, using\nword-matching as our metric, and setting the thresholds to 0:5 and 0:2. Then, at runtime,\nour q(x0jx) function proposes \ufb01rst a canopy c, and then a pair of clusters u.a.r. from c.\n(q(xjx0) is calculated by summing over all the canopies which contain any of the elements\nof the two clusters.)\n\n6 EXPERIMENTAL RESULTS\n\nWe have applied the MCMC-based algorithm to the hand-matched datasets used in [1].\n(Each of these datasets contains several hundred citations of machine learning papers, about\nhalf of them in clusters ranging in size from two to twenty-one citations.) We have also\n\n6Note that if the same cluster is picked twice, it will probably be split.\n7It would also be possible to sample directly from a model such as a hierarchical HMM\n\n\fFace\n\n349 citations, 242 papers\n\nReinforcement\n406 citations, 148 papers\n\nReasoning\n\nConstraint\n\n514 citations, 296 papers\n\n295 citations, 199 papers\n\nPhrase matching\n\nRPM + MCMC\n\n94%\n\n97%\n\n79%\n\n94%\n\n86%\n\n96%\n\n89%\n\n93%\n\nTable 1: Results on four Citeseer data sets, for the text matching and MCMC algorithms.\nThe metric used is the percentage of actual citation clusters recovered perfectly; for the\nMCMC-based algorithm, this is an average over all the MCMC-generated samples.\n\nimplemented their phrase matching algorithm, a greedy agglomerative clustering method\nbased on a metric that measures the degrees to which the words and phrases of any two\ncitations overlap. (They obtain their \u201cphrases\u201d by segmenting each citation at all punctu-\nation marks, and then taking all the bigrams of all the segments longer than two words.)\nThe results of our comparison are displayed in Figure 1, in terms of the Citeseer error met-\nric. Clearly, the algorithm we have developed easily beats our implementation of phrase\nmatching.\nWe have also applied our algorithm to a large set of citations referring to the textbook Ar-\nti\ufb01cial Intelligence: A Modern Approach. It clusters most of them correctly, but there are a\ncouple of notable exceptions. Whenever several citations share the same set of unlikely er-\nrors, they are placed together in a separate cluster. This occurs because we do not currently\nmodel the fact that erroneous citations are often copied from reference list to reference\nlist, which could be handled by extending the model to include a copiedFrom attribute.\nAnother possible extension would be the addition of a topic attribute to both papers and au-\nthors: tracking the authors\u2019 research topics might enable the system to distinguish between\nsimilarly-named authors working in different \ufb01elds. Generally speaking, we expect that\nrelational probabilistic languages with identity uncertainty will be a useful tool for creating\nknowledge from raw data.\n\nReferences\n[1] S. Lawrence, K. Bollacker, and C. Lee Giles. Autonomous citation matching. In Agents, 1999.\n[2] I. Fellegi and A. Sunter. A theory for record linkage. In JASA, 1969.\n[3] W. Cohen, H. Kautz, and D. McAllester. Hardening soft information sources. In KDD, 2000.\n[4] Y. Bar-Shalom and T. E. Fortman. Tracking and Data Association. Academic Press, 1988.\n[5] I. J. Cox and S. Hingorani. An ef\ufb01cient implementation and evaluation of Reid\u2019s multiple\n\nhypothesis tracking algorithm for visual tracking. In IAPR-94, 1994.\n\n[6] H. Pasula, S. Russell, M. Ostland, and Y. Ritov. Tracking many objects with many sensors. In\n\nIJCAI-99, 1999.\n\n[7] F. Dellaert, S. Seitz, C. Thorpe, and S. Thrun. Feature correspondence: A markov chain monte\n\ncarlo approach. In NIPS-00, 2000.\n\n[8] E. Charniak and R. P. Goldman. A Bayesian model of plan recognition. AAAI, 1993.\n[9] A. McCallum, K. Nigam, and L. H. Ungar. Ef\ufb01cient clustering of high-dimensional data sets\n\nwith application to reference matching. In KDD-00, 2000.\n\n[10] H. Pasula and S. Russell. Approximate inference for \ufb01rst-order probabilistic languages.\n\nIJCAI-01, 2001.\n\n[11] A. Pfeffer. Probabilistic Reasoning for Complex Systems. PhD thesis, Stanford, 2000.\n[12] A. Pfeffer and D. Koller. Semantics and inference for recursive probability models.\n\nAAAI/IAAI, 2000.\n\nIn\n\nIn\n\n[13] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter. Markov chain Monte Carlo in practice.\n\nChapman and Hall, London, 1996.\n\n\f", "award": [], "sourceid": 2149, "authors": [{"given_name": "Hanna", "family_name": "Pasula", "institution": null}, {"given_name": "Bhaskara", "family_name": "Marthi", "institution": null}, {"given_name": "Brian", "family_name": "Milch", "institution": null}, {"given_name": "Stuart", "family_name": "Russell", "institution": null}, {"given_name": "Ilya", "family_name": "Shpitser", "institution": null}]}