{"title": "Learning Syntactic Patterns for Automatic Hypernym Discovery", "book": "Advances in Neural Information Processing Systems", "page_first": 1297, "page_last": 1304, "abstract": null, "full_text": "       Learning syntactic patterns for automatic\n                              hypernym discovery\n\n\n           Rion Snow                    Daniel Jurafsky                 Andrew Y. Ng\n Computer Science Department          Linguistics Department      Computer Science Department\n       Stanford University             Stanford University            Stanford University\n       Stanford, CA 94305              Stanford, CA 94305             Stanford, CA 94305\n   rion@cs.stanford.edu            jurafsky@stanford.edu            ang@cs.stanford.edu\n\n\n                                           Abstract\n\n         Semantic taxonomies such as WordNet provide a rich source of knowl-\n         edge for natural language processing applications, but are expensive to\n         build, maintain, and extend. Motivated by the problem of automatically\n         constructing and extending such taxonomies, in this paper we present a\n         new algorithm for automatically learning hypernym (is-a) relations from\n         text. Our method generalizes earlier work that had relied on using small\n         numbers of hand-crafted regular expression patterns to identify hyper-\n         nym pairs. Using \"dependency path\" features extracted from parse trees,\n         we introduce a general-purpose formalization and generalization of these\n         patterns. Given a training set of text containing known hypernym pairs,\n         our algorithm automatically extracts useful dependency paths and applies\n         them to new corpora to identify novel pairs. On our evaluation task (de-\n         termining whether two nouns in a news article participate in a hypernym\n         relationship), our automatically extracted database of hypernyms attains\n         both higher precision and higher recall than WordNet.\n1 Introduction\nSemantic taxonomies and thesauri such as WordNet [5] are a key source of knowledge for\nnatural language processing applications, and provide structured information about seman-\ntic relations between words. Building such taxonomies, however, is an extremely slow\nand labor-intensive process. Further, semantic taxonomies are invariably limited in scope\nand domain, and the high cost of extending or customizing them for an application has\noften limited their usefulness. Consequently, there has been significant recent interest in\nfinding methods for automatically learning taxonomic relations and constructing semantic\nhierarchies. [1, 2, 3, 4, 6, 8, 9, 13, 15, 17, 18, 19, 20, 21]\nIn this paper, we build an automatic classifier for the hypernym/hyponym relation. A noun\nX is a hyponym of a noun Y if X is a subtype or instance of Y. Thus \"Shakespeare\" is a\nhyponym of \"author\" (and conversely \"author\" is a hypernym of \"Shakespeare\"), \"dog\" is\na hyponym of \"canine\", \"desk\" is a hyponym of \"furniture\", and so on.\nMuch of the previous work on automatic semantic classification of words has been based\non a key insight first articulated by Hearst [8], that the presence of certain \"lexico-syntactic\npatterns\" can indicate a particular semantic relationship between two nouns. Hearst noticed\nthat, for example, linking two noun phrases (NPs) via the constructions \"Such NP Y as\nNPX \", or \"NP X and other NP Y \", often implies that NP X is a hyponym of NP Y , i.e., that\nNPX is a kind of NP Y . Since then, several researchers have used a small number (typically\nless than ten) of hand-crafted patterns like these to try to automatically label such semantic\n\n\f\nrelations [1, 2, 6, 13, 17, 18]. While these patterns have been successful at identifying some\nexamples of relationships like hypernymy, this method of lexicon construction is tedious\nand severely limited by the small number of patterns typically employed.\nOur goal is to use machine learning to automatically replace this hand-built knowledge.\nWe first use examples of known hypernym pairs to automatically identify large numbers of\nuseful lexico-syntactic patterns, and then combine these patterns using a supervised learn-\ning algorithm to obtain a high accuracy hypernym classifier. More precisely, our approach\nis as follows:\n      1. Training:\n             (a) Collect noun pairs from corpora, identifying examples of hypernym pairs\n                  (pairs of nouns in a hypernym/hyponym relation) using WordNet.\n             (b) For each noun pair, collect sentences in which both nouns occur.\n             (c) Parse the sentences, and automatically extract patterns from the parse tree.\n             (d) Train a hypernym classifier based on these features.\n      2. Test:\n             (a) Given a pair of nouns in the test set, extract features and use the classifier to\n                  determine if the noun pair is in the hypernym/hyponym relation or not.\nThe rest of the paper is structured as follows. Section 2 introduces our method for automat-\nically discovering patterns indicative of hypernymy. Section 3 then describes the setup of\nour experiments. In Section 4 we analyze our feature space, and in Section 5 we describe\na classifier using these features that achieves high accuracy on the task of hypernym iden-\ntification. Section 6 shows how this classifier can be improved by adding a new source of\nknowledge, coordinate terms.\n2 Representing lexico-syntactic patterns with dependency paths\nThe first goal of our work is to automatically identify lexico-syntactic patterns indicative\nof hypernymy. In order to do this, we need a representation space for expressing these\npatterns. We propose the use of dependency paths as a general-purpose formalization of\nthe space of lexico-syntactic patterns. Dependency paths have been used successfully in\nthe past to represent lexico-syntactic relations suitable for semantic processing [11].\nA dependency parser produces a dependency tree that represents the syntactic relations\nbetween words by a list of edge tuples of the form:\n(word                    :                                        ,           ). In this formulation each\n         1,CATEGORY1 RELATION:CATEGORY2 word2                                                                            word is\nthe stemmed form of the word or multi-word phrase (so that \"authors\" becomes \"author\"),\nand corresponds to a specific node in the dependency tree; each category is the part of\nspeech label of the corresponding word (e.g., N for noun or PREP for preposition); and\nthe relation is the directed syntactic relationship exhibited between word and\n                                                                                                                 1        word2\n(e.g., OBJ for object, MOD for modifier, or CONJ for conjunction), and corresponds to a\nspecific link in the tree. We may then define our space of lexico-syntactic patterns to be all\nshortest paths of four links or less between any two nouns in a dependency tree. Figure 1\nshows the partial dependency tree for the sentence fragment \"...such authors as Herrick and\nShakespeare\" generated by the broad-coverage dependency parser MINIPAR [10].\n\n\n                                                                                                              and\n                                                                                           -N:punc:U\n                               -N:pre:PreDet       such\n      ...           authors         -N:mod:Prep      -Prep:pcomp-n:N      Herrick            -N:conj:N\n\n                                                                        -Prep:pcomp-n:N\n                                                    as                                                    Shakespeare\n\n\n                  Figure 1: MINIPAR dependency tree example with transform\n\n\f\n NP X and other NP Y :       (and,U:PUNC:N),-N:CONJ:N, (other,A:MOD:N)\n NP X or other NP Y :        (or,U:PUNC:N),-N:CONJ:N, (other,A:MOD:N)\n NP Y such as NP X :         N:PCOMP-N:PREP,such as,such as,PREP:MOD:N\n Such NPY as NPX:            N:PCOMP-N:PREP,as,as,PREP:MOD:N,(such,PREDET:PRE:N)\n NP Y including NP X :       N:OBJ:V,include,include,V:I:C,dummy node,dummy node,C:REL:N\n NP Y , especially NP X :    -N:APPO:N,(especially,A:APPO-MOD:N)\n               Table 1: Dependency path representations of Hearst's patterns\n\nWe then remove the original nouns in the noun pair to create a more general pattern. Each\ndependency path may then be presented as an ordered list of dependency tuples. We extend\nthis basic MINIPAR representation in two ways: first, we wish to capture the fact that cer-\ntain function words like \"such\" (in \"such NP as NP\") or \"other\" (in \"NP and other NP\") are\nimportant parts of lexico-syntactic patterns. We implement this by adding optional \"satel-\nlite links\" to each shortest path, i.e., single links not already contained in the dependency\npath added on either side of each noun. Second, we capitalize on the distributive nature of\nthe syntactic conjunction relation (nouns linked by \"and\" or \"or\", or in comma-separated\nlists) by distributing dependency links across such conjunctions. For example, in the sim-\nple 2-member conjunction chain of \"Herrick\" and \"Shakespeare\" in Figure 1, we add the\nentrance link \"as, -PREP:PCOMP-N:N\" to the single element \"Shakespeare\" (as a dotted\nline in the figure). Our extended dependency notation is able to capture the power of the\nhand-engineered patterns described in the literature. Table 1 shows the six patterns used in\n[1, 2, 8] and their corresponding dependency path formalizations.\n\n3 Experimental paradigm\nOur goal is to build a classifier which, when given an ordered pair of nouns, makes the\nbinary decision of whether the nouns are related by hypernymy.\nAll of our experiments are based on a corpus of over 6 million newswire sentences.1 We\nfirst parsed each of the sentences in the corpus using MINIPAR. We extract every pair of\nnouns from each sentence.\n752,311 of the resulting unique noun pairs were labeled as Known Hypernym or Known\nNon-Hypernym using WordNet.2 A noun pair (ni, nj) is labeled Known Hypernym if nj\nis an ancestor of the first sense of ni in the WordNet hypernym taxonomy, and if the only\n\"frequently-used\"3 sense of each noun is the first noun sense listed in WordNet. Note that\nnj is considered a hypernym of ni regardless of how much higher in the hierarchy it is with\nrespect to ni. A noun pair may be assigned to the second set of Known Non-Hypernym\npairs if both nouns are contained within WordNet, but neither noun is an ancestor of the\nother in the WordNet hypernym taxonomy for any senses of either noun. Of our collected\nnoun pairs, 14,387 were Known Hypernym pairs, and we assign the 737,924 most fre-\nquently occurring Known Non-Hypernym pairs to the second set; this number is selected\nto preserve the roughly 1:50 ratio of hypernym-to-non-hypernym pairs observed in our\nhand-labeled test set (discussed below).\nWe evaluated our binary classifiers in two ways. For both sets of evaluations, our classifier\nwas given a pair of nouns from an unseen sentence and had to make a hypernym vs. non-\nhypernym decision. In the first style of evaluation, we compared the performance of our\nclassifiers against the Known Hypernym versus Known Non-Hypernym labels assigned by\n\n   1The corpus contains articles from the Associated Press, Wall Street Journal, and Los Angeles\nTimes, drawn from the TIPSTER 1, 2, 3, and TREC 5 corpora [7]. Our most recent experiments\n(presented in Section 6) include articles from Wikipedia (a popular web encyclopedia), extracted\nwith the help of Tero Karvinen's Tero-dump software.\n   2We access WordNet 2.0 via Jason Rennie's WordNet::QueryData interface.\n   3A noun sense is determined to be \"frequently-used\" if it occurs at least once in the sense-tagged\nBrown Corpus Semantic Concordance files (as reported in the cntlist file distributed as part of\nWordNet 2.0). This determination is made so as to reduce the number of false hypernym/hyponym\nclassifications due to highly polysemous nouns (nouns which have multiple meanings).\n\n\f\n                                                                       Hypernym Classifiers on WordNet-labeled dev set\n                                                               1\n\n                                                         0.9                                  Logistic Regression (Buckets)\n                                                         0.8                                  Logistic Regression (Binary)\n                                                                                              Hearst Patterns\n                                                         0.7                                  And/Or Other Pattern\n                                                         0.6\n\n                                                         0.5\n\n                                                      Precision 0.4\n\n                                                         0.3\n\n                                                         0.2\n\n                                                         0.1\n\n                                                               00       0.1    0.2    0.3    0.4     0.5      0.6    0.7    0.8    0.9    1\n                                                                                                    Recall\n Figure 2: Hypernym pre/re for all features                            Figure 3: Hypernym classifiers\n\nWordNet. This provides a metric for how well our classifiers do at \"recreating\" WordNet's\njudgments.\nFor the second set of evaluations we hand-labeled a test set of 5,387 noun pairs from\nrandomly-selected paragraphs within our corpus (with part-of-speech labels assigned by\nMINIPAR). The annotators were instructed to label each ordered noun pair as one of\n\"hyponym-to-hypernym\", \"hypernym-to-hyponym\", \"coordinate\", or \"unrelated\" (the co-\nordinate relation will be defined in Section 6). As expected, the vast majority of pairs\n(5,122) were found to be unrelated by these measures; the rest were split evenly between\nhypernym and coordinate pairs (134 and 131, resp.).\nInterannotator agreement was obtained between four labelers (all native speakers of En-\nglish) on a set of 511 noun pairs, and determined for each task according to the averaged\nF-Score across all pairs of the four labelers. Agreement was 83% and 64% for the hyper-\nnym and coordinate term classification tasks, respectively.\n\n4 Features: pattern discovery\nOur first study focused on discovering which dependency paths might prove useful features\nfor our classifiers. We created a feature lexicon of 69,592 dependency paths, consisting of\nevery dependency path that occurred between at least five unique noun pairs in our corpus.\nTo evaluate these features, we constructed a binary classifier for each pattern, which simply\nclassifies a noun pair as hypernym/hyponym if and only if the specific pattern occurs at least\nonce for that noun pair. Figure 2 depicts the precision and recall of all such classifiers (with\nrecall at least .0015) on the WordNet-labeled data set.4 Using this formalism we have been\nable to capture a wide variety of repeatable patterns between hypernym/hyponym noun\npairs; in particular, we have been able to rediscover the hand-designed patterns originally\nproposed in [8] (the first five features, marked in red)5, in addition to a number of new\npatterns not previously discussed (of which four are marked as blue triangles in Figure\n2 and listed in Table 2. This analysis gives a quantitative justification to Hearst's initial\nintuition as to the power of hand-selected patterns; nearly all of Hearst's patterns are at the\nhigh-performance boundary of precision and recall for individual features.\n\n              NPY like NP X :                N:PCOMP-N:PREP,like,like,PREP:MOD:N\n              NPY called NP X :              N:DESC:V,call,call,V:VREL:N\n              NPX is a NP Y :                N:S:VBE,be,be,-VBE:PRED:N\n              NPX , a NP Y (appositive):     N:APPO:N\n          Table 2: Dependency path representations of other high-scoring patterns\n\n   4Redundant features consisting of an identical base path to an identified pattern but differing only\nby an additional \"satellite link\" are marked in Figure 2 by smaller versions of the same symbol.\n   5We mark the single generalized \"conjunction other\" pattern -N:CONJ:N, (other,A:MOD:N) to\nrepresent both of Hearst's original \"and other\" and \"or other\" patterns.\n\n\f\n                         Best Logistic Regression (Buckets):    0.3480\n                         Best Logistic Regression (Binary):     0.3200\n                         Best Multinomial Naive Bayes:          0.3175\n                         Best Complement Naive Bayes:           0.3024\n                         Hearst Patterns:                       0.1500\n                         \"And/Or Other\" Pattern:                0.1170\n\nTable 3: Average maximum F-scores for cross validation on WordNet-labeled training set\n5 A hypernym-only classifier\nOur first hypernym classifier is based on the intuition that unseen noun pairs are more\nlikely to be a hypernym pair if they occur in the test set with one or more lexico-syntactic\npatterns found to be indicative of hypernymy. We record in our noun pair lexicon each\nnoun pair that occurs with at least five unique paths from our feature lexicon discussed in\nthe previous section. We then create a feature count vector for each such noun pair. Each\nentry of the 69,592-dimension vector represents a particular dependency path, and contains\nthe total number of times that that path was the shortest path connecting that noun pair in\nsome dependency tree in our corpus. We thus define as our task the binary classification of\na noun pair as a hypernym pair based on its feature vector of dependency paths.\nWe use the WordNet-labeled Known Hypernym / Known Non-Hypernym training set de-\nfined in Section 3. We train a number of classifiers on this data set, including multinomial\nNaive Bayes, complement Naive Bayes [16], and logistic regression. We perform model\nselection using 10-fold cross validation on this training set, evaluating each model based\non its maximum F-Score averaged across all folds. The summary of average maximum\nF-scores is presented in Table 3, and the precision/recall plot of our best models is pre-\nsented in Figure 3. For comparison, we evaluate two simple classifiers based on past work\nusing only a handful of hand-engineered features; the first simply detects the presence of\nat least one of Hearst's patterns, arguably the previous best classifier consisting only of\nlexico-syntactic patterns, and as implemented for hypernym discovery in [2]. The second\nclassifier consists of only the \"NP and/or other NP\" subset of Hearst's patterns, as used\nin the automatic construction of a noun-labeled hypernym taxonomy in [1]. In our tests\nwe found greatest performance from a binary logistic regression model with 14 redundant\nthreshold buckets spaced at the exponentially increasing intervals {1, 2, 4, ...4096, 8192};\nour resulting feature space consists of 974,288 distinct binary features. These buckets are\ndefined such that a feature corresponding to pattern p at threshold t will be activated by\na noun pair n if and only if p has been observed to occur as a shortest dependency path\nbetween n at least t times.\nOur classifier shows a dramatic improvement over previous classifiers; in particular, using\nour best logistic regression classifier trained on newswire corpora, we observe a 132%\nrelative improvement of average maximum F-score over the classifier based on Hearst's\npatterns.\n6 Using coordinate terms to improve hypernym classification\nWhile our hypernym-only classifier performed better than previous classifiers based on\nhand-built patterns, there is still much room for improvement. As [2] points out, one prob-\nlem with pattern-based hypernym classifiers in general is that within-sentence hypernym\npattern information is quite sparse. Patterns are useful only for classifying noun pairs which\nhappen to occur in the same sentence; many hypernym/hyponym pairs may simply not oc-\ncur in the same sentence in the corpus. For this reason [2], following [1] suggests relying\non a second source of knowledge: \"coordinate\" relations between nouns. The WordNet\nglossary defines coordinate terms as \"nouns or verbs that have the same hypernym\". Here\nwe treat the coordinate relation as a symmetric relation that exists between two nouns that\nshare at least one common ancestor in the hypernym taxonomy, and are therefore \"the\nsame kind of thing\" at some level. Many methods exist for inferring that two nouns are\ncoordinate terms (a common subtask in automatic thesaurus induction). We expect that\n\n\f\n                                                Interannotator Agreement:                                                                         0.6405\n                                                Distributional Similarity Vector Space Model:                                                     0.3327\n                                                Thresholded Conjunction Pattern Classifier:                                                        0.2857\n                                                Best WordNet Classifier:                                                                           0.2630\n                          Table 4: Summary of maximum F-scores on hand-labeled coordinate pairs\n\n                         Coordinate term classifiers on hand-labeled test set                                                Hypernym Classifiers on hand-labeled test set\n            1                                                                                                   1\n                                                                Interannotator Agreement\n       0.9                                                                                                 0.9                                       Interannotator Agreement\n                                                                Distributional Similarity                                                            TREC+Wikipedia\n       0.8                                                      Conjunct Pattern                           0.8                                       TREC, Hybrid\n                                                                WordNet                                                                              TREC, Hypernym-only\n       0.7                                                                                                 0.7                                       WordNet Classifiers\n       0.6                                                                                                 0.6                                       Hearst Patterns\n                                                                                                                                                     And/Or Other Pattern\n       0.5                                                                                                 0.5\n   Precision 0.4                                                                                       Precision 0.4\n\n       0.3                                                                                                 0.3\n\n       0.2                                                                                                 0.2\n\n       0.1                                                                                                 0.1\n\n            0                                                                                                   0\n                    0      0.1    0.2    0.3    0.4      0.5       0.6     0.7    0.8    0.9    1                       0    0.1    0.2    0.3      0.4     0.5    0.6    0.7    0.8    0.9\n                                                       Recall                                                                                        Recall\n\n\nFigure 4: Coordinate classifiers on                                                                   Figure 5: Hypernym classifiers on\nhand-labeled test set                                                                                hand-labeled test set\n\nusing coordinate information will increase the recall of our hypernym classifier: if we are\nconfident that two nouns ni, nj are coordinate terms, and that nj is a hyponym of nk, we\nmay then infer with higher probability that ni is similarly a hyponym of nk--despite never\nhaving encountered the pair (ni, nk) within a single sentence.\n6.1 Coordinate Term Classification\nPrior work for identifying coordinate terms includes automatic word sense clustering meth-\nods based on distributional similarity (e.g., [12, 14]) or on pattern-based techniques, specif-\nically using the coordination pattern \"X, Y, and Z\" (e.g., [2]). We construct both types of\nclassifier. First we construct a vector-space model similar to [12] using single MINIPAR\ndependency links as our distributional features.6 We use the normalized similarity score\nfrom this model for coordinate term classification. We evaluate this classifier on our hand-\nlabeled test set, where of 5,387 total pairs, 131 are labeled as \"coordinate\". For purposes\nof comparison we construct a series of classifiers from WordNet, which make the binary\ndecision of determining whether two nouns are coordinate according to whether they share\na common ancestor within k nouns higher up in the hypernym taxonomy, for all k from 1\nto 6. Also, we compare a simple pattern-based classifier based on the conjunction pattern,\nwhich thresholds simply on the number of conjunction patterns found between a given pair.\nResults of this experiment are shown in Table 4 and Figure 4.\nThe strong performance of the simple conjunction pattern model suggests that it may be\nworth pursuing an extended pattern-based coordinate classifier along the lines of our hyper-\nnym classifier; for now, we proceed with our distributional similarity vector space model\n(with a 16% relative F-score improvement over the conjunction model) in the construction\nof a combined hypernym-coordinate hybrid classifier.\n6.2 Hybrid hypernym-coordinate classification\nWe now combine our hypernym and coordinate models in order to improve hypernym clas-\nsification. We define two probabilities of pair relationships between nouns: P (ni < nj),\n                                                                                                                                                                                 H\n\n     6We use the same 6 million MINIPAR-parsed sentences used in our hypernym training set. Our\nfeature lexicon consists of the 30,000 most frequent noun-connected dependency edges. We construct\nfeature count vectors for each of the most frequently occurring 163,198 individual nouns. As in [12]\nwe normalize these feature counts with pointwise mutual information, and compute as our measure\nof similarity the cosine coefficient between these normalized vectors.\n\n\f\n           Interannotator Agreement:                                                        0.8318\n           TREC+Wikipedia Hypernym-only Classifier (Logistic Regression):                    0.3592\n           TREC Hybrid Linear Interpolation Hypernym/Coordinate Model:                      0.3268\n           TREC Hypernym-only Classifier (Logistic Regression):                              0.2714\n           Best WordNet Classifier:                                                          0.2339\n           Hearst Patterns Classifier:                                                       0.1417\n           \"And/Or Other\" Pattern Classifier:                                                0.1386\n\n        Table 5: Maximum F-Score of hypernym classifiers on hand-labeled test set\n\nrepresenting the probability that noun ni has nj as an ancestor in its hypernym hierarchy,\nand P (ni  nj), the probability that nouns ni and nj are coordinate terms, i.e., that they\n           C\nshare a common hypernym ancestor at some level. Defining the probability produced by\nour best hypernym-only classifier as Pold(ni < nj), and a probability obtained by normal-\n                                                  H\nizing the similarity score from our coordinate classifier as P (ni  nj), we apply a simple\n                                                                               C\nlinear interpolation scheme to compute a new hypernymy probability. Specifically, for each\npair of nouns (ni, nk), we recompute the probability that nk is a hypernym of ni as:7\n         Pnew(ni < nk)  1Pold(ni < nk) + 2                   P (n           n                    n\n                                                           j            i          j )Pold(nj <         k )\n                   H                       H                             C                    H\n\n7 Results\nOur hand-labeled dataset allows us to compare our classifiers with WordNet and the pre-\nvious feature-based methods, now using the human labels as ground truth. Figure 5 shows\nthe performance of each method in a precision/recall plot. We evaluated several classifiers\nbased on the WordNet hypernym taxonomy.8 The best WordNet-based results are plotted\nin Figure 5. Our logistic regression hypernym-only model trained on the newswire cor-\npora has a 16% relative F-score improvement over the best WordNet classifier, while the\ncombined hypernym/coordinate model has a 40% relative F-score improvement. Our best-\nperforming classifier is a hypernym-only model additionally trained on the Wikipedia cor-\npus, with an expanded feature lexicon of 200,000 dependency paths; this classifier shows a\n54% improvement over WordNet. In Table 5 we list the maximum F-scores of each method.\nIn Table 6 we analyze the disagreements between the highest F-score WordNet classifier\nand our combined hypernym/coordinate classifier.9\n\n8 Conclusions\nOur experiments demonstrate that automatic methods can be competitive with WordNet\nfor the identification of hypernym pairs in newswire corpora. In future work we will use\nthe presented method to automatically generate flexible, statistically-grounded hypernym\ntaxonomies directly from corpora. These taxonomies will be made publicly available to\ncomplement existing semantic resources.\n\n\n   7We constrain our parameters 1, 2 such that 1+2 = 1; we set these parameters using 10-fold\ncross-validation on our hand-labeled test set. For our final evaluation we use 1 = 0.7.\n   8We tried all combinations of the following parameters: the maximum number of senses of a hy-\nponym for which to find hypernyms, the maximum distance between the hyponym and its hypernym\nin the WordNet taxonomy, and whether or not to allow synonyms. The WordNet model achieving the\nmaximum F-score uses only the first sense of a hyponym and allows a maximum distance of 4 links\nbetween a hyponym and hypernym.\n   9There are 31 such disagreements, with WordNet agreeing with the human labels on 5 and our hy-\nbrid model agreeing on the other 26. We additionally inspect the types of noun pairs where our model\nimproves upon WordNet, and find that at least 30% of our model's improvements are not restricted\nto Named Entities; given that the distribution of Named Entities among the labeled hypernyms in our\ntest set is over 60%, this gives us hope that our classifier will perform well at the task of hypernym\ninduction even in more general, non-newswire domains.\n\n\f\n  Type of Noun Pair     Count      Example Pair\n  NE: Person            7          \"John F. Kennedy / president\", \"Marlin Fitzwater / spokesman\"\n  NE: Place             7          \"Diamond Bar / city\", \"France / place\"\n  NE: Company           2          \"American Can / company\", \"Simmons / company\"\n  NE: Other             1          \"Is Elvis Alive / book\"\n  Not Named Entity:     9          \"earthquake / disaster\", \"soybean / crop\"\n                       Table 6: Analysis of improvements over WordNet\n\nAcknowledgments\nWe thank Kayur Patel, Mona Diab, Allison Buckley, and Todd Huffman for useful dis-\ncussions and assistance annotating data. R. Snow is supported by an NDSEG Fellow-\nship sponsored by the DOD and AFOSR. This work is also supported by the ARDA\nAQUAINT program, and by the Department of the Interior/DARPA under contract number\nNBCHD030010.\nReferences\n[1] Caraballo, S.A. (2001) Automatic Acquisition of a Hypernym-Labeled Noun Hierarchy from\n   Text. Brown University Ph.D. Thesis.\n[2] Cederberg, S. & Widdows, D. (2003) Using LSA and Noun Coordination Information to Improve\n   the Precision and Recall of Automatic Hyponymy Extraction. Proc. of CoNLL-2003, pp. 111118.\n[3] Ciaramita, M. & Johnson, M. (2003) Supersense Tagging of Unknown Nouns in WordNet. Proc.\n   of EMNLP-2003.\n[4] Ciaramita, M., Hofmann, T., & Johnson, M. (2003) Hierarchical Semantic Classification: Word\n   Sense Disambiguation with World Knowledge. Proc. of IJCAI-2003.\n[5] Fellbaum, C. (1998) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.\n[6] Girju, R., Badulescu A., & Moldovan D. (2003) Learning Semantic Constraints for the Automatic\n   Discovery of Part-Whole Relations. Proc. of HLT-2003.\n[7] Harman, D. (1992) The DARPA TIPSTER project. ACM SIGIR Forum 26(2), Fall, pp. 2628.\n[8] Hearst, M. (1992) Automatic Acquisition of Hyponyms from Large Text Corpora. Proc. of the\n   Fourteenth International Conference on Computational Linguistics, Nantes, France.\n[9] Hearst, M. & Schutze, H. (1993) Customizing a lexicon to better suit a computational task. In\n   Proc. of the ACL SIGLEX Workshop on Acquisition of Lexical Knowledge from Text.\n[10] Lin, D. (1998) Dependency-based Evaluation of MINIPAR. Workshop on the Evaluation of\n   Parsing Systems, Granada, Spain\n[11] Lin, D. & Pantel P. (2001) Discovery of Inference Rules for Question Answering. Natural\n   Language Engineering, 7(4), pp. 343360.\n[12] Pantel, P. (2003) Clustering by Committee. Ph.D. Dissertation. Department of Computing Sci-\n   ence, University of Alberta.\n[13] Pantel, P. & Ravichandran, D. (2004) Automatically Labeling Semantic Classes. Proc. of\n   NAACL-2004.\n[14] Pereira, F., Tishby, N., & Lee, L. (1993) Distributional Clustering of English Words. Proc. of\n   ACL-1993, pp. 183190.\n[15] Ravichandran, D. & Hovy, E. (2002) Learning Surface Text Patterns for a Question Answering\n   system. Proc. of ACL-2002.\n[16] Rennie J., Shih, L., Teevan, J., & Karger, D. (2003) Tackling the Poor Assumptions of Naive\n   Bayes Text Classifiers. Proc. of ICLM-2003.\n[17] Riloff, E. & Shepherd, J. (1997) A Corpus-Based Approach for Building Semantic Lexicons.\n   Proc of EMNLP-1997.\n[18] Roark, B. & Charniak, E. (1998) Noun-phrase co-occurerence statistics for semi-automatic-\n   semantic lexicon construction. Proc. of ACL-1998, 11101116.\n[19] Tseng, H. (2003) Semantic classification of unknown words in Chinese. Proc. of ACL-2003.\n[20] Turney, P.D., Littman, M.L., Bigham, J. & Shanyder, V. (2003) Combining independent mod-\n   ules to solve multiple-choice synonym and analogy problems. Proc. of RANLP-2003, pp. 482489.\n[21] Widdows, D. (2003) Unsupervised methods for developing taxonomies by combining syntactic\n   and statistical information. Proc. of HLT/NAACL 2003, pp. 276283.\n\n\f\n", "award": [], "sourceid": 2659, "authors": [{"given_name": "Rion", "family_name": "Snow", "institution": null}, {"given_name": "Daniel", "family_name": "Jurafsky", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}