{"title": "Subsequence Kernels for Relation Extraction", "book": "Advances in Neural Information Processing Systems", "page_first": 171, "page_last": 178, "abstract": null, "full_text": "Subsequence Kernels for Relation Extraction\n\nRazvan C. Bunescu Department of Computer Sciences University of Texas at Austin 1 University Station C0500 Austin, TX 78712 razvan@cs.utexas.edu\n\nRaymond J. Mooney Department of Computer Sciences University of Texas at Austin 1 University Station C0500 Austin, TX 78712 mooney@cs.utexas.edu\n\nAbstract\nWe present a new kernel method for extracting semantic relations between entities in natural language text, based on a generalization of subsequence kernels. This kernel uses three types of subsequence patterns that are typically employed in natural language to assert relationships between two entities. Experiments on extracting protein interactions from biomedical corpora and top-level relations from newspaper corpora demonstrate the advantages of this approach.\n\n1\n\nIntroduction\n\nInformation Extraction (IE) is an important task in natural language processing, with many practical applications. It involves the analysis of text documents, with the aim of identifying particular types of entities and relations among them. Reliably extracting relations between entities in natural-language documents is still a difficult, unsolved problem. Its inherent difficulty is compounded by the emergence of new application domains, with new types of narrative that challenge systems developed for other, well-studied domains. Traditionally, IE systems have been trained to recognize names of people, organizations, locations and relations between them (MUC [1], ACE [2]). For example, in the sentence \"protesters seized several pumping stations\", the task is to identify a L O C AT E D AT relationship between protesters (a P E R S O N entity) and stations (a L O C AT I O N entity). Recently, substantial resources have been allocated for automatically extracting information from biomedical corpora, and consequently much effort is currently spent on automatically identifying biologically relevant entities, as well as on extracting useful biological relationships such as protein interactions or subcellular localizations. For example, the sentence \"TR6 specifically binds Fas ligand\", asserts an interaction relationship between the two proteins TR6 and Fas ligand. As in the case of the more traditional applications of IE, systems based on manually developed extraction rules [3, 4] were soon superseded by information extractors learned through training on supervised corpora [5, 6]. One challenge posed by the biological domain is that current systems for doing part-of-speech (POS) tagging or parsing do not perform as well on the biomedical narrative as on the newspaper corpora on which they were originally trained. Consequently, IE systems developed for biological corpora need to be robust to POS or parsing errors, or to give reasonable performance using shallower but more reliable information, such as chunking instead of parsing. Motivated by the task of extracting protein-protein interactions from biomedical corpora, we present a generalization of the subsequence kernel from [7] that works with sequences containing combinations of words and word classes. This generalized kernel is further tailored for the task of relation extraction. Experimental results show that the new relation\n\n\f\nkernel outperforms two previous rule-based methods for interaction extraction. With a small modification, the same kernel is used for extracting top-level relations from ACE corpora, providing better results than a recent approach based on dependency tree kernels.\n\n2\n\nBackground\n\nOne of the first approaches to extracting protein interactions is that of Blaschke et al., described in [3, 4]. Their system is based on a set of manually developed rules, where each rule (or frame) is a sequence of words (or POS tags) and two protein-name tokens. Between every two adjacent words is a number indicating the maximum number of intervening words allowed when matching the rule to a sentence. An example rule is \"interaction of (3)
(3) with (3)
\", where '
' is used to denote a protein name. A sentence matches the rule if and only if it satisfies the word constraints in the given order and respects the respective word gaps. In [6] the authors described a new method ELCS (Extraction using Longest Common Subsequences) that automatically learns such rules. ELCS' rule representation is similar to that in [3, 4], except that it currently does not use POS tags, but allows disjunctions of words. An example rule learned by this system is \"- (7) interaction (0) [between | of] (5)
(9)
(17) .\". Words in square brackets separated by `|' indicate disjunctive lexical constraints, i.e. one of the given words must match the sentence at that position. The numbers in parentheses between adjacent constraints indicate the maximum number of unconstrained words allowed between the two.\n\n3\n\nExtraction using a Relation Kernel\n\nBoth Blaschke and ELCS do interaction extraction based on a limited set of matching rules, where a rule is simply a sparse (gappy) subsequence of words or POS tags anchored on the two protein-name tokens. Therefore, the two methods share a common limitation: either through manual selection (Blaschke), or as a result of the greedy learning procedure (ELCS), they end up using only a subset of all possible anchored sparse subsequences. Ideally, we would want to use all such anchored sparse subsequences as features, with weights reflecting their relative accuracy. However explicitly creating for each sentence a vector with a position for each such feature is infeasible, due to the high dimensionality of the feature space. Here we can exploit dual learning algorithms that process examples only via computing their dot-products, such as the Support Vector Machines (SVMs) [8]. Computing the dot-product between two such vectors amounts to calculating the number of common anchored subsequences between the two sentences. This can be done very efficiently by modifying the dynamic programming algorithm used in the string kernel from [7] to account only for common sparse subsequences constrained to contain the two protein-name tokens. We further prune down the feature space by utilizing the following property of natural language statements: when a sentence asserts a relationship between two entity mentions, it generally does this using one of the following three patterns: [FB] ForeBetween: words before and between the two entity mentions are simultaneously used to express the relationship. Examples: `interaction of P1 with P2 `, `activation of P1 by P2 `. [B] Between: only words between the two entities are essential for asserting the relationship. Examples: ` P1 interacts with P2 `, ` P1 is activated by P2 `. [BA] BetweenAfter: words between and after the two entity mentions are simultanei usly used to express the relationship. Examples: ` P1 P2 complex`, ` P1 and P2 o nteract`. Another observation is that all these patterns use at most 4 words to express the relationship (not counting the two entity names). Consequently, when computing the relation kernel, we restrict the counting of common anchored subsequences only to those having one of the three types described above, with a maximum word-length of 4. This type of feature\n\n\f\nselection leads not only to a faster kernel computation, but also to less overfitting, which results in increased accuracy (see Section 5 for comparative experiments). The patterns enumerated above are completely lexicalized and consequently their performance is limited by data sparsity. This can be alleviated by categorizing words into classes with varying degrees of generality, and then allowing patterns to use both words and their classes. Examples of word classes are POS tags and generalizations over POS tags such as Noun, Active Verb or Passive Verb. The entity type can also be used, if the word is part of a known named entity, as well as the type of the chunk containing the word, when chunking information is available. Content words such as nouns and verbs can also be related to their synsets via WordNet. Patterns then will consist of sparse subsequences of words, POS tags, general POS (GPOS) tags, entity and chunk types, or WordNet synsets. For example, `Noun of P1 by P2 ` is an FB pattern based on words and general POS tags.\n\n4\n\nSubsequence Kernels for Relation Extraction\n\nWe are going to show how to compute the relation kernel described in the previous section in two steps. First, in Section 4.1 we present a generalization of the subsequence kernel from [7]. This new kernel works with patterns construed as mixtures of words and word classes. Based on this generalized subsequence kernel, in Section 4.2 we formally define and show the efficient computation of the relation kernel used in our experiments. 4.1 A Generalized Subsequence Kernel\n\nLet 1 , 2 , ..., k be some disjoint feature spaces. Following the example in Section 3, 1 could be the set of words, 2 the set of POS tags, etc. Let = 1 2 ... k be the set of all possible feature vectors, where a feature vector would be associated with each position in a sentence. Given two feature vectors x, y , let c(x, y ) denote the number of common features between x and y . The next notation follows that introduced in [7]. Thus, let s, t be two sequences over the finite set , and let |s| denote the length of s = s1 ...s|s| . The sequence s[i : j ] is the contiguous subsequence si ...sj of s. Let i = (i1 , ..., i|i| ) be a sequence of |i| indices in s, in ascending order. We define the length l(i) of the index sequence i in s as i|i| - i1 + 1. Similarly, j is a sequence of |j| indices in t. Let = 1 2 ... k be the set of all possible features. We say that the sequence u is a (sparse) subsequence of s if there is a sequence of |u| indices i such that uk sik , for all k = 1, ..., |u|. Equivalently, we write u s[i] as a shorthand for the component-wise `` relationship between u and s[i]. Finally, let Kn (s, t, ) (Equation 1) be the number of weighted sparse subsequences u of length n common to s and t (i.e. u s[i], u t[j]), where the weight of u is l(i)+l(j) , for some 1. u i j Kn (s, t, ) = l(i)+l(j) (1)\nn :us[i] :ut[j] \n\nBecause for two fixed index sequen ces i and j, both of length n, the size of the set n {u n |u s[i], u t[j]} is k=1 c(sik , tjk ), then we can rewrite Kn (s, t, ) as in Equation 2: kn i j Kn (s, t, ) = c(sik , tjk )l(i)+l(j) (2)\n:|i|=n :|j|=n =1\n\nWe use as a decaying factor that penalizes longer subsequences. For sparse subsequences, this means that wider gaps will be penalized more, which is exactly the desired behavior for our patterns. Through them, we try to capture head-modifier dependencies that are important for relation extraction; for lack of reliable dependency information, the larger the word gap is between two words, the less confident we are in the existence of a headmodifier relationship between them.\n\n\f\nTo enable an efficient computation of Kn , we use the auxiliary function Kn with a similar definition as Kn , the only difference being that it counts the length from the beginning of the particular subsequence u to the end of the strings s and t, as illustrated in Equation 3: Kn (s, t, ) =\n \n\n\n\nu\n\ni\n\nj\n\n|s|+|t|-i1 -j1 +2\n\n(3)\n\nn :us[i] :ut[j] \n\nAn equivalent formula for Kn (s, t, ) is obtained by changing the exponent of from Equation 2 to |s| + |t| - i1 - j1 + 2. Based on all definitions above, Kn can be computed in O(k n|s||t|) time, by modifying the recursive computation from [7] with the new factor c(x, y ), as shown in Figure 1. In this figure, the sequence sx is the result of appending x to s (with ty defined in a similar way). To avoid clutter, the parameter is not shown in the argument list of K and K , unless it is instantiated to a specific constant.\n\n\nK0 (s, t) Ki (sx, ty ) Ki (sx, t) Kn (sx, t)\n \n\n= = = =\n\n1, f or all s, t Ki (sx, t) + 2 Ki-1 (s, t) c(x, y ) Ki (s, t) + Ki (sx, t) Kn (s, t) +\n \n\nj\n\n2 Kn-1 (s, t[1 : j - 1]) c(x, t[j ])\n\n\n\nFigure 1: Computation of subsequence kernel. 4.2 Computing the Relation Kernel\n\nAs described in Section 2, the input consists of a set of sentences, where each sentence contains exactly two entities (protein names in the case of interaction extraction). In Figure 2 we show the segments that will be used for computing the relation kernel between two example sentences s and t. In sentence s for instance, x1 and x2 are the two entities, sf is the sentence segment before x1 , sb is the segment between x1 and x2 , and sa is the sentence segment after x2 . For convenience, we also include the auxiliary segment sb = x1 sb x2 , whose span is computed as l(sb ) = l(sb ) + 2 (in all length computations, we consider x1 and x2 as contributing one unit only).\nsf s= x1 sb ' tf t= y1 t' b tb y2 ta sb x2 sa\n\nFigure 2: Sentence segments. The relation kernel computes the number of common patterns between two sentences s and t, where the set of patterns is restricted to the three types introduced in Section 3. Therefore, the kernel rK (s, t) is expressed as the sum of three sub-kernels: f bK (s, t) counting the\n\n\f\nrK(s, t) bKi (s, t) f bK (s, t) bK (s, t) baK (s, t)\n\n= = = = =\n\nf bK (s, t) + bK (s, t) + baK (s, t) Ki (sb , tb , 1) c(x1 , y1 ) c(x2 , y2 ) l(sb )+l(tb )\n \n\ni\n,j\n\nbKi (s, t) Kj (sf , tf ),\n\n\n\n1 i, 1 j, i + j < fbmax\n\ni i\n,j\n\nbKi (s, t),\n\n1 i bmax\n\n\nbKi (s, t) Kj (s- , t- ), a a\n\n1 i, 1 j, i + j < bamax\n\nFigure 3: Computation of relation kernel.\n\nnumber of common forebetween patterns, bK (s, t) for between patterns, and baK (s, t) for betweenafter patterns, as in Figure 3. All three sub-kernels include in their computation the counting of common subsequences between sb and tb . In order to speed up the computation, all these common counts can be calculated separately in bKi , which is defined as the number of common subsequences of length i between sb and tb , anchored at x1 /x2 and y1 /y2 respectively (i.e. constrained to start at x1 /y1 and to end at x2 /y2 ). Then f bK simply counts the number of subsequences that match j positions before the first entity and i positions between the entities, constrained to have length less than a constant f bmax . To obtain a similar formula for baK we simply use the reversed (mirror) version of segments sa and ta (e.g. s- and t- ). In Section 3 a a we observed that all three subsequence patterns use at most 4 words to express a relation, therefore we set constants f bmax , bmax and bamax to 4. Kernels K and K are computed using the procedure described in Section 4.1.\n\n5\n\nExperimental Results\n\nThe relation kernel (ERK) is evaluated on the task of extracting relations from two corpora with different types of narrative, which are described in more detail in the following sections. In both cases, we assume that the entities and their labels are known. All preprocessing steps sentence segmentation, tokenization, POS tagging and chunking were performed using the OpenNLP1 package. If a sentence contains n entities (n 2), it ns is replicated into 2 entences, each containing only two entities. If the two entities are known to be in a relationship, then the replicated sentence is added to the set of corresponding positive sentences, otherwise it is added to the set of negatnvessentences. During testing, i a sentence having n entities (n 2) is again replicated into 2 entences in a similar way. The relation kernel is used in conjunction with SVM learning in order to find a decision hyperplane that best separates the positive examples from negative examples. We modified the LibSVM2 package by plugging in the kernel described above. In all experiments, the decay factor is set to 0.75. The performance is measured using precision (percentage of correctly extracted relations out of total extracted) and recall (percentage of correctly extracted relations out of total number of relations annotated in the corpus). When PR curves are reported, the precision and recall are computed using output from 10-fold cross-validation. The graph points are obtained by varying a threshold on the minimum acceptable extraction confidence, based on the probability estimates from LibSVM.\n1 2\n\nURL: http://opennlp.sourceforge.net URL:http://www.csie.ntu.edu.tw/~cjlin/libsvm/\n\n\f\n5.1\n\nInteraction Extraction from AImed\n\nWe did comparative experiments on the AImed corpus, which has been previously used for training the protein interaction extraction systems in [6]. It consists of 225 Medline abstracts, of which 200 are known to describe interactions between human proteins, while the other 25 do not refer to any interaction. There are 4084 protein references and around 1000 tagged interactions in this dataset. We compare the following three systems on the task of retrieving protein interactions from AImed (assuming gold standard proteins): [Manual]: We report the performance of the rule-based system of [3, 4]. [ELCS]: We report the 10-fold cross-validated results from [6] as a PR graph. [ERK]: Based on the same splits as ELCS, we compute the corresponding precisionrecall graph. In order to have a fair comparison with the other two systems, which use only lexical information, we do not use any word classes here. The results, summarized in Figure 4(a), show that the relation kernel outperforms both ELCS and the manually written rules.\n100 90 80 70 Precision (%) 60 50 40 30 20 10 0 0 10 20 30 40 50 Recall (%) 60 70 80 90 100 Precision (%) ERK Manual ELCS 100 90 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 Recall (%) 60 70 80 90 100 ERK ERK-A\n\n(a) ERK vs. ELCS\n\n(b) ERK vs. ERK-A\n\nFigure 4: PR curves for interaction extractors. To evaluate the impact that the three types of patterns have on performance, we compare ERK with an ablated system (ERK-A) that uses all possible patterns, constrained only to be anchored on the two entity names. As can be seen in Figure 4(b), the three patterns (FB, B, BA) do lead to a significant increase in performance, especially for higher recall levels. 5.2 Relation Extraction from ACE\n\nTo evaluate how well this relation kernel ports to other types of narrative, we applied it to the problem of extracting top-level relations from the ACE corpus [2], the version used for the September 2002 evaluation. The training part of this dataset consists of 422 documents, with a separate set of 97 documents allocated for testing. This version of the ACE corpus contains three types of annotations: coreference, named entities and relations. There are five types of entities P E R S O N, O R G A N I Z AT I O N, FAC I L I T Y, L O C AT I O N, and G E O - P O L I T I C A L E N T I T Y which can participate in five general, top-level relations: RO L E, PA RT, L O C AT E D, N E A R, and S O C I A L. A recent approach to extracting relations is described in [9]. The authors use a generalized version of the tree kernel from [10] to compute a kernel over relation examples, where a relation example consists of the smallest dependency tree containing the two entities of the relation. Precision and recall values are reported for the task of extracting the 5 top-level relations in the ACE corpus under two different scenarios: [S1] This is the classic setting: one multi-class SVM is learned to discriminate among\n\n\f\nthe 5 top-level classes, plus one more class for the no-relation cases. [S2] One binary SVM is trained for relation detection, meaning that all positive relation instances are combined into one class. The thresholded output of this binary classifier is used as training data for a second multi-class SVM, trained for relation classification. We trained our relation kernel, under the first scenario, to recognize the same 5 top-level relation types. While for interaction extraction we used only the lexicalized version of the kernel, here we utilize more features, corresponding to the following feature spaces: 1 is the word vocabulary, 2 is the set of POS tags, 3 is the set of generic POS tags, and 4 contains the 5 entity types. We also used chunking information as follows: all (sparse) subsequences were created exclusively from the chunk heads, where a head is defined as the last word in a chunk. The same criterion was used for computing the length of a subsequence all words other than head words were ignored. This is based on the observation that in general words other than the chunk head do not contribute to establishing a relationship between two entities outside of that chunk. One exception is when both entities in the example sentence are contained in the same chunk. This happens very often due to nounnoun ('U.S. troops') or adjective-noun ('Serbian general') compounds. In these cases, we let one chunk contribute both entity heads. Also, an important difference from the interaction extraction case is that often the two entities in a relation do not have any words separating them, as for example in noun-noun compounds. None of the three patterns from Section 3 capture this type of dependency, therefore we introduced a fourth type of pattern, the modifier pattern M. This pattern consists of a sequence of length two formed from the head words (or their word classes) of the two entities. Correspondingly, we updated the relation kernel from Figure 3 with a new kernel term mK , as illustrated in Equation 4.\nrK (s, t) = f bK (s, t) + bK (s, t) + baK (s, t) + mK (s, t) (4)\n\nThe sub-kernel mK corresponds to a product of counts, as shown in Equation 5.\nmK (s, t) = c(x1 , y1 ) c(x2 , y2 ) 2+2 (5)\n\nWe present in Table 1 the results of using our updated relation kernel to extract relations from ACE, under the first scenario. We also show the results presented in [9] for their best performing kernel K4 (a sum between a bag-of-words kernel and the dependency kernel) under both scenarios. Table 1: Extraction Performance on ACE. Method Precision Recall F-measure (S1) ERK 73.9 35.2 47.7 (S1) K4 70.3 26.3 38.0 (S2) K4 67.1 35.0 45.8 Even though it uses less sophisticated syntactic and semantic information, ERK in S1 significantly outperforms the dependency kernel. Also, ERK already performs a few percentage points better than K4 in S2. Therefore we expect to get an even more significant increase in performance by training our relation kernel in the same cascaded fashion.\n\n6\n\nRelated Work\n\nIn [10], a tree kernel is defined over shallow parse representations of text, together with an efficient algorithm for computing it. Experiments on extracting P E R S O N - A FFI L I AT I O N and O R G A N I Z AT I O N - L O C AT I O N relations from 200 news articles show the advantage of using this new type of tree kernels over three feature-based algorithms. The same kernel was slightly generalized in [9] and applied on dependency tree representations of sentences, with dependency trees being created from head-modifier relationships extracted from syntactic parse trees. Experimental results show a clear win of the dependency tree kernel over a bag-of-words kernel. However, in a bag-of-words approach the word order is completely lost. For relation extraction, word order is important, and our experimental results support this claim all subsequence patterns used in our approach retain the order between words.\n\n\f\nThe tree kernels used in the two methods above are opaque in the sense that the semantics of the dimensions in the corresponding Hilbert space is not obvious. For subsequence kernels, the semantics is known by definition: each subsequence pattern corresponds to a dimension in the Hilbert space. This enabled us to easily restrict the types of patterns counted by the kernel to the three types that we deemed relevant for relation extraction.\n\n7\n\nConclusion and Future Work\n\nWe have presented a new relation extraction method based on a generalization of subsequence kernels. When evaluated on a protein interaction dataset, the new method showed better performance than two previous rule-based systems. After a small modification, the same kernel was evaluated on the task of extracting top-level relations from the ACE corpus, showing better performance when compared with a recent dependency tree kernel. An experiment that we expect to lead to better performance was already suggested in Section 5.2 using the relation kernel in a cascaded fashion, in order to improve the low recall caused by the highly unbalanced data distribution. Another performance gain may come from setting the factor to a more appropriate value based on a development dataset. Currently, the method assumes the named entities are known. A natural extension is to integrate named entity recognition with relation extraction. Recent research [11] indicates that a global model that captures the mutual influences between the two tasks can lead to significant improvements in accuracy.\n\n8\n\nAcknowledgements\n\nThis work was supported by grants IIS-0117308 and IIS-0325116 from the NSF. We would like to thank Rohit J. Kate and the anonymous reviewers for helpful observations.\n\nReferences\n[1] R. Grishman, Message Understanding Conference 6, http://cs.nyu.edu/cs/faculty/grishman/ muc6.html (1995). [2] NIST, ACE Automatic Content Extraction, http://www.nist.gov/speech/tests/ace (2000). [3] C. Blaschke, A. Valencia, Can bibliographic pointers for known biological data be found automatically? protein interactions as a case study, Comparative and Functional Genomics 2 (2001) 196206. [4] C. Blaschke, A. Valencia, The frame-based module of the Suiseki information extraction system, IEEE Intelligent Systems 17 (2002) 1420. [5] S. Ray, M. Craven, Representing sentence structure in hidden Markov models for information extraction, in: Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-2001), Seattle, WA, 2001, pp. 12731279. [6] R. Bunescu, R. Ge, R. J. Kate, E. M. Marcotte, R. J. Mooney, A. K. Ramani, Y. W. Wong, Comparative experiments on learning information extractors for proteins and their interactions, Artificial Intelligence in Medicine (special issue on Summarization and Information Extraction from Medical Documents) 33 (2) (2005) 139155. [7] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, C. Watkins, Text classification using string kernels, Journal of Machine Learning Research 2 (2002) 419444. [8] V. N. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998. [9] A. Culotta, J. Sorensen, Dependency tree kernels for relation extraction, in: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, 2004, pp. 423429. [10] D. Zelenko, C. Aone, A. Richardella, Kernel methods for relation extraction, Journal of Machine Learning Research 3 (2003) 10831106. [11] D. Roth, W. Yih, A linear programming formulation for global inference in natural language tasks, in: Proceedings of the Annual Conference on Computational Natural Language Learning (CoNLL), Boston, MA, 2004, pp. 18.\n\n\f\n", "award": [], "sourceid": 2787, "authors": [{"given_name": "Raymond", "family_name": "Mooney", "institution": null}, {"given_name": "Razvan", "family_name": "Bunescu", "institution": null}]}