{"title": "Identifying Patients at Risk of Major Adverse Cardiovascular Events Using Symbolic Mismatch", "book": "Advances in Neural Information Processing Systems", "page_first": 2262, "page_last": 2270, "abstract": "Cardiovascular disease is the leading cause of death globally, resulting in 17 million deaths each year. Despite the availability of various treatment options, existing techniques based upon conventional medical knowledge often fail to identify patients who might have benefited from more aggressive therapy. In this paper, we describe and evaluate a novel unsupervised machine learning approach for cardiac risk stratification. The key idea of our approach is to avoid specialized medical knowledge, and assess patient risk using symbolic mismatch, a new metric to assess similarity in long-term time-series activity. We hypothesize that high risk patients can be identified using symbolic mismatch, as individuals in a population with unusual long-term physiological activity. We describe related approaches that build on these ideas to provide improved medical decision making for patients who have recently suffered coronary attacks. We first describe how to compute the symbolic mismatch between pairs of long term electrocardiographic (ECG) signals. This algorithm maps the original signals into a symbolic domain, and provides a quantitative assessment of the difference between these symbolic representations of the original signals. We then show how this measure can be used with each of a one-class SVM, a nearest neighbor classifier, and hierarchical clustering to improve risk stratification. We evaluated our methods on a population of 686 cardiac patients with available long-term electrocardiographic data. In a univariate analysis, all of the methods provided a statistically significant association with the occurrence of a major adverse cardiac event in the next 90 days. In a multivariate analysis that incorporated the most widely used clinical risk variables, the nearest neighbor and hierarchical clustering approaches were able to statistically significantly distinguish patients with a roughly two-fold risk of suffering a major adverse cardiac event in the next 90 days.", "full_text": "Identifying Patients at Risk of Major Adverse\n\nCardiovascular Events Using Symbolic Mismatch\n\nZeeshan Syed\n\nUniversity of Michigan\nAnn Arbor, MI 48109\n\nzhs@eecs.umich.edu\n\nJohn Guttag\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nguttag@csail.mit.edu\n\nAbstract\n\nCardiovascular disease is the leading cause of death globally, resulting in 17 mil-\nlion deaths each year. Despite the availability of various treatment options, ex-\nisting techniques based upon conventional medical knowledge often fail to iden-\ntify patients who might have bene\ufb01ted from more aggressive therapy. In this pa-\nper, we describe and evaluate a novel unsupervised machine learning approach\nfor cardiac risk strati\ufb01cation. The key idea of our approach is to avoid special-\nized medical knowledge, and assess patient risk using symbolic mismatch, a new\nmetric to assess similarity in long-term time-series activity. We hypothesize that\nhigh risk patients can be identi\ufb01ed using symbolic mismatch, as individuals in\na population with unusual long-term physiological activity. We describe related\napproaches that build on these ideas to provide improved medical decision mak-\ning for patients who have recently suffered coronary attacks. We \ufb01rst describe\nhow to compute the symbolic mismatch between pairs of long term electrocardio-\ngraphic (ECG) signals. This algorithm maps the original signals into a symbolic\ndomain, and provides a quantitative assessment of the difference between these\nsymbolic representations of the original signals. We then show how this measure\ncan be used with each of a one-class SVM, a nearest neighbor classi\ufb01er, and hier-\narchical clustering to improve risk strati\ufb01cation. We evaluated our methods on a\npopulation of 686 cardiac patients with available long-term electrocardiographic\ndata. In a univariate analysis, all of the methods provided a statistically signi\ufb01cant\nassociation with the occurrence of a major adverse cardiac event in the next 90\ndays. In a multivariate analysis that incorporated the most widely used clinical\nrisk variables, the nearest neighbor and hierarchical clustering approaches were\nable to statistically signi\ufb01cantly distinguish patients with a roughly two-fold risk\nof suffering a major adverse cardiac event in the next 90 days.\n\n1 Introduction\n\nIn medicine, as in many other disciplines, decisions are often based upon a comparative analysis.\nPatients are given treatments that worked in the past on apparently similar conditions. When given\nsimple data (e.g., demographics, comorbidities, and laboratory values) such comparisons are rela-\ntively straightforward. For more complex data, such as continuous long-term signals recorded during\nphysiological monitoring, they are harder. Comparing such time-series is made challenging by three\nfactors: the need to ef\ufb01ciently compare very long signals across a large number of patients, the need\nto deal with patient-speci\ufb01c differences, and the lack of a priori knowledge associating signals with\nlong-term medical outcomes.\nIn this paper, we exploit three different ideas to address these problems.\n\n1\n\n\fsymbols,\n\n\u2022 We address the problems related to scale by abstracting the raw signal into a sequence of\n\u2022 We address the problems related to patient-speci\ufb01c differences by using a novel technique,\nsymbolic mismatch, that allows us to compare sequences of symbols drawn from distinct\nalphabets. Symbolic mismatch compares long-term time-series by quantifying differences\nbetween the morphology and frequency of prototypical functional units, and\n\u2022 We address the problems related to lack of a priori knowledge using three different meth-\nods, each of which exploits the observation that high risk patients typically constitute a\nsmall minority in a population.\n\nIn the remainder of this paper, we present our work in the context of risk strati\ufb01cation for cardio-\nvascular disease. Cardiovascular disease is the leading cause of death globally and causes roughly\n17 million deaths each year [3]. Despite improvements in survival rates, in the United States, one\nin four men and one in three women still die within a year of a recognized \ufb01rst heart attack [4].\nThis risk of death can be substantially lowered with an appropriate choice of treatment (e.g., drugs\nto lower cholesterol and blood pressure; operations such as coronary artery bypass graft; and med-\nical devices such as implantable cardioverter de\ufb01brillators) [3]. However, matching patients with\ntreatments that are appropriate for their risk has proven to be challenging [5,6].\nThat existing techniques based upon conventional medical knowledge have proven inadequate for\nrisk strati\ufb01cation leads us to explore methods with few a priori assumptions. We focus, in particular,\non identifying patients at elevated risk of major adverse cardiac events (death, myocardial infarction\nand severe recurrent ischemia) following coronary attacks. This work uses long-term ECG signals\nrecorded during patient admission for ACS. These signals are routinely collected, potentially allow-\ning for the work presented here to be deployed easily without imposing additional needs on patients,\ncaregivers, or the healthcare infrastructure.\nFortunately, only a minority of cardiac patients experience serious subsequent adverse cardiovascu-\nlar events. For example, cardiac mortality over a 90 day period following acute coronary syndrome\n(ACS) was reported to be 1.79% for the SYMPHONY trial involving 14,970 patients [1] and 1.71%\nfor the DISPERSE2 trial with 990 patients [2]. The rate of myocardial infarction (MI) over the\nsame period for the two trials was 5.11% for the SYMPHONY trial and 3.54% for the DISPERSE2\ntrial. Our hypothesis is that these patients can be discovered as anomalies in the population, i.e.,\ntheir physiological activity over long periods of time is dissimilar to the majority of the patients in\nthe population. In contrast to algorithms that require labeled training data, we propose identifying\nthese patients using unsupervised approaches based on three machine learning methods previously\nreported in the literature: one-class support vector machines (SVMs), nearest neighbor analysis, and\nhierarchical clustering.\nThe main contributions of our work are: (1) we describe a novel unsupervised approach to cardio-\nvascular risk strati\ufb01cation that is complementary to existing clinical approaches, (2) we explore the\nidea of similarity-based clinical risk strati\ufb01cation where patients are categorized in terms of their\nsimilarities rather than speci\ufb01c features based on prior knowledge, (3) we develop the hypothesis\nthat patients at future risk of adverse outcomes can be detected using an unsupervised approach as\noutliers in a population, (4) we present symbolic mismatch, as a way to ef\ufb01ciently compare very long\ntime-series without \ufb01rst reducing them to a set of features or requiring symbol registration across\npatients, and (5) we present a rigorous evaluation of unsupervised similarity-based risk strati\ufb01cation\nusing long-term data from nearly 700 patients with detailed admissions and follow-up data.\n\n2 Symbolic Mismatch\n\nWe start by describing the process through which symbolic mismatch is measured on ECG signals.\n\n2.1 Symbolization\n\nAs a \ufb01rst step, the ECG signal zm for each patient m = 1; :::; n is symbolized using the technique\nproposed by [7]. To segment the ECG signal into beats, we use two open-source QRS detection\nalgorithms [8,9]. QRS complexes are marked at locations where both algorithms agree. A variant\nof dynamic time-warping (DTW) [7] is then used to quantify differences in morphology between\n\n2\n\n\fbeats. Using this information, beats with distinct morphologies are partitioned into groups, with\neach group assigned a unique label or symbol. This is done using a Max-Min iterative clustering\nalgorithm that starts by choosing the \ufb01rst observation as the \ufb01rst centroid, c1, and initializes the set\nS of centroids to {c1}. During the i-th iteration, ci is chosen such that it maximizes the minimum\ndifference between ci and observations in S:\n\nci = arg max\nx =2S\n\nmin\ny2S\n\nC(x; y)\n\n(1)\n\nwhere C(x; y) is the DTW difference between x and y. The set S is incremented at the end of each\niteration such that S = S \u222a ci.\nThe number of clusters discovered by Max-Min clustering is chosen by iterating until the maximized\nminimum difference falls below a threshold (cid:18). At this point, the set S comprises the centroids for\nthe clustering process, and the \ufb01nal assignment of beats to clusters proceeds by matching each beat\nto its nearest centroid. Each set of beats assigned to a centroid constitutes a unique cluster. The \ufb01nal\nnumber of clusters, (cid:13), obtained using this process depends on the separability of the underlying data.\nThe overall effect of the DTW-based partitioning of beats is to transform the original raw ECG\nsignal into a sequence of symbols, i.e., into a sequence of labels corresponding to the different beat\nmorphology classes that occur in the signal. Our approach differs from the methods typically used\nto annotate ECG signals in two ways. First, we avoid using specialized knowledge to partition\nbeats into known clinical classes. There is a set of generally accepted labels that cardiologists\nuse to differentiate distinct kinds of heart beats. However, in many cases, \ufb01ner distinctions than\nprovided by these labels can be clinically relevant [7]. Our use of beat clustering rather than beat\nclassi\ufb01cation allows us to infer characteristic morphology classes that capture these \ufb01ner-grained\ndistinctions. Second, our approach does not involve extracting features (e.g., the length of the beat\nor the amplitude of the P wave) from individual beats. Instead, our clustering algorithm compares\nthe entire raw morphology of pairs of beats. This approach is potentially advantageous, because it\ndoes not assume a priori knowledge about what aspects of a beat are most relevant. It can also be\nextended to other time-series data (e.g., blood pressure and respiration waveforms).\n\n2.2 Measuring Mismatch in Symbolic Representations\n\nDenoting the set of symbol centroids for patient p as Sp and the set of frequencies with which these\nsymbols occur in the electrocardiogram as Fp (for patient q an analogous representation is adopted),\nwe de\ufb01ne the symbolic mismatch between the long-term ECG time-series for patients p and q as:\n\n\u2211\n\n\u2211\n\n p;q =\n\npi2Sp\n\nqj2Sq\n\nC(pi; qj)Fp[pi]Fq[qj]\n\n(2)\n\nwhere C(pi; qj) corresponds to the DTW cost of aligning the centroids of symbol classes pi and qj.\nIntuitively, the symbolic mismatch between patients p and q corresponds to an estimate of the ex-\npected difference in morphology between any two randomly chosen beats from these patients. The\nsymbolic mismatch computation achieves this by weighting the difference between the centroids for\nevery pair of symbols for the patients by the frequencies with which these symbols occur.\nAn important feature of symbolic mismatch is that it avoids the need to set up a correspondence\nbetween the symbols of patients p and q.\nIn contrast to cluster matching techniques [10,11] to\ncompare data for two patients by \ufb01rst making an assignment from symbols in one patient to the\nother, symbolic mismatch does not require any cross-patient registration of symbols.\nInstead, it\nperforms weighted morphologic comparisons between all symbol centroids for patients p and q. As\na result, the symbolization process does not need to be restricted to well-de\ufb01ned labels and is able\nto use a richer set of patient-speci\ufb01c symbols that capture \ufb01ne-grained activity over long periods.\n\n2.3 Spectrum Clipping and Adaptation for Kernel-based Methods\n\nThe formulation for symbolic mismatch in Equation 2 gives rise to a symmetric dissimilarity ma-\ntrix. For methods that are unable to work directly from dissimilarities, this can be transformed into a\nsimilarity matrix using a generalized radial basis function. For both the dissimilarity and similarity\ncase, however, symbolic mismatch can produce a matrix that is inde\ufb01nite. This can be problematic\n\n3\n\n\fwhen using symbolic mismatch with kernel-based algorithms since the optimization problems be-\ncome non-convex and the underlying theory is invalidated. In particular, kernel-based classi\ufb01cation\nmethods require Mercer\u2019s condition to be satis\ufb01ed by a positive semi-de\ufb01nite kernel matrix [12].\nThis creates the need to transform the symbolic mismatch matrix before it can be used as a kernel in\nthese methods.\nWe use spectrum clipping to generalize the use of symbolic mismatch for classi\ufb01cation. This ap-\nproach has been shown both theoretically and empirically to offer advantages over other strategies\n(e.g., spectrum \ufb02ipping, spectrum shifting, spectrum squaring, and the use of inde\ufb01nite kernels)\n[13]. The symmetric mismatch matrix (cid:9) has an eigenvalue decomposition:\n\nwhere U is an orthogonal matrix and (cid:3) is a diagonal matrix of real eigenvalues:\n\n(cid:3) = diag((cid:21)1; :::; (cid:21)n)\n\n(cid:9) = U T (cid:3)U\n\n(3)\n\n(4)\n\nSpectrum clipping makes (cid:9) positive semi-de\ufb01nite by clipping all negative eigenvalues to zero. The\nmodi\ufb01ed positive semi-de\ufb01nite symbolic mismatch matrix is then given by:\n\nwhere:\n\n(cid:9)clip = U T (cid:3)clipU\n\n(cid:3)clip = diag(max((cid:21)1; 0); :::; max((cid:21)n; 0))\n\n(5)\n\n(6)\n\nUsing (cid:9)clip as a kernel matrix is then equivalent to using ((cid:3)clip)1=2ui as the i-th training sample.\nThough we introduce spectrum clipping mainly for the purpose of broadening the applicability of\nsymbolic mismatch to kernel-based methods, this approach offers additional advantages. When the\nnegative eigenvalues of the similarity matrix are caused by noise, one can view spectrum clipping as\na denoising step [14]. The results of our experiments, presented later in this paper, support the view\nof spectrum clipping being useful in a broader context.\n\n3 Risk Strati\ufb01cation Using Symbolic Mismatch\n\nWe now sketch three different approaches using symbolic mismatch to identify high risk patients in a\npopulation. The following two sections contain an empirical evaluation of each. The \ufb01rst approach\nuses a one-class SVM and a symbolic mismatch similarity matrix obtained using a generalized\nradial basis transformation. The other two approaches, nearest neighbor analysis and hierarchical\nclustering, use the symbolic mismatch dissimilarity matrix. In each case, the symbolic mismatch\nmatrix is processed using spectrum clipping.\n\n3.1 Classi\ufb01cation Approach\n\nSVMs can applied to anomaly detection in a one-class setting [15] . This is done by mapping the\ndata into the feature space corresponding to the kernel and separating instances from the origin with\nthe maximum margin. To separate data from the origin, the following quadratic program is solved:\n\n\u2225w\u22252 +\n\nmin\nw;(cid:24);p\n\n1\n2\n\n1\nvn\n\n(cid:24)i \u2212 p\n\n\u2211\n\ni\n\n(w \u00b7 (cid:8)(zi)) \u2265 p \u2212 (cid:24)i i = 1; :::; n (cid:24)i \u2265 0\n\nsubject to:\n\nwhere v re\ufb02ects the tradeoff between incorporating outliers and minimizing the support region.\nFor a new instance, the label is determined by evaluating which side of the hyperplane the instance\nfalls on in the feature space. The resulting predicted label in terms of the Lagrange multipliers (cid:11)i\nand the spectrum clipped symbolic mismatch similarity matrix (cid:9)clip is then:\n\n^yj = sgn(\n\n(cid:11)i(cid:9)clip(i; j) \u2212 p)\n\n\u2211\n\ni\n\n4\n\n(7)\n\n(8)\n\n(9)\n\n\fWe apply this approach to train a one-class SVM on all patients. Patients outside the enclosing\nboundary are labeled anomalies. The parameter v can be varied to control the size of this group.\n\n3.2 Nearest Neighbor Approach\n\nOur second approach is based on the concept of nearest neighbor analysis. The assumption underly-\ning this approach is that normal data instances occur in dense neighborhoods, while anomalies occur\nfar from their closest neighbors.\nWe use an approach similar to [16]. The anomaly score of each patient\u2019s long-term time-series is\ncomputed as the sum of its distances from the time-series for its k-nearest neighbors, as measured\nby symbolic mismatch. Patients with anomaly scores exceeding a threshold (cid:18) are labeled anomalies.\n\n3.3 Clustering Approach\n\nOur third approach is based on hierarchical clustering. We place each patient in a separate cluster,\nand then proceed in each iteration to merge the two clusters that are most similar to each other. The\ndistance between two clusters is de\ufb01ned as the average of the pairwise symbolic mismatch of the\npatients in each cluster. The clustering process terminates when it enters the region of diminishing\nreturns (i.e., at the \u2019knee\u2019 of the curve corresponding to the distance of clusters merged together at\neach iteration). At this point, all patients outside the largest cluster are labeled as anomalies.\n\n4 Evaluation Methodology\n\nWe evaluated our work on patients enrolled in the DISPERSE2 trial [2]. Patients in the study were\nadmitted to a hospital with non-ST-elevation ACS. Three lead continuous ECG monitoring (LifeCard\nCF / Path\ufb01nder, DelMar Reynolds / Spacelabs, Issaqua WA) was performed for a median duration\nof four days at a sampling rate of 128 Hz. The endpoints of cardiovascular death, myocardial\ninfarction and severe recurrent ischemia were adjudicated by a blinded Clinical Events Committee\nfor a median follow-up period of 60 days. The maximum follow-up was 90 days. Data from 686\npatients was available after removal of noise-corrupted signals. During the follow-up there were\n14 cardiovascular deaths, 28 myocardial infarctions, and 13 cases of severe recurrent ischemia. We\nde\ufb01ne a major adverse cardiac event to be any of these three adverse events.\nWe studied the effectiveness of combining symbolic mismatch with each of classi\ufb01cation, near-\nest neighbor analysis and clustering in identifying a high risk group of patients. Consistent with\nother clinical studies to evaluate methods for risk strati\ufb01cation in the setting of ACS [17], we clas-\nsi\ufb01ed patients in the highest quartile as the high risk group. For the classi\ufb01cation approach, this\ncorresponded to choosing v such that the group of patients lying outside the enclosing boundary\nconstituted roughly 25% of the population. For the nearest neighbor approach we investigated all\nodd values of k from 3 to 9, and patients with anomaly scores in the top 25% of the population were\nclassi\ufb01ed as being at high risk. For the clustering approach, the varying sizes of the clusters merged\ntogether at each step made it dif\ufb01cult to select a high risk quartile. Instead, patients lying outside\nthe largest cluster were categorized as being at risk. In the tests reported later in this paper, this\ngroup contained roughly 23% the patients in the population. We used the LIBSVM implementation\nfor our one-class SVM. Both the nearest neighbor and clustering approaches were carried out using\nMATLAB implementations.\nWe employed Kaplan-Meier survival analysis to compare the rates for major adverse cardiac events\nbetween patients declared to be at high and low risk. Hazard ratios (HR) and 95% con\ufb01dence in-\nterval (CI) were estimated using a Cox proportional hazards regression model. The predictions of\neach approach were studied in univariate models, and also in multivariate models that additionally\nincluded other clinical risk variables (age\u226565 years, gender, smoking history, hypertension, dia-\nbetes mellitus, hyperlipidemia, history of chronic obstructive pulmonary disorder (COPD), history\nof coronary heart disease (CHD), previous MI, previous angina, ST depression on admission, index\ndiagnosis of MI) as well as ECG risk metrics proposed in the past (heart rate variability (HRV), heart\nrate turbulence (HRT), and deceleration capacity (DC)) [18].\n\n5\n\n\fMethod\n\nOne-Class SVM\n\n3-Nearest Neighbor\n5-Nearest Neighbor\n7-Nearest Neighbor\n9-Nearest Neighbor\n\nHierarchical Clustering\n\nHR\n1.38\n1.91\n2.10\n2.28\n2.07\n2.04\n\nP Value\n0.033\n0.031\n0.013\n0.005\n0.015\n0.017\n\n95% CI\n1.04-1.89\n1.06-3.44\n1.17-3.76\n1.28-4.07\n1.15-3.71\n1.13-3.68\n\nTable 1: Univariate association of risk predictions from different approaches using symbolic mis-\nmatch with major adverse cardiac events over a 90 day period following ACS.\n\nDiabetes Mellitus\nHyperlipidemia\nHistory of COPD\nHistory of CHD\n\nClinical Variable\nAge(cid:21)65 years\nFemale Gender\nCurrent Smoker\nHypertension\n\nHR\n1.82\n0.69\n1.05\n1.44\n1.95\n1.00\n1.05\n1.10\n1.17\n0.94\nST depression>0.5mm 1.13\n1.42\nIndex diagnosis of MI\n1.56\nHeart Rate Variability\nHeart Rate Turbulence\n1.64\n1.77\nDeceleration Capacity\n\nPrevious MI\n\nPrevious angina\n\nP Value\n0.041\n0.261\n0.866\n0.257\n0.072\n0.994\n0.933\n0.994\n0.630\n0.842\n0.675\n0.134\n0.128\n0.013\n0.002\n\n95% CI\n1.02-3.24\n0.37-1.31\n0.59-1.87\n0.77-2.68\n0.94-4.04\n0.55-1.82\n0.37-2.92\n0.37-2.92\n0.62-2.22\n0.53-1.68\n0.64-2.01\n0.90-2.26\n0.88-2.77\n1.11-2.42\n1.23-2.54\n\nTable 2: Univariate association of existing clinical and ECG risk variables with major adverse car-\ndiac events over a 90 day period following ACS.\n5 Results\n\n5.1 Univariate Results\n\nResults of univariate analysis for all three unsupervised symbolic mismatch-based approaches are\npresented in Table 1. The predictions from all methods showed a statistically signi\ufb01cant (i.e., p <\n0:05) association with major adverse cardiac events following ACS. The results in Table 1 can\nbe interpreted as roughly a doubled rate of adverse outcomes per unit time in patients identi\ufb01ed as\nbeing at high risk by the nearest neighbor and clustering approaches. For the classi\ufb01cation approach,\npatients identi\ufb01ed as being at high risk had a nearly 40% increased risk.\nFor comparison, we also include the univariate association of the other clinical and ECG risk vari-\nables in our study (Table 2). Both the nearest neighbor and clustering approaches had a higher hazard\nratio in this patient population than any of the other variables studied. Of the clinical risk variables,\nonly age was found to be signi\ufb01cantly associated on univariate analysis with major cardiac events\nafter ACS. Diabetes (p=0.072) was marginally outside the 5% level of signi\ufb01cance. Of the ECG risk\nvariables, both HRT and DC showed a univariate association with major adverse cardiac events in\nthis population. These results are consistent with the clinical literature on these risk metrics.\n\n5.2 Multivariate Results\n\nWe measured the correlation between the predictions of the unsupervised symbolic mismatch-based\napproaches and both the clinical and ECG risk variables. All of the unsupervised approaches had\nlow correlation with both sets of variables (R \u2264 0:2). This suggests that the results of these novel\napproaches can be usefully combined with results of existing approaches.\nOn multivariate analysis, both the nearest neighbor approach and the clustering approach were inde-\npendent predictors of adverse outcomes (Table 3). In our study, the nearest neighbor approach (for\nk > 3) had the highest hazard ratio on both univariate and multivariate analysis. Both the nearest\nneighbor and clustering approaches predicted patients with an approximately two-fold increased risk\nof adverse outcomes. This increased risk did not change much even after adjusting for other clinical\nand ECG risk variables.\n\n6\n\n\fMethod\n\nOne-Class SVM\n\n3-Nearest Neighbor\n5-Nearest Neighbor\n7-Nearest Neighbor\n9-Nearest Neighbor\n\nHierarchical Clustering\n\nAdjusted HR\n\n1.32\n1.88\n2.07\n2.25\n2.04\n1.86\n\nP Value\n0.074\n0.042\n0.018\n0.008\n0.021\n0.042\n\n95% CI\n0.97-1.79\n1.02-3.46\n1.13-3.79\n1.23-4.11\n1.11-3.73\n1.02-3.46\n\nTable 3: Multivariate association of high risk predictions from different approaches using symbolic\nmismatch with major adverse cardiac events over a 90 day period following ACS. Multivariate\nresults adjusted for variables in Table 2.\n\nMethod\n\nOne-Class SVM\n\n3-Nearest Neighbor\n5-Nearest Neighbor\n7-Nearest Neighbor\n9-Nearest Neighbor\n\nHierarchical Clustering\n\nHR\n1.36\n1.74\n1.57\n1.73\n1.89\n1.19\n\nP Value\n0.038\n0.069\n0.142\n0.071\n0.034\n0.563\n\n95% CI\n1.01-1.79\n0.96-3.16\n0.86-2.88\n0.95-3.14\n1.05-3.41\n0.67-2.12\n\nTable 4: Univariate association of high risk predictions without the use of spectrum clipping. None\nof the approaches showed a statistically signi\ufb01cant association with the study endpoint in any of the\nmultivariate models including other clinical risk variables when spectrum clipping was not used.\n\n5.3 Effect of Spectrum Clipping\n\nWe also investigated the effect of spectrum clipping on the performance of our different risk strat-\ni\ufb01cation approaches. Table 4 presents the associations when spectrum clipping was not used. For\nall three methods, performance was worse without the use of spectrum clipping, although the effect\nwas small for the one-class SVM case.\n\n6 Related Work\n\nMost previous work on comparing signals in terms of their raw samples (e.g., metrics such as\ndynamic time warping, longest common subsequence, edit distance with real penalty, sequence\nweighted alignment, spatial assembling distance, threshold queries) [19] focuses on relatively short\ntime-series. This is due to the runtime of these methods (quadratic for many methods) and the need\nto reason in terms of the frequency and dynamics of higher-level signal constructs (as opposed to\nindividual samples) when studying systems over long periods.\nMost prior research on comparing long-term time-series focuses instead on extracting speci\ufb01c fea-\ntures from long-term signals and quantifying the differences between these features. In the context\nof cardiovascular disease, long-term ECG is often reduced to features (e.g., mean heart rate or heart\nrate variability) and compared in terms of these features. These approaches, unlike our symbolic\nmismatch based approaches, draw upon signi\ufb01cant a priori knowledge. Our belief was that for\napplications like risk stratifying patients for major cardiac events, focusing on a set of specialized\nfeatures leads to important information being potentially missed. In our work, we focus instead\non developing an approach that avoids use of signi\ufb01cant a priori knowledge by comparing the raw\nmorphology of long-term time-series. We propose doing this in a computationally ef\ufb01cient and\nsystematic way through symbolization. While this use of symbolization represents a lossy compres-\nsion of the original signal, the underlying DTW-based process of quantifying differences between\nlong-term time-series remains grounded in the comparison of raw morphology.\nSymbolization maps the comparison of long-term time-series into the domain of sequence compar-\nison. There is an extensive body of prior work focusing on the comparison of sequential or string\ndata. Algorithms based on measuring the edit distance between strings are widely used in disci-\nplines such as computational biology, but are typically restricted to comparisons of short sequences\nbecause of their computational complexity. Research on the use of pro\ufb01le hidden Markov models\n[20,21] to optimize recognition of binary labeled sequences is more closely related to our work. This\nwork focuses on learning the parameters of a hidden Markov model that can represent approxima-\ntions of sequences and can be used to score other sequences. Such approaches require large amounts\nof data or good priors to train the hidden Markov models. Computing forward and backward prob-\n\n7\n\n\fabilities from the Baum-Welch algorithm is also very computationally intensive. Other research in\nthis area focuses on mismatch tree-based kernels [22], which use the mismatch tree data structure\n[23] to quantify the difference between two sequences based on the approximate occurrence of \ufb01xed\nlength subsequences within them. Similar to this approach is work on using a \u201cbag of motifs\u201d rep-\nresentation [24], which provides a more \ufb02exible representation than \ufb01xed length subsequences but\nusually requires prior knowledge of motifs in the data [24].\nIn contrast to these efforts, we use a simple computationally ef\ufb01cient approach to compare sym-\nbolic sequences without prior knowledge. Most importantly, our approach helps address the situ-\nation where symbolizing long-term time-series in a patient-speci\ufb01c manner leads to the symbolic\nsequences from different alphabets [25]. In this case, hidden Markov models, mismatch trees or a\n\u201cbag of motifs\u201d approach trained on one patient cannot be easily used to score the sequences for\nother patients. Instead, any comparative approach must maintain a hard or soft registration of sym-\nbols across individuals. Symbolic mismatch complements existing work on sequence comparison\nby using a measure that quanti\ufb01es differences across patients while retaining information on how\nthe symbols for these patients differ.\nFinally, we distinguish our work from earlier method for ECG-based risk strati\ufb01cation. These meth-\nods typically calculate a particular pre-de\ufb01ned feature from the raw ECG signal, and to use it to rank\npatients along a risk continuum. Our approach, focusing on detecting patients with high symbolic\nmismatch relative to other patients in the population, is orthogonal to the use of specialized high risk\nfeatures along two important dimensions. First, it does not require the presence of signi\ufb01cant prior\nknowledge. For the cardiovascular care, we only assume that ECG signals from patients who are\nat high risk differ from those of the rest of the population. There are no speci\ufb01c assumptions about\nthe nature of these differences. Second, the ability to partition patients into groups with similar\nECG characteristics and potentially common risk pro\ufb01les potentially allows for a more \ufb01ne-grained\nunderstanding of a how a patient\u2019s future health may evolve over time. Matching patients to past\ncases with similar ECG signals could lead to more accurate assignments of risk scores for particular\nevents such as death and recurring heart attacks.\n\n7 Discussion\n\nIn this paper, we described a novel unsupervised learning approach to cardiovascular risk strati\ufb01ca-\ntion that is complementary to existing clinical approaches.\nWe proposed using symbolic mismatch to quantify differences in long-term physiological time-\nseries. Our approach uses a symbolic transformation to measure changes in the morphology and\nfrequency of prototypical functional units observed over long periods in two signals. Symbolic\nmismatch avoids feature extraction and deals with inter-patient differences in a parameter-less way.\nWe also explored the hypothesis that high risk patients in a population can be identi\ufb01ed as individuals\nwith anomalous long-term signals. We developed multiple comparative approaches to detect such\npatients, and evaluated these methods in a real-world application of risk strati\ufb01cation for major\nadverse cardiac events following ACS.\nOur results suggest that symbolic mismatch-based comparative approaches may have clinical utility\nin identifying high risk patients, and can provide information that is complementary to existing\nclinical risk variables. In particular, we note that the hazard ratios we report are typically considered\nclinically meaningful. In a different study of 118 variables in 15,000 post-ACS patients with 90 day\nfollow-up similar to our population, [1] did not \ufb01nd any variables with a hazard ratio greater than\n2.00. We observed a similar result in our patient population, where all of the existing clinical and\nECG risk variables had a hazard ratio less than 2.00. In contrast to this, our nearest neighbor-based\napproach achieved a hazard ratio of 2.28, even after being adjusted for existing risk measures.\nOur study has limitations. While our decision to compare the morphology and frequency of pro-\ntotypical functional units leads to a measure that is computationally ef\ufb01cient on large volumes of\ndata, this process does not capture information related to the dynamics of these prototypical units.\nWe also observe that all three of the comparative approaches investigated in our study focus only on\nidentifying patients who are anomalies. While we believe that symbolic mismatch may have further\nuse in supervised learning, this hypothesis needs to be evaluated more fully in future work.\n\n8\n\n\fReferences\n\n(1986) Quantitative investigation of QRS detection rules using the\n\n(2001) Estimating the support of a high-dimensional\n\n(2009) Risk strati\ufb01cation for sudden cardiac death: current approaches and\n\n[1] LK Newby, MV Bhapkar, HD White et al. (2003) Predictors of 90-day outcome in patients stabilized after\nacute coronary syndromes. Eur Heart J, 172-181.\n[2] C.P. Cannon, S. Husted, R.A. Harringtonet al. (2007) Safety, Tolerability, and Initial Ef\ufb01cacy of AZD6140,\nthe First Reversible Oral Adenosine Diphosphate Receptor Antagonist, Compared With Clopidogrel, in Patients\nWith NonST-Segment Elevation Acute Coronary Syndrome Primary. J Am Coll Cardiol, 1844-1851.\n[3] World Health Organization. (2009) Cardiovascular Diseases Fact Sheet.\n[4] J. Mackay, G.A. Mensah, S. Mendis et al. (2004) The Atlas of Heart Disease and Stroke. WHO.\n[5] J.J. Bailey, A.S. Berson, H. Handelsman et al. (2001) Utility of current risk strati\ufb01cation tests for predicting\nmajor arrhythmic events after myocardial infarction. J Am Coll Cardio, 1902-1911.\n[6] G. Lopera & A.B. Curtis.\npredictive value. Curr Cardiol Rev, 56-64.\n[7] Z. Syed, J. Guttag & C. Stultz. (2007) Clustering and Symbolic Analysis of Cardiovascular Signals: Dis-\ncovery and Visualization of Medically Relevant Patterns in Long-Term Data Using Limited Prior Knowledge.\nEURASIP J Adv Sig Proc, 1-16.\n[8] P.S. Hamilton & W.J. Tompkins.\nMIT/BIH arrhythmia database. IEEE Trans Biomed Eng, 1157-1165.\n[9] W. Zong, GB Moody, & D. Jiang. (2003) A robust open-source algorithm to detect onset and duration of\nQRS complexes. Comp Cardiol, 737-740.\n[10] S.H. Chang, F.H. Cheng, W. Hsu et al. (1997) Fast algorithm for point pattern matching: invariant to\ntranslations, rotations and scale changes. Pattern Recognition, 311-320.\n[11] W.W. Cohen & J. Richman (2002). Learning to match and cluster large high-dimensional data sets for data\nintegration. In Proc. ACM SIGKDD, 475-480.\n[12] B. Scholkopf & A.J. Smola. (2002) Learning with Kernels. MIT Press.\n[13] Y. Chen, E.K. Garcia, M.R. Gupta et al. (2009) Similarity-based classi\ufb01cation: concepts and algorithms.\nJMLR, 747-776.\n[14] G. Wu, EY. Chang & Z. Zhang. (2005) An analysis of transformation on non-positive semide\ufb01nite simi-\nlarity matrix for kernel machines. Technical report, University of California, Santa Barbara.\n[15] B. Scholkopf, J.C. Platt, J. Shawe-Taylor, et al.\ndistribution. Neural Computation, 1443-1471.\n[16] E. Eskin, A. Arnold, M. Prerau et al. (2002) A geometric framework for unsupervised anomaly detection.\nApp Data Mining Comp Secur, 1-20.\n[17] M.G. Shlipak, J.H. Ix, K. Bibbins-Domingo et al. (2008) Biomarkers to predict recurrent cardiovascular\ndisease: the Heart and Soul Study. JAMA, 50-57.\n[18] B. M. Scirica. (2010) Acute coronary syndrome: emerging tools for diagnosis and risk assessment. J Am\nColl Cardiol, 1403-1415.\n[19] H. Ding, G. Trajcevski, P Scheuermann et al. (2008) Querying and mining of time series data: experimen-\ntal comparison of representations and distance measures. In Proc. VLDB, 1542-1552.\n[20] A. Krogh. (1994) Hidden Markov models for labeled sequences. In Proc. ICPR, 140-144.\n[21] T. Jaakkola, M. Diekhans & D. Haussler. (1999) Using the Fisher kernel method to detect remote protein\nhomologies. In Proc. ICISMB, 149-158.\n[22] C. Leslie, E. Eskin, J. Weston et al. (2003) Mismatch string kernels for SVM protein classi\ufb01cation. In\nProc. NIPS, 1441-1448.\n[23] E. Eskin & P.A. Pevzner. (2002) Finding composite regulatory patterns in DNA sequences. Bioinformatics,\n354-363.\n[24] A. Ben-Hur & D. Brutlag. (2006) Sequence motifs: highly predictive features of protein function. Feature\nExtraction, 625-645.\n[25] Z. Syed, C. Stultz, M. Kellis et al. (2010) Motif discovery in physiological datasets: a methodology for\ninferring predictive elements. ACM Trans. Knowledge Discovery in Data, 1-23.\n\n9\n\n\f", "award": [], "sourceid": 210, "authors": [{"given_name": "Zeeshan", "family_name": "Syed", "institution": null}, {"given_name": "John", "family_name": "Guttag", "institution": null}]}