{"title": "A Dirichlet Mixture Model of Hawkes Processes for Event Sequence Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 1354, "page_last": 1363, "abstract": "How to cluster event sequences generated via different point processes is an interesting and important problem in statistical machine learning.  To solve this problem, we propose and discuss an effective model-based clustering method based on a novel Dirichlet mixture model of a special but significant type of point processes --- Hawkes process.  The proposed model generates the event sequences with different clusters from the Hawkes processes with different parameters, and uses a Dirichlet process as the prior distribution of the clusters.  We prove the identifiability of our mixture model and propose an effective variational Bayesian inference algorithm to learn our model.  An adaptive inner iteration allocation strategy is designed to accelerate the convergence of our algorithm. Moreover, we investigate the sample complexity and the computational complexity  of our learning algorithm in depth.  Experiments on both synthetic and real-world data show that the clustering method based on our model can learn structural triggering patterns hidden in asynchronous event sequences robustly and achieve superior performance on clustering purity and consistency compared to existing methods.", "full_text": "A Dirichlet Mixture Model of Hawkes Processes for\n\nEvent Sequence Clustering\n\nHongteng Xu\u2217\nSchool of ECE\n\nGeorgia Institute of Technology\nhongtengxu313@gmail.com\n\nHongyuan Zha\n\nCollege of Computing\n\nGeorgia Institute of Technology\n\nzha@cc.gatech.edu\n\nAbstract\n\nHow to cluster event sequences generated via different point processes is an inter-\nesting and important problem in statistical machine learning. To solve this problem,\nwe propose and discuss an effective model-based clustering method based on a\nnovel Dirichlet mixture model of a special but signi\ufb01cant type of point processes \u2014\nHawkes process. The proposed model generates the event sequences with different\nclusters from the Hawkes processes with different parameters, and uses a Dirichlet\ndistribution as the prior distribution of the clusters. We prove the identi\ufb01ability\nof our mixture model and propose an effective variational Bayesian inference\nalgorithm to learn our model. An adaptive inner iteration allocation strategy is\ndesigned to accelerate the convergence of our algorithm. Moreover, we investigate\nthe sample complexity and the computational complexity of our learning algorithm\nin depth. Experiments on both synthetic and real-world data show that the clus-\ntering method based on our model can learn structural triggering patterns hidden\nin asynchronous event sequences robustly and achieve superior performance on\nclustering purity and consistency compared to existing methods.\n\n1\n\nIntroduction\n\nIn many practical situations, we need to deal with a huge amount of irregular and asynchronous\nsequential data. Typical examples include the viewing records of users in an IPTV system, the\nelectronic health records of patients in hospitals, among many others. All of these data are so-called\nevent sequences, each of which contains a series of events with different types in the continuous time\ndomain, e.g., when and which TV program a user watched, when and which care unit a patient is\ntransferred to. Given a set of event sequences, an important task is learning their clustering structure\nrobustly. Event sequence clustering is meaningful for many practical applications. Take the previous\ntwo examples: clustering IPTV users according to their viewing records is bene\ufb01cial to the program\nrecommendation system and the ads serving system; clustering patients according to their health\nrecords helps hospitals to optimize their medication resources.\nEvent sequence clustering is very challenging. Existing work mainly focuses on clustering syn-\nchronous (or aggregated) time series with discrete time-lagged observations [19, 23, 39]. Event\nsequences, on the contrary, are in the continuous time domain, so it is dif\ufb01cult to \ufb01nd a universal and\ntractable representation for them. A potential solution is constructing features of event sequences via\nparametric [22] or nonparametric [18] methods. However, these feature-based methods have a high\nrisk of over\ufb01tting because of the large number of parameters. What is worse, these methods actually\ndecompose the clustering problem into two phases: extracting features and learning clusters. As a\nresult, their clustering results are very sensitive to the quality of learned (or prede\ufb01ned) features.\n\n\u2217Corresponding author.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTo make concrete progress, we propose a Dirichlet Mixture model of Hawkes Processes (DMHP\nfor short) and study its performance on event sequence clustering in depth. In this model, the event\nsequences belonging to different clusters are modeled via different Hawkes processes. The priors\nof the Hawkes processes\u2019 parameters are designed based on their physically-meaningful constraints.\nThe prior of the clusters is generated via a Dirichlet distribution. We propose a variational Bayesian\ninference algorithm to learn the DMHP model in a nested Expectation-Maximization (EM) framework.\nIn particular, we introduce a novel inner iteration allocation strategy into the algorithm with the help\nof open-loop control theory, which improves the convergence of the algorithm. We prove the local\nidenti\ufb01ability of our model and show that our learning algorithm has better sample complexity and\ncomputational complexity than its competitors.\nThe contributions of our work include: 1) We propose a novel Dirichlet mixture model of Hawkes pro-\ncesses and demonstrate its local identi\ufb01ability. To our knowledge, it is the \ufb01rst systematical research\non the identi\ufb01ability problem in the task of event sequence clustering. 2) We apply an adaptive inner\niteration allocation strategy based on open-loop control theory to our learning algorithm and show\nits superiority to other strategies. The proposed strategy achieves a trade-off between convergence\nperformance and computational complexity. 3) We propose a DMHP-based clustering method. It\nrequires few parameters and is robust to the problems of over\ufb01tting and model misspeci\ufb01cation,\nwhich achieves encouraging clustering results.\n\n2 Related Work\n\nt ]/dt, where HC\n\nA temporal point process [4] is a random process whose realization consists of an event sequence\n{(ti, ci)}M\ni=1 with time stamps ti \u2208 [0, T ] and event types ci \u2208 C = {1, ..., C}. It can be equivalently\nrepresented as C counting processes {Nc(t)}C\nc=1, where Nc(t) is the number of type-c events\noccurring at or before time t. A way to characterize point processes is via the intensity function\n\u03bbc(t) = E[dNc(t)|HC\nt = {(ti, ci)|ti < t, ci \u2208 C} collects historical events of all\ntypes before time t. It is the expected instantaneous rate of happening type-c events given the history,\nwhich captures the phenomena of interests, i.e., self-triggering [13] or self-correcting [44].\nHawkes Processes. A Hawkes process [13] is a kind of point processes modeling complicated event\nsequences in which historical events have in\ufb02uences on current and future ones. It can also be viewed\nas a cascade of non-homogeneous Poisson processes [8, 34]. We focus on the clustering problem of\nthe event sequences obeying Hawkes processes because Hawkes processes have been proven to be\nuseful for describing real-world data in many applications, e.g., \ufb01nancial analysis [1], social network\nanalysis [3, 51], system analysis [22], and e-health [30, 42]. Hawkes processes have a particular form\nof intensity:\n\n(cid:88)C\n\n(cid:90) t\n\n\u03c6cc(cid:48)(s)dNc(cid:48)(t \u2212 s),\n\n(1)\n\nc(cid:48)=1\n\n0\n\nc(cid:48)=1\n\n\u03bbc(t) = \u00b5c +\n\nwhere \u00b5c is the exogenous base intensity independent of the history while(cid:80)C\n\n(cid:82) t\n0 \u03c6cc(cid:48)(s)dNc(cid:48)(t\u2212s)\nthe endogenous intensity capturing the peer in\ufb02uence. The decay in the in\ufb02uence of historical type-c(cid:48)\nevents on the subsequent type-c events is captured via the so-called impact function \u03c6cc(cid:48)(t), which is\nnonnegative. A lot of existing work uses prede\ufb01ned impact functions with known parameters, e.g.,\nthe exponential functions in [29, 50] and the power-law functions in [49]. To enhance the \ufb02exibility, a\nnonparametric model of 1-D Hawkes process was \ufb01rst proposed in [16] based on ordinary differential\nequation (ODE) and extended to multi-dimensional case in [22, 51]. Another nonparametric model\nis the contrast function-based model in [30], which leads to a Least-Squares (LS) problem [7].\nA Bayesian nonparametric model combining Hawkes processes with in\ufb01nite relational model is\nproposed in [3]. Recently, the basis representation of impact functions was used in [6,15,41] to avoid\ndiscretization.\nSequential Data Clustering and Mixture Models. Traditional methods mainly focus on clustering\nsynchronous (or aggregated) time series with discrete time-lagged variables [19, 23, 39]. These\nmethods rely on probabilistic mixture models [46], extracting features from sequential data and\nthen learning clusters via a Gaussian mixture model (GMM) [25, 28]. Recently, a mixture model\nof Markov chains is proposed in [21], which learns potential clusters from aggregate data. For\nasynchronous event sequences, most of the existing clustering methods can be categorized into feature-\nbased methods, clustering event sequences from learned or prede\ufb01ned features. Typical examples\n\n2\n\n\f(cid:88)\n\n\u03bbk\nc (t) = \u00b5k\n\nc +\nc ] \u2208 RC\n\n(cid:88)D\n\n(cid:88)\ncc(cid:48)(t) via basis functions as(cid:80)\n\nccidgd(t \u2212 ti),\nak\n\nti<t\n\nd=1\n\ninclude the Gaussian process-base multi-task learning method in [18] and the multi-task multi-\ndimensional Hawkes processes in [22]. Focusing on Hawkes processes, the feature-based mixture\nmodels in [5, 17, 47] combine Hawkes processes with Dirichlet processes [2, 36]. However, these\nmethods aim at modeling clusters of events or topics hidden in event sequences (i.e., sub-sequence\nclustering), which cannot learn clusters of event sequences. To our knowledge, the model-based\nclustering method for event sequences has been rarely considered.\n\n3 Proposed Model\n\n3.1 Dirichlet Mixture Model of Hawkes Processes\nGiven a set of event sequences S = {sn}N\ni=1 contains a series of events\nci \u2208 C = {1, ..., C} and their time stamps ti \u2208 [0, Tn], we model them via a mixture model of\nHawkes processes. According to the de\ufb01nition of Hawkes process in (1), for the event sequence\nbelonging to the k-th cluster its intensity function of type-c event at time t is\n\nn=1, where sn = {(ti, ci)}Mn\n\n\u03c6k\ncci\n\nti<t\n\n(t \u2212 ti) = \u00b5k\n\nc +\n\n(2)\n\n(cid:88)\n\n(cid:89)\n\n(cid:90) T\n\n(cid:16)\u2212(cid:88)\n\n(cid:17)\n\n+ is the exogenous base intensity of the k-th Hawkes process. Following the\nwhere \u00b5k = [\u00b5k\ncc(cid:48)dgd(t \u2212 ti),\nwork in [41], we represent each impact function \u03c6k\nwhere gd(t) \u2265 0 is the d-th basis function and Ak = [ak\nis the coef\ufb01cient tensor.\nHere we use Gaussian basis function, and their number D can be decided automatically using the\nbasis selection method in [41].\nIn our mixture model, the probability of the appearance of an event sequence s is\n\ncc(cid:48)d] \u2208 RC\u00d7C\u00d7D\n\nd ak\n\n0+\n\n\u03bbk\nci\n\n\u03c0kHP(s|\u00b5k, Ak), HP(s|\u00b5k, Ak) =\n\ni\n\nk\n\n(ti) exp\n\np(s; \u0398) =\nHere \u03c0k\u2019s are the probabilities of clusters and HP(s|\u00b5k, Ak) is the conditional probability of the\nevent sequence s given the k-th Hawkes process, which follows the intensity function-based de\ufb01nition\nin [4]. According to the Bayesian graphical model, we regard the parameters of Hawkes processes,\n{\u00b5k, Ak}, as random variables. For \u00b5k\u2019s, we consider its positiveness and assume that they obey\nC \u00d7 K independent Rayleigh distributions. For Ak\u2019s, we consider its nonnegativeness and sparsity\nas the work in [22, 41, 50]) did, and assume that they obey C \u00d7 C \u00d7 D \u00d7 K independent exponential\ndistributions. The prior of cluster is a Dirichlet distribution. Therefore, we can describe the proposed\nDirichlet mixture model of Hawkes process in a generative way as\n\n0\n\n.\n\nc\n\n\u03bbk\nc (s)ds\n\n(3)\n\n\u03c0 \u223c Dir(\u03b1/K, ..., \u03b1/K), k|\u03c0 \u223c Category(\u03c0),\n\u00b5 \u223c Rayleigh(B), A \u223c Exp(\u03a3), s|k, \u00b5, A \u223c HP(\u00b5k, Ak),\n\n0+\n\nand A = [ak\n\n+\nc ], \u03a3 = [\u03c3k\n\ncc(cid:48)d] \u2208 RC\u00d7C\u00d7D\u00d7K\n\nc ] \u2208 RC\u00d7K\nare parameters of Hawkes processes, and\ncc(cid:48)d]} are hyper-parameters. Denote the latent variables indicating the labels of\n(cid:89)\n\nHere \u00b5 = [\u00b5k\n{B = [\u03b2k\nclusters as matrix Z \u2208 {0, 1}N\u00d7K. We can factorize the joint distribution of all variables as2\np(S, Z, \u03c0, \u00b5, A) = p(S|Z, \u00b5, A)p(Z|\u03c0)p(\u03c0)p(\u00b5)p(A), where\np(S|Z, \u00b5, A) =\np(\u03c0) = Dir(\u03c0|\u03b1),\n\nHP(sn|\u00b5k, Ak)znk ,\n\np(Z|\u03c0) =\nc|\u03b2k\nc ),\n\nRayleigh(\u00b5k\n\ncc(cid:48)d|\u03c3k\n\n(cid:89)\n\n(cid:89)\n\n(cid:89)\n\n(\u03c0k)znk ,\n\nExp(ak\n\np(A) =\n\np(\u00b5) =\n\ncc(cid:48)d).\n\n(4)\n\nn,k\n\nn,k\n\nc,c(cid:48),d,k\n\nc,k\n\nOur mixture model of Hawkes processes are different from the models in [5, 17, 47]. Those models\nfocus on the sub-sequence clustering problem within an event sequence. The intensity function is a\nweighted sum of multiple intensity functions of different Hawkes processes. Our model, however,\naims at \ufb01nding the clustering structure across different sequences. The intensity of each event is\ngenerated via a single Hawkes process, while the likelihood of an event sequence is a mixture of\nlikelihood functions from different Hawkes processes.\n\n2Rayleigh(x|\u03b2) = x\n\u03b22 e\n\n\u2212 x2\n2\u03b22 , Exp(x|\u03c3) = 1\n\n\u03c3 e\u2212 x\n\n\u03c3 , x \u2265 0.\n\n3\n\n\f3.2 Local Identi\ufb01ability\n\nOne of the most important questions about our mixture model is whether it is identi\ufb01able or not.\nAccording to the de\ufb01nition of Hawkes process and the work in [26, 31], we can prove that our model\nis locally identi\ufb01able. The proof of the following theorem is given in the supplementary \ufb01le.\nTheorem 3.1. When the time of observation goes to in\ufb01nity, the mixture model of the Hawkes pro-\ncesses de\ufb01ned in (3) is locally identi\ufb01able, i.e., for each parameter point \u0398 = vec\n,\nwhere \u03b8k = {\u00b5k, Ak} \u2208 RC\nfor k = 1, .., K, there exists an open neighborhood of \u0398\ncontaining no other \u0398(cid:48) which makes p(s; \u0398) = p(s; \u0398(cid:48)) holds for all possible s.\n\n+ \u00d7 RC\u00d7C\u00d7D\n\n(cid:16)(cid:104)\u03c01\n\n(cid:105)(cid:17)\n\n\u03c0K\n\u03b8K\n\n...\n...\n\n0+\n\n\u03b81\n\n4 Proposed Learning Algorithm\n\n4.1 Variational Bayesian Inference\n\nInstead of using purely MCMC-based learning method like [29], we propose an effective variational\nBayesian inference algorithm to learn (4) in a nested EM framework. Speci\ufb01cally, we consider a\nvariational distribution having the following factorization:\n\nq(Z, \u03c0, \u00b5, A) = q(Z)q(\u03c0, \u00b5, A) = q(Z)q(\u03c0)\n\nq(\u00b5k)q(Ak).\n\n(5)\n\n(cid:89)\n\nk\n\nAn EM algorithm can be used to optimize (5).\nUpdate Responsibility (E-step). The logarithm of the optimized factor q\u2217(Z) is approximated as\n\nlog q\u2217(Z) = E\u03c0[log p(Z|\u03c0)] + E\u00b5,A[log p(S|Z, \u00b5, A)] + C\n\n=\n\n(cid:88)\n(cid:88)\n\u2248(cid:88)\n\n=\n\nznk\n\nznk\n\nznk\n\nn,k\n\nn,k\n\nn,k\n\n(cid:0)E[log \u03c0k] + E[log HP(sn|\u00b5k, Ak)](cid:1) + C\n(cid:90) Tn\n(cid:16)E[log \u03c0k] + E[\n(cid:88)\n(ti) \u2212 (cid:88)\n(cid:16)E[log \u03c0k] +\n(cid:16)\n(cid:88)\n(ti)] \u2212 Var[\u03bbk\n(cid:123)(cid:122)\n(cid:124)\n2E2[\u03bbk\n\nlog E[\u03bbk\n\nlog \u03bbk\nci\n\nci\n\nci\n\nc\n\n0\n\ni\n\ni\n\n\u03c1nk\n\n\u03bbk\nc (s)ds]\n\n+ C\n\n(cid:17)\n(cid:17) \u2212 (cid:88)\n\nci\n\n(ti)]\n(ti)]\n\n(cid:90) Tn\n\nE[\n\nc\n\n0\n\n(cid:17)\n(cid:125)\n\n+C.\n\n\u03bbk\nc (s)ds]\n\nwhere C is a constant and Var[\u00b7] represents the variance of random variable. Each term E[log \u03bbk\nis approximated via its second-order Taylor expansion log E[\u03bbk\nresponsibility rnk is calculated as\n\nc (t)]\nc (t)]\nc (t)] [37]. Then, the\n\nc (t)] \u2212 Var[\u03bbk\n2E2[\u03bbk\n\nrnk = E[znk] = \u03c1nk/(\n\n\u03c1nj).\n\n(6)\n\nn rnk for all k\u2019s.\n\nUpdate Parameters (M-step). The logarithm of optimal factor q\u2217(\u03c0, \u00b5, A) is\n\nlog(p(\u00b5k)p(Ak)) + EZ[log p(Z|\u03c0)] + log p(\u03c0) +\n\nrnk log HP(sn|\u00b5k, Ak) + C.\n\nDenote Nk =(cid:80)\n(cid:88)\n\nlog q\u2217(\u03c0, \u00b5, A)\n\n=\n\nk\n\n(cid:88)\n\nj\n\n(cid:88)\n\nn,k\n\nFollowing the work in [41, 47, 50], we need to apply an EM algorithm to solve (7) iteratively. After\n\nrnk log HP(sn|\u00b5k, Ak).\n\nn,k\n\n(7)\n\nWe can estimate the parameters of Hawkes processes via:\n\n\u02c6\u00b5, (cid:98)A = arg max\u00b5,A log(p(\u00b5)p(A)) +\ngetting optimal \u02c6\u00b5 and (cid:98)A, we update distributions as\n\n(cid:88)\n\n\u03a3k = (cid:98)Ak, Bk =(cid:112)2/\u03c0 \u02c6\u00b5k.\n\n(8)\n\nUpdate The Number of Clusters K. When the number of clusters K is unknown, we initialize K\nrandomly and update it in the learning phase. There are multiple methods to update the number of\n\n4\n\n\f(a) Convergence curve\n\n(b) Responsibility and ground truth\n\nFigure 1: The data contain 200 event sequences generated via two 5-dimensional Hawkes processes.\n(a) Each curve is the average of 5 trials\u2019 results. In each trial, total 100 inner iterations are applied.\nThe increasing (decreasing) strategy changes the number of inner iterations from 2 to 8 (from 8 to 2).\nThe constant strategy \ufb01xes the number to 5. (b) The black line is the ground truth. The red dots are\nresponsibilities after 15 inner iterations, and the red line is their average.\n\nclusters. Regrading our Dirichlet distribution as a \ufb01nite approximation of a Dirichlet process, we set\na large initial K as the truncation level. A simple empirical method is discarding the empty cluster\n(i.e., Nk = 0) and merging the cluster with Nk smaller than a threshold Nmin in the learning phase.\nBesides this, we can apply the MCMC in [11, 48] to update K via merging or splitting clusters.\nRepeating the three steps above, our algorithm maximizes the log-likelihood function (i.e., the\nlogarithm of (4)) and achieves optimal {\u03a3, B} accordingly. Both the details of our algorithm and its\ncomputational complexity are given in the supplementary \ufb01le.\n\n4.2\n\nInner Iteration Allocation Strategy and Convergence Analysis\n\nOur algorithm is in a nested EM framework, where the outer iteration corresponds to the loop of\nE-step and M-step and the inner iteration corresponds to the inner EM in the M-step. The runtime of\nour algorithm is linearly proportional to the total number of inner iterations. Given \ufb01xed runtime\n(or the total number of inner iterations), both the \ufb01nal achievable log-likelihood and convergence\nbehavior of the algorithm highly depend on how we allocate the number of inner iterations across\nthe outer iterations.\nIn this work, we test three inner iteration allocation strategies. The \ufb01rst\nstrategy is heuristic, which \ufb01xes, increases, or decreases the number of inner iterations as the outer\niteration progresses. Compared with constant inner iteration strategy, the increasing or decreasing\nstrategy might improve the convergence of algorithm [9]. The second strategy is based on open-loop\ncontrol [27]: in each outer iteration, we compute objective function via two methods respectively\n\u2014 updating parameters directly (i.e., continuing current M-step and going to next inner iteration) or\n\ufb01rst updating responsibilities and then updating parameters (i.e., going to a new loop of E-step and\nM-step and starting a new outer iteration). The parameters corresponding to the smaller negative\nlog-likelihood are preserved. The third strategy is applying Bayesian optimization [33,35] to optimize\nthe number of inner iterations per outer iteration via maximizing the expected improvement.\nWe apply these strategies to a synthetic data set and visualize their impacts on the convergence of our\nalgorithm in Fig. 1(a). The open-loop control strategy and the Bayesian optimization strategy obtain\ncomparable performance on the convergence of algorithm. They outperform heuristic strategies (i.e.,\nincreasing, decreasing and \ufb01xing the number of inner iterations per outer iteration), which reduce the\nnegative log-likelihood more rapidly and reach lower value \ufb01nally. Although adjusting the number of\ninner iterations via different methodologies, both these two strategies tend to increase the number of\ninner iterations w.r.t. the number of outer iterations. In the beginning of algorithm, the open-loop\ncontrol strategy updates responsibilities frequently, and similarly, the Bayesian optimization strategy\nassigns small number of inner iterations. The heuristic strategy that increasing the number of inner\niterations follows the same tendency, and therefore, is just slightly worse than the open-loop control\nand the Bayesian optimization. This phenomenon is because the estimated responsibility is not\nreliable in the beginning. Too many inner iterations at that time might make learning results fall into\nbad local optimums.\nFig. 1(b) further veri\ufb01es our explanation. With the help of the increasing strategy, most of the\nresponsibilities converge to the ground truth with high con\ufb01dence after just 15 inner iterations, because\n\n5\n\n20406080100The number of inner iterations3.443.463.483.5Negative Log-likelihood#1042 Clusters20406080100The number of inner iterations5.15.155.2Negative Log-likelihood#1043 Clusters20406080100The number of inner iterations7.17.27.37.47.5Negative Log-likelihood#1044 Clusters20406080100The number of inner iterations8.38.48.58.6Negative Log-likelihood#1045 ClustersIncreasingConstantDecreasingOpenLoopBayesOpt20406080100The number of inner iterations2.372.382.392.42.412.42Negative Log-likelihood#105IncreasingConstantDecreasing2004006008001000Indices of samples00.20.40.60.81ResponsibilityIncreasingGround Truth{rn} (15 Inner Iter.)E(rn) (15 Inner Iter.)2004006008001000Indices of samples00.20.40.60.81ResponsibilityConstant2004006008001000Indices of samples00.20.40.60.81ResponsibilityDecreasing\f(a) MMHP+DPGMM\n\n(b) DMHP\n\nFigure 2: Comparisons for various methods on F1 score of minor cluster.\n\nthe responsibilities has been updated over 5 times. On the contrary, the responsibilities corresponding\nto the constant and the decreasing strategies have more uncertainty \u2014 many responsibilities are\naround 0.5 and far from the ground truth.\nBased on the analysis above, the increasing allocation strategy indeed improves the convergence\nof our algorithm, and the open-loop control and the Bayesian optimization are superior to other\ncompetitors. Because the computational complexity of the open-loop control is much lower than\nthat of the Bayesian optimization, in the following experiments, we apply open-loop control strategy\nto our learning algorithm. The scheme of our learning algorithm and more detailed convergence\nanalysis can be found in the supplementary \ufb01le.\n\n4.3 Empirical Analysis of Sample Complexity\n\nFocusing on the task of clustering event sequences, we investigate the sample complexity of our\nDMHP model and its learning algorithm. In particular, we want to show that the clustering method\nbased on our model requires fewer samples than existing methods to identify clusters successfully.\nAmong existing methods, the main competitor of our method is the clustering method based on the\nmulti-task multi-dimensional Hawkes process (MMHP) model in [22]. It learns a speci\ufb01c Hawkes\nprocess for each sequence and clusters the sequences via applying the Dirichlet processes Gaussian\nmixture model (DPGMM) [10, 28] to the parameters of the corresponding Hawkes processes.\nFollowing the work in [14], we demonstrate the superiority of our DMHP-based clustering method\nthrough the comparison on the identi\ufb01ability of minor clusters given \ufb01nite number of samples.\nSpeci\ufb01cally, we consider a binary clustering problem with 500 event sequences. For the k-th cluster,\nk = 1, 2, Nk event sequences are generated via a 1-dimensional Hawkes processes with parameter\n\u03b8k = {\u00b5k, Ak}. Taking the parameter as a representation of the clustering center, we can calculate\nthe distance between two clusters as d = (cid:107)\u03b81 \u2212 \u03b82(cid:107)2. Assume that N1 < N2, we denote the \ufb01rst\n. Applying our DMHP model\ncluster as \u201cminor\u201d cluster, whose sample percentage is \u03c01 = N1\nand its learning algorithm to the data generated with different d\u2019s and \u03c01\u2019s, we can calculate the F1\nscores of the minor cluster w.r.t. {d, \u03c0}. The high F1 score means that the minor cluster is identi\ufb01ed\nwith high accuracy. Fig. 2 visualizes the maps of F1 scores generated via different methods w.r.t. the\nnumber of events per sequence. We can \ufb01nd that the F1 score obtained via our DMHP-based method\nis close to 1 in most situations. Its identi\ufb01able area (yellow part) is much larger than that of the\nMMHP+DPGMM method consistently w.r.t. the number of events per sequence. The unidenti\ufb01able\ncases happen only in the following two situations: the parameters of different clusters are nearly\nequal (i.e., d \u2192 0); or the minor cluster is extremely small (i.e., \u03c01 \u2192 0). The enlarged version of\nFig. 2 is given in the supplementary \ufb01le.\n\nN1+N2\n\n5 Experiments\n\nTo demonstrate the feasibility and the ef\ufb01ciency of our DMHP-based sequence clustering method, we\ncompare it with the state-of-the-art methods, including the vector auto-regressive (VAR) method [12],\nthe Least-Squares (LS) method in [7], and the multi-task multi-dimensional Hawkes process (MMHP)\nin [22]. All of the three competitors \ufb01rst learn features of sequences and then apply the DPGMM [10]\nto cluster sequences. The VAR discretizes asynchronous event sequences to time series and learns\ntransition matrices as features. Both the LS and the MMHP learn a speci\ufb01c Hawkes process for each\nevent sequence. For each event sequence, we calculate its infectivity matrix \u03a6 = [\u03c6cc(cid:48)], where the\n\nelement \u03c6cc(cid:48) is the integration of impact function (i.e.,(cid:82) \u221e\n\n0 \u03c6cc(cid:48)(t)dt), and use it as the feature.\n\n6\n\n80 Events per Sequence0.10.20.30.4Sample Percentage of Minor Cluster0.20.40.60.811.21.41.61.8Distance between Centers80 Events per Sequence0.10.20.30.4Sample Percentage of Minor Cluster0.20.40.60.811.21.41.61.8Distance between Centers40 Events per Sequence0.10.20.30.4Sample Percentage of Minor Cluster0.20.40.60.811.21.41.61.8Distance between Centers40 Events per Sequence0.10.20.30.4Sample Percentage of Minor Cluster0.20.40.60.811.21.41.61.8Distance between Centers20 Events per Sequence0.10.20.30.4Sample Percentage of Minor Cluster0.20.40.60.811.21.41.61.8Distance between Centers0.20.30.40.50.60.70.80.920 Events per Sequence0.10.20.30.4Sample Percentage of Minor Cluster0.20.40.60.811.21.41.61.8Distance between Centers80 Events per Sequence0.10.20.30.4Sample Percentage of Minor Cluster0.20.40.60.811.21.41.61.8Distance between Centers80 Events per Sequence0.10.20.30.4Sample Percentage of Minor Cluster0.20.40.60.811.21.41.61.8Distance between Centers40 Events per Sequence0.10.20.30.4Sample Percentage of Minor Cluster0.20.40.60.811.21.41.61.8Distance between Centers40 Events per Sequence0.10.20.30.4Sample Percentage of Minor Cluster0.20.40.60.811.21.41.61.8Distance between Centers20 Events per Sequence0.10.20.30.4Sample Percentage of Minor Cluster0.20.40.60.811.21.41.61.8Distance between Centers0.20.30.40.50.60.70.80.920 Events per Sequence0.10.20.30.4Sample Percentage of Minor Cluster0.20.40.60.811.21.41.61.8Distance between Centers\fTable 1: Clustering Purity on Synthetic Data.\nSine-like \u03c6(t)\n\nPiecewise constant \u03c6(t)\n\nC K\n2\n3\n4\n5\n\n5\n\nLS+\n\nVAR+\n\nMMHP+ DMHP\nDPGMM DPGMM DPGMM\n0.9898\n0.5917\n0.5235\n0.9683\n0.3860\n0.5565\n0.9360\n0.5112\n0.2894\n0.9055\n0.2543\n0.4656\n\n0.5639\n0.5278\n0.4365\n0.3980\n\nLS+\n\nVAR+\n\nMMHP+ DMHP\nDPGMM DPGMM DPGMM\n0.8085\n0.5913\n0.5222\n0.7715\n0.3618\n0.4517\n0.7056\n0.3876\n0.2901\n0.6774\n0.2476\n0.3245\n\n0.5589\n0.4402\n0.3365\n0.2980\n\nFor the synthetic data with clustering labels, we use clustering purity [24] to evaluate various methods:\n\n(cid:88)K\n\nk=1\n\nPurity =\n\n1\nN\n\nmaxj\u2208{1,...,K(cid:48)} |Wk \u2229 Cj|,\n\nwhere Wk is the learned index set of sequences belonging to the k-th cluster, Cj is the real index set\nof sequence belonging to the j-th class, and N is the total number of sequences. For the real-world\ndata, we visualize the infectivity matrix of each cluster and measure the clustering consistency via\na cross-validation method [38, 40]. The principle is simple: because random sampling does not\nchange the clustering structure of data, a clustering method with high consistency should preserve\nthe pairwise relationships of samples in different trials. Speci\ufb01cally, we test each clustering method\nwith J (= 100) trials. In the j-th trial, data is randomly divided into two folds. After learning the\ncorresponding model from the training fold, we apply the method to the testing fold. We enumerate\nall pairs of sequences within a same cluster in the j-th trial and count the pairs preserved in all other\ntrials. The clustering consistency is the minimum proportion of preserved pairs over all trials:\n\nConsistency = minj\u2208{1,..,J}\n\nj(cid:48)(cid:54)=j\n\n(n,n(cid:48))\u2208Mj\n\nn = kj(cid:48)\n1{kj(cid:48)\nn(cid:48)}\n(J \u2212 1)|Mj| ,\n\n(cid:88)\n\n(cid:88)\n\nwhere Mj = {(n, n(cid:48))|kj\nn is the index of cluster of the n-th sequence in the j-th trial.\nkj\n\nn(cid:48)} is the set of sequence pairs within same cluster in the j-th trial, and\n\nn = kj\n\n5.1 Synthetic Data\n\n5 , 2\u03c0\n\ncc(cid:48), sk\n\ncc(cid:48), \u03c9k\n\ncc(cid:48) = bk\n\ncc(cid:48)(t\u2212sk\n\ncc(cid:48)} are sampled randomly from [ \u03c0\n\ncc(cid:48)(1\u2212cos(\u03c9k\ncc(cid:48) \u00d7 round(\u03c6k\n\nWe generate two synthetic data sets with various clusters using sine-like impact functions and\npiecewise constant impact functions respectively. In each data set, the number of clusters is set from 2\nto 5. Each cluster contains 400 event sequences, and each event sequence contains 50 (= Mn) events\nand 5 (= C) event types. The elements of exogenous base intensity are sampled uniformly from [0, 1].\nEach sine-like impact function in the k-th cluster is formulated as \u03c6k\ncc(cid:48)))),\nwhere {bk\n5 ]. Each piecewise constant impact function\nis the truncation of the corresponding sine-like impact function, i.e., 2bk\nTable 1 shows the clustering purity of various methods on the synthetic data. Compared with the\nthree competitors, our DMHP obtains much better clustering purity consistently. The VAR simply\ntreats asynchronous event sequences as time series, which loses the information like the order of\nevents and the time delay of adjacent events. Both the LS and the MMHP learn Hawkes process for\neach individual sequence, which might suffer to over-\ufb01tting problem in the case having few events\nper sequence. These competitors decompose sequence clustering into two phases: learning feature\nand applying DPGMM, which is very sensitive to the quality of feature. The potential problems\nabove lead to unsatisfying clustering results. Our DMHP method, however, is model-based, which\nlearns clustering result directly and reduces the number of unknown variables greatly. As a result,\nour method avoids the problems of these three competitors and obtains superior clustering results.\nAdditionally, the learning results of the synthetic data with piecewise constant impact functions prove\nthat our DMHP method is relatively robust to the problem of model misspeci\ufb01cation \u2014 although\nour Gaussian basis cannot \ufb01t piecewise constant impact functions well, our method still outperforms\nother methods greatly.\n\ncc(cid:48)/(2bk\n\ncc(cid:48))).\n\n5.2 Real-world Data\n\nWe test our clustering method on two real-world data sets. The \ufb01rst is the ICU patient \ufb02ow data used\nin [43], which is extracted from the MIMIC II data set [32]. This data set contains the transition\n\n7\n\n\fTable 2: Clustering Consistency on Real-world Data.\n\nMethod\n\nICU Patient\nIPTV User\n\n0.0901\n0.0443\n\nVAR+DPGMM LS+DPGMM MMHP+DPGMM DMHP\n0.3778\n0.2004\n\n0.1390\n0.0389\n\n0.3313\n0.1382\n\n(b) DMHP\n\n(a) Histogram of K\n\n(c) MMHP+DPGMM\nFigure 3: Comparisons on the ICU patient \ufb02ow data.\n\nprocesses of 30, 308 patients among different kinds of care units. The patients can be clustered\naccording to their transition processes. The second is the IPTV data set in [20, 22], which contains\n7, 100 IPTV users\u2019 viewing records collected via Shanghai Telecomm Inc. The TV programs are\ncategorized into 16 classes and the viewing behaviors more than 20 minutes are recorded. Similarly,\nthe users can be clustered according to their viewing records. The event sequences in these two data\nhave strong but structural triggering patterns, which can be modeled via different Hawkes processes.\nTable 2 shows the performance of various clustering methods on the clustering consistency. We can\n\ufb01nd that our method outperforms other methods obviously, which means that the clustering result\nobtained via our method is more stable and consistent than other methods\u2019 results. In Fig. 3 we\nvisualize the comparison for our method and its main competitor MMHP+DPGMM on the ICU\npatient \ufb02ow data. Fig. 3(a) shows the histograms of the number of clusters for the two methods. We\ncan \ufb01nd that MMHP+DPGMM method tends to over-segment data into too many clusters. Our DMHP\nmethod, however, can \ufb01nd more compact clustering structure. The distribution of the number of\nclusters concentrates to 6 and 19 for the two data sets, respectively. In our opinion, this phenomenon\nre\ufb02ects the drawback of the feature-based method \u2014 the clustering performance is highly dependent\non the quality of feature while the clustering structure is not considered suf\ufb01ciently in the phase of\nextracting feature. Taking learned infectivity matrices as representations of clusters, we compare our\nDMHP method with MMHP+DPGMM in Figs. 3(b) and 3(c). The infectivity matrices obtained by\nour DMHP are sparse and with distinguishable structure, while those obtained by MMHP+DPGMM\nare chaotic \u2014 although MMHP also applies sparse regularizer to each event sequence\u2019 infectivity\nmatrix, it cannot guarantee the average of the infectivity matrices in a cluster is still sparse. Same\nphenomena can also be observed in the experiments on the IPTV data. More experimental results are\ngiven in the supplementary \ufb01le.\n\n6 Conclusion and Future Work\n\nIn this paper, we propose and discuss a Dirichlet mixture model of Hawkes processes and achieve\na model-based solution to event sequence clustering. We prove the identi\ufb01ability of our model\nand analyze the convergence, sample complexity and computational complexity of our learning\nalgorithm. In the aspect of methodology, we plan to study other potential priors, e.g., the prior based\non determinantial point processes (DPP) in [45], to improve the estimation of the number of clusters,\nand further accelerate our learning algorithm via optimizing inner iteration allocation strategy in near\nfuture. Additionally, our model can be extended to Dirichlet process mixture model when K \u2192 \u221e. In\nthat case, we plan to apply Bayesian nonparametrics to develop new learning algorithms. The source\ncode can be found at https://github.com/HongtengXu/Hawkes-Process-Toolkit.\n\n8\n\n5101520250204060DMHPMMHPCluster 1246810246810Cluster 2246810246810Cluster 3246810246810Cluster 4246810246810Cluster 5246810246810Cluster 6246810246810Cluster 1246810246810Cluster 2246810246810Cluster 3246810246810Cluster 4246810246810Cluster 5246810246810Cluster 6246810246810Cluster 7246810246810Cluster 8246810246810Cluster 9246810246810Cluster 10246810246810Cluster 11246810246810Cluster 12246810246810Cluster 13246810246810Cluster 14246810246810Cluster 15246810246810Cluster 16246810246810Cluster 17246810246810Cluster 18246810246810Cluster 19246810246810Cluster 20246810246810Cluster 1246810246810Cluster 2246810246810Cluster 3246810246810Cluster 4246810246810Cluster 5246810246810Cluster 6246810246810Cluster 1246810246810Cluster 2246810246810Cluster 3246810246810Cluster 4246810246810Cluster 5246810246810Cluster 6246810246810Cluster 7246810246810Cluster 8246810246810Cluster 9246810246810Cluster 10246810246810Cluster 11246810246810Cluster 12246810246810Cluster 13246810246810Cluster 14246810246810Cluster 15246810246810Cluster 16246810246810Cluster 17246810246810Cluster 18246810246810Cluster 19246810246810Cluster 20246810246810\fAcknowledgment\n\nThis work is supported in part by NSF IIS-1639792, IIS-1717916, and CMMI-1745382.\n\nReferences\n[1] E. Bacry, K. Dayri, and J.-F. Muzy. Non-parametric kernel estimation for symmetric Hawkes\nprocesses. application to high frequency \ufb01nancial data. The European Physical Journal B,\n85(5):1\u201312, 2012.\n\n[2] D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian\n\nanalysis, 1(1):121\u2013143, 2006.\n\nprocesses. In NIPS, 2012.\n\n[3] C. Blundell, J. Beck, and K. A. Heller. Modelling reciprocating relationships with Hawkes\n\n[4] D. J. Daley and D. Vere-Jones. An introduction to the theory of point processes: volume II:\n\ngeneral theory and structure, volume 2. Springer Science & Business Media, 2007.\n\n[5] N. Du, M. Farajtabar, A. Ahmed, A. J. Smola, and L. Song. Dirichlet-Hawkes processes with\n\napplications to clustering continuous-time document streams. In KDD, 2015.\n\n[6] N. Du, L. Song, M. Yuan, and A. J. Smola. Learning networks of heterogeneous in\ufb02uence. In\n\nNIPS, 2012.\n\n[7] M. Eichler, R. Dahlhaus, and J. Dueck. Graphical modeling for multivariate Hawkes processes\n\nwith nonparametric link functions. Journal of Time Series Analysis, 2016.\n\n[8] M. Farajtabar, N. Du, M. Gomez-Rodriguez, I. Valera, H. Zha, and L. Song. Shaping social\n\nactivity by incentivizing users. In NIPS, 2014.\n\n[9] G. H. Golub, Z. Zhang, and H. Zha. Large sparse symmetric eigenvalue problems with\nhomogeneous linear constraints: the Lanczos process with inner\u2013outer iterations. Linear\nAlgebra And Its Applications, 309(1):289\u2013306, 2000.\n\n[10] D. G\u00f6r\u00fcr and C. E. Rasmussen. Dirichlet process Gaussian mixture models: Choice of the base\n\ndistribution. Journal of Computer Science and Technology, 25(4):653\u2013664, 2010.\n\n[11] P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model\n\ndetermination. Biometrika, pages 711\u2013732, 1995.\n\n[12] F. Han and H. Liu. Transition matrix estimation in high dimensional time series. In ICML,\n\n2013.\n\n58(1):83\u201390, 1971.\n\n[13] A. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika,\n\n[14] D. Kim. Mixture inference at the edge of identi\ufb01ability. Ph.D. Thesis, 2008.\n[15] R. Lemonnier and N. Vayatis. Nonparametric Markovian learning of triggering kernels for\nmutually exciting and mutually inhibiting multivariate Hawkes processes. In Machine Learning\nand Knowledge Discovery in Databases, pages 161\u2013176. 2014.\n\n[16] E. Lewis and G. Mohler. A nonparametric EM algorithm for multiscale Hawkes processes.\n\nJournal of Nonparametric Statistics, 2011.\n\n[17] L. Li and H. Zha. Dyadic event attribution in social networks with mixtures of Hawkes processes.\n\nIn CIKM, 2013.\n\nIn ICML, 2015.\n\n2005.\n\n[18] W. Lian, R. Henao, V. Rao, J. Lucas, and L. Carin. A multitask point process predictive model.\n\n[19] T. W. Liao. Clustering of time series data: a survey. Pattern recognition, 38(11):1857\u20131874,\n\n[20] D. Luo, H. Xu, H. Zha, J. Du, R. Xie, X. Yang, and W. Zhang. You are what you watch and\nwhen you watch: Inferring household structures from IPTV viewing data. IEEE Transactions\non Broadcasting, 60(1):61\u201372, 2014.\n\n[21] D. Luo, H. Xu, Y. Zhen, B. Dilkina, H. Zha, X. Yang, and W. Zhang. Learning mixtures\nIEEE Transactions on\n\nof Markov chains from aggregate data with structural constraints.\nKnowledge and Data Engineering, 28(6):1518\u20131531, 2016.\n\n[22] D. Luo, H. Xu, Y. Zhen, X. Ning, H. Zha, X. Yang, and W. Zhang. Multi-task multi-dimensional\n\nHawkes processes for modeling event sequences. In IJCAI, 2015.\n\n[23] E. A. Maharaj. Cluster of time series. Journal of Classi\ufb01cation, 17(2):297\u2013314, 2000.\n[24] C. D. Manning, P. Raghavan, H. Sch\u00fctze, et al. Introduction to information retrieval, volume 1.\n\nCambridge university press Cambridge, 2008.\n\n9\n\n\f[25] C. Maugis, G. Celeux, and M.-L. Martin-Magniette. Variable selection for clustering with\n\nGaussian mixture models. Biometrics, 65(3):701\u2013709, 2009.\n\n[26] E. Meijer and J. Y. Ypma. A simple identi\ufb01cation proof for a mixture of two univariate normal\n\ndistributions. Journal of Classi\ufb01cation, 25(1):113\u2013123, 2008.\n\n[27] B. A. Ogunnaike and W. H. Ray. Process dynamics, modeling, and control. Oxford University\n\nPress, USA, 1994.\n\n[28] C. E. Rasmussen. The in\ufb01nite Gaussian mixture model. In NIPS, 1999.\n[29] J. G. Rasmussen. Bayesian inference for Hawkes processes. Methodology and Computing in\n\nApplied Probability, 15(3):623\u2013642, 2013.\n\n[30] P. Reynaud-Bouret, S. Schbath, et al. Adaptive estimation for Hawkes processes; application to\n\ngenome analysis. The Annals of Statistics, 38(5):2781\u20132822, 2010.\n\n[31] T. J. Rothenberg. Identi\ufb01cation in parametric models. Econometrica: Journal of the Econometric\n\nSociety, pages 577\u2013591, 1971.\n\n[32] M. Saeed, C. Lieu, G. Raber, and R. G. Mark. MIMIC II: a massive temporal ICU patient\ndatabase to support research in intelligent patient monitoring. In Computers in Cardiology,\n2002, pages 641\u2013644. IEEE, 2002.\n\n[33] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas. Taking the human out of\nthe loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148\u2013175, 2016.\n[34] A. Simma and M. I. Jordan. Modeling events with cascades of Poisson processes. In UAI, 2010.\n[35] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning\n\nalgorithms. In NIPS, 2012.\n\n[36] R. Socher, A. L. Maas, and C. D. Manning. Spectral Chinese restaurant processes: Nonpara-\n\nmetric clustering based on similarities. In AISTATS, 2011.\n\n[37] Y. W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm\n\nfor latent Dirichlet allocation. In NIPS, 2006.\n\n[38] R. Tibshirani and G. Walther. Cluster validation by prediction strength. Journal of Computa-\n\ntional and Graphical Statistics, 14(3):511\u2013528, 2005.\n\n[39] J. J. Van Wijk and E. R. Van Selow. Cluster and calendar based visualization of time series data.\n\nIn IEEE Symposium on Information Visualization, 1999.\n\n[40] U. Von Luxburg. Clustering Stability. Now Publishers Inc, 2010.\n[41] H. Xu, M. Farajtabar, and H. Zha. Learning Granger causality for Hawkes processes. In ICML,\n\n[42] H. Xu, D. Luo, and H. Zha. Learning Hawkes processes from short doubly-censored event\n\n2016.\n\nsequences. In ICML, 2017.\n\n[43] H. Xu, W. Wu, S. Nemati, and H. Zha. Patient \ufb02ow prediction via discriminative learning\nof mutually-correcting processes. IEEE Transactions on Knowledge and Data Engineering,\n29(1):157\u2013171, 2017.\n\n[44] H. Xu, Y. Zhen, and H. Zha. Trailer generation via a point process-based visual attractiveness\n\nmodel. In IJCAI, 2015.\n\n[45] Y. Xu, P. M\u00fcller, and D. Telesca. Bayesian inference for latent biologic structure with determi-\n\nnantal point processes (DPP). Biometrics, 2016.\n\n[46] S. J. Yakowitz and J. D. Spragins. On the identi\ufb01ability of \ufb01nite mixtures. The Annals of\n\nMathematical Statistics, pages 209\u2013214, 1968.\n\n[47] S.-H. Yang and H. Zha. Mixture of mutually exciting processes for viral diffusion. In ICML,\n\n2013.\n\n[48] Z. Zhang, K. L. Chan, Y. Wu, and C. Chen. Learning a multivariate Gaussian mixture model\nwith the reversible jump MCMC algorithm. Statistics and Computing, 14(4):343\u2013355, 2004.\n[49] Q. Zhao, M. A. Erdogdu, H. Y. He, A. Rajaraman, and J. Leskovec. SEISMIC: A self-exciting\n\npoint process model for predicting tweet popularity. In KDD, 2015.\n\n[50] K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse low-rank networks using\n\nmulti-dimensional Hawkes processes. In AISTATS, 2013.\n\n[51] K. Zhou, H. Zha, and L. Song. Learning triggering kernels for multi-dimensional Hawkes\n\nprocesses. In ICML, 2013.\n\n10\n\n\f", "award": [], "sourceid": 878, "authors": [{"given_name": "Hongteng", "family_name": "Xu", "institution": "Duke University"}, {"given_name": "Hongyuan", "family_name": "Zha", "institution": "Georgia Tech"}]}