{"title": "The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process", "book": "Advances in Neural Information Processing Systems", "page_first": 6754, "page_last": 6764, "abstract": "Many events occur in the world. Some event types are stochastically excited or inhibited\u2014in the sense of having their probabilities elevated or decreased\u2014by patterns in the sequence of previous events. Discovering such patterns can help us predict which type of event will happen next and when. We model streams of discrete events in continuous time, by constructing a neurally self-modulating multivariate point process in which the intensities of multiple event types evolve according to a novel continuous-time LSTM. This generative model allows past events to influence the future in complex and realistic ways, by conditioning future event intensities on the hidden state of a recurrent neural network that has consumed the stream of past events. Our model has desirable qualitative properties. It achieves competitive likelihood and predictive accuracy on real and synthetic datasets, including under missing-data conditions.", "full_text": "The Neural Hawkes Process: A Neurally\nSelf-Modulating Multivariate Point Process\n\nHongyuan Mei\n\nJason Eisner\n\nDepartment of Computer Science, Johns Hopkins University\n\n3400 N. Charles Street, Baltimore, MD 21218 U.S.A\n\n{hmei,jason}@cs.jhu.edu\n\nAbstract\n\nMany events occur in the world. Some event types are stochastically excited or\ninhibited\u2014in the sense of having their probabilities elevated or decreased\u2014by\npatterns in the sequence of previous events. Discovering such patterns can help\nus predict which type of event will happen next and when. We model streams\nof discrete events in continuous time, by constructing a neurally self-modulating\nmultivariate point process in which the intensities of multiple event types evolve\naccording to a novel continuous-time LSTM. This generative model allows past\nevents to in\ufb02uence the future in complex and realistic ways, by conditioning future\nevent intensities on the hidden state of a recurrent neural network that has con-\nsumed the stream of past events. Our model has desirable qualitative properties.\nIt achieves competitive likelihood and predictive accuracy on real and synthetic\ndatasets, including under missing-data conditions.\n\nIntroduction\n\n1\nSome events in the world are correlated. A single event, or a pattern of events, may help to cause\nor prevent future events. We are interested in learning the distribution of sequences of events (and\nin future work, the causal structure of these sequences). The ability to discover correlations among\nevents is crucial to accurately predict the future of a sequence given its past, i.e., which events are\nlikely to happen next and when they will happen.\nWe speci\ufb01cally focus on sequences of discrete events in continuous time (\u201cevent streams\u201d). Model-\ning such sequences seems natural and useful in many applied domains:\n\n\u2022 Medical events. Each patient has a sequence of acute incidents, doctor\u2019s visits, tests, diag-\nnoses, and medications. By learning from previous patients what sequences tend to look\nlike, we could predict a new patient\u2019s future from their past.\n\u2022 Consumer behavior. Each online consumer has a sequence of online interactions. By mod-\neling the distribution of sequences, we can learn purchasing patterns. Buying cookies may\ntemporarily depress purchases of all desserts, yet increase the probability of buying milk.\n\u2022 \u201cQuanti\ufb01ed self\u201d data. Some individuals use cellphone apps to record their behaviors\u2014\neating, traveling, working, sleeping, waking. By anticipating behaviors, an app could per-\nform helpful supportive actions, including issuing reminders and placing advance orders.\n\u2022 Social media actions. Previous posts, shares, comments, messages, and likes by a set of\n\u2022 Other event streams arise in news, animal behavior, dialogue, music, etc.\n\nusers are predictive of their future actions.\n\nA basic model for event streams is the Poisson process (Palm, 1943), which assumes that events\noccur independently of one another. In a non-homogenous Poisson process, the (in\ufb01nitesimal)\nprobability of an event happening at time t may vary with t, but it is still independent of other events.\nA Hawkes process (Hawkes, 1971; Liniger, 2009) supposes that past events can temporarily raise\nthe probability of future events, assuming that such excitation is x positive, y additive over the past\nevents, and z exponentially decaying with time.\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Drawing an event stream from a neural Hawkes process. An LSTM reads the sequence of past\nevents (polygons) to arrive at a hidden state (orange). That state determines the future \u201cintensities\u201d of the two\ntypes of events\u2014that is, their time-varying instantaneous probabilities. The intensity functions are continuous\nparametric curves (solid lines) determined by the most recent LSTM state, with dashed lines showing the\nsteady-state asymptotes that they would eventually approach. In this example, events of type 1 excite type 1\nbut inhibit type 2. Type 2 excites itself, and excites or inhibits type 1 according to whether the count of type 2\nevents so far is odd or even. Those are immediate effects, shown by the sudden jumps in intensity. The events\nalso have longer-timescale effects, shown by the shifts in the asymptotic dashed lines.\n\nHowever, real-world patterns often seem to violate these assumptions. For example, x is violated\nif one event inhibits another rather than exciting it: cookie consumption inhibits cake consump-\ntion. y is violated when the combined effect of past events is not additive. Examples abound: The\n20th advertisement does not increase purchase rate as much as the \ufb01rst advertisement did, and may\neven drive customers away. Market players may act based on their own complex analysis of market\nhistory. Musical note sequences follow some intricate language model that considers melodic tra-\njectory, rhythm, chord progressions, repetition, etc. z is violated when, for example, a past event\nhas a delayed effect, so that the effect starts at 0 and increases sharply before decaying.\nWe generalize the Hawkes process by determining the event intensities (instantaneous probabilities)\nfrom the hidden state of a recurrent neural network. This state is a deterministic function of the past\nhistory. It plays the same role as the state of a deterministic \ufb01nite-state automaton. However, the\nrecurrent network enjoys a continuous and in\ufb01nite state space (a high-dimensional Euclidean space),\nas well as a learned transition function. In our network design, the state is updated discontinuously\nwith each successive event occurrence and also evolves continuously as time elapses between events.\nOur main motivation is that our model can capture effects that the Hawkes process misses. The com-\nbined effect of past events on future events can now be superadditive, subadditive, or even subtrac-\ntive, and can depend on the sequential ordering of the past events. Recurrent neural networks already\ncapture other kinds of complex sequential dependencies when applied to language modeling\u2014that\nis, generative modeling of linguistic word sequences, which are governed by syntax, semantics, and\nhabitual usage (Mikolov et al., 2010; Sundermeyer et al., 2012; Karpathy et al., 2015). We wish to\nextend their success (Chelba et al., 2013) to sequences of events in continuous time.\nAnother motivation for a more expressive model than the Hawkes process is to cope with missing\ndata. Even in a domain where Hawkes might be appropriate, it is hard to apply Hawkes when\nsequences are only partially observed. Real datasets may systematically omit some types of events\n(e.g., illegal drug use, or of\ufb02ine purchases) which, in the true generative model, would have a strong\nin\ufb02uence on the future. They may also have stochastically missing data, where the missingness\nmechanism\u2014the probability that an event is not recorded\u2014can be complex and data-dependent\n(MNAR). In this setting, we can \ufb01t our model directly to the observation sequences, and use it to\npredict observation sequences that were generated in the same way (using the same complete-data\ndistribution and the same missingness mechanism). Note that if one knew the true complete-data\ndistribution\u2014perhaps Hawkes\u2014and the true missingness mechanism, one would optimally predict\nthe incomplete future from the incomplete past in Bayesian fashion, by integrating over possible\ncompletions (imputing the missing events and considering their in\ufb02uence on the future). Our hope\nis that the neural model is expressive enough that it can learn to approximate this true predictive\ndistribution.\nIts hidden state after observing the past should implicitly encode the Bayesian\nposterior, and its update rule for this hidden state should emulate the \u201cobservable operator\u201d that\nupdates the posterior upon each new observation. See Appendix A.4 for further discussion.\n\n2\n\nType-1Type-2Intensity-1Intensity-2BaseRate-1BaseRate-2LSTM-Unit\fA \ufb01nal motivation is that one might wish to intervene in a medical, economic, or social event stream\nso as to improve the future course of events. Appendix D discusses our plans to deploy our model\nfamily as an environment model within reinforcement learning, where an agent controls some events.\n\n2 Notation\n\n(cid:16) I(cid:88)\n\n(cid:17)\n\nWe are interested in constructing distributions over event streams (k1, t1), (k2, t2), . . ., where each\nki \u2208 {1, 2, . . . , K} is an event type and 0 < t1 < t2 < \u00b7\u00b7\u00b7 are times of occurrence.1 That is, there\nare K types of events, tokens of which are observed to occur in continuous time.\nFor any distribution P in our proposed family, an event stream is almost surely in\ufb01nite. However,\nwhen we observe the process only during a time interval [0, T ], the number I of observed events is\nalmost surely \ufb01nite. The log-likelihood (cid:96) of the model P given these I observations is\n\nlog P ((ki, ti) | Hi)\n\n+ log P (tI+1 > T | HI )\n\ni=1\n\n(1)\nwhere the history Hi is the pre\ufb01x sequence (k1, t1), (k2, t2), . . . , (ki\u22121, ti\u22121), and P ((ki, ti) | Hi)\nis the probability density that the next event occurs at time ti and has type ki.\nThroughout the paper, the subscript i usually denotes quantities that affect the distribution of the\nnext event (ki, ti). These quantities depend only on the history Hi.\nWe use (lowercase) Greek letters for parameters related to the classical Hawkes process, and Roman\nletters for other quantities, including hidden states and af\ufb01ne transformation parameters. We denote\nvectors by bold lowercase letters such as s and \u00b5, and matrices by bold capital Roman letters such as\nU. Subscripted bold letters denote distinct vectors or matrices (e.g., wk). Scalar quantities, includ-\ning vector and matrix elements such as sk and \u03b1j,k, are written without bold. Capitalized scalars\nrepresent upper limits on lowercase scalars, e.g., 1 \u2264 k \u2264 K. Function symbols are notated like\ntheir return type. All R \u2192 R functions are extended to apply elementwise to vectors and matrices.\n\n3 The Model\n\nIn this section, we \ufb01rst review Hawkes processes, and then introduce our model one step at a time.\nFormally, generative models of event streams are multivariate point processes. A (temporal) point\nprocess is a probability distribution over {0, 1}-valued functions on a given time interval (for us,\n[0,\u221e)). A multivariate point process is formally a distribution over K-tuples of such functions. The\nkth function indicates the times at which events of type k occurred, by taking value 1 at those times.\n\n3.1 Hawkes Process: A Self-Exciting Multivariate Point Process (SE-MPP)\n\nA basic model of event streams is the non-homogeneous multivariate Poisson process. It assumes\nthat an event of type k occurs at time t\u2014more precisely, in the in\ufb01nitesimally wide interval [t, t +\ndt)\u2014with probability \u03bbk(t)dt. The value \u03bbk(t) \u2265 0 can be regarded as a rate per unit time, just like\nthe parameter \u03bb of an ordinary Poisson process. \u03bbk is known as the intensity function, and the total\n\nintensity of all event types is given by \u03bb(t) =(cid:80)K\n\nk=1 \u03bbk(t).\n\nA well-known generalization that captures interactions is the self-exciting multivariate point pro-\ncess (SE-MPP), or Hawkes process (Hawkes, 1971; Liniger, 2009), in which past events h from\nthe history conspire to raise the intensity of each type of event. Such excitation is positive, additive\nover the past events, and exponentially decaying with time:\n\n\u03bbk(t) = \u00b5k +\n\n(2)\nwhere \u00b5k \u2265 0 is the base intensity of event type k, \u03b1j,k \u2265 0 is the degree to which an event of type\nj initially excites type k, and \u03b4j,k > 0 is the decay rate of that excitation. When an event occurs, all\nintensities are elevated to various degrees, but then will decay toward their base rates \u00b5.\n\nh:th<t\n\n\u03b1kh,k exp(\u2212\u03b4kh,k(t \u2212 th))\n\n(cid:88)\n\n1More generally, one could allow 0 \u2264 t1 \u2264 t2 \u2264 \u00b7\u00b7\u00b7 , where ti is a immediate event if ti\u22121 = ti and a\ndelayed event if ti\u22121 < ti. It is not too dif\ufb01cult to extend our model to assign positive probability to immediate\nevents, but we will disallow them here for simplicity.\n\n3\n\n\f3.2 Self-Modulating Multivariate Point Processes\n\nThe positivity constraints in the Hawkes process limit its expressivity. First, the positive interaction\nparameters \u03b1j,k fail to capture inhibition effects, in which past events reduce the intensity of future\nevents. Second, the positive base rates \u00b5 fail to capture the inherent inertia of some events, which are\nunlikely until their cumulative excitation by past events crosses some threshold. To remove such lim-\nitations, we introduce two self-modulating models. Here the intensities of future events are stochas-\ntically modulated by the past history, where the term \u201cmodulation\u201d is meant to encompass both exci-\ntation and inhibition. The intensity \u03bbk(t) can even \ufb02uctuate non-monotonically between successive\nevents, because the competing excitatory and inhibitory in\ufb02uences may decay at different rates.\n\n3.2.1 Hawkes Process with Inhibition: A Decomposable Self-Modulating MPP (D-SM-MPP)\n\nOur \ufb01rst move is to enrich the Hawkes model\u2019s expressiveness while still maintaining its decom-\nposable structure. We relax the positivity constraints on \u03b1j,k and \u00b5k, allowing them to range over\nR, which allows inhibition (\u03b1j,k < 0) and inertia (\u00b5k < 0). However, the resulting total activation\ncould now be negative. We therefore pass it through a non-linear transfer function fk : R \u2192 R+\nto obtain a positive intensity function as required:\n\n(cid:88)\n\n\u03bbk(t) = fk(\u02dc\u03bbk(t))\n\n(3a)\n\n\u02dc\u03bbk(t) = \u00b5k +\n\n\u03b1kh,k exp(\u2212\u03b4kh,k(t \u2212 th))\n\n(3b)\n\nh:th<t\n\nAs t increases between events, the intensity \u03bbk(t) may both rise and fall, but eventually approaches\nthe base rate f (\u00b5k+0), as the in\ufb02uence of each previous event still decays toward 0 at a rate \u03b4j,k > 0.\nWhat non-linear function fk should we use? The ReLU function f (x) = max(x, 0) is not strictly\npositive as required. A better choice is the scaled \u201csoftplus\u201d function f (x) = s log(1 + exp(x/s)),\nwhich approaches ReLU as s \u2192 0. We learn a separate scale parameter sk for each event type k,\nwhich adapts to the rate of that type. So we instantiate (3a) as \u03bbk(t) = fk(\u02dc\u03bbk(t)) = sk log(1 +\nexp(\u02dc\u03bbk(t)/sk)). Appendix A.1 graphs this and motivates the \u201csoftness\u201d and the scale parameter.\n\n3.2.2 Neural Hawkes Process: A Neurally Self-Modulating MPP (N-SM-MPP)\n\nOur second move removes the restriction that the past events have independent, additive in\ufb02uence\non \u02dc\u03bbk(t). Rather than predict \u02dc\u03bbk(t) as a simple summation (3b), we now use a recurrent neural\nnetwork. This allows learning a complex dependence of the intensities on the number, order, and\ntiming of past events. We refer to our model as a neural Hawkes process.\nJust as before, each event type k has an time-varying intensity \u03bbk(t), which jumps discontinuously\nat each new event, and then drifts continuously toward a baseline intensity. In the new process, how-\never, these dynamics are controlled by a hidden state vector h(t) \u2208 (\u22121, 1)D, which in turn depends\non a vector c(t) \u2208 RD of memory cells in a continuous-time LSTM.2 This novel recurrent neural\nnetwork architecture is inspired by the familiar discrete-time LSTM (Hochreiter and Schmidhuber,\n1997; Graves, 2012). The difference is that in the continuous interval following an event, each\nmemory cell c exponentially decays at some rate \u03b4 toward some steady-state value \u00afc.\nAt each time t > 0, we obtain the intensity \u03bbk(t) by (4a), where (4b) de\ufb01nes how the\nhidden states h(t) are continually obtained from the memory cells c(t) as the cells decay:\n\n\u03bbk(t) = fk(w(cid:62)\n\nk h(t))\n\n(4a)\n\nh(t) = oi (cid:12) (2\u03c3(2c(t)) \u2212 1) for t \u2208 (ti\u22121, ti]\n\n(4b)\nThis says that on the interval (ti\u22121, ti]\u2014in other words, after event i\u22121 up until event i occurs at\nsome time ti\u2014the h(t) de\ufb01ned by equation (4b) determines the intensity functions via equation (4a).\nSo for t in this interval, according to the model, h(t) is a suf\ufb01cient statistic of the history (Hi, t \u2212\nti\u22121) with respect to future events (see equation (1)). h(t) is analogous to hi in an LSTM language\nmodel (Mikolov et al., 2010), which summarizes the past event sequence k1, . . . , ki\u22121. But in our\ndecay architecture, it will also re\ufb02ect the interarrival times t1 \u2212 0, t2 \u2212 t1, . . . , ti\u22121 \u2212 ti\u22122, t\u2212 ti\u22121.\nThis interval (ti\u22121, ti] ends when the next event ki stochastically occurs at some time ti. At this\npoint, the continuous-time LSTM reads (ki, ti) and updates the current (decayed) hidden cells c(t)\nto new initial values ci+1, based on the current (decayed) hidden state h(ti).\n\n2We use one-layer LSTMs with D hidden units in our present experiments, but a natural extension is to use\n\nmulti-layer (\u201cdeep\u201d) LSTMs (Graves et al., 2013), in which case h(t) is the hidden state of the top layer.\n\n4\n\n\fHow does the continuous-time LSTM make those updates? Other than depending on decayed values\nh(ti), the update formulas resemble the discrete-time case:3\nii+1 \u2190 \u03c3 (Wiki + Uih(ti) + di)\n(5a)\nf i+1 \u2190 \u03c3 (Wf ki + Uf h(ti) + df )\n(5b)\nzi+1 \u2190 2\u03c3 (Wzki + Uzh(ti) + dz) \u2212 1 (5c)\noi+1 \u2190 \u03c3 (Woki + Uoh(ti) + do)\n(5d)\n\nci+1 \u2190 f i+1 (cid:12) c(ti) + ii+1 (cid:12) zi+1\n(6a)\n\u00afci+1 \u2190 \u00aff i+1 (cid:12) \u00afci + \u00af\u0131i+1 (cid:12) zi+1\n(6b)\n\u03b4i+1 \u2190 f (Wdki + Udh(ti) + dd) (6c)\n\nThe vector ki \u2208 {0, 1}K is the ith input: a one-hot encoding of the new event ki, with non-zero\nvalue only at the entry indexed by ki. The above formulas will make a discrete update to the LSTM\nstate. They resemble the discrete-time LSTM, but there are two differences. First, the updates do\nnot depend on the \u201cprevious\u201d hidden state from just after time ti\u22121, but rather its value h(ti) at time\nti, after it has decayed for a period of ti \u2212 ti\u22121. Second, equations (6b)\u2013(6c) are new. They de\ufb01ne\nhow in future, as t > ti increases, the elements of c(t) will continue to deterministically decay (at\ndifferent rates) from ci+1 toward targets \u00afci+1. Speci\ufb01cally, c(t) is given by (7), which continues to\ncontrol h(t) and thus \u03bbk(t) (via (4), except that i has now increased by 1).\n\nc(t) def= \u00afci+1 + (ci+1 \u2212 \u00afci+1) exp (\u2212\u03b4i+1 (t \u2212 ti)) for t \u2208 (ti, ti+1]\n\ni\n\n(7)\nIn short, not only does (6a) de\ufb01ne the usual cell values ci+1, but equation (7) de\ufb01nes c(t) on R>0.\nOn the interval (ti, ti+1], c(t) follows an exponential curve that begins at ci+1 (in the sense that\nc(t) = ci+1) and decays toward \u00afci+1 (which it would approach as t \u2192 \u221e, if extrapolated).\nlimt\u2192t+\nA schematic example is shown in Figure 1. As in the previous models, \u03bbk(t) drifts deterministically\nbetween events toward some base rate. But the neural version is different in three ways: x The\nbase rate is not a constant \u00b5k, but shifts upon each event.4 y The drift can be non-monotonic,\nbecause the excitatory and inhibitory in\ufb02uences on \u03bbk(t) from different elements of h(t) may decay\nat different rates. z The sigmoidal transfer function means that the behavior of h(t) itself is a little\nmore interesting than exponential decay. Suppose that ci is very negative but increases toward a\ntarget \u00afci > 0. Then h(t) will stay close to \u22121 for a while and then will rapidly rise past 0. This\nusefully lets us model a delayed response (e.g. the last green segment in Figure 1).\nWe point out two behaviors that are naturally captured by our LSTM\u2019s \u201cforget\u201d and \u201cinput\u201d gates:\n\u2022 if f i+1 \u2248 1 and ii+1 \u2248 0, then ci+1 \u2248 c(ti). So c(t) and h(t) will be continuous at ti.\n\u2022 if \u00aff i+1 \u2248 1 and \u00af\u0131i+1 \u2248 0, then \u00afci+1 \u2248 \u00afci. So although there may be a jump in activation,\n\nThere is no jump due to event i, though the steady-state target may change.\n\nit is temporary. The memory cells will decay toward the same steady states as before.\n\nAmong other bene\ufb01ts, this lets us \ufb01t datasets in which (as is common) some pairs of event types do\nnot in\ufb02uence one another. Appendix A.3 explains why all the models in this paper have this ability.\nThe drift of c(t) between events controls how the system\u2019s expectations about future events change\nas more time elapses with no event having yet occured. Equation (7) chooses a moderately \ufb02exible\nparametric form for this drift function (see Appendix D for some alternatives). Equation (6a) was\ndesigned so that c in an LSTM could learn to count past events with discrete-time exponential\ndiscounting; and (7) can be viewed as extending that to continuous-time exponential discounting.\nOur memory cell vector c(t) is a deterministic function of the past history (Hi, t \u2212 ti).5 Thus,\nthe event intensities at any time are also deterministic via equation (4). The stochastic part of the\nmodel is the random choice\u2014based on these intensities\u2014of which event happens next and when it\nhappens. The events are in competition: an event with high intensity is likely to happen sooner than\nan event with low intensity, and whichever one happens \ufb01rst is fed back into the LSTM. If no event\ntype has high intensity, it may take a long time for the next event to occur.\nTraining the model means learning the LSTM parameters in equations (5) and (6c) along with the\nother parameters mentioned in this section, namely sk \u2208 R and wk \u2208 RD for k \u2208 {1, 2, . . . , K}.\n\nU and d tensors. The \u00aff and \u00af\u0131 in equation (6b) are de\ufb01ned analogously to f and i but with different weights.\n\n3The upright-font subscripts i, f, z and o are not variables, but constant labels that distinguish different W,\n4Equations (4b) and (7) imply that after event i \u2212 1, the base rate jumps to fk(w(cid:62)(oi (cid:12) (2\u03c3(2\u00afci) \u2212 1))).\n5Appendix A.2 explains how our LSTM handles the start and end of the sequence.\n\n5\n\n\f4 Algorithms\nFor the proposed models, the log-likelihood (1) of the parameters turns out to be given by a simple\nformula\u2014the sum of the log-intensities of the events that happened, at the times they happened,\nminus an integral of the total intensities over the observation interval [0, T ]:\n\n(cid:88)\n\ni:ti\u2264T\n\n(cid:96) =\n\nlog \u03bbki(ti) \u2212\n\n(cid:90) T\n(cid:124)\n\nt=0\n\n(cid:123)(cid:122)\n\n\u03bb(t)dt\n\n(cid:125)\n\ncall this \u039b\n\n(8)\n\nThe full derivation is given in Appendix B.1. Intuitively, the \u2212\u039b term (which is \u2264 0) sums the\nlog-probabilities of in\ufb01nitely many non-events. Why? The probability that there was not an event\nof any type in the in\ufb01nitesimally wide interval [t, t + dt) is 1 \u2212 \u03bb(t)dt, whose log is \u2212\u03bb(t)dt.\nWe can locally maximize (cid:96) using any stochastic gradient method. A detailed recipe is given in Ap-\npendix B.2, including the Monte Carlo trick we use to handle the integral in equation (8).\nIf we wish to draw random sequences from the model, we can adopt the thinning algorithm (Lewis\nand Shedler, 1979; Liniger, 2009) that is commonly used for the Hawkes process. See Appendix B.3.\nGiven an event stream pre\ufb01x (k1, t1), (k2, t2), . . . , (ki\u22121, ti\u22121), we may wish to predict the time\nand type of the single next event. The next event\u2019s time ti has density pi(t) = P (ti = t | Hi) =\n. To predict a single time whose expected L2 loss is as low as possible,\n\u03bb(t) exp\ntpi(t)dt. Given the next event time ti, the most likely\ntype would be argmaxk \u03bbk(ti)/\u03bb(ti), but the most likely next event type without knowledge of ti\nis \u02c6ki = argmaxk\n\u03bbk(t)\n\u03bb(t) pi(t)dt. The integrals in the preceding equations can be estimated by\nMonte Carlo sampling much as before (Appendix B.2). For event type prediction, we recommend\na paired comparison that uses the same sample of t values for each k in the argmax; this reduces\nsampling variance and also lets us share the \u03bb(t) and pi(t) computations across all k.\n\nwe should choose \u02c6ti = E[ti | Hi] = (cid:82) \u221e\n\n(cid:16)\u2212(cid:82) t\n\n\u03bb(s)ds\n\n(cid:82) \u221e\n\nti\u22121\n\n(cid:17)\n\nti\u22121\n\nti\u22121\n\n5 Related Work\nThe Hawkes process has been widely used to model event streams, including for topic modeling and\nclustering of text document streams (He et al., 2015; Du et al., 2015a), constructing and inferring\nnetwork structure (Yang and Zha, 2013; Choi et al., 2015; Etesami et al., 2016), personalized rec-\nommendations based on users\u2019 temporal behavior (Du et al., 2015b), discovering patterns in social\ninteraction (Guo et al., 2015; Lukasik et al., 2016), learning causality (Xu et al., 2016), and so on.\nRecent interest has focused on expanding the expressivity of Hawkes processes. Zhou et al. (2013)\ndescribe a self-exciting process that removes the assumption of exponentially decaying in\ufb02uence\n(as we do). They replace the scaled-exponential summands in equation (2) with learned positive\nfunctions of time (the choice of function again depends on ki, k). Lee et al. (2016) generalize the\nconstant excitation parameters \u03b1j,k to be stochastic, which increases expressivity. Our model also\nallows non-constant interactions between event types, but arranges these via deterministic, instead of\nstochastic, functions of continuous-time LSTM hidden states. Wang et al. (2016) consider non-linear\neffects of past history on the future, by passing the intensity functions of the Hawkes process through\na non-parametric isotonic link function g, which is in the same place as our non-linear function fk. In\ncontrast, our fk has a \ufb01xed parametric form (learning only the scale parameter), and is approximately\nlinear when x is large. This is because we model non-linearity (and other complications) with a\ncontinuous-time LSTM, and use fk only to ensure positivity of the intensity functions.\nDu et al. (2016) independently combined Hawkes processes with recurrent neural networks (and\nXiao et al. (2017a) propose an advanced way of estimating the parameters of that model). How-\never, Du et al.\u2019s architecture is different in several respects. They use standard discrete-time LSTMs\nwithout our decay innovation, so they must encode the intervals between past events as explicit nu-\nmerical inputs to the LSTM. They have only a single intensity function \u03bb(t), and it simply decays\nexponentially toward 0 between events, whereas our more modular model creates separate (poten-\ntially transferrable) functions \u03bbk(t), each of which allows complex and non-monotonic dynamics\nen route to a non-zero steady state intensity. Some structural limitations of their design are that ti\nand ki are conditionally independent given h (they are determined by separate distributions), and\nthat their model cannot avoid a positive probability of extinction at all times. Finally, since they take\n\n6\n\n\ff = exp, the effect of their hidden units on intensity is effectively multiplicative, whereas we take\nf = softplus to get an approximately additive effect inspired by the classical Hawkes process. Our\nrationale is that additivity is useful to capture independent (disjunctive) causes; at the same time, the\nhidden units that our model adds up can each capture a complex joint (conjunctive) cause.\n\n6 Experiments6\n\nWe \ufb01t our various models on several simulated and real-world datasets, and evaluated them in each\ncase by the log-probability that they assigned to held-out data. We also compared our approach with\nthat of Du et al. (2016) on their prediction task. The datasets that we use in this paper range from\none extreme with only K = 2 event types but mean sequence length > 2000, to the other extreme\nwith K = 5000 event types but mean sequence length 3. Dataset details can be found in Table 1\nin Appendix C.1. Training details (e.g., hyperparameter selection) can be found in Appendix C.2.\n\n6.1 Synthetic Datasets\nIn a pilot experiment with synthetic data (Appendix C.4), we con\ufb01rmed that the neural Hawkes\nprocess generates data that is not well modeled by training an ordinary Hawkes process, but that\nordinary Hawkes data can be successfully modeled by training an neural Hawkes process.\nIn this experiment, we were not limited to measuring the likelihood of the models on the stochastic\nevent sequences. We also knew the true latent intensities of the generating process, so we were able\nto directly measure whether the trained models predicted these intensities accurately. The pattern of\nresults was similar.\n\n6.2 Real-World Media Datasets\nRetweets Dataset (Zhao et al., 2015). On Twitter, novel tweets are generated from some distribu-\ntion, which we do not model here. Each novel tweet serves as the beginning-of-stream event (see\nAppendix A.2) for a subsequent stream of retweet events. We model the dynamics of these streams:\nhow retweets by various types of users (K = 3) predict later retweets by various types of users.\nDetails of the dataset and its preparation are given in Appendix C.5. The dataset is interesting for its\ntemporal pattern. People like to retweet an interesting post soon after it is created and retweeted by\nothers, but may gradually lose interest, so the intervals between retweets become longer over time.\nIn other words, the stream begins in a self-exciting state, in which previous retweets increase the\nintensities of future retweets, but eventually interest dies down and events are less able to excite one\nanother. The decomposable models are essentially incapable of modeling such a phase transition,\nbut our neural model should have the capacity to do so.\nWe generated learning curves (Figure 2) by training our models on increasingly long pre\ufb01xes of\nthe training set. As we can see, our self-modulating processes signi\ufb01cantly outperform the Hawkes\nprocess at all training sizes. There is no obvious a priori reason to expect inhibition or even inertia\nin this application domain, which explains why the D-SM-MPP makes only a small improvement\nover the Hawkes process when the latter is well-trained. But D-SM-MPP requires much less data,\nand also has more stable behavior (smaller error bars) on small datasets. Our neural model is even\nbetter. Not only does it do better on the average stream, but its consistent superiority over the other\ntwo models is shown by the per-stream scatterplots in Figure 3, demonstrating the importance of our\nmodel\u2019s neural component even with large datasets.\n\nMemeTrack Dataset (Leskovec and Krevl, 2014). This dataset is similar in conception to\nRetweets, but with many more event types (K = 5000). It considers the reuse of \ufb01xed phrases,\nor \u201cmemes,\u201d in online media. It contains time-stamped instances of meme use in articles and posts\nfrom 1.5 million different blogs and news sites. We model how the future occurrence of a meme is\naffected by its past trajectory across different websites\u2014that is, given one meme\u2019s past trajectory\nacross websites, when and where it will be mentioned again.\nOn this dataset,7 the advantage of our full neural models was dramatic, yielding cross-entropy per\nevent of around \u22128 relative to the \u221215 of D-SM-MPP\u2014which in turn is far above the \u2212800 of the\n\n6Our code and data are available at https://github.com/HMEIatJHU/neurawkes.\n7Data preparation details are given in Appendix C.6.\n\n7\n\n\fFigure 2: Learning curve (with 95% error bars) of all three models on the Retweets (left two) and MemeTrack\n(right two) datasets. Our neural model signi\ufb01cantly outperforms our decomposable model (right graph of each\npair), and both signi\ufb01cantly outperform the Hawkes process (left of each pair\u2014same graph zoomed out).\n\nFigure 3: Scatterplots of N-SM-MPP vs. SE-MPP (left) and N-SM-MPP\nvs. D-SM-MPP (right), comparing the held-out log-likelihood of the\ntwo models (when trained on our full Retweets training set) with respect\nto each of the 2000 test sequences. Nearly all points fall to the right of\ny = x, since N-SM-MPP (the neural Hawkes process) is consistently\nmore predictive than our non-neural model and the Hawkes process.\n\nFigure 4: Scatterplot of N-SM-\nMPP vs. SE-MPP, comparing\ntheir log-likelihoods with respect\nto each of the 31 incomplete se-\nquences\u2019 test sets. All 31 points\nfall to the right of y = x.\n\nHawkes process. Figure 2 illustrates the persistent gaps among the models. A scatterplot similar\nto Figure 3 is given in Figure 13 of Appendix C.6. We attribute the poor performance of the Hawkes\nprocess to its failure to capture the latent properties of memes, such as their topic, political stance,\nor interestingness. This is a form of missing data (section 1), as we now discuss.\nAs the table in Appendix C.1 indicates, most memes in MemeTrack are uninteresting and give rise\nto only a short sequence of mentions. Thus the base mention probability is low. An ideal analysis\nwould recognize that if a speci\ufb01c meme has been mentioned several times already, it is a posteriori\ninteresting and will probably be mentioned in future as well. The Hawkes process cannot distinguish\nthe interesting memes from the others, except insofar as they appear on more in\ufb02uential websites.\nBy contrast, our D-SM-MPP can partly capture this inferential pattern by using negative base rates\n\u00b5 to create \u201cinertia\u201d (section 3.2.1). Indeed, all 5000 of its learned \u00b5k parameters were negative,\nwith values ranging from \u221210 to \u221230, which numerically yields 0 intensity and is hard to excite.\nAn ideal analysis would also recognize that if a speci\ufb01c meme has appeared mainly on conservative\nwebsites, it is a posteriori conservative and unlikely to appear on liberal websites in the future. The\nD-SM-MPP, unlike the Hawkes process, can again partly capture this, by having conservative web-\nsites inhibit liberal ones. Indeed, 24% of its learned \u03b1 parameters were negative. (We re-emphasize\nthat this inhibition is merely a predictive effect\u2014probably not a direct causal mechanism.)\nAnd our N-SM-MPP process is even more powerful. The LSTM state aims to learn suf\ufb01cient statis-\ntics for predicting the future, so it can learn hidden dimensions (which fall in (\u22121, 1)) that encode\nuseful posterior beliefs in boolean properties of the meme such as interestingness, conservativeness,\ntimeliness, etc. The LSTM\u2019s \u201clong short-term memory\u201d architecture explicitly allows these beliefs\nto persist inde\ufb01nitely through time in the absence of new evidence, without having to be refreshed\nby redundant new events as in the decomposable models. Also, the LSTM\u2019s hidden dimensions\nare computed by sigmoidal activation rather than softplus activation, and so can be used implicitly\nto perform logistic regression. The \ufb02at left side of the sigmoid resembles softplus and can model\ninertia as we saw above: it takes several mentions to establish interestingness. Symmetrically, the\n\ufb02at right side can model saturation: once the posterior probability of interestingness is at 80%, it\ncannot climb much farther no matter how many more mentions are observed.\nA \ufb01nal potential advantage of the LSTM is that in this large-K setting, it has fewer parameters\nthan the other models (Appendix C.3), sharing statistical strength across event types (websites) to\ngeneralize better. The learning curves in Figure 2 suggest that on small data, the decomposable\n\n8\n\n125250500100020004000800016000number of training sequences403020100log-likelihood per eventN-SM-MPPD-SM-MPPSE-MPP125250500100020004000800016000number of training sequences98765log-likelihood per eventN-SM-MPPD-SM-MPPSE-MPP10002000400080001600032000number of training sequences2000150010005000500log-likelihood per eventN-SM-MPPD-SM-MPPSE-MPP400080001600032000number of training sequences605040302010010log-likelihood per eventN-SM-MPPD-SM-MPP10864202N-SM-MPP10864202SE-MPP10864202N-SM-MPP10864202D-SM-MPP2.01.81.61.41.21.0N-SM-MPP2.01.81.61.41.21.0SE-MPP\f(non-neural) models may over\ufb01t their O(K 2) interaction parameters \u03b1j,k. Our neural model only\nhas to learn O(D2) pairwise interactions among its D hidden nodes (where D (cid:28) K), as well as\nO(KD) interactions between the hidden nodes and the K event types. In this case, K = 5000 but\nD = 64. This reduction by using latent hidden nodes is analogous to nonlinear latent factor analysis.\n\n6.3 Modeling Streams With Missing Data\nWe set up an arti\ufb01cial experiment to more directly investigate the missing-data setting of section 1,\nwhere we do not observe all events during [0, T ], but train and test our model just as if we had.\nWe sampled synthetic event sequences from a standard Hawkes process (just as in our pilot exper-\niment from 6.1), removed all the events of selected types, and then compared the neural Hawkes\nprocess (N-SM-MPP) with the Hawkes process (SE-MPP) as models of these censored sequences.\nSince we took K = 5, there were 25 \u2212 1 = 31 ways to construct a dataset of censored sequences.\nAs shown in Figure 4, for each of the 31 resulting datasets, training a neural Hawkes model achieves\nbetter generalization. Appendix A.4 discusses why this kind of behavior is to be expected.\n\n6.4 Prediction Tasks\u2014Medical, Social and Financial\nTo compare with Du et al. (2016), we evaluate our model on the prediction tasks and datasets that\nthey proposed. The Financial Transaction dataset contains long streams of high frequency stock\ntransactions for a single stock, with the two event types \u201cbuy\u201d and \u201csell.\u201d The electrical medical\nrecords (MIMIC-II) dataset is a collection of de-identi\ufb01ed clinical visit records of Intensive Care\nUnit patients for 7 years. Each patient has a sequence of hospital visit events, and each event records\nits time stamp and disease diagnosis. The Stack Over\ufb02ow dataset represents two years of user awards\non a question-answering website: each user received a sequence of badges (of 22 different types).\nWe follow Du et al. (2016) and attempt to predict every held-out event (ki, ti) from its history Hi,\nevaluating the prediction \u02c6ki with 0-1 loss (yielding an error rate, or ER) and evaluating the prediction\n\u02c6ti with L2 loss (yielding a root-mean-squared error, or RMSE). We make minimum Bayes risk\npredictions as explained in section 4. Figure 8 in Appendix C.7 shows that our model consistently\noutperforms that of Du et al. (2016) on event type prediction on all the datasets, although for time\nprediction neither model is consistently better.\n\n6.5 Sensitivity to Number of Parameters\nDoes our method do well because of its \ufb02exible nonlinearities or just because it has more parameters?\nThe answer is both. We experimented on the Retweets data with reducing the number of hidden\nunits D. Our N-SM-MPP substantially outperformed SE-MPP (the Hawkes process) on held-out\ndata even with very few parameters, although more parameters does even better:\nnumber of hidden units Hawkes\n21\nnumber of parameters\nlog-likelihood\n-7.19\nWe also tried halving D across several datasets, which had negligible effect, always decreasing\nheld-out log-likelihood by < 0.2% relative.\nMore information about model sizes is given in Appendix C.3. Note that the neural Hawkes process\ndoes not always have more parameters. When K is large, we can greatly reduce the number of\nparams below that of a Hawkes process, by choosing D (cid:28) K, as for MemeTrack in section 6.2.\n\n256\n921091\n-6.10\n\n32\n14787\n-6.16\n\n2\n87\n-6.41\n\n4\n283\n-6.36\n\n1\n31\n-6.51\n\n8\n1011\n-6.24\n\n16\n3811\n-6.18\n\n7 Conclusion\nWe presented two extensions to the multivariate Hawkes process, a popular generative model of\nstreams of typed, timestamped events. Past events may now either excite or inhibit future events.\nThey do so by sequentially updating the state of a novel continuous-time recurrent neural network\n(LSTM). Whereas Hawkes sums the time-decaying in\ufb02uences of past events, we instead sum the\ntime-decaying in\ufb02uences of the LSTM nodes. Our extensions to Hawkes aim to address real-world\nphenomena, missing data, and causal modeling. Empirically, we have shown that both extensions\nyield a signi\ufb01cantly improved ability to predict the course of future events. There are several excit-\ning avenues for further improvements (discussed in Appendix D), including embedding our model\nwithin a reinforcement learner to discover causal structure and learn an intervention policy.\n\n9\n\n\fAcknowledgments\n\nWe are grateful to Facebook for enabling this work through a gift to the second author. Nan Du\nkindly helped us by making his code public and answering questions, and the NVIDIA Corporation\nkindly donated two Titan X Pascal GPUs. We also thank our lab group at Johns Hopkins University\u2019s\nCenter for Language and Speech Processing for helpful comments. The \ufb01rst version of this work\nappeared on arXiv in December 2016.\n\nReferences\nCiprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony\nRobinson. One billion word benchmark for measuring progress in statistical language modeling.\nComputing Research Repository, arXiv:1312.3005, 2013. URL http://arxiv.org/abs/\n1312.3005.\n\nEdward Choi, Nan Du, Robert Chen, Le Song, and Jimeng Sun. Constructing disease network and\ntemporal progression model via context-sensitive Hawkes process. In Data Mining (ICDM), 2015\nIEEE International Conference on, pages 721\u2013726. IEEE, 2015.\n\nNan Du, Mehrdad Farajtabar, Amr Ahmed, Alexander J Smola, and Le Song. Dirichlet-Hawkes\nprocesses with applications to clustering continuous-time document streams. In Proceedings of\nthe 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 219\u2013228. ACM, 2015a.\n\nNan Du, Yichen Wang, Niao He, Jimeng Sun, and Le Song. Time-sensitive recommendation from\nrecurrent user activities. In Advances in Neural Information Processing Systems (NIPS), pages\n3492\u20133500, 2015b.\n\nNan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song.\nRecurrent marked temporal point processes: Embedding event history to vector. In Proceedings\nof the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 1555\u20131564. ACM, 2016.\n\nJalal Etesami, Negar Kiyavash, Kun Zhang, and Kushagra Singhal. Learning network of multivariate\n\nHawkes processes: A time series approach. arXiv preprint arXiv:1603.04319, 2016.\n\nManuel Gomez Rodriguez, Jure Leskovec, and Bernhard Sch\u00a8olkopf. Structure and dynamics of\ninformation pathways in online media. In Proceedings of the Sixth ACM International Conference\non Web Search and Data Mining, pages 23\u201332. ACM, 2013.\n\nAlex Graves. Supervised Sequence Labelling with Recurrent Neural Networks. Springer, 2012.\n\nURL http://www.cs.toronto.edu/\u02dcgraves/preprint.pdf.\n\nAlex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with deep\nbidirectional LSTM. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE\nWorkshop on, pages 273\u2013278. IEEE, 2013.\n\nFangjian Guo, Charles Blundell, Hanna Wallach, and Katherine Heller. The Bayesian echo cham-\nber: Modeling social in\ufb02uence via linguistic accommodation. In Proceedings of the Eighteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 315\u2013323, 2015.\n\nAlan G Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika,\n\n58(1):83\u201390, 1971.\n\nXinran He, Theodoros Rekatsinas, James Foulds, Lise Getoor, and Yan Liu. Hawkestopic: A joint\nmodel for network inference and topic modeling from text-based cascades. In Proceedings of the\nInternational Conference on Machine Learning (ICML), pages 871\u2013880, 2015.\n\nSepp Hochreiter and J\u00a8urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\nAndrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and understanding recurrent networks.\n\narXiv preprint arXiv:1506.02078, 2015.\n\n10\n\n\fDiederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of\n\nthe International Conference on Learning Representations (ICLR), 2015.\n\nYoung Lee, Kar Wai Lim, and Cheng Soon Ong. Hawkes processes with stochastic excitations. In\n\nProceedings of the International Conference on Machine Learning (ICML), 2016.\n\nJure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http:\n\n//snap.stanford.edu/data, June 2014.\n\nPeter A Lewis and Gerald S Shedler. Simulation of nonhomogeneous Poisson processes by thinning.\n\nNaval Research Logistics Quarterly, 26(3):403\u2013413, 1979.\n\nThomas Josef Liniger. Multivariate Hawkes processes. Diss., Eidgen\u00a8ossische Technische\n\nHochschule ETH Z\u00a8urich, Nr. 18403, 2009, 2009.\n\nMichal Lukasik, PK Srijith, Duy Vu, Kalina Bontcheva, Arkaitz Zubiaga, and Trevor Cohn. Hawkes\nprocesses for continuous time sequence classi\ufb01cation: An application to rumour stance classi\ufb01-\ncation in Twitter. In Proceedings of 54th Annual Meeting of the Association for Computational\nLinguistics, pages 393\u2013398, 2016.\n\nTomas Mikolov, Martin Kara\ufb01\u00b4at, Luk\u00b4as Burget, Jan Cernock\u00b4y, and Sanjeev Khudanpur. Recurrent\nneural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the\nInternational Speech Communication Association, Makuhari, Chiba, Japan, September 26-30,\n2010, pages 1045\u20131048, 2010.\n\nC. Palm. Intensit\u00a8atsschwankungen im Fernsprechverkehr. Ericsson technics, no. 44. L. M. Ericcson,\n\n1943. URL https://books.google.com/books?id=5cy2NQAACAAJ.\n\nJudea Pearl. Causal inference in statistics: An overview. Statistics Surveys, 3:96\u2013146, 2009.\n\nMartin Sundermeyer, Hermann Ney, and Ralf Schluter. LSTM neural networks for language mod-\n\neling. Proceedings of INTERSPEECH, 2012.\n\nYichen Wang, Bo Xie, Nan Du, and Le Song. Isotonic Hawkes processes. In Proceedings of the\n\nInternational Conference on Machine Learning (ICML), 2016.\n\nShuai Xiao, Mehrdad Farajtabar, Xiaojing Ye, Junchi Yan, Xiaokang Yang, Le Song, and Hongyuan\nIn Advances in Neural\n\nZha. Wasserstein learning of deep generative point process models.\nInformation Processing Systems 30, 2017a.\n\nShuai Xiao, Junchi Yan, Mehrdad Farajtabar, Le Song, Xiaokang Yang, and Hongyuan Zha. Joint\nmodeling of event sequence and time series with attentional twin recurrent neural networks. arXiv\npreprint arXiv:1703.08524, 2017b.\n\nHongteng Xu, Mehrdad Farajtabar, and Hongyuan Zha. Learning Granger causality for Hawkes\nprocesses. In Proceedings of the International Conference on Machine Learning (ICML), 2016.\n\nShuang-hong Yang and Hongyuan Zha. Mixture of mutually exciting processes for viral diffusion.\nIn Proceedings of the International Conference on Machine Learning (ICML), pages 1\u20139, 2013.\n\nOmar F. Zaidan and Jason Eisner. Modeling annotators: A generative approach to learning from an-\nnotator rationales. In Proceedings of the Conference on Empirical Methods in Natural Language\nProcessing (EMNLP), pages 31\u201340, 2008.\n\nQingyuan Zhao, Murat A Erdogdu, Hera Y He, Anand Rajaraman, and Jure Leskovec. Seismic: A\nself-exciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1513\u20131522.\nACM, 2015.\n\nKe Zhou, Hongyuan Zha, and Le Song. Learning triggering kernels for multi-dimensional Hawkes\nprocesses. In Proceedings of the International Conference on Machine Learning (ICML), pages\n1301\u20131309, 2013.\n\n11\n\n\f", "award": [], "sourceid": 3392, "authors": [{"given_name": "Hongyuan", "family_name": "Mei", "institution": "JOHNS HOPKINS UNIVERSITY"}, {"given_name": "Jason", "family_name": "Eisner", "institution": "Johns Hopkins University"}]}