{"title": "Learning Latent Process from High-Dimensional Event Sequences via Efficient Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 3847, "page_last": 3856, "abstract": "We target modeling latent dynamics in high-dimension marked event sequences without any prior knowledge about marker relations. Such problem has been rarely studied by previous works which would have fundamental difficulty to handle the arisen challenges: 1) the high-dimensional markers and unknown relation network among them pose intractable obstacles for modeling the latent dynamic process; 2) one observed event sequence may concurrently contain several different chains of interdependent events; 3) it is hard to well define the distance between two high-dimension event sequences. To these ends, in this paper, we propose a seminal adversarial imitation learning framework for high-dimension event sequence generation which could be decomposed into: 1) a latent structural intensity model that estimates the adjacent nodes without explicit networks and learns to capture the temporal dynamics in the latent space of markers over observed sequence; 2) an efficient random walk based generation model that aims at imitating the generation process of high-dimension event sequences from a bottom-up view; 3) a discriminator specified as a seq2seq network optimizing the rewards to help the generator output event sequences as real as possible. Experimental results on both synthetic and real-world datasets demonstrate that the proposed method could effectively detect the hidden network among markers and make decent prediction for future marked events, even when the number of markers scales to million level.", "full_text": "Learning Latent Process from High-Dimensional\n\nEvent Sequences via Ef\ufb01cient Sampling\n\nQitian Wu1,2, Zixuan Zhang1,2, Xiaofeng Gao1,2\u2217, Junchi Yan2,3, Guihai Chen4\n\n1Shanghai Key Laboratory of Scalable Computing and Systems\n\n2Department of Computer Science and Engineering, Shanghai Jiao Tong University\n\n3MoE Key Lab of Arti\ufb01cial Intelligence, Shanghai Jiao Tong University\n4State Key Labrotary of Novel Software Technology, Nanjing University\n\n{echo740, zzx_gongshi117}@sjtu.edu.cn, gao-xf@cs.sjtu.edu.cn\n\nyanjunchi@sjtu.edu.cn, gchen@nju.edu.cn\n\nAbstract\n\nWe target modeling latent dynamics in high-dimension marked event sequences\nwithout any prior knowledge about marker relations. Such problem has been\nrarely studied by previous works which would have fundamental dif\ufb01culty to\nhandle the arisen challenges: 1) the high-dimensional markers and unknown\nrelation network among them pose intractable obstacles for modeling the latent\ndynamic process; 2) one observed event sequence may concurrently contain several\ndifferent chains of interdependent events; 3) it is hard to well de\ufb01ne the distance\nbetween two high-dimension event sequences. To these ends, in this paper, we\npropose a seminal adversarial imitation learning framework for high-dimension\nevent sequence generation which could be decomposed into: 1) a latent structural\nintensity model that estimates the adjacent nodes without explicit networks and\nlearns to capture the temporal dynamics in the latent space of markers over observed\nsequence; 2) an ef\ufb01cient random walk based generation model that aims at imitating\nthe generation process of high-dimension event sequences from a bottom-up view;\n3) a discriminator speci\ufb01ed as a seq2seq network optimizing the rewards to help\nthe generator output event sequences as real as possible. Experimental results on\nboth synthetic and real-world datasets demonstrate that the proposed method could\neffectively detect the hidden network among markers and make decent prediction\nfor future marked events, even when the number of markers scales to million level.\n\n1\n\nIntroduction\n\nEvent sequence, consisting of a series of tuples (time, marker) that records at which time which type\nof event takes place, could be a \ufb01ne-grained representation [10] of temporal data that are pervasive\nin real-life applications. For example, one tweet or topic in social networks could give rise to huge\nnumber of forwarding behaviors, forming an information cascade. Such a cascade can be recorded by\nan event sequence composed of what time each retweet happens and who (a user) forwards the tweet,\ni.e., the marker. Another typical example is the POI route of a visitor in city, and the event sequence\nrecords when the person visits which POI and the POI is the marker. Also, there are cases where\nthe markers contain compositional features, like job-hopping events in one period, where the event\nsequence records at which time who from which department of which company transfers to which\ndepartment of which company. In this case, the marker contains \ufb01ve-dimension information. In the\nabove examples, the number of markers could easily scale to an astronomical value when: 1) there\nare billions of users in one social network like Twitter; 2) there are a wealth of POIs in a big city; and\n\n\u2217Xiaofeng Gao is corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f3) the compositional features stem from plenty of dimensions. In the literature, these event sequences\nwith a huge number of event types are termed as high-dimension (marked) event sequence [6].\nOne problem for marked event sequence is to model the latent dynamic process from observed\nsequences. Such a latent process can be further decomposed into two mutually dependent processes:\ntemporal point process, which captures the temporal dynamics between two adjacent events, and\nrelation network, which denotes the dependencies among different markers. There are plenty of\nprevious studies targeting the problem from different aspects. For temporal point process, a great\nnumber of works [3, 13, 15, 16, 28] attempt to model the intensify function from statistic views,\nand recent studies harness deep recurrent model [24], generative adversarial network [23] and\nreinforcement learning [19, 18] to learn the temporal process. These researches mainly focus on\none-dimension event sequences where each event possesses the same marker. For marker relation\nmodelling, several early studies [12, 27, 25] assume static correlation coef\ufb01cients among markers and\nin some later works, the static coef\ufb01cients are replaced by a series of parametric or non-parametric\ndensity functions [9, 11, 7]. Nevertheless, since these works need to learn dozens of parameters\nfor each edge, which induces O(n2) parameter space, they are mostly limited in cases of multi-\ndimensional event sequences, where the number of markers is up to on hundred level.\nThere are a few existing studies that attempt to handle high-dimensional markers in one system.\nFor instance, [8] targets information estimation in continuous-time diffusion network where each\nedge entails a transmission function. Several similar works like [2, 17] also focus on temporal point\nprocess in a huge diffusion network. However, they assume a given topology of the network, and\ndifferently in our work, the network of markers is unknown. Furthermore, [22] directly models the\nlatent process from observed event sequences without the known network and tries to capture the\ndependencies among markers through temporal attention mechanism, which, nevertheless, could\nonly implicitly re\ufb02ect the relation network, while we aim at explicitly uncovering the hidden network\nwith better interpretability. Moreover, the authors in [1] build a probabilistic model to uncover the\ntime-varying networks of dependencies. By contrast, apart from network reconstruction, our paper\nalso deals with the temporal dynamic process over the graph.\nLearning latent process in high-dimension event sequences is highly intractable. Firstly, due to\nthe huge number of markers, the unknown network could be pretty sparse, which makes previous\nmethods assuming density function for each edge fail to work. The high-dimension markers also\nrequire a both effective and ef\ufb01cient representation. Secondly, one event sequence may consist of\nseveral different subsequences each of which entails a chain of interdependent event markers. In other\nwords, two time-adjacent events in one sequence do not necessarily mean they possess dependency\nsince the latter event may be caused by an earlier event. Such phenomenon makes the relations among\nevents quite implicit. Thirdly, it is hard to quantify the discrepancy between two event sequences\nwhen events possess different markers. However, a proper loss function, which is the premise for\ndecent model accuracy, highly requires a well-de\ufb01ned distance measure.\nTo these ends, in this paper, we propose a seminal adversarial imitation learning framework that aims\nat imitating the latent generation process of high-dimension event sequences. The main intuition\nbehind the methodology is that if the model can generate event sequences close to the real ones, one\ncan believe that the model has accurately captured the latent process. Speci\ufb01cally, the generator\nmodel contains two sub-modules: 1) a latent structural intensity model, which uses one marker\u2019s\nembedding feature to estimate a group of markers that are possibly its \ufb01rst-order neighbors and\ncaptures the temporal point process in the latent space of observed markers, and 2) an ef\ufb01cient random\nwalk based generation model, which attempts to conduct a random walk on the local relation network\nof markers and generate the time and marker of next event based on the historical events. The special\ngenerator taking a bottom-up view for event generation with good interpretability could generalize\nto arbitrary cases without any parametric assumption, and can as well be ef\ufb01ciently implemented\nbased on our theoretical insights. To detour the intractable distance measure for high-dimension\nevent sequences, we design a seq2seq discriminator that maximizes reward on ground-truth event\nsequence (expert policy) and minimizes reward on generated one which will be further used to train\nthe generator. To verify the model, we run experiments on two synthetic datasets and two real-world\ndatasets. The empirical results show that the proposed model can give decent prediction for future\nevents and network reconstruction, even when the number of markers scale to very high dimensions.\n\n2\n\n\f2 Methodology\n\nPreliminary for Temporal Point Process. Event sequence can be modeled as a point process [4]\nwhere each new event\u2019s arrival time is treated as a random variable given the history of previous\nevents. A common way to characterize a point process is via a conditional intensity function de\ufb01ned\nas: \u03bb(t|Ht) =\n, where Ht and N (t) denote the history of previous events and\nnumber of events until time t, respectively. Then the arrival time of a new event would obey a density\n\u03bb(\u03c4|Ht)d\u03c4 ), while the marker of the new event obeys a\ncertain discrete distribution p(m|Ht).\n\ndistribution f (t|Ht) = \u03bb(t|Ht) exp(\u2212(cid:82) t\n\ntn\n\nP(N (t+dt)\u2212N (t)=1|Ht)\n\ndt\n\nNotations and Problem Formulation. Assume that a system has M types of events, i.e. markers,\ndenoted as M = {mi}M\ni=1, and M can be arbitrarily large. There exists a hidden relation network\nG = (V,E), where V = M and E = {cij}M\u00d7M denotes a set of directed edges. Here cij = 1\nindicates that marker mj is the descendant of marker mi (i.e., an event with marker mi could cause an\nevent with marker mj), and cij = 0 denotes independence between two markers. An event sequence\nS entails a series of events with time and marker, denoted as S = {(tk, mik )} (k = 0, 1,\u00b7\u00b7\u00b7 ) where\ntk and mik denote time and marker of the k-th event, respectively, and mik is the descendant of\none of previous markers min where 0 \u2264 n < k. Note that it is possible that an event is caused by\nmore than one events before, and we only consider the \ufb01rst parent as the true parent [11, 9, 8]. We\ncall event (t0, mi0) as source event. The problem in this paper can be formulated as follows. Given\nobserved event sequences {S}, we aim at recovering the hidden relation network G and learning\nthe latent process in event sequences, i.e., the conditional distribution P((tk+1, mik+1)|Hk) where\nHk = {(tn, min )}k\nModel Overview. The fundamental idea of our methodology is to imitate the event generation\nprocess from a bottom-up view where the time and marker of each new event are sampled based on the\nhistory of previous events and the network. Such idea could be justi\ufb01ed by the main intuition that the\nmodel conceivably succeeds to capture the latent process once it can generate event sequences which\nare close to the ground-truth ones. To achieve this goal, we build a framework named LANTERN\n(Learning Latent Process in High-Dimension Marked Event Sequences), shown in Fig. 1. We will go\ninto the details in the following.\n\nn=0 denotes the history up to time tk.\n\n2.1 Generating High-Dimension Event Sequences\n\nLatent Structural Intensity Model. For marker mi, we use an M-dimension one-hot vector vi\nto represent it. Then by multiplying an embedding matrix WM \u2208 RD\u00d7M , we can further encode\neach marker into a latent semantic space and obtain its representation di = WM vi. The embedding\nmatrix WM is initially assigned with random values and will be updated in training so as to capture\nthe similarity between markers on semantic level. Given the history of event sequence (up to time tk)\nwith the \ufb01rst k + 1 events, i.e., Hk = {(tn, min )}k\nn=0, we build an deep attentive intensity model to\ncapture both the temporal dynamic and structural dependency in the event sequence.\nFor the n-th event, the marker min corresponds to a D-dimension embedding vector din. To obtain\na consistent representation, we embed the continuous time by adopting a linear transformation\ntn = wT tn + bT , where wT , bT \u2208 RD\u00d71 are two trainable vectors. Then we linearly add the\nembeddings of marker and time, en = \u03b7 \u00b7 tn + din, to represent the n-th event, incorporating both\ntemporal and structural information. Later, we de\ufb01ne a D-dimension intensity function by attentively\naggregating the representations of all previous events,\n\nhn = M ultiHeadAttn(e0, e1,\u00b7\u00b7\u00b7 , ek), n = 0, 1,\u00b7\u00b7\u00b7 , k,\n\n(1)\n\nwhere the M ultiHeadAttn(\u00b7) is speci\ufb01ed in Appendix A.\nRemark. The equation (1) computes a D-dimension intensity function in the latent space of high-\ndimension markers. Compared with previous works that rely on a scalar intensity value for each\ndimension (speci\ufb01ed by either statistic functions or deep models), our model possesses two advantages.\nFirstly, the marker embedding enables (1) to capture the structural proximity among markers in a\nlatent space and the value of hk implies the instantaneous arrival rate of new markers on semantic\nlevel. Such property enables our model to express more complex dynamics with ef\ufb01ciency, especially\nfor high-dimension event sequences. Secondly, the time is encoded as vector representation, instead\n\n3\n\n\fFigure 1: Framework of LANTERN: a generator lever-\nages multi-head attention units to capture the intensify\nfunction in the latent space of markers and a random\nwalk method to generate next event, and a discriminator\naims at optimizing the reward for each sampling.\n\nFigure 2: Local relation network of an event\nsequence (with k = 4). The blue nodes\nrepresent event markers existing in the se-\nquence, and the white nodes belong to their\ncausal descendants.\n\nof directly concatenating as a scalar value with the marker embedding in previous works. Such\nsetting is similar to the position embedding [20, 5] for sentence representation in NLP task, while the\ndifference is that for event sequences we deal with continuous time, which is more \ufb01ne-grained than\ndiscrete positions.\n\nRandom Walk Based Next Event Generation. Due to the causal-effect nature in event sequences,\nnew event marker could only lie in the descendants of all existing markers. Use Mk = \u222ak\nn=0{min}\nto denote the set of existing markers in Hk. For mi \u2208 Mk, its descendants in relation network could\nbe estimated by attentively sampling over\n\np(mj \u2208 Ni) =\n\n(cid:80)M\n\nC [dj||di])\n\nexp(w(cid:62)\nu=1 exp(w(cid:62)\n\n,\n\nC [du||di])\n\n(2)\nwhere Ni = {mj|cij = 1} denotes the set of descendants of mi in G. Such sampling method is\ninspired by graph attention network (GAT) [21], while the difference is that GAT aims at encoding a\ngiven network as feature vectors and on the contrary, our model uses the trainable embeddings of\nnodes to retrieve the network. The denominator of (2) requires embedding for each marker in the\nsystem, which poses high computational cost for training. Thus in practical implementation, we can\n\ufb01x all p(mj \u2208 Ni) during one epoch, and update the parameters when the epoch \ufb01nishes. That could\nreduce the complexity as well as avoid high variance for event generation in different mini-batches.\nThe probability that the n-th event (tn, min ) in the history sequence Hk causes a new event with\nmarker mj \u2208 Nin can be approximated by\np(mj \u2208 N in|mj \u2208 Nin) =\n\nexp(w(cid:62)\n\n(3)\n\nN [hn||dj] + bN )\nexp(w(cid:62)\n\nN [hn||du] + bN )\n\n(cid:80)\n\n,\n\nmu\u2208Nin\n\nwhere N in denotes true descendants of marker min in the event sequence.\nTo sample new marker mik+1, we design a random walk approach which interprets the generation\nprocess from a bottom-up view. Consider a multiset Ak (which allows one element appears multiple\ntimes) consisting of all existing markers in Hk, and a multiset Dk containing the descendants of\nk = {cinj} (n = 1,\u00b7\u00b7\u00b7 , k)\nall existing markers. Besides, we de\ufb01ne two another multisets: E E\nwhich contains relation edges where cinj connects existing marker min \u2208 Ak with its descendant\nmj \u2208 Nin given by sampling over (2) and E T\nk = {cin\u22121in} (n = 1,\u00b7\u00b7\u00b7 , k) which contains true\nrelation edges where cin\u22121in connects two event marker min\u22121 , min \u2208 Ak and min \u2208 N in\u22121. De\ufb01ne\nVk = Ak \u222a Dk and Ek = E E\nk . Then Vk and Ek would induce a graph Gk = (Vk,Ek) which we\ncall local relation network. Fig. 2 shows an example of local relation network for event sequences\nwhere the solid lines denote true relation edges. By de\ufb01nition, the new marker can be only sampled\nfrom Dk, i.e. the leaf nodes in Gk.\n\nk \u222a E T\n\n4\n\n\fAlgorithm 1: Ef\ufb01cient Random Walk based Sampling for Generation of Next Event Marker\n1 INPUT: (t0, mi0), source event time and marker (which can be given or initially sampled from M);\nNi, sampled descendants for each marker mi, i = 1,\u00b7, M.\n2 D0 \u2190 Ni0, set \u03c10(mj) = p(mj \u2208 N i0|mj \u2208 Ni0) according to (1) and (3) for each mj \u2208 Ni0;\n3 for k = 1,\u00b7\u00b7\u00b7 , T do\nDraw mik from MN (\u03c1k\u22121), and update Dk \u2190 Dk\u22121 \u222a Nik // Assume min is parent of mik\n4\nand we need to keep record of the parent of each mj \u2208 Dk;\nbk \u2190 \u03c1k\u22121(min ) \u00b7 p(mik \u2208 N in|mi \u2208 Nin ), where mik \u2208 N v;\n\u03c1k\u22121(mik ) \u2190 \u03c1k\u22121(mik ) \u2212 bk;\nfor mi \u2208 Nik do\n\n\u03c1k(mi) \u2190 bk \u00b7 p(mi \u2208 N ik|mi \u2208 Nik );\n\n5\n6\n7\n8\n\n9 OUTPUT: S = {(tk, mik )}T\n\nk=0, a generated event sequence.\n\nFor each mj \u2208 Dk, use P k\nn=0 contains each\nmarker mun on the path where mu0 = mi0 and muN = mj. (Note that here N varies with different\nj and we omit the subscript here to keep notation clean). Here, P k\nj possesses an important property\nbased on the causal-effect nature of event sequences.\n\nj to denote the path from mi0 to mj and P k\n\nj = {mun}N\n\nIn local relation network Gk = (Vk,Ek), for any mj \u2208 Dk, each path P k\n\nTheorem 1.\n{mun}N\nThen we give our random walk based generation process for next event:\n\nn=0 satis\ufb01es that for any n, 0 \u2264 n < N, it holds mun \u2208 Ak.\n\nj =\n\n\u2022 Marker Generation: start with the source event marker mi0, and when the current move is\nfrom marker min to mi: if mi \u2208 Ak, jump to the next marker mj \u2208 Ni with probability\np(mj \u2208 N in|mj \u2208 Nin ) given by (3); otherwise, i.e., mi \u2208 Dk stop and set mik+1 = mi.\n\u2022 Time Estimation: we estimate the time interval between next event and the k-th event as\nT )), where hn is the intensity representation up to time\n\n\u2206tk+1 = log(1 + exp(W(cid:48)\ntn and (tn, min ) is the n-th event in Hk. Finally, tk+1 = tk + \u2206tk+1.\n\nT hn + b(cid:48)\n\nj |Gk) =(cid:81)N\n\nTheorem 1 guarantees the well-de\ufb01nedness of the above interpretable approach. However, its\ntheoretical complexity is quadratic w.r.t the maximum length of event sequences. We further propose\nan equivalent sampling method that requires linear time complexity.\nj = {mun}N\nEf\ufb01cient Algorithm. For each mj \u2208 Dk, the path P k\nn=0 would induce a probability\nn=1 p(mun \u2208 N un\u22121|mun \u2208 Nun\u22121). Then we can obtain the following theorem.\np(P k\nTheorem 2. The random walk approach is equivalent to drawing a marker mj from Dk according\nto a multinomial distribution MN (\u03c1) where \u03c1(mj) = p(P k\nTheorem 2 allows us to design an alternative sampling algorithm by iteratively using previous\noutcomes, which is shown in Alg. 1. We further show that the sampling method of Alg. 1 is\nwell-de\ufb01ned and equivalent to the one in Theorem 2. Also, its complexity is linear w.r.t the sequence\nlength. The proofs are in appendix B.\n\nj |Gk).\n\n2.2 Training by Inverse Reinforcement Learning\n\nHere \u03c0(ak|sk) =(cid:80)\nof reward r(S) = r(a, s) =(cid:80)\n\nOptimization. As discussed in previous subsection, the main goal of our model is to generate\nevent sequences as real as possible. The generator can be treated as an agent who interacts with\nthe environment and gives policy \u03c0(ak|sk), where action ak = (tk, mik ) and state sk = Hk\u22121.\np(mik \u2208 Nmi) \u00b7 \u03c1k\u22121(mik ). The goal is to maximize the expectation\nk \u03b3krk, where \u03b3 is a discount factor. Since to\nmeasure the discrepancy between two high-dimension event sequences is quite intractable, it is\nhard to determine a proper reward function. We thus turn to inverse reinforcement learning which\n\nk \u03b3kr(ak, sk) =(cid:80)\n\nmi\u2208Mk\u22121\n\n5\n\n\fT(cid:88)\n\nE\u03c0\u03b8 [\u2207\u03b8 log \u03c0(a|s) log Dw(S)] \u2212 \u03bb\u2207\u03b8H(\u03c0)\nB(cid:88)\n\u2248 1\nB\n\n\u03b3k\u2207\u03b8 log \u03c0(ak|sk) log dk(Sb; w)) \u2212 \u03bb\n\nb=1\n\nk=1\n\nT(cid:88)\n\nk=1\n\nwhere Qlog(a, s) = E\u03c0\u03b8 (\u2212 log \u03c0\u03b8(a|s)|s0 = s, a0 = a).\nThe training algorithm is given by Alg. 2 in Appendix D.\n\n\u2207\u03b8 log \u03c0(ak|sk)Qlog(a, s),\n\n(6)\n\n(5)\n\n(7)\n\n\u03c0\n\nmin\n\n\u2212H(\u03c0) + max\n\nconcurrently optimizes the reward function and policy network, and the objective can be written as\n(4)\nwhere S = {(tk, mik )} (S\u2217 = {(t\u2217\n)}) is the generated (ground-truth) event sequences given\nthe same source event, \u03c0E is the expert policy that gives S\u2217, and H(\u03c0) denotes entropy of policy.\nWe proceed to adopt the GAIL [14] framework to learn the reward function by considering a\ndiscriminator Dw : S \u2192 [0, 1]T , which is parametrized by w and maps an event sequence to a\nsequence of rewards {rk}T\nk=1 in the range [0, 1]. Then the gradient for the discriminator is given by\n\nE\u03c0E (r(S\u2217)) \u2212 E\u03c0(r(S)),\n\nk, m\u2217\n\nik\n\nr\n\nT(cid:88)\n\nE\u03c0[\u2207w log Dw(S)] + E\u03c0E [\u2207w log(1 \u2212 Dw(S\u2217))]\nB(cid:88)\n\u2248 1\nB\n\n\u2207w log dk(Sb; w) +\n\nT(cid:88)\n\n\u2207w log(1 \u2212 dk(S\u2217\n\nb=1\n\nk=1\n\nk=1\n\nb ; w)),\n\nwhere dk(S; w) = rk is the k-th output of Dw(S) and we sample B generated sequences {Sb}B\nb=1 to\napproximate the expectation. Then we give the policy gradient for the generator with parameter set \u03b8:\n\nIngredients of Discriminator. We harness a sequence-to-sequence model to implement the dis-\ncriminator model Dw : S \u2192 [0, 1]T . Given event sequence S = {(tk, mik )}T\nk=0 with event\nembedding e0, e1,\u00b7\u00b7\u00b7 , eT , we have\n\nak = M ultiHeadAttn(e0, e1,\u00b7\u00b7\u00b7 , eT ),\nrk = sigmoid(WDak + bD), k = 1,\u00b7\u00b7\u00b7 , T,\n\nwhere WD \u2208 RD\u00d71 and bD is a scalar.\n\n3 Experiments\n\nWe apply our model LANTERN to two synthetic datasets and two real-world datasets in order to\nverify its effectiveness in modeling high-dimension event sequences. The codes are released at\nhttps://github.com/zhangzx-sjtu/LANTERN-NeurIPS-2019.\n\nb )2 exp(\u2212( t\u2212a\n\nSynthetic Data Generation. We generate two networks, a small one with 1000 nodes and a\nlarge one with 100,000 nodes, and the directed edges are sampled by a Bernoulli distribution with\np = 5 \u00d7 10\u22123 for the small network and p = 3 \u00d7 10\u22125 for the large network. The nodes in\nnetwork are treated as markers. Each edge cij corresponds to a Rayleigh distribution fij(t|a, b) =\nb )2), t \u2265 a. We basically set a = 0 and b = 1. Then we generate event\nt\u2212a ( t\u2212a\n2\nsequences in this way: 1) randomly select a node as marker of source event and set the time of source\nevent as 0; 2) for each sampled marker i, sample the time of next event with marker j which is the\ndescendant of marker i in the network according to fij(t) and pick the event with smallest time as\nnew sampled event. The whole process repeats till the time exceeds a global time window T c. If\na sampled event marker has more than one parents, we use the smallest sampled time as the true\nsampled time of event. We repeat the above process and generate 10,000 event sequences for the\nsmall network and 100,000 event sequences for the large network. We call the dataset with small\nnetwork as Syn-Small and the dataset with large network as Syn-Large.\n\nReal-World Data Information. We also use two real-world datasets in our experiment. Firstly,\nMemeTracker dataset [11] contains hyperlinks between articles and records information \ufb02ow from\none site to another. In this setting, each site plays as a marker and each article would generate an\ninformation cascade which can be treated as an event sequence. The hyperlinks represent the relation\n\n6\n\n\fTable 1: Results for network reconstruction. We compare the estimated edges and ground-truth edges,\nand statistic precision (PRE), recall (REC) and F1 score (F1). For LANTERN, LANTERN-RNN and\nLANTERN-PR, we use the edges with top K probabilities given by (2) for one marker as estimated\nedges and consider three different settings of K. In Syn-Small, Syn-Large, we set K1 = 3, K2 = 4,\nK3 = 5; in Memetracker and Weibo, we consider K1 = 25, K2 = 30, K3 = 35.\n\nMethods\n\nNETRATE\n\nKernelCascade\n\nLTN-PR (K1)\nLTN-PR (K2)\nLTN-PR (K3)\nLTN-RNN (K1)\nLTN-RNN (K2)\nLTN-RNN (K3)\nLANTERN (K1)\nLANTERN (K2)\nLANTERN (K3)\n\nSyn-Small\n\nSyn-Large\n\nMemeTracker\n\nPRE\n\n0.4983\n\n0.4975\n\n0.5899\n0.5856\n0.5823\n\n0.4476\n0.4718\n0.4888\n\n0.5758\n0.5742\n0.5733\n\nREC\n\n0.3986\n\n0.3980\n\n0.3539\n0.4685\n0.5823\n\n0.2686\n0.3774\n0.4888\n\n0.3455\n0.4594\n0.5733\n\nF1\n\nPRE\n\nREC\n\n0.4429\n\n0.4422\n\n0.4424\n0.5205\n0.5823\n\n0.3357\n0.4194\n0.4888\n\n0.4318\n0.5104\n0.5733\n\n-\n\n-\n\n0.4740\n0.4987\n0.3984\n\n0.6523\n0.4980\n0.4976\n\n0.4833\n0.5000\n0.4952\n\n-\n\n-\n\n0.4740\n0.4987\n0.6640\n\n0.3914\n0.6640\n0.8293\n\n0.4833\n0.6667\n0.8483\n\nF1\n\n-\n\n-\n\n0.4740\n0.4987\n0.4980\n\n0.4892\n0.5691\n0.6220\n\n0.4833\n0.5714\n0.6253\n\nPRE\n\n0.5665\n\n0.5364\n\n0.4973\n0.4637\n0.4336\n\n0.4998\n0.4653\n0.4352\n\n0.4987\n0.4651\n0.4354\n\nREC\n\n0.2447\n\n0.2897\n\n0.3357\n0.3756\n0.4098\n\n0.3374\n0.3769\n0.4113\n\n0.3367\n0.3767\n0.4114\n\nF1\n\nPRE\n\n0.3418\n\n0.3762\n\n0.4009\n0.4150\n0.4214\n\n0.4028\n0.4165\n0.4211\n\n0.4020\n0.4163\n0.4230\n\n-\n\n-\n\n0.3824\n0.3560\n0.3302\n\n0.5706\n0.5417\n0.5306\n\n0.5726\n0.5448\n0.5320\n\nWeibo\nREC\n\n-\n\n-\n\n0.3524\n0.3864\n0.3717\n\n0.5274\n0.5910\n0.5966\n\n0.5295\n0.5944\n0.5982\n\nF1\n\n-\n\n-\n\n0.3654\n0.3692\n0.3484\n\n0.5462\n0.5631\n0.5596\n\n0.5483\n0.5663\n0.5611\n\nnetwork among markers. We \ufb01lter a network of top 583 sites with 6700 cascades. The MemeTracker\ndataset is used to compare our model with some previous methods which focus on learning the\nnetwork and temporal process in event sequences with hundreds of markers. Besides, we consider\na large-scale dataset Weibo [26] which records the resharing of posts among 1, 787, 443 users with\n413, 503, 687 following edges. Here each user corresponds to an event marker and every resharing\nbehavior of user can be seen as an event, so the cascades of resharing would form an high-dimension\nevent sequence. We extract 105 users with 2531525 edges and 105 cascades to evaluate our model in\nmodeling high-dimension event sequences.\n\nCompetitors and Baselines. We compare our model with two previous methods, NETRATE [11]\nand KernelCascade [9], which attempt to learn the heterogeneous network and the temporal process\nfrom event sequences by learning a transmission density function for each edge. Since their huge\nparameter size limits the scalability to very high-dimension markers, we only apply them to our\nsmall synthetic dataset and MemeTracker dataset. Besides, we consider two simpli\ufb01ed versions of\nLANTERN as ablation study: LANTERN-RNN which replaces the multi-head attention mechanism\nby RNN structure, and LANTERN-PR which removes the discriminator and uses a heuristic reward\nfunction as training signal for generator. We compare our model with them in four datasets to study\nthe effectiveness of attention mechanism and inverse reinforcement learning. For each method, we\nrun \ufb01ve times and report the average values in this paper. All the improvements in this paper are\nsigni\ufb01cant according to the Wilcoxon signed-rank test on 5% con\ufb01dence level. The implementation\ndetails for baselines and hyper-parameter settings are presented in Appendix C.\nEvent Prediction. We use our model to predict the time and marker of next event given part of\nobserved sequence, and use MSE and accuracy to evaluate the performance of time and marker predic-\ntion, respectively. The results of all methods are shown in Fig. 3. As we can see, in MemeTracker and\nSyn-SMALL, KernelCascade slightly outperforms NETRATE for both time and marker prediction,\nwhile our model LANTERN achieves great improvement over two competitors, especially when\ngiven very few observed events. LANTERN-RNN performs better compared with LANTERN-PR\nin small datasets. However, when the dimension of markers is extremely large, the performance\nof LANTERN-RNN bears a considerable decline probably due to the limited capacity of RNN\narchitecture to capture the high-variational relations between high-dimensional event markers. In\nfour datasets, LANTERN-PR is generally inferior to LANTERN for both time and marker prediction.\nThe possible reason is that the heuristic reward function cannot well characterize the discrepancy\nbetween event sequences and may provide unreliable training signals.\nNetwork Reconstruction. We also leverage the model to reconstruct the network topology and\nuse precision, recall and F1 score as metrics. The results are shown in Table 1 where we shorten\nLANTERN-RRN and LANTERN-PR as LTN-RNN and LTN-PR, respectively. As shown in Table 1,\nLANTERN could give the best reconstruction F1 score among all baselines, and achieve averagely\n14.9% improvement over the better one of NETRATE and KernelCascade. Also, LANTERN\noutperforms LANTERN-RNN, which indicates that the multi-head attention network could better\ncapture the latent structural proximity among markers in event sequences.\n\n7\n\n\fFigure 3: Experimental results for time and marker prediction in four datasets. We truncate a certain\nratio of an event sequence as observed information and aim at predicting the time and marker of next\nevent. The \ufb01gures show the prediction performance under different observed ratios.\n\n(a) Sequence length = 5.\n\n(b) Sequence length = 25.\n\n(c) Sequence length = 50.\n\nFigure 4: Scalability test in synthetic dataset. We change marker number from 100 to 100,000 and\nreport running time of LANTERN, NETRATE and KernelCascade with different sequence lengths.\nScalability. We also test our model under different numbers of markers and sequence lengths, and\npresent the results in Fig. 3. The experiments are deployed on Nvidia Tesla K80 GPUs with 12G\nmemory and we statistic the running time to discuss the model scalability. It shows that with the\nmarker number increasing from 100 to 100,000, the running time of LANTERN increases in a linear\nmanner, while the trends of two other methods behave almost in an exponential way. When the\nsystem has a huge number of markers (like on million level), LANTERN is still effective with good\nscalability, but NETRATE and KernelCascade would be too time-consuming due to the fact that they\nneed to optimize a transmission density function for each edge in the network, which induces at least\nquadratic parameter space in terms of marker number.\n\n4 Conclusion\n\nIn this paper, we focus on learning both the hidden relation network and temporal point process in\nhigh-dimension marked event sequences, which has rarely been studied and poses some intractable\nchallenges for previous approaches. To solve the problem, we \ufb01rstly build a generator model that\ntakes a bottom-up view to imitate the generation process of event sequences. The generator model\nconsiders each marker as an embedding vector, uses graph-based attentive estimation for network\nreconstruction, and entails a latent structural intensity function to capture the temporal point process\nin the latant space of markers over the sequence. Then we design an interpretable and ef\ufb01cient random\nwalk based sampling approach to generate the next event. To overcome the dif\ufb01culty of measuring\nthe discrepancy between high-dimension event sequences, we use inverse reinforcement learning\nto optimize the reward function for event generation. Extensive experiments on both synthetic and\n(large-scale) real-world datasets demonstrate that our model could give superior prediction for future\nevents as well as reconstruct the hidden network. Also, scalability tests show that the model can\ntackle event sequences with huge number of markers.\n\n8\n\n20406080 % Observed Seq0.511.5 MSE20406080 % Observed Seq123 MSE020406080 % Observed Seq510 MSE020406080 % Observed Seq0.40.60.81 MSE20406080 % Observed Seq0.30.40.50.6 Accuracy20406080 % Observed Seq0.450.50.550.6 Accuracy20406080 % Observed Seq0.40.60.81 Accuracy20406080 % Observed Seq0.50.60.70.8 Accuracy(a) Syn-Small(b) Syn-Large(d) Weibo(c) MemeTracker(a) Syn-Small(b) Syn-Large(d) Weibo(c) MemeTracker LANTERNLANTERN-RNNLANTERN-PRKernelCascadeNETRATE102103104105Number of Markers100102104Running Time (s)LANTERNNETRATEKernelCascade102103104105Number of Markers100102104Running Time (s)LANTERNNETRATEKernelCascade102103104105Number of Markers100102104106Running Time (s)LANTERNNETRATEKernelCascade\f5 Acknowledgement\n\nThis work was supported by the National Key RD Program of China [2018YFB1004703]; the National\nNatural Science Foundation of China [61872238, 61672353, 61972250]; the Shanghai Science and\nTechnology Fund [17510740200]; the CCF-Huawei Database System Innovation Research Plan\n[CCF-Huawei DBIR2019002A]; the Huawei Innovation Research Program [HO2018085286]; the\nState Key Laboratory of Air Traf\ufb01c Management System and Technology [SKLATM20180X], and\nthe Tencent Social Ads Rhino-Bird Focused Research Program.\n\nReferences\n[1] A. Ahmed and E. P. Xing. Recovering time-varying networks of dependencies in social and biological\nstudies. In Proceedings of the National Academy of Sciences of the United States of America, page 11878\n\u201311883, 2009.\n\n[2] S. Bourigault, S. Lamprier, and P. Gallinari. Representation learning for information diffusion through\n\nsocial networks: an embedded cascade model. In WSDM, pages 573\u2013582, 2016.\n\n[3] D. R. Cox. Some statistical methods connected with series of events. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), pages 129\u2013164, 1955.\n\n[4] D. Daley and V.-J. David. An Introduction to the Theory of Point Processes: Volume II: General Theory\n\nand Structure. Springer Science & Business Media, 2007.\n\n[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers\n\nfor language understanding. arXiv preprint arXiv:1810.04805, 2018.\n\n[6] V. Didelez. Graphical models for marked point processes based on local independence. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 70(1):245\u2013264, 2008.\n\n[7] R. D. DMichael Eichler and J. Dueck. Graphical modeling for multivariate hawkes processes with\n\nnonparametric link functions. arXiv preprint arXiv:1605.06759, 2016.\n\n[8] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha. Scalable in\ufb02uence estimation in continuous-time\n\ndiffusion networks. In NIPS, pages 3147\u20133155, 2013.\n\n[9] N. Du, L. Song, A. J. Smola, and M. Yuan. Learning networks of heterogeneous in\ufb02uence. In NIPS, pages\n\n2789\u20132797, 2012.\n\n[10] A. S. Fotheringham and D. W. Wong. The modi\ufb01able areal unit problem in multivariate statistical analysis.\n\nEnvironment and planning A, 23(7):1025\u20131044, 1991.\n\n[11] M. Gomez-Rodriguez, D. Balduzzi, and B. Sch\u00f6lkopf. Uncovering the temporal dynamics of diffusion\n\nnetworks. In ICML, pages 561\u2013568, 2011.\n\n[12] M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and in\ufb02uence. In KDD,\n\npages 1019\u20131028, 2010.\n\n[13] A. G. Hawkes. Point spectra of some mutually exciting point processes. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), 1971.\n\n[14] J. Ho and S. Ermon. Generative adversarial imitation learning. In NIPS, 2016.\n\n[15] V. Isham and M. Westcott. A self-correcting pint process. Advances in Applied Probability, 37:629\u2013646,\n\n1979.\n\n[16] E. Lewis and E. Mohler. A nonparametric em algorithm for multiscale hawkes processes. Journal of\n\nNonparametric Statistics, 2011.\n\n[17] C. Li, J. Ma, X. Guo, and Q. Mei. Deepcas: An end-to-end predictor of information cascades. In WWW,\n\npages 577\u2013586, 2017.\n\n[18] S. Li, S. Xiao, S. Zhu, N. Du, Y. Xie, and L. Song. Learning temporal point processes via reinforcement\n\nlearning. In NIPS, 2018.\n\n[19] U. Upadhyay, A. De, and M. G. Rodriguez. Deep reinforcement learning of marked temporal point\n\nprocesses. In Advances in Neural Information Processing Systems, pages 3168\u20133178, 2018.\n\n9\n\n\f[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and I. Polosukhin.\nAttention is all you need. In Advances in neural information processing systems, pages 5998\u20136008, 2017.\n\n[21] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Li\u00f2, and Y. Bengio. Graph attention networks. In\n\nICLR, 2018.\n\n[22] Y. Wang, H. Shen, S. Liu, J. Gao, and X. Cheng. Cascade dynamics modeling with attention-based\n\nrecurrent neural network. In IJCAI, pages 2985\u20132991, 2017.\n\n[23] S. Xiao, M. Farajtabar, X. Ye, J. Yan, L. Song, and H. Zha. Wasserstein learning of deep generative point\n\nprocess models. In NIPS, 2017.\n\n[24] S. Xiao, J. Yan, X. Yang, H. Zha, and S. Chu. Modeling the intensity function of point process via recurrent\n\nneural networks. In AAAI, 2017.\n\n[25] H. Xu, M. Farajtabar, and H. Zha. Learning granger causality for hawkes processes. In ICML, pages\n\n1717\u20131726, 2016.\n\n[26] J. Zhang, J. Tang, J. Li, Y. Liu, and C. Xing. Who in\ufb02uenced you? predicting retweet via social in\ufb02uence\n\nlocality. TKDD, 9(3):25:1\u201325:26, 2015.\n\n[27] K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse low-rank networks using multi-\ndimensional hawkes processes. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n641\u2013649, 2013.\n\n[28] K. Zhou, H. Zha, and L. Song. Learning triggering kernels for multi-dimensional hawkes processes. In\n\nICML, pages 1301\u20131309, 2013.\n\n10\n\n\f", "award": [], "sourceid": 2112, "authors": [{"given_name": "Qitian", "family_name": "Wu", "institution": "Shanghai Jiao Tong University"}, {"given_name": "Zixuan", "family_name": "Zhang", "institution": "Shanghai Jiao Tong University"}, {"given_name": "Xiaofeng", "family_name": "Gao", "institution": "Shanghai Jiao Tong University"}, {"given_name": "Junchi", "family_name": "Yan", "institution": "Shanghai Jiao Tong University"}, {"given_name": "Guihai", "family_name": "Chen", "institution": "Shanghai Jiao Tong University"}]}