{"title": "Learning Influence among Interacting Markov Chains", "book": "Advances in Neural Information Processing Systems", "page_first": 1577, "page_last": 1584, "abstract": null, "full_text": "Learning Influence among Interacting Markov Chains\nDong Zhang IDIAP Research Institute CH-1920 Martigny, Switzerland zhang@idiap.ch Samy Bengio IDIAP Research Institute CH-1920 Martigny, Switzerland bengio@idiap.ch Daniel Gatica-Perez IDIAP Research Institute CH-1920 Martigny, Switzerland gatica@idiap.ch Deb Roy Massachusetts Institute of Technology Cambridge, MA 02142, USA dkroy@media.mit.edu\n\nAbstract\nWe present a model that learns the influence of interacting Markov chains within a team. The proposed model is a dynamic Bayesian network (DBN) with a two-level structure: individual-level and group-level. Individual level models actions of each player, and the group-level models actions of the team as a whole. Experiments on synthetic multi-player games and a multi-party meeting corpus show the effectiveness of the proposed model.\n\n1\n\nIntroduction\n\nIn multi-agent systems, individuals within a group coordinate and interact to achieve a goal. For instance, consider a basketball game where a team of players with different roles, such as attack and defense, collaborate and interact to win the game. Each player performs a set of individual actions, evolving based on their own dynamics. A group of players interact to form a team. Actions of the team and its players are strongly correlated, and different players have different influence on the team. Taking another example, in conversational settings, some people seem particularly capable of driving the conversation and dominating its outcome. These people, skilled at establishing the leadership, have the largest influence on the group decisions, and often shift the focus of the meeting when they speak [8]. In this paper, we quantitatively investigate the influence of individual players on their team using a dynamic Bayesian network, that we call two-level influence model. The proposed model explicitly learns the influence of individual player on the team with a two-level structure. In the first level, we model actions of individual players. In the second one, we model team actions as a whole. The model is then applied to determine (a) the influence of players in multi-player games, and (b) the influence of participants in meetings. The paper is organized as follows. Section 2 introduces the two-level influence model. Section 3 reviews related models. Section 4 presents results on multi-player games, and Section 5 presents results on a meeting corpus. Section 6 provides concluding remarks.\n\n\f\nindividual player state\ni St-1 i St i St+1\n\nt-1 team state\n\nt\n\nt+1\nG St+1\n\nSt-1\n1 St\n\nG\n\nSt\n\nG\n\nOt-1 S S2 SN\n1\n\ni\n\nOt Q=1 Q=2 Q=N\n\ni\n\nOt+1\nplayer A\n1 St-1 S 2 S St-1 3 S St-1 1 St+1 2 St+1 3 St+1\n\ni\n\nobservation (a)\n\nQ\nplayer B\n\nS2 t S3 t (b)\n\nSG\n(c)\n\nplayer C\n\nFigure 1: (a) Markov Model for individual player. (b) Two-level influence model (for simplicity, we omit the observation variables of individual Markov chains, and the switching parent variable Q). (c) Switching parents. Q is called a switching parent of S G , and {S 1 S N } are conditional parents of S G . When Q = i, S i is the only parent of S G .\n\n2\n\nTwo-level Influence Model\n\nThe proposed model, called two-level influence model, is a dynamic Bayesian network (DBN) with a two-level structure: the player level and the team level (Fig. 1). The player level represents the actions of individual players, evolving based on their own Markovian dynamics (Fig. 1 (a)). The team level represents group-level actions (the action belongs to the team as a whole, not to a particular player). In Fig. 1 (b), the arrows up (from players to team) represent the influence of the individual actions on the group actions, and the arrows down (from team to players) represent the influence of the group actions on the individual actions. Let O i and S i denote the observation and state of the ith player respectively, and S G denotes the team state. For N players, and observation sequences of identical length T , the joint distribution of our model is given by P (S, O) = iN\n=1\n\nt T iN tT t T iN G ii N G1 ii i P (St |St-1 , St-1 ). (1) P (Ot |St ) P (St |St St ) P (S1 )\n=1 =1 =1 =2 =1\n\nRegarding the player level, we model the actions of each individual with a first-order Markov model (Fig. 1 (a)) with one observation variable O i and one state variable S i . Furthermore, to capture the dynamics of all the players interacting as a team, we add a hidden variable S G (team state), which is responsible to model the group-level actions. Different from individual player state that has its own Markovian dynamics, team state is not directly influenced by its previous state . S G could be seen as the aggregate behaviors of the individuals, yet provides a useful level of description beyond individual actions. There are two kinds of relationships between the team and players: (1) The team state at time t influences the players' states at the next time (down arrow in Fig. 1 (b)). In other words, the state of the ith player at time t + 1 depends on its previous state as well as on i i G the team state, i.e., P (St+1 |St , St ). (2) The team state at time t is influenced by all the players' states at the current time (up arrow in Fig. 1 (b)), resulting in a conditional state G1 N transition distribution P (St |St St ). To reduce the model complexity, we add one hidden variable Q in the model, to switch parents for S G . The idea of switching parent (also called Bayesian multi-nets in [3]) is as follows: a variable -S G in this case- has a set of parents {Q, S 1 S N } (Fig. 1(c)). Q is the switching parent that determines which of the other parents to use, conditioned on the current value of the switching parent. {S 1 S N } are the conditional parents. In Fig. 1(c), Q switches the parents of S G among {S 1 S N }, corresponding to the distribution\nG1 N P (St |St St )\n\n=\n\niN\n=1\n\nG 1 N P (St , Q = i|St St )\n\n(2)\n\n\f\n=\n\niN\n=1\n\nN Gi N 1 P (Q = i|St St )P (St |St St , Q = i)\n\n(3)\n\n=\n\niN\n=1\n\nGi P (Q = i)P (St |St ) =\n\niN\n=1\n\nGi i P (St |St ).\n\n(4)\n\nFrom Eq. 3 to Eq. 4, we made two assumptions: (i) Q is independent of {S 1 S N }; i G and (ii) when Q = i, St only depends on St . The distribution over the switching-parent variable P (Q) essentially describes how much influence or contribution the state transitions of the player variables have on the state transitions of the team variable. We refer to i = N P (Q = i) as the influence value of the ith player. Obviously, i=1 i = 1. If we further assume that all player variables have the same number of states NS , and the team variable has NG possible states, the joint log probability is given by log P (S, O) =\nN iN j S =1 =1 i i zj,1 log P (S1 = j ) + N t T iN j S =1 =1 =1 i ii zj,t log P (Ot |St = j )\n\ni + tT iN\n\nnitial pr obability\n\ne\n\nmission pr obability\n\nNNN jSkSgG\n\nG i i G i i zj,t zk,t-1 zg,t-1 log P (St = j |St-1 = k , St-1 = g )\n\n=2 =1 =1 =1 =1\n\ng + tT\nNN kSgG\n\nr oup inf luence on indiv idual tr ansition\n\ni G zg,t zk,t log{\n\niN\n=1\n\ni G i P (St = g |St = k )},\n\n(5)\n\n=1 =1 =1\n\ni\n\nndiv idual inf luence on g r oup\n\nwhere the indicator variable zj,t = 1 if St = j , otherwise zj,t = 0. We can see that the 2 model has complexity O(T N NG NS ). For T = 2000, NS = 10, NG = 5, N = 4, a total of 106 operations is required, which is still tractable. For the model implementation, we used the Graphical Models Toolkit (GMTK) [4], a DBN system for speech, language, and time series data. Specifically, we used the switching parents feature of GMTK, which greatly facilitates the implementation of the two-level model to learn the influence values using the Expectation Maximization (EM) algorithm. Since EM has the problem of local maxima, good initialization is very important. To initialize the emission probability distribution in Eq. 5, we first train individual action models (Fig. 1 (a)) by pooling all observation sequences together. Then we use the trained emission distribution from the individual action model to initialize the emission distribution of the two-level influence model.This procedure is beneficial because we use data from all individual streams together, and thus have a larger amount of training data for learning.\n\n3\n\nRelated Models\n\nThe proposed two-level influence model is related to a number of models, namely mixedmemory Markov model (MMM) [14, 11], coupled HMM (CHMM) [13], influence model [1, 2, 6] and dynamical systems trees (DSTs) [10]. MMMs decompose a complex model into mixtures of simpler ones, for example, a K-order Markov model, into mixtures of firstK order models: P (St |St-1 St-2 St-K ) = i=1 i P (St |St-i ). The CHMM models interactions of multiple Markov chains by directly linking the current state of one stream i1 2 N with the previous states of all the streams (including itself): P (St |St-1 St-1 St-1 ). However, the model becomes computationally intractable for more than two streams. The influence model [1, 2, 6] simplifies the state transition distribution of the CHMM into a\n\n\f\nFigure 2: (a) A snapshot of the multi-player games: four players move along the pathes labeled in the map. (b) A snapshot of four-participant meetings.\ni1 2 N convex combination of pairwise conditional distributions, i.e., P (St |St-1 St-1 St-1 ) = N ij j =1 j i P (St |St-1 ). We can see that influence model and MMM take the same strategy to reduce complex models with large state spaces to a combination of simpler ones with smaller state spaces. In [2, 6], the influence model was used to analyze speaking patterns in conversations (i.e., turn-taking) to determine how much influence one participant has on others. In such model, j i is regarded as the influence of the j th player on the ith player.\n\nAll these models, however, limit themselves to modeling the interactions between individual players, i.e., the influence of one player on another player. The proposed two-level influence model extends these models by using the group-level variable S G that allows G 12 N to model the influence between all the players and the team: P (St |St St St ) = N Gi i=1 i P (St |St ), and additionally conditioning the dynamics of each player on the team i i G state: P (St+1 |St , St ). DSTs [10] have a tree structure that models interacting processes through the parent hidden Markov chains. There are two differences between DSTs and our model: (1) In DSTs, the parent chain has its own Markovian dynamics, while the team state of our model is not directly influenced by the previous team state. Thus, our model captures the emergent phenomena in which the group action is \"nothing more\" than the aggregate behaviors of individuals, yet it provides a useful level of representation beyond individual actions. (2) The influence between players and team in our model is \"bi-direction\" (up and down arrows in Fig. 1(b)). In DSTs, the influence between child and parent chains is \"uni-direction\": parent chains could influence child chains, while child chains could not influence their parent chains.\n\n4\n\nExperiments on Synthetic Data\n\nWe first test our model on multi-player synthetic games, in which four players (labeled A-D) move along a number of predetermined paths manually labeled in a map (Fig. 2(a)), based on the following rules: Game I: Player A moves randomly. Player B and C are meticulously following player A. Player D moves randomly. Game II: Player A moves randomly. Player B is meticulously following player A. Player C moves randomly. Player D is meticulously following player C . Game III: All four players, A, B , C and D, move randomly. A follower moves randomly until it lies on the same path of its target, and after that it tries to reach the target by following the target's direction. The initial positions and speeds of players are randomly generated. The observation of an individual player is its motion trajectory in the form of a sequence of positions, (x1 , y1 ), (x2 , y2 ) (xt , yt ), each of which belongs to one of 20 predetermined paths in the map. Therefore, we set NS = 20. The number of team states is set to NG = 5. In experiments, we found that the final results were not sensitive to the specific number of team states for this dataset in a wide range. The length of each game sequence is T = 2000 frames. EM iterations were stopped once\n\n\f\nGame I 1 Player Player Player Player A B C D 1\n\nGame II Player Player Player Player A B C D 1\n\nGame III Player Player Player Player A B C D\n\nInfluence Value\n\nInfluence Value\n\n0.6 0.4 0.2 0 10 20 EM Iterations 30\n\n0.6 0.4 0.2 0 20 40 EM Iterations 60\n\nInfluence Value\n\n0.8\n\n0.8\n\n0.8 0.6 0.4 0.2 0 20 40 EM Iterations\n\n60\n\nFigure 3: Influence values with respect to the EM iterations in different games. the relative difference in the global log likelihood was less than 2%. Fig. 3 shows the learned influence value for each of the four players in the different games with respect to the number of EM iterations. We can see that for Game I, player A is the leader player based on the defined rules. The final learned influence value for player A is almost 1, while the influence for the rest three players are almost 0. For Game II, player A and player C are both leaders based on the defined rules. The learned influence values for player A and C are indeed close to 0.5, which indicates they have similar influence on the team. For Game III, the four players are moving randomly, and the learned influence values are around 0.25, which indicates that all players have similar influence on the team. The results on these toy data suggest that our model is capable of learning sensible values for {i }, in good agreement with the concept of influence we have described before.\n\n5\n\nExperiments on Meeting Data\n\nAs an application of the two-level influence model, we investigate the influence of participants in meetings. Status, dominance, and influence are important concepts in social psychology for which our model could be particularly suitable in a (dynamic) conversational setting [8]. We used a public meeting corpus (available at http://mmm.idiap.ch), which consists of 30 five-minute four-participant meetings collected in a room equipped with synchronized multi-channel audio and video recorders [12]. A snapshot of the meeting is shown in Fig. 2 (b). These meetings have pre-defined topics and an action agenda, designed to ensure discussions and monologues. Manual speech transcripts are also available. We first describe how we manually collected influence judgements, and the performance measure we used. We then report our results using audio and language features, compared with simple baseline methods. 5.1 Manually Labeling Influence Values and the Performance Measure\n\nThe manual annotation of influence of meeting participants is to some degree a subjective task, as a definite ground-truth does not exist. In our case, each meeting was labeled by three independent annotators who had no access to any information about the participants (e.g. job titles and names). This was enforced to avoid any bias based on prior knowledge of the meeting participants (e.g. a student would probably assign a large influence value to his supervisor). After watching an entire meeting, the three annotators were asked to assign a probability-based value (ranging from 0 to 1, all adding up to 1) to meeting participants, which indicated their influence in the meeting (Fig. 5(b-d)). From the three annotations, we computed the pairwise Kappa statistics [7], a commonly used measure for inter-rate agreement. The obtained pairwise Kappa ranges between 0.68 and 0.72, which demonstrates a good agreement among the different annotators. We estimated the ground-truth influence values by averaging the results from the three annotators (Fig. 5(a)). We use Kullback-Leibler (KL) divergence to evaluate the results. For the j th meet~ ing, given an automatically determined influence distribution P (Q), and the ground ~ truth influence distribution P (Q), the KL divergence is given by: D j (P P ) =\n\n\f\nsilence person A\n\nspeaking person B audio 0000001100000011000000111 language 0000002200000033000000444 timeline\n\naudio 0011100001111100111111000 language 0022200003333300444444000 timeline\n\nFigure 4: Illustration of state sequences using audio and language features respectively: Using audio, there are two states: speaking and silence. Using language, the number of states equals PLSA topics plus one silence state. N\ni=1\n\n~ P (Q = i) log2\n\n~ P (Q=i) P (Q=i) ,\n\nwhere N is the number of participants. The smaller D j , the\n\n~ better the performance (if P = P Dj = 0). Note that KL divergence is not symmetric. M 1 ~ We calculate the average KL divergence for all the meetings: D = M j =1 Dj (P P ), where M is the number of meetings. 5.2 Audio and Language Features\n\nWe first extract audio features useful to detect speaking turns in conversations. We compute the SRP-PHAT measure using the signals from a 8-microphone array [12], which is a continuous value indicating the speech activity from a particular participant. We use a Gaussian emission probability, and set NS = 2, each state corresponding to speaking and non-speaking (silence), respectively (Fig. 4). Additionally, language features were extracted from manual transcripts. After removing stop words, the meeting corpus contains 2175 unique terms. We then employed probabilistic latent semantic analysis (PLSA) [9], which is a language model that projects documents in the high-dimensional bag-of-words space into a topic-based space of lower dimension. Each dimension in this new space represents a \"topic\", and each document is represented as a mixture of topics. In our case, a document corresponds to one speech utterance (ts , te , w1 w2 wk ), where ts is the start time, te is the end time, and w1 w2 wk is a sequence of words. PLSA is thus used as a feature extractor that could potentially capture \"topic turns\" in meetings. We embedded PLSA into our model by treating the states of individual players as instances of PLSA topics (similar to [5]). Therefore, the PLSA model determines the emission probability in Eq. 5. We repeat the PLSA topic within the same utterance (ts t te ). The topic for the silence segments was set to 0 (Fig. 4). We can see that using audio-only features can be seen as a special case of using language features, by using only one topic in the PLSA model (i.e., all utterances belong to the same topic). We set 10 topics in PLSA (NS = 10), and set NG = 5 using simple reasonable a priori knowledge. EM iterations were stopped once the relative difference in the global log likelihood was less than 2%. 5.3 Results and Discussions\n\nWe compare our model with a method based on the speaking length (how much time each of the participants speaks). In this case, the influence value of a meeting participant is N defined to be proportional to his speaking length: P (Q = i) = Li / i=1 Li , where Li is the speaking length of participant i. As a second baseline model, we randomly generated 1000 combinations of influence values (under the constraint that the sum of the four values equals 1), and report the average performance. The results are shown in Table 1 (left) and Fig. 5(e-h). We can see that the results of the three methods: model + language, model + audio, and speaking-length (Fig. 5 (e-g)) are significantly better than the result of randomization (Fig. 5 (h)). Using language features\n\n\f\n1\n\n1\n\n2 4 5 10 15 20 25 30\n\n0.5\n\n2 4 5 10 15 20 25 30\n\n0.5\n\n0\n\n(a)\n2 4\n\n0\n\n(b)\n\n1\n\n1\n\n2 4 5 10 15 20 25 30\n\n0.5\n\n0.5\n\n0\n\n(c)\n2 4\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n(d)\n\n1\n\n1\n\n2 4 5 10 15 20 25 30\n\n0.5\n\n0.5\n\n0\n\n(e)\n2 4\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n(f)\n\n1\n\n1\n\n2 4 5 10 15 20 25 30\n\n0.5\n\n0.5\n\n0\n\n(g)\n\n0\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\n(h)\n\nFigure 5: Influence values of the 4 participants (y-axis) in the 30 meetings (x-axis) (a) ground-truth (average of the three human annotations: A1 , A2 , A3 ). (b) A1 : human annotation 1 (c) A2 : human annotation 2 (d) A3 : human annotation 3 (e) our model + language (f) our model + audio (g) speaking-length (h) randomization. Table 1: Results on meetings (\"model\" denotes the two-level influence model).\nMethod model + Language model + Audio Speaking length Randomization KL divergence 0.106 0.135 0.226 0.863 Human Annotation Ai vs. Aj Ai vs. Ai Ai vs. GT KL divergence 0.090 0.053 0.037\n\nwith our model achieves the best performance. Our model (using either audio or language features) outperforms the speaking-length based method, which suggests that the learned influence distributions are in better accordance with the influence distributions from human judgements. As shown in Fig. 4, using audio features can be seen as a special case of using language features. We use language features to capture \"topic turns\" by factorizing the two states: \"speaking, silence\" into more states: \"topic1, topic2, ..., silence\". We can see that the result using language features is better than that using audio features. In other words, compared with \"speaking turns\", \"topic turns\" improves the performance of our model to learn the influence of participants in meetings. It is interesting to look at the KL divergence between any pair of the three human annotations (Ai vs. Aj ), any one against the average of the others (Ai vs. Ai ), and any one against the ground-truth (Ai vs. GT). The average results are shown in Table 1 (right). We can see that the result of \"Ai vs. GT\" is the best, which is reasonable since \"GT\" is the average of A1 , A2 , and A3 . Fig. 6(a) shows the histogram of KL divergence between any pair of human annotations for the 30 meetings. The histogram has a distribution of = 0.09, = 0.11. We can see that the results of our model (language: 0.106, audio: 0.135) are very close to the mean ( = 0.09), which indicates that our model is comparable to human performance. With our model, we can calculate the cumulative influence of each meeting participant over time. Fig. 6(b) shows such an example using the two-level influence model with audio features. We can see that the cumulative influence is related to the meeting agenda: The meeting starts with the monologue of person1 (monologue1). The influence of person1 is almost 1, while the influences of the other persons are nearly 0. When four participants are\n\n\f\n0.18 0.15\nInfluence Value\n\n1\n\nmonologue1\n\n0.12 0.09 0.06 0.03 0 0 0.1 0.2 0.3 0.4 KL divergence 0.5 0.6\n\n0.6 0.4 0.2 0\n\nmonologue4\n\n0.8\n\ndiscussion\n\nPerson1 Person2 Person3 Person4\n\ndiscussion\n\n(a)\n\n1\n\nTime (min.)\n\n2\n\n3\n\n4\n\n5\n\n(b)\n\nFigure 6: (a) Histogram of KL divergence between any pair of the human annotations (Ai vs. Aj ) for the 30 meetings. (b) The evolution of cumulative influence over time (5 minutes). The dotted vertical lines indicate the predefined meeting agenda. involved in a discussion, the influence of person1 decreases, and the influences of the other three persons increase. The influence of person4 increases quickly during monologue4. The final influence of participants becomes stable in the second discussion.\n\n6\n\nConclusions\n\nWe have presented a two-level influence model that learns the influence of all players within a team. The model has a two-level structure: individual-level and group-level. Individual level models actions of individual players and group-level models the group as a whole. Experiments on synthetic multi-player games and a multi-party meeting corpus showed the effectiveness of the proposed model. More generally, we anticipate that our approach to multi-level influence modeling may provide a means for analyzing a wide range of social dynamics to infer patterns of emergent group behaviors.\n\nAcknowledgements\nThis work was supported by the Swiss National Center of Competence in Research on Interactive Multimodal Information Management (IM2), and the EC project AMI (Augmented Multi-Party Interaction) (pub. AMI-124). We thank Florent Monay (IDIAP) and Jeff Bilmes (University of Washington) for sharing PLSA code and the GMTK. We also thank the annotators for their efforts.\n\nReferences\n[1] C. Asavathiratham. The influence model: A tractable representation for the dynamics of networked markov chains. Ph.D. dissertation, Dept. of EECS, MIT, Cambridge, 2000. [2] S. Basu, T. Choudhury, B. Clarkson, and A. Pentland. Learning human interactions with the influence model. MIT Media Laboratory Technical Note No. 539, 2001. [3] J. Bilmes. Dynamic bayesian multinets. In Uncertainty in Artificial Intelligence, 2000. [4] J. Bilmes and G. Zweig. The graphical models toolkit: An open source software system for speech and time series processing. Proc. ICASSP, vol. 4:39163919, 2002. [5] D. Blei and P. Moreno. Topic segmentation with an aspect hidden markov model. Proc. of ACM SIGIR conference on Research and development in information retrieval, pages 343348, 2001. [6] T. Choudhury and S. Basu. Modeling conversational dynamics as a mixed memory markov process. Proc. of Intl. Conference on Neural Information and Processing Systems (NIPS), 2004. [7] J.A. Cohen. A coefficient of agreement for nominal scales. Educ Psych Meas, 20:3746, 1960. [8] S. L. Ellyson and J. F. Dovidio, editors. Power, Dominance, and Nonverbal Behavior. SpringerVerlag., 1985. [9] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. In Machine Learning, 42:177196, 2001. [10] A. Howard and T. Jebara. Dynamical systems trees. In Uncertainty in Artificial Intelligence'01. [11] K. Kirchhoff, S. Parandekar, and J. Bilmes. Mixed-memory markov models for automatic language identification. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2000. [12] I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang. Automatic analysis of multimodal group actions in meetings. In IEEE Transactions on PAMI, volume 27(3), 2005. [13] N. Oliver, B. Rosario, and A. Pentland. Graphical models for recognizing human interactions. Proc. of Intl. Conference on Neural Information and Processing Systems (NIPS), 1998. [14] L. K. Saul and M. I. Jordan. Mixed memory markov models: Decomposing complex stochastic processes as mixtures of simpler ones. Machine Learning, 37(1):7587, 1999.\n\n\f\n", "award": [], "sourceid": 2918, "authors": [{"given_name": "Dong", "family_name": "Zhang", "institution": null}, {"given_name": "Daniel", "family_name": "Gatica-perez", "institution": null}, {"given_name": "Samy", "family_name": "Bengio", "institution": null}, {"given_name": "Deb", "family_name": "Roy", "institution": null}]}