{"title": "Confusions over Time: An Interpretable Bayesian Model to Characterize Trends in Decision Making", "book": "Advances in Neural Information Processing Systems", "page_first": 3261, "page_last": 3269, "abstract": "We propose Confusions over Time (CoT), a novel generative framework which facilitates a multi-granular analysis of the decision making process. The CoT not only models the confusions or error properties of individual decision makers and their evolution over time, but also allows us to obtain diagnostic insights into the collective decision making process in an interpretable manner. To this end, the CoT models the confusions of the decision makers and their evolution over time via time-dependent confusion matrices. Interpretable insights are obtained by grouping similar decision makers (and items being judged) into clusters and representing each such cluster with an appropriate prototype and identifying the most important features characterizing the cluster via a subspace feature indicator vector. Experimentation with real world data on bail decisions, asthma treatments, and insurance policy approval decisions demonstrates that CoT can accurately model and explain the confusions of decision makers and their evolution over time.", "full_text": "Confusions over Time: An Interpretable Bayesian\nModel to Characterize Trends in Decision Making\n\nHimabindu Lakkaraju\n\nDepartment of Computer Science\n\nStanford University\n\nhimalv@cs.stanford.edu\n\nJure Leskovec\n\nDepartment of Computer Science\n\nStanford University\n\njure@cs.stanford.edu\n\nAbstract\n\nWe propose Confusions over Time (CoT), a novel generative framework which\nfacilitates a multi-granular analysis of the decision making process. The CoT\nnot only models the confusions or error properties of individual decision makers\nand their evolution over time, but also allows us to obtain diagnostic insights into\nthe collective decision making process in an interpretable manner. To this end,\nthe CoT models the confusions of the decision makers and their evolution over\ntime via time-dependent confusion matrices. Interpretable insights are obtained\nby grouping similar decision makers (and items being judged) into clusters and\nrepresenting each such cluster with an appropriate prototype and identifying the\nmost important features characterizing the cluster via a subspace feature indicator\nvector. Experimentation with real world data on bail decisions, asthma treatments,\nand insurance policy approval decisions demonstrates that CoT can accurately\nmodel and explain the confusions of decision makers and their evolution over\ntime.\n\n1\n\nIntroduction\n\nSeveral diverse domains such as judiciary, health care, and insurance rely heavily on human decision\nmaking. Since decisions of judges, doctors, and other decision makers impact millions of people, it is\nimportant to reduce errors in judgement. The \ufb01rst step towards reducing such errors and improving\nthe quality of decision making is to diagnose the errors being made by the decision makers. It is\ncrucial to not only identify the errors made by individual decision makers and how they change over\ntime, but also to determine common patterns of errors encountered in the collective decision making\nprocess. This turns out to be quite challenging in practice partly because there is no ground truth\nwhich captures the optimal decision in a given scenario.\nPrior research has mainly focussed on modeling decisions of individual decision makers [2, 12]. For\ninstance, the Dawid-Skene model [2] assumes that each item (eg., a patient) has an underlying true\nlabel (eg., a particular treatment) and a decision maker\u2019s evaluation of the item will be masked by\nher own biases and confusions. Confusions of each individual decision maker j are modeled using a\nlatent confusion matrix \u0398j, where an entry in the (p, q) cell denotes the probability that an item with\ntrue label p will be assigned a label q by the decision maker j. The true labels of items and latent\nconfusion matrices of decision makers are jointly inferred as part of the inference process. However,\na major drawback of the Dawid-Skene framework and several of its extensions [16, 17, 12] is that\nthey do not provide any diagnostic insights into the collective decision making process. Furthermore,\nnone of these approaches account for temporal changes in the confusions of decision makers.\nHere, we propose a novel Bayesian framework, Confusions over Time (CoT), which jointly: 1)\nmodels the confusions of individual decision makers 2) captures the temporal dynamics of their\ndecision making 3) provides interpretable insights into the collective decision making process, and 4)\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\finfers true labels of items. While there has been prior research on each of the aforementioned aspects\nindependently, there has not been a single framework which ties all of them together in a principled\nyet simple way.\nThe modeling process of the CoT groups decision makers (and items) into clusters. Each such cluster\nis associated with a subspace feature indicator vector which determines the most important features\nthat characterize the cluster, and a prototype which is the representative data point for that cluster. The\nprototypes and the subspace feature indicator vectors together contribute to obtaining interpretable\ninsights into the decision making process. The decisions made by decision makers on items are\nmodeled as interactions between such clusters. More speci\ufb01cally, each pair of (decision maker,\nitem) clusters is associated with a set of latent confusion matrices, one for each discrete time instant.\nThe decisions are then modeled as multinomial variables sampled from such confusion matrices.\nThe inference process involves jointly inferring cluster assignments, the latent confusion matrices,\nprototypes and feature indicator vectors corresponding to each of the clusters, and true labels of items\nusing a collapsed Gibbs sampling procedure.\nWe analyze the performance of CoT on three real-world datasets: (1) judicial bail decisions; (2)\ntreatment recommendations for asthma patients; (3) decisions to approve/deny insurance requests.\nExperimental results demonstrate that the proposed framework is very effective at inferring true\nlabels of items, predicting decisions made by decision makers, and providing diagnostic insights into\nthe collective decision making process.\n\n2 Related Work\n\nHere, we provide an overview of related research on modeling decision making. We also highlight\nthe connections of this work to two other related yet different research directions: stochastic block\nmodels, and interpretable models.\n\nModeling decision making. There has been a renewed interest in analyzing and understanding\nhuman decisions due to the recent surge in applications in crowdsourcing, public policy, and edu-\ncation [6]. Prior research in this area has primarily focussed on the following problems: inferring\ntrue labels of items from human annotations [17, 6, 18, 3], inferring the expertise of decision mak-\ners [5, 19, 21], analyzing confusions or error properties of individual decision makers [2, 12, 10], and\nobtaining diagnostic insights into the collective decision making process [10].\nWhile some of the prior work has addressed each of these problems independently, there have been\nvery few attempts to unify the aforementioned directions. For instance, Whitehill et. al. [19] proposed\na model which jointly infers the true labels and estimate of evaluator\u2019s quality by modeling decisions\nas functions of the expertise levels of decision makers and the dif\ufb01culty levels of items. However, this\napproach neither models the error properties of decision makers, nor provides any diagnostic insights\ninto the process of decision making. Approaches proposed by Skene et al. [2] and Liu et al. [12]\nmodel the confusions of individual decision makers and also estimate the true labels of items, but fail\nto provide any diagnostic insights into the patterns of collective decisions. Recently, Lakkaraju et al.\nproposed a framework [10] which also provides diagnostic insights but it requires a post-processing\nstep employing Apriori algorithm to obtain these insights. Furthermore, none of the aforementioned\napproaches model the temporal dynamics of decision making.\n\nStochastic block models. There has been a long line of research on modeling relational data using\nstochastic block models [15, 7, 20]. These modeling techniques typically involved grouping entities\n(eg., nodes in a graph) such that interactions between these entities (eg., edges in a graph) are governed\nby their corresponding clusters. However, these approaches do not model the nuances of decision\nmaking such as confusions of decision makers which is crucial to our work.\n\nInterpretable models. A large body of machine learning literature focused on developing inter-\npretable models for classi\ufb01cation [11, 9, 13, 1] and clustering [8]. To this end, various classes of\nmodels such as decision lists [11], decision sets [9], prototype (case) based models [1], and general-\nized additive models [13] were proposed. However, none of these approaches can be readily applied\nto determine error properties of decision makers.\n\n2\n\n\f3 Confusions over Time Model\n\nIn this section, we present CoT, a novel Bayesian framework which facilitates an interpretable,\nmulti-granular analysis of the decision making process. We begin by discussing the problem setting\nand then dive into the details of modeling and inference.\n\n3.1 Setting\n\nLet J and I denote the sets of decision makers and items respectively. Each item is judged by one\nor more decision makers. In domains such as judiciary and health care, each defendant or patient is\ntypically assessed by a single decision maker, where as each item is evaluated by multiple decision\nmakers in settings such as crowdsourcing. Our framework can handle either scenarios and does not\nmake any assumptions about the number of decision makers required to evaluate an item. However,\nwe do assume that each item is judged no more than once by any given decision maker. The decision\nmade by a decision maker j about an item i is denoted by ri,j. Each decision ri,j is associated\nwith a discrete time stamp ti,j \u2208 {1, 2\u00b7\u00b7\u00b7 T} corresponding to the time instant when the item i was\nevaluated by the decision maker j.\nEach decision maker has M different features or attributes and a(j)\nm denotes the value of the mth\nfeature of decision maker j. Similarly, each item has N different features or attributes and b(i)\nn\nrepresents the value of the nth feature of item i. Each item i is associated with a true label zi \u2208\n{1, 2,\u00b7\u00b7\u00b7 K}. zi is not observed in the data and is modeled as a latent variable. This mimics most\nreal-world scenarios where the ground-truth capturing the optimal decision or the true label is often\nnot available.\n\n3.2 De\ufb01ning Confusions over Time (CoT) model\n\nThe CoT model jointly addresses the problems of modeling confusions of individual decision makers\nand how these confusions change over time, and also provides interpretable diagnostic insights into\nthe collective decision making process. Each of these aspects is captured by the generative process of\nthe CoT model, which comprises of the following components: (1) cluster assignments; (2) prototype\nselection and subspace feature indicator generation for each of the clusters; (3) true label generation\nfor each of the items; (4) time dependent confusion matrices. Below, we describe each of these\ncomponents and highlight the connections between them.\n\nCluster Assignments. The CoT model groups decision makers and items into clusters. The model\nassumes that there are L1 decision maker clusters and L2 item clusters. The values of L1, L2 are\nassumed to be available in advance. Each decision maker j is assigned to a cluster cj, which is\nsampled from a multinomial distribution with a uniform Dirichlet prior \u0001\u03b1. Similarly, each item i is\nassociated with a cluster di, sampled from a multinomial distribution with a uniform Dirichlet prior\n\u0001\u03b1(cid:48). The features of decision makers and the decisions that they make depend on the clusters they\nbelong to. Analogously, the true labels of items, their features and the decisions involving them are\nin\ufb02uenced by the clusters to which they are af\ufb01liated.\n\nPrototype and Subspace Feature Indicator. The interpretability of the CoT stems from the follow-\ning two crucial components: associating each decision maker and item cluster with a prototype or\nan exemplar, and a subspace feature indicator which is a binary vector indicating which features are\nimportant in characterizing the cluster. The prototype pc of a decision maker cluster c is obtained\nby sampling uniformly over all decision makers 1\u00b7\u00b7\u00b7|J| i.e., pc \u223c U nif orm(1,|J|). The subspace\nfeature indicator, \u03c9c, of the cluster c is a binary vector of length M. An element of this vector, \u03c9c,f ,\ncorresponds to the feature f and indicates if that feature is important (\u03c9c,f = 1) in characterizing\nthe cluster c. \u03c9c,f \u2208 {0, 1} is sampled from a Bernoulli distribution. The prototype p(cid:48)d and subspace\nfeature indicator vector \u03c9(cid:48)d corresponding to an item cluster d are de\ufb01ned analogously.\nGenerating the features: The prototype and the subspace feature indicator vector together provide a\ntemplate for generating feature values of the cluster members. More speci\ufb01cally, if the mth feature is\ndesignated as an important feature for cluster c, then instances in that cluster are very likely to inherit\nthe coresponding feature value from the prototype datapoint pc.\n\n3\n\n\fWe sample the value of a discrete feature m corresponding to decision maker j, a(j)\nm , from a\nmultinomial distribution \u03c6cj ,m where cj denotes the cluster to which j belongs. \u03c6cj ,m is in turn\nsampled from a Dirichlet distribution parameterized by the vector gpcj ,m,\u03c9cj ,m,\u03bb i.e., \u03c6cj ,m \u223c\nDirichlet(gpcj ,m,\u03c9cj ,m,\u03bb). gpc,m,\u03c9c,m,\u03bb is a vector de\ufb01ned such that the eth element of the vector\ncorresponds to the prior on the eth possible value of the mth feature. The eth element of this vector\nis de\ufb01ned as:\n\ngpc,m,\u03c9c,m,\u03bb(e) = \u03bb (1 + \u00b51 [\u03c9c,m = 1 and pc,m = Vm,e])\n\n(1)\nwhere 1 denotes the indicator function, and \u03bb and \u00b5 are the hyperparameters which determine the\nextent to which the prototype will be copied by the cluster members. Vm,e denotes the eth possible\nvalue of the mth feature. For example, let us assume that the mth feature corresponds to gender\nwhich can take one of the following values: {male, f emale, N A}, then Vm,1 represents the value\nmale, Vm,2 denotes the values female and so on.\nEquation 1 can be explained as follows: if the mth feature is irrelevant to the cluster c (i.e., \u03c9c,m = 0),\nthen \u03c6c,m will be sampled from a Dirichlet distribution with a uniform prior \u03bb. On the other hand, if\n\u03c9c,m = 1, then \u03c6c,m has a prior of \u03bb + \u00b5 on that feature value which matches the prototype\u2019s feature\nvalue, and a prior of just \u03bb on all the other possible feature values. The larger the value of \u00b5, the\nhigher the likelihood that the cluster members assume the same feature value as that of the prototype.\nValues of continuous features are sampled in an analogous fashion. We model continuous features as\nGaussian distributions. If a particular continuous feature is designated as an important feature for\nsome cluster c, then the mean of the Gaussian distribution corresponding to this feature is set to be\nequal to that of the corresponding feature value of the prototype pc, otherwise the mean is set to 0.\nThe variance of the Gaussian distribution is set to \u03c3 for all continuous features.\nThough the above exposition focused on clusters of decision makers, we can generate feature values\nof items in a similar manner. Feature values of items belonging to some cluster d are sampled from\nthe corresponding feature distributions \u03c6(cid:48)d, which are in turn sampled from priors which account for\nthe prototype p(cid:48)d and subspace feature indicator \u03c9(cid:48)d.\nTrue Labels of Items. Our model assumes that every item i is associated with a true label zi. This\ntrue label is sampled from a multinomial distribution \u03c1di where di is the cluster to which i belongs.\n\u03c1di is sampled from a Dirichlet prior which ensures that the true labels of the members of cluster di\nconform to the true label of the prototype. The prior is de\ufb01ned using a vector g(cid:48)p(cid:48)\nand each element\nof this vector can be computed as:\n\nd\n\n(cid:16)\n\n(cid:104)\n\n(cid:105)(cid:17)\n\n(2)\nNote that Equation 2 assigns a higher prior to the label which is the same as that of the cluster\u2019s\nprototype. The larger the value of \u00b5, the higher the likelihood that the true labels of all the cluster\nmembers will be the same as that of the prototype.\n\n1 + \u00b51\n\n(e) = \u03bb\n\ng(cid:48)p(cid:48)\n\n= e\n\nzp(cid:48)\n\nd\n\nd\n\nTime Dependent Confusion Matrices. Each pair of decision maker-item clusters (c, d) is associated\nwith a set of latent confusion matrices \u0398(t)\nc,d, one for each discrete time instant t. These confusion\nmatrices in\ufb02uence how the decision makers in the cluster c judge the items in d and also allow us to\nstudy how decision maker confusions change with time.\nEach confusion matrix is of size K \u00d7 K where K denotes the number of possible values that an item\ncan be labeled with. Each entry (p, q) in a confusion matrix \u0398 determines the probability that an\nitem with true label p will be assigned the label q. Higher probability mass on the diagonal signi\ufb01es\naccurate decisions.\nLet us consider the confusion matrix corresponding to decision maker cluster c, item cluster d, and\ntime instant 1 (\ufb01rst time instant): \u0398(1)\nc,d,z (z is the row index)\nis sampled from a Dirichlet distribution with a uniform prior \u2227. The CoT framework also models the\ndependencies between the confusion matrices at consecutive time instants via a trade-off parameter \u03c0.\nThe magnitude of \u03c0 determines how similar \u0398(t+1)\nc,d. The eth element in the row z of the\nconfusion matrix \u0398(t+1)\n\nc,d. Each row of this matrix denoted by \u0398(1)\n\nc,d,z is sampled as follows:\n\nis to \u0398(t)\n\nc,d\n\n(cid:16)\n\n(cid:104)\n\n(cid:105)(cid:17)\n\n1 + \u03c0\n\n\u0398(t)\n\nc,d,z(e)\n\n(3)\n\n\u0398(t+1)\nc,d,z (e) \u223c Dirichlet(h\u0398(t)\n\nc,d,z(e),\u2227\n\n) where h\u0398(t)\n\nc,d,z(e),\u2227\n\n= \u2227\n\n4\n\n\fFigure 1: Plate notation for the CoT model. Each block is annotated with descriptive text. The\nhyperparameters \u03bb, \u00b5 are omitted to improve readability.\n\nGenerating the decisions: Our model assumes that the decision rj,i made by a decision maker j about\nan item i depends on the clusters cj, di that j and i belong to, the time instant tj,i when the decision\nwas made, and the true label zi of item i. More speci\ufb01cally, rj,i \u223c M ultinomial(\u0398(ti,j )\nComplete Generative Process. Please refer to the Appendix for the complete generative process of\nCoT. The graphical representation of CoT is shown in Figure 1.\n\ncj ,di,zi).\n\n3.3\n\nInference\n\nWe use collapsed Gibbs sampling [4] approach to infer the latent variables of the CoT framework.\nThis technique involves integrating out all the intermediate latent variables \u0398, \u03c6, \u03c6(cid:48), \u03c1 and sampling\nonly the variables corersponding to prototypes pc, p(cid:48)d, subspace feature indicator vectors \u03c9c,m, \u03c9(cid:48)d,n,\ncluster assignments cj, di and item labels zi. The update equation for pc is given by: 1.\n\np(pc = q|z, c, d, \u03c9, \u03c9(cid:48)\n\n, p(cid:48), p(\u2212c)) \u221d\n\n1 (\u03c9c,m = 1) \u00d7 uc,m,qm\n\n(4)\n\nwhere uc,m,qm denotes the number of instances belonging to cluster c for which the discrete feature\nm takes the value qm. qm denotes the value of the feature m corresponding to the decision maker\nq. The update equation for p(cid:48)d can be derived analogously. The conditional distribution for \u03c9c,m\nobtained by integrating out \u03c6 variables is:\n\nM(cid:89)\n\nm=1\n\n\uf8f1\uf8f2\uf8f3\u03b2 \u00d7\n\np(\u03c9c,m = s|z, c, d, \u03c9\u2212(c,m), \u03c9(cid:48)\n\n, p(cid:48)\n\n, p, \u03bb) \u221d\n\nB(gpc,m ,1,\u03bb+\u02dcuc)\n\nB(gpc,m ,1,\u03bb)\n\n(1 \u2212 \u03b2) \u00d7\n\nB(gpc,m ,0,\u03bb+\u02dcuc)\n\nB(gpc,m ,0,\u03bb)\n\nif s = 1\notherwise\n\n(5)\n\nwhere \u02dcuc denotes the number of decision makers belonging to cluster c and B denotes the Beta\nfunction. The conditional distribution for \u03c9(cid:48)d,n can be written in an analogous manner. The conditional\ndistributions for cj, di and zi can be derived as described in [10].\n\n4 Experimental Evaluation\n\nIn this section, we present the evaluation of the CoT model on a variety of datasets. Our experiments\nare designed to evaluate the performance of our model on a variety of tasks such as recovering\nconfusion matrices, predicting item labels and decisions made by decision makers. We also study the\ninterpretability aspects of our model by evaluating the insights obtained.\n\n1Due to space limitations, we present the update equations assuming that the features of decision makers and\n\nitems are discrete.\n\n5\n\ndi|J| |I| |J| x |I| cjL2 !0dp0d0drj,itj,i\u21e5(t)\u21e5(t+1)T a(j)L1 pc!cc|J||I|0\u21b5\u21b50T\u21e2\u270f\u270f\u21b5\u270f0\u21b5\u270f0b(i)ziPrototypes & Subspace Feature Indicators for Decision Maker Clusters Prototypes & Subspace Feature Indicators for Item Clusters Features of Decision Makers Features and Labels of Items Decisions Time-dependent Confusion Matrices ^,\u21e1\fDataset\n\n# of\n\nEvaluators\n\nBail\n\n252\n\n# of\nItems\n250,500\n\n# of\n\nDecisions\n250,500\n\nEvaluator\nFeatures\n\nItem\n\nFeatures\n\n# of felony, misd.,\nminor offense cases\n\nPrevious arrests, offenses,\npays rent, children, gender\n\nAsthma\n\nInsurance\n\n48\n\n74\n\n60,048\n\n62,497\n\nGender, age, experience,\nspecialty, # of patients seen\n\nGender, age, asthma history,\n\nBMI, allergies\n\n49,876\n\n50,943\n\n# of policy decisions,\n\n# of construction, chemical,\n\ntechnology decisions\n\ndomain, previous losses,\npremium amount quoted\n\nTable 1: Summary statistics of our datasets.\n\nDatasets. We evaluate CoT on the following real-world datasets: (1) Bail dataset comprising\nof information about criminal court judges deciding if defendants should be released without any\nconditions, asked to pay bail amount, or be locked up (K = 3); Here, decision makers are judges and\nitems are defendants. (2) Asthma dataset which captures the treatment recommendations given by\ndoctors to patients. Patients are recommended one of the two possible categories of treatments: mild\n(mild drugs/inhalers), strong (nebulizers/immunomodulators) (K = 2). (3) Insurance data which\ncontains information about client managers deciding if a client company\u2019s insurance request should\nbe approved or denied (K = 2). Each of the datasets spans about three years in time. We do have\nthe ground-truth of true labels associated with defendants/patients/insurance clients in the form of\nexpert decisions and observed consequences for each of the datasets. Note, however, that we only use\na small fraction (5%) of the available true labels during the learning process. The decision makers\nand items are associated with a variety of features in each of these datasets (Table 1).\n\nBaselines & Ablations. We benchmark the performance of CoT against the following state-of-the-\nart baselines: Dawid-Skene Model (DS) [2], Single Confusion model (SC) [12], Hybrid Confusion\nModel (HC) [12], Joint Confusion Model (JC) [10]. DS, SC and HC models focus only on modeling\ndecisions of individual decision makers and do not provide any diagnostic insights into the decision\nmaking process JC model, on the other hand, also provides diagnostic insights (via post processing)\nNone of the baselines account for the temporal aspects.\nTo evaluate the importance of the various components of our model, we also consider multiple\nablations of CoT. Non-temporal CoT (NT-CoT) is a variant of CoT which does not incorporate\nthe temporal component and hence is applicable to a single time instance. Non-intepretable CoT\n(NI-CoT) is another ablation which does not involve the prototype or subspace feature indicator vector\ngeneration, instead \u03c6, \u03c6(cid:48), and \u03c1 are sampled from symmetric Dirichlet priors.\n\nExperimental Setup.\nIn most real-world settings involving human decision making, the true labels\nof items are available for very few instances. We mimic this setting in our experiments by employing\nweak supervision. We let the all models (including the baselines) access the true labels of about 5%\nof the items (selected randomly) in the data during the learning phase. In all of our experiments, we\ndivide each dataset into three discrete time chunks. Each time chunk corresponds to a year in the\ndata. While our model can handle the temporal aspects explicitly, the same is not true for any of\nthe baselines as well as the ablation Non-temporal CoT. To work around this, we run each of these\nmodels separately on data from each time slice. We run the collapsed Gibbs sampling inference\nuntil the approximate convergence of log-likelihood. All the hyperparameters of our model are\ninitialized to standard values: \u0001\u03b2 = \u0001(cid:48)\u03b2 = \u2227 = \u03c0 = \u0001\u03b1 = \u0001(cid:48)\u03b1 = \u03bb = \u00b5 = 1, \u03c3 = 0.1. The number of\ndecision maker and item clusters L1 and L2 were set using the Bayesian Information Criterion (BIC)\nmetric [14]. The parameters of all the other baselines were chosen similarly.\n\n4.1 Evaluating Estimated Confusion Matrices and Predictive Power\n\nWe evaluate CoT on estimating confusion matrices, predicting item labels, and predicting decisions\nof decision makers. We \ufb01rst present the details of each task and then discuss the results.\n\nRecovering Confusion Matrices. We experiment with the CoT model to determine how accurately\nit can recover decision maker confusion matrices. To measure this, we use the Mean Absolute Error\n\n6\n\n\fTask\n\nPredicting item labels\n\nMethod\n\nSC\nDS\nHC\nJC\nLR\n\nBail Asthma\n0.53\n0.61\n0.62\n0.64\n0.56\n0.65\nNT-CoT\nNI-CoT\n0.69\nCoT\n0.71\nGain % 9.86\n\n0.59\n0.63\n0.65\n0.68\n0.60\n0.68\n0.70\n0.72\n5.56\n\nInsurance\n\n0.51\n0.64\n0.66\n0.69\n0.57\n0.69\n0.70\n0.74\n6.76\n\nAsthma\n\nInferring confusion matrices\nBail\nInsurance\n0.38\n0.32\n0.31\n0.26\n\n0.40\n0.36\n0.33\n0.29\n\n0.31\n0.28\n0.26\n0.19\n\n0.24\n0.21\n0.19\n36.84\n\n0.19\n0.18\n0.16\n18.75\n\n0.28\n0.26\n0.23\n26.09\n\nPredicting decisions\n\nBail Asthma\n0.52\n0.56\n0.59\n0.64\n0.58\n0.66\n0.67\n0.69\n7.25\n\n0.51\n0.58\n0.64\n0.67\n0.60\n0.68\n0.70\n0.74\n9.46\n\nInsurance\n\n0.55\n0.58\n0.61\n0.66\n0.57\n0.66\n0.68\n0.71\n7.04\n\nTable 2: Experimental results: CoT consistently performs best across all tasks and datasets. Bottom\nrow of the table indicates percentage gain of the CoT over the best performing baseline JC.\n\n(MAE) metric to compare the elements of the estimated confusion matrix (\u0398(cid:48)) and the observed\nconfusion matrix (\u0398). MAE of two such matrices is the sum of element wise differences:\n\nK(cid:88)\n\nK(cid:88)\n\nM AE(\u0398, \u0398(cid:48)) =\n\n1\nK 2\n\n|\u0398u,v \u2212 \u0398(cid:48)u,v|\n\nu=1\n\nv=1\n\nWhile the baseline models SC, DS, HC associate a single confusion matrix with each decision maker,\nthe baseline JC and our model assume that each decision maker can have multiple confusion matrices\n(one per each item cluster). To ensure a fair comparison, we apply the MAE metric every time a\ndecision maker judges an item choosing the appropriate confusion matrix and then compute the\naverage MAE.\n\nPredicting Item Labels. We also evaluate the CoT on the task of predicting item labels. We use\nthe AUC ROC metric to measure the predictive performance. In addition to the previously discussed\nbaselines, we also compare the performance against Logistic Regression (LR) classi\ufb01er. The LR\nmodel was provided decision maker and item features, time stamps, and decision maker decisions as\ninput features.\n\nPredicting Evaluator Decisions. The CoT model can also be used to predict decision maker\ndecisions. Recall that the decision maker decisions are regarded as observed variables through out\nthe inference. However, we can leverage the values of all the latent variables learned during inference\nto predict the decision maker decisions. In order to execute this task, we divide the data into 10 folds\nand carry out the Gibbs sampling inference procedure on the \ufb01rst 9 folds where the decision maker\ndecisions are observed. We then use the estimated latent variables to sample the decision maker\ndecisions for the remaining fold. We repeat this process over each of the 10 folds and report average\nAUC.\n\nResults and Discussion. Results of all the tasks are presented in Table 2. CoT outperforms all the\nother baselines and ablations on all the tasks. The SC model which assumes that all the decision\nmakers share a single confusion matrix performs extremely poorly compared to the other baselines,\nindicating that its assumptions are not valid in real-world datasets. The JC model, which groups\nsimilar decision makers and items together turns out to be one of the best performing baseline. The\nperformance of our ablation models indicates that excluding the temporal aspects of the CoT causes a\ndip in the performance of the model on all the tasks. Furthermore, leaving out the interpretability\ncomponents affects the model performance slightly. These results demonstrate the utility of the\njoint inference of temporal, interpretable aspects alongside decision maker confusions and cluster\nassignments.\n\n4.2 Evaluating Interpretability\n\nIn this section, we \ufb01rst present an evaluation of the quality of the clusters generated by the CoT\nmodel. We then discuss some of the qualitative insights obtained using CoT.\n\n7\n\n\fModel\n\nJC\n\nNT-CoT\nNI-CoT\nCoT\n\nPurity\n\nInverse Purity\n\nBail Asthma\n0.67\n0.74\n0.72\n0.83\n\n0.71\n0.78\n0.76\n0.84\n\nInsurance\n\n0.74\n0.79\n0.78\n0.81\n\nBail Asthma\n0.63\n0.72\n0.67\n0.78\n\n0.66\n0.73\n0.71\n0.79\n\nInsurance\n\n0.67\n0.76\n0.72\n0.82\n\nTable 3: Average purity and inverse purity computed across all decision maker and item clusters.\n\nFigure 2: Estimated (top) and observed (bottom) confusion matrices for asthma dataset: These\nmatrices correspond to the group of decision makers who have relatively little experience (# of years\npractising medicine = 0/1) and the group of patients with allergies but no past asthma attacks.\n\nCluster Quality. The prototype and the subspace feature indicator vector of a cluster allow us to\nunderstand the nature of the instances in the cluster. For instance, if the subspace feature indicator\nvector signi\ufb01es that gender is the one and only important feature for some decision maker cluster c\nand if the prototype of that cluster has value gender = female, then we can infer that c constitutes\nof female decision makers. Since we are able to associate each cluster with such patterns, we can\nreadily de\ufb01ne the notions of purity and inverse purity of a cluster.\nConsider cluster c from the example above again. Since the de\ufb01ning pattern of this cluster is gender\n= female, we can compute the purity of the cluster by calculating what fraction of the decision makers\nin the cluster are female. Similarly, we can also compute what fraction of all decision makers who are\nfemale and are assigned to cluster c. This is referred to as inverse purity. While purity metric captures\nthe notion of cluster homogeneity, the inverse purity metric ensures cluster completeness.\nWe compute the average purity and inverse purity metrics for the CoT, its ablation and a baseline\nJC across all the decision maker and item clusters and the results are presented in Table 3. Notice\nthat CoT outperforms all the other ablations and the JC baseline. It is interesting to note that the\nnon-interpretable CoT (NI-CoT) has much lower purity and inverse purity compared to non-temporal\nCoT (NT-CoT) as well as the CoT. This is partly due to the fact that the NI-CoT does not model\nprototypes or feature indicators which leads to less pure clusters.\n\nQualitative Inspection of Insights. We now inspect the cluster descriptions and the corresponding\nconfusion matrices generated by our approach. Figure 2 shows one of the insights obtained by our\nmodel on the asthma dataset. The confusion matrices presented in Figure 2 correspond to the group\nof doctors with 0/1 years of experience evaluating patients who have allergies but did not suffer from\nprevious asthma attacks. The three confusion matrices, one for each year (from left to right), shown\non the top row in Figure 2 correspond to our estimates and those on the bottom row are computed\nfrom the data. It can be seen that the estimated confusion matrices match very closely with the ground\ntruth thus demonstrating the effectiveness of the CoT framework.\nInterpreting results in Figure 2, we \ufb01nd that doctors within the \ufb01rst year of their practice (left most\nconfusion matrix) were recommending stronger treatments (nebulizers and immunomodulators) to\npatients who are likely to get better with milder treatments such as low impact drugs and inhalers.\nAs time passed, they were able to better identify patients who could get better with milder options.\nThis is a very interesting insight and we also found that such a pattern holds for client managers with\nrelatively little experience. This could possibly mean that inexperienced decision makers are more\nlikely to be risk averse, and therefore opt for safer choices.\n\n8\n\n\fReferences\n[1] J. Bien and R. Tibshirani. Prototype selection for interpretable classi\ufb01cation. The Annals of Applied\n\nStatistics, pages 2403\u20132424, 2011.\n\n[2] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em\n\nalgorithm. Applied Statistics, (1):20\u201328, 1979.\n\n[3] O. Dekel and O. Shamir. Vox populi: Collecting high-quality labels from a crowd. In COLT, 2009.\n\n[4] T. L. Grif\ufb01ths and M. Steyvers. Finding scienti\ufb01c topics. Proceedings of the National Academy of Sciences,\n\n101(suppl 1):5228\u20135235, 2004.\n\n[5] M. Joglekar, H. Garcia-Molina, and A. Parameswaran. Evaluating the crowd with con\ufb01dence. In KDD,\n\npages 686\u2013694, 2013.\n\n[6] E. Kamar, S. Hacker, and E. Horvitz. Combining human and machine intelligence in large-scale crowd-\n\nsourcing. In AAMAS, pages 467\u2013474, 2012.\n\n[7] C. Kemp, J. B. Tenenbaum, T. L. Grif\ufb01ths, T. Yamada, and N. Ueda. Learning systems of concepts with an\n\nin\ufb01nite relational model. In AAAI, page 5, 2006.\n\n[8] B. Kim, C. Rudin, and J. A. Shah. The bayesian case model: A generative approach for case-based\n\nreasoning and prototype classi\ufb01cation. In NIPS, pages 1952\u20131960. 2014.\n\n[9] H. Lakkaraju, S. H. Bach, and J. Leskovec. Interpretable decision sets: A joint framework for description\n\nand prediction. In KDD, pages 1675\u20131684, 2016.\n\n[10] H. Lakkaraju, J. Leskovec, J. Kleinberg, and S. Mullainathan. A bayesian framework for modeling human\n\nevaluations. In SDM, pages 181\u2013189, 2015.\n\n[11] B. Letham, C. Rudin, T. H. McCormick, D. Madigan, et al. Interpretable classi\ufb01ers using rules and bayesian\nanalysis: Building a better stroke prediction model. The Annals of Applied Statistics, 9(3):1350\u20131371,\n2015.\n\n[12] C. Liu and Y.-M. Wang. Truelabel + confusions: A spectrum of probabilistic models in analyzing multiple\n\nratings. In ICML, pages 225\u2013232, 2012.\n\n[13] Y. Lou, R. Caruana, and J. Gehrke. Intelligible models for classi\ufb01cation and regression. In KDD, pages\n\n150\u2013158, 2012.\n\n[14] A. A. Neath and J. E. Cavanaugh. The bayesian information criterion: background, derivation, and\n\napplications. Wiley Interdisciplinary Reviews: Computational Statistics, 4(2):199\u2013203, 2012.\n\n[15] K. Nowicki and T. A. B. Snijders. Estimation and prediction for stochastic blockstructures. Journal of the\n\nAmerican Statistical Association, 96(455):1077\u20131087, 2001.\n\n[16] V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised\nlearning from multiple experts: Whom to trust when everyone lies a bit. In ICML, pages 889\u2013896, 2009.\n\n[17] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy. Learning from crowds.\n\nJMLR, pages 1297\u20131322, 2010.\n\n[18] R. Snow, B. O\u2019Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast\u2014but is it good?: Evaluating non-expert\n\nannotations for natural language tasks. In EMNLP, pages 254\u2013263, 2008.\n\n[19] J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. R. Movellan. Whose vote should count more: Optimal\n\nintegration of labels from labelers of unknown expertise. In NIPS, pages 2035\u20132043, 2009.\n\n[20] K. S. Xu and A. O. Hero III. Dynamic stochastic blockmodels: Statistical models for time-evolving\nnetworks. In International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction,\npages 201\u2013210. Springer, 2013.\n\n[21] D. Zhou, J. C. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax entropy. In\n\nNIPS, pages 2204\u20132212, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1627, "authors": [{"given_name": "Himabindu", "family_name": "Lakkaraju", "institution": "Stanford University"}, {"given_name": "Jure", "family_name": "Leskovec", "institution": "Stanford University"}]}