{"title": "Multi-value Rule Sets for Interpretable Classification with Feature-Efficient Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 10835, "page_last": 10845, "abstract": "We present the Multi-value Rule Set (MRS) for interpretable\nclassification with feature efficient presentations. Compared to\nrule sets built from single-value rules, MRS adopts a more\ngeneralized form of association rules that allows multiple values\nin a condition. Rules of this form are more concise than classical\nsingle-value rules in capturing and describing patterns in data.\nOur formulation also pursues a higher efficiency of feature utilization,\nwhich reduces possible cost in data collection and storage.\nWe propose a Bayesian framework for formulating an MRS model\nand develop an efficient inference method for learning a maximum\na posteriori, incorporating theoretically grounded bounds to iteratively\nreduce the search space and improve the search efficiency.\nExperiments on synthetic and real-world data demonstrate that\nMRS models have significantly smaller complexity and fewer features\nthan baseline models while being competitive in predictive\naccuracy.", "full_text": "Multi-value Rule Sets for Interpretable Classi\ufb01cation\n\nwith Feature-Ef\ufb01cient Representations\n\nTong Wang\n\nTippie School of Business\n\nUniversity of Iowa\nIowa City, IA 52242\n\ntong-wang@uiowa.edu\n\nAbstract\n\nWe present the Multi-value Rule Set (MRS) for interpretable classi\ufb01cation with\nfeature ef\ufb01cient presentations. Compared to rule sets built from single-value rules,\nMRS adopts a more generalized form of association rules that allows multiple\nvalues in a condition. Rules of this form are more concise than classical single-\nvalue rules in capturing and describing patterns in data. Our formulation also\npursues a higher ef\ufb01ciency of feature utilization, which reduces possible cost in\ndata collection and storage. We propose a Bayesian framework for formulating an\nMRS model and develop an ef\ufb01cient inference method for learning a maximum\na posteriori, incorporating theoretically grounded bounds to iteratively reduce the\nsearch space and improve the search ef\ufb01ciency. Experiments on synthetic and real-\nworld data demonstrate that MRS models have signi\ufb01cantly smaller complexity\nand fewer features than baseline models while being competitive in predictive\naccuracy. Human evaluations show that MRS is easier to understand and use\ncompared to other rule-based models.\n\n1\n\nIntroduction\n\nIn many real-world applications of machine learning, human experts desire the interpretability of a\nmodel as much as the predictive accuracy. As opposed to \u201cblack box\u201d models, interpretable models\nare easy for humans to understand and extract insights, which is imperative in domains such as\nhealthcare, law enforcement, etc. In some occasions, the need for interpretability even outweighs\nthat for accuracy due to legal or ethnic concerns. Among different forms of interpretable models, we\nare particularly interested in rule-based models in this paper. This type of models produce decisions\nbased on a set of rules following simple \u201cif-else\u201d logic: if a rule (or a set of rules) is satis\ufb01ed, the\nmodel outputs the corresponding decision. The set of rules can be either ordered [17, 35, 5] or\nunordered [15, 30, 19, 24], depending on the speci\ufb01c model structure.\nPrior rule-based models in the literature are from built single-value rules [15, 30, 19]. For example,\n[State = California] AND [Marital status = married], where a condition (e.g., [state=California]) is\na pair of a feature (e.g., state) and a single value (e.g., California). However, while single-value\nrules can express primitive concepts, they are inadequate in capturing more general trends in the\nunderlying data, especially when working with features with a medium to high cardinality. Rules\nbuilt from these features tend to have too small support. They are either less likely to be selected\nin the \ufb01nal output, introducing selection bias in the model [7], or induce a large model if selected,\nhurting the model interpretability. For example, to capture a set of married or divorced people\nwho live in California, Texas, Arizona, and Oregon, a model needs to include eight rules, each\nrule being a combination of a state and a marital status, yielding an overly complicated model. As\nmodern machine learning has in part moved on to pursue a better and more concise way of model\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fpresentation as well as predictive accuracy, single-value rules do not suf\ufb01ce for some applications,\nneither do models built from them.\nTo mitigate this problem, rules in a more generalized form have been proposed in the literature that\nallow multiple values [26, 22], also called internal disjunctions of values in a condition [6]. For\nexample: [state = California or Texas or Arizona or Oregon] AND [marital status = married or\ndivorced]. In this case, we only need one rule instead of eight single-value rules, yielding a more\nconcise presentation while preserving the information. We refer to rules of this form multi-value\nrules, which will serve as the building blocks of our proposed model in this paper. The prior efforts\non multi-value rules have mainly focused on \ufb01nding individual rules and using heuristics such as\ninterestingness, con\ufb01dence, etc., instead of building a principled classi\ufb01cation model with a global\nobjective function that considers predictive accuracy and model complexity.\nAnother important aspect that has been overlooked by previous rule-based models is the need to\ncontrol the total number of unique features. The number of different entities humans need to com-\nprehend is directly associated with how easy it is to understand the model, as con\ufb01rmed by the\nconclusions of Miller [21] relative to the magical number seven. With fewer features involved, it\nalso becomes easier for domain experts to gain clear insights into the data. In practice, models using\nfewer features are easier to understand and bring down the overall cost in data collection.\nTo combine the factors considered above, we propose a novel rule-based classi\ufb01er, Multi-value Rule\nSet (MRS), which is a set of multi-value rules. An instance is classi\ufb01ed as positive if it satis\ufb01es at\nleast one of the rules. A MRS has great advantages over models built from single-value rules in (i)\na more concise presentation of information and (ii) using a smaller number of features in the model.\nWe develop a Bayesian framework for learning MRS which provides a uni\ufb01ed framework to jointly\noptimize data \ufb01tting and model complexity without directly \u201chard\u201d controlling either. We propose\na principled objective combining the interpretability and the predictive accuracy where we devise a\nprior model that promotes a small set of short rules using a few features. We propose an ef\ufb01cient\ninference algorithm for learning a maximum a posteriori model. We show with experiments on\nstandard data sets that MRS produces predictive accuracy comparable to or better than prior art with\nlower complexity and fewer overall features.\n\n2 Related Work\n\nThere has been a series of research on developing rule-based models for classi\ufb01cation [31, 15, 12,\n3, 28, 19, 5, 24, 24]. Various structures and formats of models were proposed, from the earlier\nwork on Classi\ufb01cation based on Association Rules (CBA) [18] and Repeated Incremental Pruning\nto Produce Error Reduction (Ripper) [5] to more recent work on rule sets [31, 15, 19] and rule lists\n[16, 29]. A major development along this line of work is that interpretability has been recognized\nand emphasized. Therefore controlling model complexity for easier interpretation is becoming an\nimportant component in the modeling. However, previously mentioned models rely on single-value\nrules and are limited in the expressive power, leaving redundancy in the model. In addition, learning\nin previous methods is mostly a two-step procedure[31, 18, 16], that \ufb01rst uses off-the-shelf data\nmining algorithms to generate a set of rules and then chooses a set from them to form the \ufb01nal\nmodel. This in practice will encounter the bottleneck of mining rules of a large maximum length\n(millions of rules can be generated from a medium size data set if the maximum length is set to\nonly 3 [31]). Furthermore, few of the previous works consider limiting the number of features. Our\nmodel aims to combines rule learning and feature assignment into the same process.\nOur work is broadly related to generalized association rules that consider disjunctive relationships.\nAmong various works along this line, some consider disjunction in the rule level, using the dis-\njunction connector instead of a conjunction connector as used by classical rule form. For example,\na1_a2_(cid:1)(cid:1)(cid:1) ! Y , where a1 is a rule. Representative works include [10, 9, 8]. This primitive form of\nrules was extended to consider disjunctions in the condition / literal level [22], yielding multi-value\nrules of the form (a1 _ a2 _ (cid:1)(cid:1)(cid:1) ) ^ (b1 _ b2 (cid:1)(cid:1)(cid:1) ) ! Y . Prior efforts have mainly focused on mining\nindividual multi-value rules [25, 10] using heuristics such as interestingness. Some works built clas-\nsi\ufb01ers comprised of multi-value rules [1, 2, 20]. However, they still rely on greedy methods such as\ngreedy induction to build a model, and they do not consider model complexity or restrict the number\nof features. Here, we optimize a global objective that considers predictive accuracy, model size, and\nthe total number of features. By tuning the parameters in the Bayesian framework, our model can\n\n2\n\n\fstrike a nice balance between the different aspects of the model to suit the domain speci\ufb01c need of\nusers.\n\n3 Multi-value Rule Sets\nWe work with standard classi\ufb01cation data set S that consists of N observations fxn; yngN\nn=1. Let\ny represent the set of labels.Each observation has J features, and we denote the jth feature of the\nnth observation as xnj. Let Vj represent a set of values the jth feature takes. This notation can be\nadapted to continuous attributes by discretizing the values.\n\n3.1 Multi-value Rules\n\nNow we introduce the basic components in Multi-value Rule Set model.\nDe\ufb01nition 1 An item is a pair of a feature j and a value v, where j 2 f1; 2;(cid:1)(cid:1)(cid:1) ; Jg and v 2 Vj.\nDe\ufb01nition 2 A condition is a collection of items with the same feature j, denoted as c = (j; V ),\nwhere j 2 f1; 2;(cid:1)(cid:1)(cid:1) ; Jg and V (cid:26) Vj. V is a union of values in the items.\nDe\ufb01nition 3 A multi-vale rule is a conjunction of conditions, denoted as r = fckgk.\nInterchangeable values are grouped into a value set in a condition, such as [state = California or\nTexas or Arizona or Oregon]. Following the de\ufb01nitions, an item is the atom in a multi-value rule. It\nis also a special case of a condition with a single value, for example, [state = California].\nNow we de\ufb01ne a classi\ufb01er built from multi-value rules. By an abuse of notation, we use r((cid:1)) to\nrepresent a Boolean function that indicates if an observation satis\ufb01es rule r: r((cid:1)) : X ! f0; 1g: Let\nR denote a Multi-value Rule Set. We de\ufb01ne a classi\ufb01er R((cid:1)):\n\n{\n\nR(x) =\n\n1 9r 2 R; r(x) = 1\n0\n\notherwise.\n\n(1)\n\nx is classi\ufb01ed as positive if it satis\ufb01es at least one rule in R and we say x is covered by r.\n\n3.2 MRS Formulation\n\nOur proposed framework considers two aspects of a model: 1) interpretability, characterized by a\nprior model for MRS, which considers the complexity (number of rules and lengths of rules) and\nfeature assignment. 2) predictive accuracy, represented by the conditional likelihood of data given\nan MRS model. Both components have tunable parameters to trade off between interpretability and\npredictive accuracy. Now we formulate the model.\nPrior for MRS The prior model for MRS jointly determines the number of rules M, lengths of\nm=1 and feature assignment fzmgM\nrules fLmgM\nm=1, where m is the rule index. We propose a two-\nstep process for constructing a rule set, where the \ufb01rst step determines the size and shape of an MRS\nmodel and the second step \ufb01lls in the empty \u201cboxes\u201d with items.\nCreating empty \u201cboxes\u201d - complexity assignment: First, we draw the number of rules M from a\nPoisson distribution, where (cid:21)M (cid:24) Gamma((cid:11)M ; (cid:12)M ). Second, we determine the number of items\nin each rule, denoted as Lm. Lm (cid:24) Poisson((cid:21)L), which is a Poisson distribution truncated to only\nallow positive outcomes. The arrival rate for this Poisson distribution, (cid:21)L, is governed by a Gamma\ndistribution with parameters (cid:11)L; (cid:12)L. Since we favor simpler models for interpretability purposes,\nwe set (cid:11)L < (cid:12)L and (cid:11)M < (cid:12)M to encourage a small set of short rules. These two steps together\ndetermine the size and shape of an MRS model. Therefore, we call parameters (cid:11)M ; (cid:12)M ; (cid:11)L; (cid:12)L\nshape parameters. Hs = f(cid:11)M ; (cid:12)M ; (cid:11)L; (cid:12)L; (cid:18)g. This step creates empty \u201cboxes\u201d to be \ufb01lled with\nitems in the following step and assigns overall complexity to the model.\nFilling \u201cboxes\u201d - feature assignment: A m-th rule is a collection of Lm \u201cboxes\u201d, each containing\nan item. Let zmk represent the feature assigned to the kth box in the m-th rule, where zmk 2\nf1; :::; Jg and zm represent the set of feature assignments in the m-th rule. We sample zm from\na multinomial distribution with weights p drawn from a Dirichlet distribution parameterized by\n\n3\n\n\f\u2211\n\nk\n\n\u001f(zmk = j) and\n\nhyperparameter (cid:18) = f(cid:18)jgJ\nj=1. Let lmj denote the number of items with attribute j in the m-th\nrule, i.e., lmj =\nj lmj = Lm. It means lmj items share the same feature\nj and therefore can be merged into a condition. We truncate the multinomial distribution to only\nallow lmj (cid:20) jVjj. Remarks: we use Multinomial-Dirichlet distribution for feature assignment for\nits clustering property of the outcomes. The prior model will tend to re-use features already in the\nrule. This is consistent with the interpretability goal of our model: we would like to form a MRS\nmodel with fewer features so that multiple items can be merged in to one condition. The prior does\nnot consider values in each item since they do not affect the size and the shape of the model and\ntherefore have no effect on the interpretability. In summary, the prior for MRS model follows a\ndistribution below, where C is a function of Hs and (cid:0)((cid:1)) is a gamma function.\n\n\u220f\n\nJ\nj=1 (cid:0)(lmj + (cid:18)j)\n\n\u2211\n\n\u2211\n\nM\u220f\n\nm=1\n\np(R; Hs) / (cid:0)(M\n\n+ (cid:11)M )CM\n\n(cid:3)\n(cid:0)(M(cid:3) + 1)\n\n(cid:0)(Lm + (cid:11)L)\n(cid:0)(Lm + 1)\n\n((cid:12)L + 1)Lm (cid:0)(Lm +\n\nj=1 (cid:18)j)\n\n:\n\n(2)\n\nConditional Likelihood Now we consider the predictive accuracy of a MRS by modeling the condi-\ntional likelihood of labels y given features x and a MRS model R. Our prediction on the outcomes\nare based on the coverage of MRS. According to formula (1), if an observation satis\ufb01es R (covered\nby R), it is predicted to be positive, otherwise, it\u2019s negative. We assume label yn is drawn from\na Bernoulli distribution with probabilities (cid:26)+ or (cid:26)(cid:0) to be consistent with the predicted outcome.\nSpeci\ufb01cally, when R(xn) = 1, i.e., xn satis\ufb01es the rule set, yn has probability (cid:26)+ to be positive,\nand when R(xn) = 0, yn has probability (cid:26)(cid:0) to be negative. (cid:26)+; (cid:26)(cid:0) govern the predictive accuracy\non the training data. We assume that they are drawn from two Beta distributions with hyperparam-\neters ((cid:11)+; (cid:12)+) and ((cid:11)(cid:0); (cid:12)(cid:0)), respectively, which control the predictive power of the model. The\nconditional likelihood is shown below, given parameters Hc = f(cid:11)+; (cid:12)+; (cid:11)(cid:0); (cid:12)(cid:0)g:\np(yjx; R; Hc) / B(TP + (cid:11)+; FP + (cid:12)+)B(TN + (cid:11)(cid:0); FN + (cid:12)(cid:0));\n\n(3)\nwhere TP, FP, TN and FN represent the number of true positives, false positives, true negatives and\nfalse negatives, respectively. B((cid:1)) is a Beta function which comes from integrating out (cid:26)+; (cid:26)(cid:0) in the\nconditional likelihood function.\nWe will write p(R; Hs) as p(R) and p(yjx; R; Hc) as p(yjx), ignoring dependence on parameters\nwhen necessary. Regarding setting hyperparameters Hs; Hc, there are natural settings for (cid:18) (all\nentries being 1). This means there\u2019s no prior preference for features. For Gamma distributions,\nwe set (cid:11)M and (cid:11)L to 1. Then the strength of the prior for constructing a simple MRS depends\non (cid:12)M and (cid:12)L. Increasing (cid:12)M and (cid:12)L decreases the expected number of rules and the expected\nlength of rules, thus penalizing more on larger models. There are four real-valued parameters in the\nconditional likelihood to set, (cid:11)+; (cid:12)+; (cid:11)(cid:0); (cid:12)(cid:0). They jointly control the probability that a prediction\nof MRS model is correct. Therefore we should always set (cid:11)+ > (cid:12)+; (cid:11)(cid:0) > (cid:12)(cid:0). The ratios of (cid:11)+; (cid:12)+\nand (cid:11)(cid:0); (cid:12)(cid:0) are associated with the expected predictive accuracy. Setting values of the parameters\ncan be done through cross-validation, another layer of hierarchy with more diffuse hyperparameters,\nor plain intuition.\n\n3.3 Clustering of Features\n\n\u2032. Every rule in R\n\n\u2032 remains the same as R except in the m-th rule. Let l\n\n; l\n(cid:0) 1. We claim this \ufb02ip of feature increases the prior probability of the model, i.e.,\n\nWe use Multinomial-Dirichlet in the prior model to take advantage of the \u201cclustering\u201d effect in\nfeature assignment. Our goal is to formulate a model which favors rules with fewer features. Here\nwe prove this effect. Let R denote a MRS model and lmj represent the number of items in rule\nm taking feature j. Now we do a small change in R: pick two features j1; j2 in rule m where\n(cid:21) lmj2 and replace an item taking feature j2 with an item taking feature j1, and we denote the\nlmj1\n\u2032\nnew rule set as R\nmj2\n= lmj1 + 1 and\ndenote the number of items taking feature j1 and j2 in the new model, and l\nlmj2 = l\nTheorem 1 If lmj1 + (cid:18)j1\nWhen we choose uniform prior where all (cid:18)j are equal, the theorem will be reduced to a simpler\nform, that the model always tends to reuse the most prevalent features. For example, given two\nrules [state = California or Texas] AND [marital status = married] and [state = California] AND\n[marital status = married] AND [age (cid:21) 45], our model will favor rule sets containing the former, if\neverything else being equal. (All proofs are in the supplementary material.)\n\n(cid:21) lmj2 + (cid:18)j2, then p(R\n\n\u2032\nmj2\n\n\u2032\n\n) (cid:21) p(R).\n\n\u2032\nmj1\n\n\u2032\nmj1\n\n4\n\n\f4\n\nInference Method\n\nInference for rule-based models is challenging because it involves a search over exponentially many\npossible sets of rules: since each rule is a conjunction of conditions, the number of rules increases\nexponentially with the number of features in a data set, and the solution space (all possible rule sets)\nis a powerset of the rule space. To obtain a maximum a posteriori (MAP) model within this solution\nspace, Gibbs Sampling takes tens of thousands of iterations or more to converge even searching\nwithin a reduced space of only a couple of thousands of pre-mined and pre-selected rules [16, 29].\nHere we propose an ef\ufb01cient inference algorithm that adopts the basic search procedure in simu-\nlated annealing. Given an objective function p(RjS) over discrete search space of different rule sets\nand a temperature schedule function over time steps, T [t], a simulated annealing [13] procedure is\na discrete time, discrete state Markov Chain where at step t, given the current state R[t], the next\nstate R[t+1] is chosen by \ufb01rst proposing a neighbor and accepting it with probability that gradually\ndecreases with time. In this framework, we also incorporate the following strategies for faster com-\nputation. 1) we use theoretical bounds for bounding the sampling chain to reduce computation. 2)\ninstead of randomly proposing a neighboring solution, we aim to improve from the current solution\nby evaluating neighbors and pick the right one to move on to.\n\n4.1 Theoretical bounds on MAP models\n\nWe exploit the model formulation to guide us in the search. We start by looking at MRS models\nwith one rule removed. Removing a rule will yield a simpler model but may lower the likelihood.\nHowever, we can prove that the loss in likelihood is bounded as a function of the support. For a rule\nset R and index z, we use Rnz to represent a set that contains all rules from R except the zth rule,\ni.e., Rnz = frmjrm 2 R; m \u0338= zg: De\ufb01ne\n\n(cid:12)(cid:0)(N+ + (cid:11)+ + (cid:12)+ (cid:0) 1)\n\n(N(cid:0) + (cid:11)(cid:0) + (cid:12)(cid:0))(N+ + (cid:11)+ (cid:0) 1)\n\n;\n\n(cid:7) =\n\n\u2211\n\nn r(xn): Then the following holds:\n\nwhere N+; N(cid:0) are the number of positive and negative examples, respectively. Notate the support\nof a rule as supp(r) =\nLemma 1 If (cid:11)+ > (cid:12)+; (cid:11)(cid:0) > (cid:12)(cid:0), then p(yjx; R) (cid:21) (cid:7)supp(z)p(yjx; Rnz).\n(cid:7) is meaningful if (cid:7) (cid:20) 1, otherwise this lemma means adding a rule always increases the conditional\nlikelihood. This condition almost always holds since (cid:11)+ > (cid:12)+; (cid:11)(cid:0) > (cid:12)(cid:0) and we do not set (cid:12)+ to\na signi\ufb01cantly large value. In practice it is recommended to set (cid:12)+; (cid:12)(cid:0) to 1.\nWe then introduce some notations that will be used later. Let L(cid:3) denote the maximum likelihood\nof data S, which is achieved when all data are classi\ufb01ed correctly (this holds when (cid:11)+ > (cid:12)+ and\n(cid:11)(cid:0) > (cid:12)(cid:0)), i.e. TP = N+, FP = 0, TN = N(cid:0), and FN = 0, giving: L(cid:3)\n:= B(N+ +(cid:11)+; (cid:12)+)B(N(cid:0) +\n(cid:11)(cid:0); (cid:12)(cid:0)): Let v[t] denote the best solution found until iteration t, i.e.,\n\nv[t] = max\n(cid:28)(cid:20)t\n\np(R[(cid:28) ]jS):\n\nAccording to the prior model, containing too many rules penalizes the model due to the large com-\nplexity. Therefore, to hold a spot in the model, each rule needs to make enough \u201ccontribution\u201d to the\nobjective, i.e., capturing enough of the positive class, to cancel off the decrease in the prior. There-\nfore, we claim that the support of rule in the MAP model is lower bounded, and the bound becomes\ntighter as v[t] increases along the iterations.\n\nH =\n\nTheorem 2 Take a data set S and a MRS model with parameters\n\n{\n(cid:11)M ; (cid:12)M ; (cid:11)L; (cid:12)L; (cid:11)+; (cid:12)+; (cid:11)(cid:0); (cid:12)(cid:0);f(cid:18)jgJ\nwhere H 2 (N+)J+8. De\ufb01ne R\n2666 log M [t](cid:11)M \u2126\n(cid:12)L; (cid:11)+ > (cid:12)+; (cid:11)(cid:0) > (cid:12)(cid:0) and (cid:7) (cid:20) 1, we have:\nM [t]+(cid:11)M(cid:0)1\nlog 1\n(cid:7)\n\n(cid:3) 2 arg maxR p(RjS; H) and M\n\u230a\n\n3777 ; and M [t] =\n\n; supp(r) (cid:21)\n\n8r 2 R\n\nlog L(cid:3)\n\n(cid:3)\n\n}\n(cid:3)j. If (cid:11)M < (cid:12)M ; (cid:11)L <\n= jR\n\nj=1\n\n;\n\n(cid:3)\n\n\u230b\n\n;\n\n+ log p(\u2205) (cid:0) v[t]\nlog \u2126\n\n5\n\n\f\u2211\n\n((cid:12)M +1)((cid:12)L+1)(cid:11)L+1\n\nJ\nj=1 (cid:18)j\n\n.\n\n(cid:11)M (cid:12)(cid:11)L\n\nL (cid:11)L max((cid:18))\n\nwhere \u2126 =\np(\u2205) is the prior of an empty set. L(cid:3) and p(\u2205) upper bound the conditional likelihood and prior,\nrespectively. The difference between log L(cid:3)\n+ log p(\u2205) and v[t], the numerator in M [t], represents\nthe room for improvement from the current solution v[t]. The smaller the difference, the smaller the\nM [t]. When we choose (cid:11)M = 1, then the bound on support is reduced to\n\n\u2308\n\n\u2309\n\nsupp(r) (cid:21)\n\nlog \u2126\nlog 1\n(cid:7)\n\n:\n\nWe can control the bounds by changing parameters in H to increase or decrease \u2126. As \u2126 increases,\nthe bound M [t] decreases, which indicates a stronger preference for a simpler model with a smaller\nnumber of rules. Simultaneously, the lower bound for support increases, which is equivalent to\nreducing the search space. To increase \u2126, one can increase (cid:12)M\n, which is the expected number of\n(cid:11)M\nrules from the prior distribution, or increase (cid:12)L\n, which is the expected number of items in each rule.\n(cid:11)L\nWe incorporate the bound on the support in the search algorithm to check if a rule quali\ufb01es to be\nincluded.\n\n4.2 Proposing step\n\nHere we detail the proposing step at each iteration in the search algorithm. We simultaneously de\ufb01ne\nthe set of neighbors and the process to choose a neighbor to propose. A \u201cnext state\u201d is proposed by\n\ufb01rst selecting an action to alter the current MRS and then choosing from \u201cneighboring\u201d models\ngenerated by that action. To improve the search ef\ufb01ciency, we do not perform a random action, but\ninstead, we sample from misclassi\ufb01ed examples to choose an action that can improve the current\nstate R[t]. If the misclassi\ufb01ed example is positive, it means R[t] fails to \u201ccover\u201d it and therefore\nneeds to increase the coverage by randomly choosing one of the following actions.\n\n\u2022 Add a value: Choose a rule rm 2 R[t], a condition ck 2 rm and then a candidate value\nv 2 Vzmk\nn(cid:23)(ck), then ck (zmk; (cid:23)(ck) [ v). (cid:23)(ck) indicates the value(s) in condition ck.\n\u2022 Remove a condition: Choose a rule rm 2 R[t] and a condition ck 2 rm, then rm = fck\u2032 2\nrmjck\u2032 \u0338= ckg\n\u2022 Add a rule: Generate a new rule r\nR[t+1] R[t] [ r\n\n) satis\ufb01es the bound in Theorem 2,\n\n\u2032 where supp(r\n\n\u2032\n\n\u2032\n\nwhere we use (cid:23)((cid:1)) to access the feature in a condition.\nOn the other hand, if the misclassi\ufb01ed example is negative, it means R[t] covers more than it should\nand therefore needs to reduce the coverage by randomly choosing one of the following actions.\n\n\u2022 Add a condition: Choose a rule rm 2 R[t] \ufb01rst, choose a feature j\nthen a set of values V\n\u2022 Remove a rule: Choose a rule rm 2 R[t], then R[t+1] = fr 2 R[t]jr \u0338= rmg\n\n\u2032 2 Vj\u2032, then update rm rm [ (j\n\n; V\n\n)\n\n\u2032\n\n\u2032\n\n\u2032 2 f1;(cid:1)(cid:1)(cid:1) ; Jgnzm and\n\nThe above actions involve choosing a value, a condition, or a rule to perform the action on. Different\nchoices result in different neighboring candidate models. To select one from them, we evaluate\np((cid:1)jS) on every model. Then a choice is made between exploration (choosing a random model) and\nexploitation (choosing the best model). This randomness helps to avoid local minima and helps the\nMarkov Chain to converge to a global optimum.\nSee the supplementary material for the complete algorithm.\n\n5 Experimental Evaluation\n\nWe perform a detailed experimental evaluation of MRS models on simulated and real-world data sets.\nThe \ufb01rst part of our experiments is designed to study the effect of hyperparameters on interpretability\nand predictive accuracy. The second part of the experiments compares MRS with classic and state-\nof-the-art benchmark baselines.\n\n6\n\n\f5.1 Accuracy & Interpretability Trade-off\n\n(cid:3). Then we generate labels y from R\n\nWe generate ten data sets of 100k observations with 50 arbitrary numerical features uniformly drawn\nfrom 0 to 1. For each data set, we construct a set of 10 rules by \ufb01rst drawing the number of conditions\nuniformly from 1 to 10 for each rule and then \ufb01lling conditions with randomly selected features.\nSince the data are numeric, we generate a range for each feature by randomly selecting two values\nfrom 0 to 1, one as the lower boundary and the other as the upper boundary. These ten rules are\n(cid:3): observations that\nthe ground truth rule set denoted as R\n(cid:3) are positive. Then each data set is partitioned into 75% training and 25% testing. To\nsatisfy R\napply the MRS model, we discretize each feature into ten intervals and obtain a binary data set of\nsize 100k by 500 on which we run the proposed model. We set entries in (cid:18) to 1, (cid:11)+ = (cid:11)(cid:0) = 100\nand (cid:12)+ = (cid:12)(cid:0) = 1. Out of the four shape parameters (cid:11)M ; (cid:12)M ; (cid:11)L; (cid:12)L, we \ufb01x (cid:11)M ; (cid:11)L to 1 and only\nvary (cid:12)M ; (cid:12)L. Larger values of (cid:12)M ; (cid:12)L indicate a stronger prior preference for simpler models. Let\n(cid:12)M ; (cid:12)L take values from f1; 10; 100; 1000; 10000g, giving a total of 25 sets of parameters. On each\ntraining data set, we run the MRS model with the 25 sets of parameters and then evaluate the output\nmodel on the test set. We repeat the process for ten data sets. Figure 1 shows the hold-out error, the\nnumber of conditions and the number of features used in the model. Each block corresponds to a\nparameter set. The values are averaged over ten data sets.\n\n(a) the avg error rate.\nFigure 1: Effect of shape parameters on predictive accuracy and interpretability.\n\n(b) the avg number of conditions. (c) the avg number of features.\n\nThe left-bottom corner represents models with the least constraint on complexity ((cid:12)M = 1; (cid:12)L = 1)\nand they achieve the lowest error but at the cost of the highest complexity and the largest feature set.\nAs (cid:12)M and (cid:12)L increase, the model becomes less complex, with fewer conditions and fewer features,\nbut at the cost of predictive accuracy. The right-top corner represents models with the strongest\npreference for simplicity: the smallest model with the largest error. The three \ufb01gures show a clear\npattern of the trade-off between interpretability and predictive accuracy.\n5.2 Real World data sets\n\nWe then evaluate the performance of MRS on six real-world data sets from law enforcement, health-\ncare, and demography where interpretability is most desired. The data sets are publicly available at\nUCI Machine Learning Repository or ICPSR. Among these, medical data sets are especially suitable\nfor MRS since many features such as diagnose categories have very high cardinalities.\n\nTable 1: A summary of data sets\n\ndata set\nJuvenile Delinquency [23]\nCredit card [34]\nCensus [14]\nRecidivism\nHospital Readmission [27]\nIn-hospital Mortality\n\nN\n4,023\n30,000\n48,842\n11,645\n100,000\n200,000\n\nd\n69\n24\n14\n106\n55\n14\n\nY = 1\ndelinquency\ncredit card default\nincome(cid:21) 50k\nrecidivism\nreadmitted\ndeath in hospital\n\nFeatures\nexposure to violence, demo, etc\ngender, history of past payment, etc\ngender, age, occupation, etc\nconviction, employment, demo, etc\ndiagnose history, symptoms, etc\ndiagnoses, medical history, etc\n\nBaselines We benchmark the performance of MRS against the following rule-based models for clas-\nsi\ufb01cation: Scalable Bayesian Rule Lists (SBRL) [33], Classi\ufb01cation Based on Associations (CBA)\n[18], Repeated Incremental Pruning to Produce Error Reduction (Ripper) [5] and Bayesian Rule\nSets (BRS) [31]. CBA and Ripper were designed to bridge the gap between association rule mining\n\n7\n\n\fand classi\ufb01cation and thus focused mostly on optimizing for predictive accuracy. They are among\nthe earliest and most-cited work on rule-based classi\ufb01ers. On the other hand, BRS and SBRL, two\nrecently proposed frameworks aim to achieve simpler models as well as high predictive accuracy.\nAll of the four rule-based models use classical single value rules. Additionally, we would like to\nquantify the possible loss (if any) in predictive accuracy for gaining interpretability. Therefore, we\nalso use two black-boxes, random forest and XGBoost to benchmark the performance without ac-\ncounting for interpretability.\nExperimental Setup We performed 5-fold cross validation for each method. In each fold, we set\naside 20% of data during training for parameter tuning and used a grid search to locate the best set\nof parameters. We use R and python packages for the random forest, SBRL, CBA and Ripper [11]\nand use the publicly available code for BRS 1. The MRS model has a set of hyperparameters Hs; Hc.\nWe set entries in (cid:18) to 1, (cid:11)+ = (cid:11)(cid:0) = 100 and (cid:12)+ = (cid:12)(cid:0) = 1. (cid:11)M ; (cid:12)M control the number of rules\nand (cid:11)L; (cid:12)L control lengths of rules. We set (cid:11)M ; (cid:11)L to 1 and vary (cid:12)M ; (cid:12)L. We report in Table 2 the\naverage test error, the average number of conditions in the output model, and the average number\nof unique features used in each model, computed from the 5 folds. The standard deviations are also\nreported.\n\nTable 2: Evaluation of predictive performance and model complexity over 5-fold cross validation\n\nJuvenile\n\nCredit card\n\nCensus\n\nRecidivism\n\nReadmission\n\nMortality\n\nTask\nMethod\nRipper\nCBA\nSBRL\nBRS\nMRS\n\naccuracy ncond nfeat\n.88(.01) 35(13) 23(5)\n.88(.01) 27(22) 18(12)\n.88(.01) 10(2)\n9(2)\n.88(.01) 21(4) 11(3)\n.89(.00) 18(3)\n6(2)\nnval:\n19(1)\nRF\n.90(.00)\nXGBoost .91(.01)\n\n\u2013\n\u2013\n\n\u2013\n\u2013\n\naccuracy ncond nfeat accuracy ncond nfeat accuracy ncond nfeat accuracy ncond nfeat\n.82(.01) 23(8) 12(2)\n.80(.01) 35(3) 6(0)\n.82(.00) 15(2) 10(2)\n.81(.01) 17(2) 8(2)\n.82(.01) 10(7) 5(3)\nnval:\n.82(.00)\n.83(.01)\n\nncond nfeat\n.58(.01) 35(9) 12(1) .26(.01) 115(6) 9(1)\n.61(.01) 39(10) 13(1) .28(.02) 435(18) 10(2)\n4(1)\n.61(.01) 21(1) 7(1)\n6(1)\n4(0)\n.59(.01) 9(11) 5(3). 39(.01) 10(1)\n.60(.00)\n6(2)\n3(1)\n8(2)\nnval:\n\u2013\n.61(.00)\n.60(.00)\n\u2013\n\n.84(.01) 67(11) 7(0)\n.79(.01) 13(12) 6(2)\n.82(.00) 32(2) 10(1)\n.79(.01) 33(11) 11(2)\n.80(.00) 14(8) 5(3)\nnval:\n.86(.00)\n.87(.01)\n\n.39(.00)\nnval:\n.41(.01)\n.41(.02)\n\n.78(.00) 78(18) 32(4)\n.72(.01) 87(25) 27(5)\n.75(.00) 10(1) 9(1)\n.73(.01) 16(11) 8 (3)\n.74(.02)\n3(1)\nnval:\n.74(.00)\n.75(.05)\n\n6(3)\n8(3)\n\n6(3)\n8(4)\n\u2013\n\u2013\n\n.30(.01)\n\n\u2013\n\n\u2013\n\n13(5)\n\n\u2013\n\u2013\n\n3(0)\n\n\u2013\n\u2013\n\nF1\n\n\u2013\n\u2013\n\n\u2013\n\u2013\n\n29(17)\n\n\u2013\n\u2013\n\n\u2013\n\u2013\n\nResults We evaluate the predictive performance and interpretability performance by measuring three\nmetrics: i) the accuracy on the test set (we report F1 score for the mortality data set since it is highly\nunbalanced), ii) the total number of conditions in the output model (for MRS models, we also report\nthe total number of values), and iii) the average number of unique features in the model. MRS\nachieves consistently competitive predictive accuracy using signi\ufb01cantly fewer conditions and fewer\nfeatures. On data sets credit card and mortality, MRS is the best performing model: highest accuracy,\nsmallest complexity, and fewest features. On juvenile data set, MRS achieves the highest accuracy\nwhile using the second smallest number of conditions. On readmission data set, MRS loses slightly\nin accuracy compared to CBA and SBRL but only uses 6 conditions while CBA used 39 and SBRL\nused 21. In summary, MRS models use the fewest conditions on \ufb01ve out of six data sets. They use\nthe smallest number of features on all six data sets, even for juvenile data set where MRS has more\nconditions than SBRL model but still wins in the number of features.\nWe show an MRS model learned from data set juvenile to inspect if the grouping of categories\nis meaningful. It consists of two rules, and if a teenager satis\ufb01es either of them, then the model\npredicts the teenager will conduct delinquency in the future. In this data set, features are questions\nand feature values are choices for the questions.\n1: [Have your friends ever hit or threatened to hit someone without any reason? = \u201cAll of them\u201d or\n\u201cNot sure\u201d or \u201cRefused to Answer\u201d]\n2: [Have your friends purposely damaged or destroyed property that did not belong to them? = \u201cAll\nof them\u201d or \u201cMost of them\u201d or \u201cSome of them\u201d] AND [Did any of your family members use hard\ndrugs? = \u201cYes\u201d] AND [\u201cHas any of your family members or friends ever beat you up with their \ufb01sts\nso hard that you were hurt pretty bad? = \u201cYes\u201d]\nIt is interesting to notice that MRS grouped three values in the \ufb01rst rule together and the three values\nin the \ufb01rst condition in the second rule. Grouped values are considered interchangeable by the model.\nIt is intuitive to explain the grouping with common sense. People avoid answering when they feel\nalerted or uncomfortable with the question [4, 32]. In this case, this question concerns the privacy of\ntheir friends, making people more reserved and hesitant to provide a de\ufb01nite answer. So they would\nrather say they are not sure or refuse to answer than directly say yes.\n\n1https://github.com/wangtongada/BOA\n\n8\n\n\f5.3\n\nInterpretability Evaluation by Humans\n\nTo further evaluate the model interpretability, in addition to quantitively measuring the size of the\nmodel, we would like to understand how quickly and how correctly humans understand a machine\nlearning model. We designed a short survey and sent it to a group of 70 undergraduate students.\nThe survey was designed as an online quiz with credit to motivate students to do it as accurately as\npossible. The students have been enrolled in a machine learning class for a couple of weeks and\nhave some knowledge about predictive models.\nWe chose to show models built from data set \u201ccredit card\u201d since output models are smallest compared\nto other data sets, so it\u2019s easier for humans to understand. The students were asked to use the models\nto make predictions on given instances. Every method has \ufb01ve models, each from one of the \ufb01ve\nfolds. Therefore, each student was shown with one model for every one of the \ufb01ve methods. The\nsurvey \ufb01rst taught them how to use a model with instructions and an example, and then asked them\nto use the model to make predictions on two instances. Their answers and response time were\nrecorded.\nSince all competing methods are rule-based models, it is important that students understand the\nnotion of rules before working with any of the models. Therefore, we designed a screening question\non rules and students can only proceed with the survey if they answered the question correctly. 66\nstudents passed the test.\nWe report in Figure 2 the accuracy and response time of each method averaged over \ufb01ve folds.\nNote that response time refers to the total time for understanding the model and using the model to\npredict two instances. Accuracy was evaluated against the predictions of a model, not the true labels.\nMethods MRS and BRS achieve the highest accuracy, and SBRL achieves the lowest accuracy. We\nhypothesize this is because SBRL uses an ordered set of rule connected by \u201celse-if\u201d which makes\nit a little more dif\ufb01cult to understand compared to un-ordered rules in the other methods. For the\nresponse time, MRS uses a signi\ufb01cantly small amount of time, less than half of that of CBA and\nRipper, due to the Bayesian prior to favor small models and a concise presentation allowing multiple\nconditions in a rule. BRS also takes a very short time, a bit longer than MRS, followed by SBRL.\nMRS, BRS, and SBRL all have a Bayesian component to favor small models while CBA and Ripper\ndo not, thus taking signi\ufb01cantly longer to understand and use.\n\nFigure 2: Effect of shape parameters on predictive accuracy and interpretability\n\n6 Conclusions\n\nWe proposed a Multi-value Rule Set (MRS) which provides a more concise and feature-ef\ufb01cient\nmodel form to classify and explain. We developed an inference algorithm that incorporates theoret-\nically grounded bounds to reduce computation. Compared with state-of-the-art rule-based models,\nMRS showed competitive predictive accuracy while achieving a signi\ufb01cant reduction in complexity\nand feature sets, thus improving the interpretability, demonstrated by human evaluation. A major\ncontribution is that we demonstrated the possibility of using fewer features without hurting too much\n(if any) predictive performance.\nNote that we do not claim that multi-value rules are more interpretable than single-value rules since\nit is well-known that interpretability comes in different forms for different domains. However, our\nmodel provides a more \ufb02exible solution for interpretable models since, after all, a single-value rule\nis just a special case of multi-value rules. We believe the potential in the proposed multi-value rules\nis not only limited to MRS. They can be adopted in other rule-based models.\nCode: The MRS code is available at https://github.com/wangtongada/MRS.\n\n9\n\n\fReferences\n[1] M. R. Berthold. Mixed fuzzy rule formation. International journal of approximate reasoning,\n\n32(2-3):67\u201384, 2003.\n\n[2] V. Bombardier, C. Mazaud, P. Lhoste, and R. Vogrig. Contribution of fuzzy reasoning method\nto knowledge integration in a defect recognition system. Computers in industry, 58(4):355\u2013\n366, 2007.\n\n[3] Z. Chi, H. Yan, and T. Pham. Fuzzy algorithms: with applications to image processing and\n\npattern recognition, volume 10. World Scienti\ufb01c, 1996.\n\n[4] P. M. Chisnall. Questionnaire design, interviewing and attitude measurement. Journal of the\n\nMarket Research Society, 35(4):392\u2013393, 1993.\n\n[5] W. W. Cohen. Fast effective rule induction. In Proceedings of the twelfth international confer-\n\nence on machine learning, pages 115\u2013123, 1995.\n\n[6] K. A. DeJong and W. M. Spears. Learning concept classi\ufb01cation rules using genetic algorithms.\n\nTechnical report, GEORGE MASON UNIV FAIRFAX VA, 1990.\n\n[7] H. Deng, G. Runger, and E. Tuv. Bias of importance measures for multi-valued attributes\nand solutions. Arti\ufb01cial neural networks and machine Learning\u2013ICANN 2011, pages 293\u2013300,\n2011.\n\n[8] T. Hamrouni, S. B. Yahia, and E. M. Nguifo. Sweeping the disjunctive search space towards\nmining new exact concise representations of frequent itemsets. Data & Knowledge Engineer-\ning, 68(10):1091\u20131111, 2009.\n\n[9] T. Hamrouni, S. B. Yahia, and E. M. Nguifo. Generalization of association rules through\n\ndisjunction. Annals of Mathematics and Arti\ufb01cial Intelligence, 59(2):201\u2013222, 2010.\n\n[10] I. Hilali, T.-Y. Jen, D. Laurent, C. Marinica, and S. B. Yahia. Mining interesting disjunctive\nIn International Workshop on Information Search,\n\nassociation rules from unfrequent items.\nIntegration, and Personalization, pages 84\u201399. Springer, 2013.\n\n[11] K. Hornik, C. Buchta, and A. Zeileis. Open-source machine learning: R meets Weka. Compu-\n\ntational Statistics, 24(2):225\u2013232, 2009.\n\n[12] H. Ishibuchi and T. Nakashima. Effect of rule weights in fuzzy rule-based classi\ufb01cation sys-\n\ntems. IEEE Transactions on Fuzzy Systems, 9(4):506\u2013515, 2001.\n\n[13] S. Kirkpatrick, C. D. Gelatt, M. P. Vecchi, et al. Optimization by simulated annealing. science,\n\n220(4598):671\u2013680, 1983.\n\n[14] R. Kohavi. Scaling up the accuracy of naive-bayes classi\ufb01ers: A decision-tree hybrid. In KDD,\n\nvolume 96, pages 202\u2013207. Citeseer, 1996.\n\n[15] H. Lakkaraju, S. H. Bach, and J. Leskovec. Interpretable decision sets: A joint framework for\n\ndescription and prediction. In ACM SIGKDD, pages 1675\u20131684. ACM, 2016.\n\n[16] B. Letham, C. Rudin, T. H. McCormick, D. Madigan, et al.\n\nInterpretable classi\ufb01ers using\nrules and bayesian analysis: Building a better stroke prediction model. The Ann of Appl Stats,\n9(3):1350\u20131371, 2015.\n\n[17] W. Li, J. Han, and J. Pei. Cmar: Accurate and ef\ufb01cient classi\ufb01cation based on multiple class-\n\nassociation rules. In ICDM, pages 369\u2013376. IEEE, 2001.\n\n[18] B. L. W. H. Y. Ma and B. Liu. Integrating classi\ufb01cation and association rule mining. In KDD,\n\n1998.\n\n[19] D. Malioutov and K. Varshney. Exact rule learning via boolean compressed sensing. In Inter-\n\nnational Conference on Machine Learning, pages 765\u2013773, 2013.\n\n10\n\n\f[20] M. Mampaey, S. Nijssen, A. Feelders, R. Konijn, and A. Knobbe. Ef\ufb01cient algorithms for \ufb01nd-\ning optimal binary features in numeric and nominal labeled data. Knowledge and Information\nSystems, 42(2):465\u2013492, 2015.\n\n[21] G. A. Miller. The magical number seven, plus or minus two: some limits on our capacity for\n\nprocessing information. Psychological review, 63(2):81, 1956.\n\n[22] A. A. Nanavati, K. P. Chitrapura, S. Joshi, and R. Krishnapuram. Mining generalised disjunc-\ntive association rules. In Proceedings of the tenth international conference on Information and\nknowledge management, pages 482\u2013489. ACM, 2001.\n\n[23] J. D. Osofsky. The effect of exposure to violence on young children. American Psychologist,\n\n50(9):782, 1995.\n\n[24] P. R. Rijnbeek and J. A. Kors. Finding a short and accurate decision rule in disjunctive normal\n\nform by exhaustive search. Machine learning, 80(1), 2010.\n\n[25] R. Srikant and R. Agrawal. Mining generalized association rules. 1995.\n\n[26] M. Steinbach and V. Kumar. Generalizing the notion of con\ufb01dence. Knowledge and Informa-\n\ntion Systems, 12(3):279\u2013299, 2007.\n\n[27] B. Strack, J. P. DeShazo, C. Gennings, J. L. Olmo, S. Ventura, K. J. Cios, and J. N. Clore. Im-\npact of hba1c measurement on hospital readmission rates: analysis of 70,000 clinical database\npatient records. BioMed research international, 2014, 2014.\n\n[28] T. Tran, W. Luo, D. Phung, J. Morris, K. Rickard, and S. Venkatesh. Preterm birth prediction:\nStable selection of interpretable rules from high dimensional data. In Proceedings of the 1st\nMachine Learning for Healthcare Conference, volume 56 of Proceedings of Machine Learn-\ning Research, pages 164\u2013177, Northeastern University, Boston, MA, USA, 18\u201319 Aug 2016.\nPMLR.\n\n[29] F. Wang and C. Rudin. Falling rule lists. In Arti\ufb01cial Intelligence and Statistics, pages 1013\u2013\n\n1022, 2015.\n\n[30] T. Wang, C. Rudin, F. Doshi, Y. Liu, E. Klamp\ufb02, and P. MacNeille. A bayesian framework\nfor learning rule sets for interpretable classi\ufb01cation. Journal of Machine Learning Research,\n2017.\n\n[31] T. Wang, C. Rudin, F. Velez-Doshi, Y. Liu, E. Klamp\ufb02, and P. MacNeille. Bayesian rule sets\n\nfor interpretable classi\ufb01cation. ICDM, 2016.\n\n[32] G. B. Willis. Cognitive interviewing: A tool for improving questionnaire design. Sage Publi-\n\ncations, 2004.\n\n[33] H. Yang, C. Rudin, and M. Seltzer. Scalable bayesian rule lists. ICML, 2017.\n\n[34] I.-C. Yeh and C.-h. Lien. The comparisons of data mining techniques for the predictive\naccuracy of probability of default of credit card clients. Expert Systems with Applications,\n36(2):2473\u20132480, 2009.\n\n[35] X. Yin and J. Han. Cpar: Classi\ufb01cation based on predictive association rules. In SIAM Inter-\n\nnational Conference on Data Mining, pages 331\u2013335. SIAM, 2003.\n\n11\n\n\f", "award": [], "sourceid": 6913, "authors": [{"given_name": "Tong", "family_name": "Wang", "institution": "University of Iowa"}]}