{"title": "MonoForest framework for tree ensemble analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 13780, "page_last": 13789, "abstract": "In this work, we introduce a new decision tree ensemble representation framework: instead of using a graph model we transform each tree into a well-known polynomial form. We apply the new representation to three tasks: theoretical analysis, model reduction, and interpretation. The polynomial form of a tree ensemble allows a straightforward interpretation of the original model. In our experiments, it shows comparable results with state-of-the-art interpretation techniques. Another application of the framework is the ensemble-wise pruning: we can drop monomials from the polynomial, based on train data statistics. This way we reduce the model size up to 3 times without loss of its quality. It is possible to show the equivalence of tree shape classes that share the same polynomial. This fact gives us the ability to train a model in one tree's shape and exploit it in another, which is easier for computation or interpretation. We formulate a problem statement for optimal tree ensemble translation from one form to another and build a greedy solution to this problem.", "full_text": "MonoForest framework for tree ensemble analysis\n\nIgor Kuralenok\n\nYandex / JetBrains Research\nsolar@yandex-team.ru\n\nVasily Ershov\n\nYandex\n\nnoxoomo@yandex-team.ru\n\nYandex / Saint Petersburg campus of National Research University Higher School of Economics\n\nIgor Labutin\n\nLabutin.IgorL@gmail.com\n\nAbstract\n\nIn this work, we introduce a new decision tree ensemble representation framework:\ninstead of using a graph model we transform each tree into a well-known polyno-\nmial form. We apply the new representation to three tasks: theoretical analysis,\nmodel reduction, and interpretation. The polynomial form of a tree ensemble\nallows a straightforward interpretation of the original model. In our experiments,\nit shows comparable results with state-of-the-art interpretation techniques. An-\nother application of the framework is the ensemble-wise pruning: we can drop\nmonomials from the polynomial, based on train data statistics. This way we reduce\nthe model size up to 3 times without loss of its quality. It is possible to show the\nequivalence of tree shape classes that share the same polynomial. This fact gives\nus the ability to train a model in one tree\u2019s shape and exploit it in another, which\nis easier for computation or interpretation. We formulate a problem statement for\noptimal tree ensemble translation from one form to another and build a greedy\nsolution to this problem.\n\n1\n\nIntroduction\n\nIndustry and science combined efforts in the \ufb01eld of machine learning give us powerful techniques\nand tools to solve different kinds of supervised learning tasks. Just a few lines of code could train\na model that solves classi\ufb01cation, regression, ranking, and other problems. Modern techniques,\nlike deep neural networks He et al. (11) learn complex models, but for many practical applications\nwe need to understand why and how the model makes a prediction. This knowledge allows us to\nimprove quality, protect from adversarial attacks, make a model resistant to data corruptions and\nso on. Recent efforts in deep learning models interpretation allow us to understand the decision for\nparticular examples Ribeiro et al. (26); Shrikumar et al. (27); \u0160trumbelj, Kononenko (31); Koh, Liang\n(15); Lundberg, Lee (19), but understanding the logic behind a complex model is still a challenging\ntask.\nUnlike neural networks, decision trees are supposed to be easy to understand, which is true in case of\nshallow trees, but becomes a complicated task if we start using ensembles or increase the depth of a\ntree. Ensemble methods, especially gradient boosted decision trees, show state-of-the-art results on\nstructured and categorical data Prokhorenkova et al. (25). For well-engineered input features decision\ntree ensembles signi\ufb01cantly outperform deep networks: in competitions held by Kaggle ensemble\nmodels built by GBDT libraries often outperform their rivals.\nOne way to work with the interpretation problem is to make a clearer representation of a model or an\noptimization setup. An example of such perspective change is (4): the results of this paper allow us\nto use known facts and intuitions from differential equations for neural networks analysis.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOur work proposes a new framework for tree ensemble representation, learned by techniques like\nRandom Forest (2) or Gradient Boosted Decision Tree Friedman (9). We call it MonoForest as sum\nof trees is converted to forest of monomials. The proposed framework can be used for both theoretical\ntree ensemble analysis and practical enhancements of existing methods. The main contributions of\nthis paper are:\n\n\u2022 an algorithm for tree ensemble conversion to polynomial form;\n\u2022 a proof of uniqueness of the polynomial form for all shapes of trees that have the same\n\nvalues;\n\noblivious trees;\n\n\u2022 an algorithm for conversion of polynomial ensemble representation to sum of symmetric\n\n\u2022 an ensemble-wise pruning algorithm and its experimental study;\n\u2022 a method for global feature attribution, based on polynomial form of tree ensemble.\n\nThe rest of the paper is organized as follows: section 3 introduces the new framework; the next three\nsections study applications of the proposed framework: theoretical analysis is in section 4, model\nreduction is in section 5, and interpretation is in section 6. Finally, section 7 summarizes future work\nand presents our conclusions.\n\nensemble, e.g. H(x) =(cid:80)T\n\n2 Notation\nAt \ufb01rst, let us introduce a notation that will be used throughout the paper. I{\u2217} denotes the indicator\nfunction, i.e., I{condition} is equal to one if the condition is true and zero otherwise. The vector\nof input features will be denoted by x \u2208 Rn. We assume that the decision tree ensemble is \ufb01xed,\ntherefore there are a \ufb01xed number of minimal possible split rules C. For the rules we will use the\nfollowing notations: c(x) \u2208 {0, 1} as a right split indicator function, i.e. c(x) = I{xi(c) > b(c)},\nwhere i(c) is an index of feature to split, and b(c) is a condition used; associated left split indicator,\nI{xi(c) \u2264 b(c)}, is equal to 1\u2212 c(x). We can group conditions by feature: Ck = {c \u2208 C : i(c) = k}\nand order them by condition value: cki, ckj \u2208 Ck, b(cki) < b(ckj \u2264 bk) \u21d4 i < j with maximal\ncondition bk for each feature. H is used for the entire ensemble, and ht for the components of the\nt=1 ht(x); for the \ufb01xed decision tree we use d to denote depth of the tree.\nBy 2C we denote the set of all possible products of right split indicator functions; elements of this set\nwe call monomials and denote by M.\nPopular boosting libraries have different growing policies for decision tree induction. All these\npolicies use a greedy algorithm to search for a tree, but in a slightly different manner. As a result,\nlibraries generate trees of different shape:\nXGBoost Chen, Guestrin (5), by default, use a level-wise policy. Trees are built level by level, on\neach iteration each leaf is independently split by condition with the best gain. Optionally, the tree is\nprunned after training. This policy generates ensemble with balanced trees of \ufb01xed depth.\nLightGBM Ke et al. (13) uses a loss-guide policy. This policy on each step of tree induction splits\nonly a single leaf, one with the best gain. As a result, trees are usually highly imbalanced and deep.\nCatBoost (25) uses a level-wise policy, similar to one used in XGBoost, but with one more restriction:\nCatBoost searches for just one single condition to split all leaves simultaneously. Thus, the result is,\neffectively, a decision table. Such trees are also called oblivious trees (16).\nThroughout this paper we will refer to trees, generated by XGBoost and LightGBM as non-symmetric\ntrees, and to trees, generated by CatBoost, as symmetric.\n\n3 Tree ensemble as polynomial\n\nThe core idea of MonoForest is quite simple: we transfer a single tree to algebraic form and then\nminimize the entire ensemble, using properties of a tree representation. Decision tree splits the\nfeatures space into non-intersecting areas (leaves l) where a single constant or vector is used to\n\n2\n\n\fFigure 1: Core idea of MonoForest framework\n\n(cid:88)\n\nl\u2208leaves\n\n(cid:89)\n\n(cid:88)\n\nM\u22082C\n\n(cid:89)\n\n(cid:89)\n\nc\u2208M\n\nspecify leaf property. It is possible to express this statement in algebraic form:\n\nh(x) =\n\nwlI{x \u2208 l}\n\n(1)\n\nEach leaf indicator function is a product of indicators induced by splits along the path from root to\nterminal node:\n\nI{x \u2208 l} =\n\nc(x)\n\nc\u2208right splits\n\nc\u2208left splits\n\n(1 \u2212 c(x))\n\n(2)\n\nThus, replacing leaf indicator from equation (1) with equation (2) and expanding brackets, we will\nobtain tree in polynomial form:\n\nht(x) =\n\nwMt\n\nc(x)\n\n(3)\n\n(cid:81)\n\nFigure 1 illustrates the idea of the transformation. The atomic parts of the representation\n(wM\nc\u2208M c(x)) we refer to as monomials through the rest of this paper. The proposed repre-\nsentation of the tree has three valuable properties:\n\n1. the values wM t are de\ufb01ned by points, satisfying all conditions of the monomial;\n2. conditions in a single monomial are based on different features;\n3. it is possible to reorder conditions inside monomials.\n\nThe \ufb01rst property seems to be obvious, but it has interesting consequences: the set of points that\nde\ufb01ne the value wM reduces when the number of conditions increases and the value itself becomes\nnoisy; for all trees in the ensemble values wMt are dependant on the exact same set of points and\nconsequently depend on each other. The level of dependency grows with the number of conditions in\nM 1. In extreme case, conditions from M can split a point from the others and all wMt are de\ufb01ned by\nthis single point.\nThe second property comes from the fact that stronger conditions devours weaker ones: if two\nconditions of the same monomial are based on the same feature, their borders are always ordered\nand we can remove the weakest from the monomial. For example, assume leaf is a product of\ntwo indicator functions I{x0 > 0}I{x0 \u2264 1}. This is equal to I{x0 > 0}(1 \u2212 I{x0 > 1}) =\nI{x0 > 0} \u2212 I{x0 > 0}I{x0 > 1} = I{x0 > 0} \u2212 I{x0 > 1}. I{x0 > 1} is a stronger condition\nin the second monomial, thus I{x0 > 0} could be removed. This property limits the number of\n\n(cid:1)(maxk |Ck|)i, which is signi\ufb01cantly smaller then the naive estimate\n(cid:1). Please note, that despite this fact, for simpli\ufb01cation of notation we are using 2C as set of\n(cid:0)|C|\n\npossible monomials to(cid:80)d\n(cid:80)d\n\npossible monomials.\nThe third property allows optimizing model application time. For example, if there are two conditions\nthat are met in the result model together and one condition c1 is much more computationally dif\ufb01cult\n\n(cid:0)n\n\ni\n\ni=1\n\ni\n\ni=1\n\n1MonoForest decomposition is similar to n-way ANOVA decomposition, where dependence for factors\nx, y, z is decomposed to main effects (x, y, z) and their iterations (xy, yz, xyz). In ANOVA x, y, z are\ncategorical factors, while in MonoForest it is right split indicator functions.\n\n3\n\n\f(e.g. taken from a database) than the other c2, we can skip its calculation completely for examples x\nsuch that c1(x) = 0.\nSummation over elements of ensemble gives us the \ufb01nal form of a decision function:\n\nH(x) =\n\nwMt\n\nc(x) =\n\nwM\n\nc(x)\n\n(4)\n\n(cid:32) T(cid:88)\n\n(cid:88)\n\nM\u22082C\n\nt=1\n\n(cid:33) (cid:89)\n\nc\u2208M\n\n(cid:88)\n\nM\u22082C\n\n(cid:89)\n\nc\u2208M\n\nThis transformation alone is able to reduce the number of monomials comparing the leaves count in\nthe original model due to the grouping of monomials from different trees.\nTo the best of our knowledge, this form was not studied before in the literature. The polynomial form\nis easier to analyze, interpret and it provides a way to work with different tree shapes in an uni\ufb01ed\nmanner.\n\nImplementation and conversion complexity Polynomial conversion could be done in a straight-\nforward recursive way: we could extract leaves from the tree one-by-one. Each leaf produces at most\n2d monomials, where d is the depth of this leaf. This conversion could be done in O(|L|2d), where\n|L| denotes the number of leaves in the decision tree and d is maximum depth.\n\n4 Theoretical analysis\n\nDecision tree equivalence, normalization, and minimal representation are known problems in statistics\nLavalle, Fishburn (17) and computer science Zantema, Bodlaender (30). The proposed framework\ngives one more way to view these problems: we made a decomposition of complex decision function\nonto atomic decision factors with a certain degree of features interaction. In this section we use this\nproperty to set up a task of tree shape change in ensemble. We represent a tree ensemble as a sum of\ntrees of \ufb01xed shape and minimize the length of this sum. To be able to declare the existence of such\ntransfer we need to de\ufb01ne equivalence class of tree ensembles with the following theorem:\nTheorem 1. Two tree ensembles H and H(cid:48) are de\ufb01ned on Rn2, have the same value for all possible\npoints \u2200x \u2208 Rn, H(x) = H(cid:48)(x), iff: 1) their set of conditions C and C(cid:48) are equal, 2) they have\nequal polynomial representation \u2200M \u2208 2C, wM = w(cid:48)\n\nM\n\nj = x(cid:48)(cid:48)\n\ni(c) = b(c) + \u0001 and, x(cid:48)\n\nProof. The reverse part of the proof is obvious. To prove the direct \ufb01rst proposition suppose, without loss of\ngenerality, that there is a condition c \u2208 C, c /\u2208 C(cid:48). We can take x(cid:48), x(cid:48)(cid:48) \u2208 X such that H(x(cid:48)) (cid:54)= H(x(cid:48)(cid:48)), x(cid:48)\ni(c) =\nj , j (cid:54)= i(c) because C is a minimal conditions set for ensemble H, otherwise\nb(c), x(cid:48)(cid:48)\nthere is no such pair of points in X that are split by c. Thus c can be excluded from C. We can \ufb01nd \u0001 such that\nC(cid:48) won\u2019t be able to split x(cid:48) from x(cid:48)(cid:48) because it does not contains c, consequently H(cid:48)(x(cid:48)) = H(cid:48)(x(cid:48)(cid:48)). This is\ncontradiction of the initial statement.\nConditions sets are equal and we need to show that \u2200M, wM = w(cid:48)\n\u2205 and we\nneed to prove the induction step by size of M and going from lowest to highest condition values for each of\nfeature in set. For each features combination CM = {i(c)|c \u2208 M} we start from the lowest conditions set\nMi = M0 : c \u2208 M0 \u21d2 b(c) = ci(c)0 and proceed to highest \u00afM : c \u2208 \u00afM \u21d2 b(c) = bi(c) rising single\ncondition at a time. For this monomial Mi choose x(cid:48), x(cid:48)(cid:48) \u2208 Rn such that:\n\nM . It is clear that w\u2205 = w(cid:48)\n\n\u2022 for all coordinates outside CM they both have value of minimal bound for this feature: x(cid:48)\n\u2022 for all coordinates from c \u2208 Mi: x(cid:48)\n\nk = b(c) + \u0001\n\nk = x(cid:48)(cid:48)\n\nk = x(cid:48)(cid:48)\n\nk = ck0,\n\n0 = H(x(cid:48)) \u2212 H(x(cid:48)(cid:48)) \u2212 (H(cid:48)(x(cid:48)) \u2212 H(cid:48)(x(cid:48)(cid:48))) =(cid:80)i\n\nFor these points the difference between values of ensembles comes only from monomials from features set CM :\nMj , because all monomials of lesser length are\nsupposed to be equal in H and H(cid:48), monomial of length q with higher conditions (j > i) are zero by construction\nof x(cid:48) and x(cid:48)(cid:48). At \ufb01rst step there is the only element left in the right part of the equation: 0 = wM0 \u2212 w(cid:48)\nM0 and\nwM0 = w(cid:48)\nMi.\n\nM0, constructing pair of x at each step of induction it is easy to show, that all wMi = w(cid:48)\n\nj=0 wMj \u2212 w(cid:48)\n\n2To save the space we formulated the theorem in Rn task space, this limitation can be easily eliminated, but\n\nthe proof will be longer.\n\n4\n\n\fFigure 2: Oblivious tree conversion to polynomial form\n\nThe direct implication of the theorem is that ensembles of trees that are able to generate the same\nset of monomials (e.g. of the same depth) are equivalent. Theoretically, this allows us to claim\nequivalence of the decisions set used by libraries with different shapes of atomic trees. Polynomial\nform determines the tree ensemble, but it is possible to convert the polynomial form back to a tree\nensemble of a different form. This task could be formalized in the following way:\n\n\uf8eb\uf8ed(cid:32)\nH(x) \u2212 T(cid:88)\n\n(cid:33)2\uf8f6\uf8f8 + \u03bbT\n\n(5)\n\n{h\u2217\n\nt}T\nt=1 = arg min\nT,ht\u2208H\n\nEx\u223cX\n\nht(x)\n\nt=1\n\nwhere H is the set of trees of a certain shape and \u03bb\u2014regularization parameter. This set optimization\nis clearly NP-hard: there is a set cover problem under the shell3. On the other hand, the target\nof the optimization is submodular, so it is possible to \ufb01nd an approximately optimal solution in\npolynomial time Bach (1). Unfortunately, set optimization algorithms often become impractical\nbecause of their computational dif\ufb01culty, and we decided to start from a simple greedy algorithm,\nleaving the mathematically sound version for future work. We have \ufb01xed the tree shape to symmetric\noblivious trees. This type of trees is trivial in computations and it is possible to speed up decision\nfunction computation if we could convert tree ensembles of arbitrary shapes to them. From a practical\nperspective, the resolution of this task allows us to train easier for training but heavier model and\nthen, transfer the result solution to some lightweight form, easier for exploitation.\nThe polynomial form of the ensemble gives the idea of a greedy algorithm for such optimization:\nat each step, we eliminate monomial with the greatest number of features in it. Using the fact that\noblivious trees have a single monomial of maximum length (see Figure 2), the task becomes much\neasier. The resulting algorithm is presented in Algorithm 1.\nThere is no universal strategy for tree induction: optimal one changes with dataset at hand. On the\nother hand, symmetric trees outperform other tree shapes: they are coded by a decision table, thus\nevaluation for one tree requires several bit-wise operations and one look-up in the table and takes\n>10x less time for the same number of trees Dorogush et al. (6).\nFor our experiment we have used model built by LightGBM for Higgs dataset4, transferred it to an\nensemble of symmetric trees by Algorithm 1 and then compared model execution times of the original\nmodel and transformed version. The time we need to apply the original model with fast CatBoost\ncalculator for non-symmetric trees was 2.57 seconds; the transformed ensemble was applied in 1.57\nseconds5. This gives us 40% speedup free of charge.\n\n5 Model reduction\n\nDecision tree pruning has been studied for decades: critical values, error complexity and reduced\nerror methods (24), more recent bottom-up methods (14), minimum description length based methods\n(23) and others. Ensemble-wise pruning is less studied. Several works were done on the pruning of\ntrained ensemble. (21), (22) trained a random forest ensemble, later using boosting to prune it. On\neach boosting iteration, tree search spaces were restricted to only those that were part of a random\n\n3We need to cover all monomials by polynomials of the \ufb01xed shape.\n4LightGBM provides the best quality model for this dataset\n5For experiment we have used dual-socket server with Intel Xeon CPU E5-2650 and 256GB of RAM.\n\nCatBoost version was equal to 0.14.2\n\n5\n\n\fResult: Symmetric tree ensemble H(x) =(cid:80)T\n\nData: Monomials M1 . . . MN , monomial weights W1, . . . WN\n\nt=1 H[t](x)\ndef SymmetricTree(M: monomial features, W : weight):\n\nreturn oblivious tree, generated by monomial M with weights W ;\n\ndef IsSubset(M : monomial features , h : tree):\n\nreturn True if features from M are subset of feature of tree h;\n\ndef AddMonomialToTree(M : monomial features , W : weight, h : tree):\n\nAdd monomial M with weight W to tree h ;\n\nH = [];\nT = 0;\nfor i \u2208 1 . . . N do\n\nif \u2203t \u2208 0 . . . T : IsSubset(Mi, H[t]) = True then\n\n/* Tree has the same split conditions\nAddMonomialToTree(Mi, Wi, H[t]);\n\nH[T ] = SymmetricTree(Mi, Wi);\nT = T + 1;\n\nelse\n\nReturn(cid:80)T\n\nend\n\nt=1 H[t] ;\n\nAlgorithm 1: Greedy ensemble composition algorithm.\n\n*/\n\nM\n\n(cid:0)w2\n\nc\u2208M c(x)(cid:1). This approach is\n(cid:81)\n\nforest. Kappa pruning Margineantu, Dietterich (20), and modi\ufb01cation of this technique (29), were\nproposed to prune AdaBoost ensembles. Those heuristic techniques rely on the assumption that\ngradient boosting builds an ensemble of weak classi\ufb01ers. Greedy strategies are then used to select the\nmost diverse sample of them. To the best of our knowledge, those methods are not used in practice\ntoday. In practice, pruning is done on a per-tree basis, and early stopping strategies are used to select\nthe optimal ensemble size Chen, Guestrin (5); Prokhorenkova et al. (25).\nIn this part we use the \ufb01rst property of the polynomial representation: the monomial coef\ufb01cients\nare determined by the same set of points in all trees of the ensemble. We will de\ufb01ne a quality\nmeasure of the resulting monomials, based on point statistics and remove the least valuable from the\nmodel. To measure monomial quality we use \u03b7(M ) = Ex\u223cX\nclosely related to feature importance, proposed by Breiman et al. (3). They estimated a squared risk\nimprovement from region partitioning; our measure is an estimated squared risk improvement over\nsetting the speci\ufb01ed monomial weight to zero. This way we get a simple ensemble-wise pruning\nstrategy while it is de\ufb01nitely possible to use more sophisticated methods like sparse re-weighting\nmonomials.\nThe introduced monomial quality measure leads to a straightforward pruning scheme: select some\nthreshold \u03b1 and for each monomial M with \u03b7(M ) < \u03b1 set weight to zero. This threshold could be\nselected by heuristics like \u2018Elbow method\u2018, based on learn statistics only, or \u03b1 could be estimated\nusing the validation set. The former approach requires human judgment; thus its quality is hard to\nestimate. The latter one could be applied automatically; thus we could estimate its performance in a\nfair way.\nWe use several publicly-available binary classi\ufb01cation datasets6 to evaluate the quality of the automatic\napproach. The experimental setup is the following: we split the data into train/validate/test groups,\nthe parameters were tuned on train/validation pair, then these parameters were used on train+validate\njoined dataset to get the \ufb01nal model. The result model is evaluated on the test part of a data. It\nis important to note that we tuned the optimal gradient step, number of trees in the ensemble,\nregularization parameters of a single trees, etc. This way we get an ensemble with the minimal\nnumber of leaves using state-of-the-art techniques.\nObtained ROC AUC values are presented in the second (original model) and third (pruned model)\ncolumns of the Table 1. The last column contains the polynomial model reduction ratio. The\nexperimental results allow us to claim that the tree ensemble can be signi\ufb01cantly reduced without\nloss of model quality in a variety of practical tasks. The features used there have numerical as well as\ncategorical nature.\n\n6Information about datasets is available in supplementary materials.\n\n6\n\n\fTrained ensemble(AUC)\n\nPruned ensemble(AUC) Model reduction (Ratio)\n\nDataSet\nAdult\nAmazon\n\nKDD Internet\nKDD Upselling\n\nEpsilon\n\n92.76%\n82.51%\n95.71%\n85.72%\n95.76%\n\n92.75%\n82.51%\n95.74%\n85.72%\n95.76%\n\n2.58\n2.6\n2.07\n3.78\n1.11\n\nTable 1: Quality and size of original and pruned models.\n\n6\n\nInterpretation\n\nFeature attribution methods are designed to answer, why and how each feature in\ufb02uences models\npredictions. These methods could be global, describing feature in\ufb02uence on average, or local,\nexplaining how the model deals with one sample.\nGlobal feature attribution methods are well developed. Classical approaches are still widely used\ntoday. Breiman et al. (3) gain, or total reduction of loss contributed by all splits, provides a way\nto estimate the relative contribution of each input feature to the response. Partial dependency plots\nHastie et al. (10) are used to summarize dependence of response on the input feature. Resampling\nstrategies are used to design alternatives to Breiman et al. (3) feature importance measures. Brieman\u2019s\n\u2018Variable Importance\u2018 Breiman (2) for random forests and its model-agnostic version \u2018Model Reliance\u2018\nFisher et al. (8) are the best known examples. Other sampling strategies lead to a big variety of others\nD\u00edaz-Uriarte, Andr\u00e9s de (7); Ishwaran (12); Strobl et al. (28).\nLocal feature attribution methods deal with importance measures for each sample in the dataset.\nWork by Lundberg, Lee (19) has recently shown that, under certain conditions, there is a single\nunique solution for additive feature attribution methods with three desirable properties (local accuracy,\nmissingness, and consistency) \u2014 SHAP values. This work was adapted to decision trees in (18).\nAll these techniques could be applied combined with MonoForest framework. However, our repre-\nsentation allows us to interpret a tree ensemble model as a linear function. As shown in Section 3, it\nis possible to present a tree ensemble as polynomial:\n\n(cid:88)\n\nM\u22082C\n\nI(cid:8)xi(c) > b(c)(cid:9)\n\n(cid:89)\n\nc\u2208M\n\nH(x) =\n\nwM\n\nDue to the second property of the representation, for each feature k monomial M either contains\na single condition c such that i(c) = k, or there is no dependency on this feature. Let us denote\nC\u2212k = {c \u2208 C : i(c) (cid:54)= k}, cki\u2014i-th ordered border condition on feature k: i(c) = k, i > j \u21d4\nb(cki) > b(ckj). We can redistribute monomials the following way:\n\n(cid:88)\n\nM\u22082C\u2212k\n\n(cid:89)\n\nc\u2208M\n\n(cid:88)\n\ni\n\n\uf8eb\uf8ed (cid:88)\n\nM\u22082C\u2212k\n\nH(x) =\n\nwM\n\nc(x) +\n\nI {xk > bki}\n\nwM\u222a{cki}\n\nc(x)\n\n(7)\n\nThis formula is linearly dependant on conditions I {xk > bki} and we can use expected linear\ncoef\ufb01cients to evaluate the in\ufb02uence of each condition cki. To get a single value for entire feature we\nsum all dependant condition values:\n\n(cid:88)\n\ni\n\n\u03bd(k) =\n\nEx\u223cX\n\n\uf8eb\uf8edI {xk > bki} (cid:88)\n\nM\u22082C\u2212k\n\n(cid:89)\n\nc\u2208M\n\nwM\u222a{cki}\n\nc(x)\n\n(8)\n\nThe aggregation to the \ufb01nal feature score is straightforward and has its limitations. For example\nif the monomial values are spread around zero throughout the data points it will have next to zero\nexpectation despite the possible big in\ufb02uence of the feature to particular points. The naive nature of\nthe proposed score does not interfere with its good behavior in real-world examples.\nThere is no established way to compare feature attribution methods. To demonstrate the quality of\nthe proposed approach we built a decision tree ensemble binary one-vs-rest classi\ufb01er for each class\nof the MNIST dataset using CatBoost and analyzed each classi\ufb01er using three methods: MonoForest,\nSHAP, and permutation based Model Reliance, proposed by Fisher et al. (8). The model accuracy\n\n7\n\n(6)\n\n\uf8f6\uf8f8\n\n(cid:89)\n\nc\u2208M\n\n\uf8f6\uf8f8\n\n\fFigure 3: Feature importance visualisations on MNIST data.\n\nfor each class was > 98%. The task of the MNIST dataset is to identify a single handwritten digit.\nThe features are the levels of grey of each pixel in the input image. We have calculated importance\nvalues for all features and put them on the picture of the same size as input images. SHAP is known\nas a local attribution method, but the authors suggested a way to use it as a global attribution too.\nThe results for the \ufb01rst four classes are presented in the Figure 3. The MonoForest results allow\nsplitting positive and negative in\ufb02uence of features in comparison to SHAP global attribution and\nModel Reliance.\n\n7 Conclusions and Future work\n\nIn this work, we have introduced a new framework for decision tree ensemble representation. The\nproposed framework demonstrated good results on two popular tasks: model interpretation and\nreduction. The new representation has interesting mathematical properties of its components, and\nthey allow us to simplify algorithms relevant for decision tree ensembles. We show that, using\nprimitive \ufb01ltering techniques for linear models, it is possible to signi\ufb01cantly simplify the original\nmodel without loss of its quality after applying other pruning strategies.\nAnother important task for MonoForest application is model interpretation. We used a straightforward\napproach to polynomial models analysis and were able to get promising results in comparison with\nstate-of-the-art techniques, such as SHAP and VI, though a comparison of interpretation methods is\nnot well-established \ufb01eld yet.\nThe proposed representation allows us to introduce a class of equivalence in tree ensembles and claim\nthat the decision space of the popular GBDT libraries is equivalent. We have set up a problem of\noptimal ensemble conversion and provided a greedy algorithm to solve this problem. It is important\nto note that this way we can split tasks of ensemble training and performance optimization of the\nresult decision function. As an example of such conversion, we used easy-to-train LightGBM trees\nand then converted them to a more effective form of oblivious trees used by CatBoost. The algorithm\ncan be improved by set optimization techniques to achieve even better results.\nThe MonoForest framework has shown its applicability to white-box analysis of decision trees\nensembles. This work shows that even \u2018naive\u2018 methods provide good results, thus promising much\nmore when combined with more sophisticated techniques. There are several directions for future\nexplorations:\n\n\u2022 framework applications expansion to other tasks;\n\u2022 improvement of ensemble tree pruning algorithm: LASSO and other regularization based\n\u2022 greedy performance optimization of the model application: if we want to build ordered by\nmodel value list, it is possible to skip some computationally demanding features valuation\n\ntechniques look promising;\n\n8\n\n\fReferences\n[1] Bach Francis R. Learning with Submodular Functions: A Convex Optimization Perspective //\n\nCoRR. 2011. abs/1111.6453.\n\n[2] Breiman Leo. Random Forests // Mach. Learn. X 2001. 45, 1. 5\u201332.\n\n[3] Breiman Leo, Friedman J. H., Olshen R. A., Stone C. J. Classi\ufb01cation and Regression Trees.\n\n1984.\n\n[4] Chen Tian Qi, Rubanova Yulia, Bettencourt Jesse, Duvenaud David K. Neural Ordinary\n\nDifferential Equations // NeurIPS. 2018. 6572\u20136583.\n\n[5] Chen Tianqi, Guestrin Carlos. Xgboost: A scalable tree boosting system // Proceedings of\nthe 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.\n2016. 785\u2013794.\n\n[6] Dorogush Anna Veronika, Ershov Vasily, Gulin Andrey. CatBoost: gradient boosting with\n\ncategorical features support // CoRR. 2018. abs/1810.11363.\n\n[7] D\u00edaz-Uriarte Ram\u00f3n, Andr\u00e9s Sara Alvarez de. Gene selection and classi\ufb01cation of microarray\n\ndata using random forest. // BMC Bioinformatics. 2006. 7. 3.\n\n[8] Fisher Aaron, Rudin Cynthia, Dominici Francesca. All Models are Wrong, but Many are\nUseful: Learning a Variable\u2019s Importance by Studying an Entire Class of Prediction Models\nSimultaneously. 2018.\n\n[9] Friedman Jerome H. Greedy function approximation: a gradient boosting machine // Annals of\n\nstatistics. 2001. 1189\u20131232.\n\n[10] Hastie Trevor, Tibshirani Robert, Friedman Jerome H. The elements of statistical learning: data\n\nmining, inference, and prediction, 2nd Edition. 2009. (Springer series in statistics).\n\n[11] He Kaiming, Zhang Xiangyu, Ren Shaoqing, Sun Jian. Delving Deep into Recti\ufb01ers: Surpassing\nHuman-Level Performance on ImageNet Classi\ufb01cation // Proceedings of the 2015 IEEE\nInternational Conference on Computer Vision (ICCV). Washington, DC, USA: IEEE Computer\nSociety, 2015. 1026\u20131034. (ICCV \u201915).\n\n[12] Ishwaran Hemant. Variable importance in binary regression trees and forests // Electronic\n\nJournal of Statistics. 1 2007. 1. 519\u2013537.\n\n[13] Ke Guolin, Meng Qi, Finley Thomas, Wang Taifeng, Chen Wei, Ma Weidong, Ye Qiwei, Liu\nTie-Yan. LightGBM: A Highly Ef\ufb01cient Gradient Boosting Decision Tree // Advances in Neural\nInformation Processing Systems 30. 2017. 3146\u20133154.\n\n[14] Kearns Michael J., Mansour Yishay. A Fast, Bottom-Up Decision Tree Pruning Algorithm\nwith Near-Optimal Generalization // Proceedings of the Fifteenth International Conference\non Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998.\n269\u2013277. (ICML \u201998).\n\n[15] Koh Pang Wei, Liang Percy. Understanding Black-box Predictions via In\ufb02uence Functions\n// Proceedings of the 34th International Conference on Machine Learning. 70. International\nConvention Centre, Sydney, Australia: PMLR, 06\u201311 Aug 2017. 1885\u20131894. (Proceedings of\nMachine Learning Research).\n\n[16] Kohavi Ron, Li Chia-Hsin. Oblivious Decision Trees Graphs and Top Down Pruning //\nProceedings of the 14th International Joint Conference on Arti\ufb01cial Intelligence - Volume 2.\nSan Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1995. 1071\u20131077. (IJCAI\u201995).\n\n[17] Lavalle Irving H., Fishburn Peter C. Equivalent decision trees and their associated strategy sets\n\n// Theory and Decision. Jul 1987. 23, 1. 37\u201363.\n\n[18] Lundberg Scott M., Erion Gabriel G., Lee Su-In. Consistent Individualized Feature Attribution\n\nfor Tree Ensembles // CoRR. 2018. abs/1802.03888.\n\n9\n\n\f[19] Lundberg Scott M, Lee Su-In. A Uni\ufb01ed Approach to Interpreting Model Predictions // Advances\n\nin Neural Information Processing Systems 30. 2017. 4765\u20134774.\n\n[20] Margineantu Dragos D., Dietterich Thomas G. Pruning Adaptive Boosting // Proceedings\nof the Fourteenth International Conference on Machine Learning. San Francisco, CA, USA:\nMorgan Kaufmann Publishers Inc., 1997. 211\u2013218. (ICML \u201997).\n\n[21] Mart\u00ednez-Mu\u00f1oz G., Hern\u00e1ndez-Lobato D., Su\u00e1rez A. An Analysis of Ensemble Pruning\nTechniques Based on Ordered Aggregation // IEEE Transactions on Pattern Analysis and\nMachine Intelligence. Feb 2009. 31, 2. 245\u2013259.\n\n[22] Mart\u00ednez-Mu\u00f1oz Gonzalo, Su\u00e1rez Alberto. Using boosting to prune bagging ensembles //\n\nPattern Recognition Letters. 2007. 28, 1. 156 \u2013 165.\n\n[23] Mehta Manish, Rissanen Jorma, Agrawal Rakesh. MDL-based Decision Tree Pruning //\nProceedings of the First International Conference on Knowledge Discovery and Data Mining.\n1995. 216\u2013221. (KDD\u201995).\n\n[24] Mingers John. An Empirical Comparison of Pruning Methods for Decision Tree Induction //\n\nMachine Learning. Nov 1989. 4, 2. 227\u2013243.\n\n[25] Prokhorenkova Liudmila, Gusev Gleb, Vorobev Aleksandr, Dorogush Anna Veronika, Gulin An-\ndrey. CatBoost: unbiased boosting with categorical features // Advances in Neural Information\nProcessing Systems 31. 2018. 6638\u20136648.\n\n[26] Ribeiro Marco T\u00falio, Singh Sameer, Guestrin Carlos. Anchors: High-Precision Model-Agnostic\nExplanations // Proceedings of the Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence,\n(AAAI-18), the 30th innovative Applications of Arti\ufb01cial Intelligence (IAAI-18), and the 8th\nAAAI Symposium on Educational Advances in Arti\ufb01cial Intelligence (EAAI-18), New Orleans,\nLouisiana, USA, February 2-7, 2018. 2018. 1527\u20131535.\n\n[27] Shrikumar Avanti, Greenside Peyton, Kundaje Anshul. Learning Important Features Through\n\nPropagating Activation Differences // CoRR. 2017. abs/1704.02685.\n\n[28] Strobl Carolin, Boulesteix Anne-Laure, Kneib Thomas, Augustin Thomas, Zeileis Achim. Condi-\n\ntional variable importance for random forests // BMC Bioinformatics. 2008. 9.\n\n[29] Tamon Christino, Xiang Jie. On the Boosting Pruning Problem // Machine Learning: ECML\n\n2000. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000. 404\u2013412.\n\n[30] Zantema Hans, Bodlaender Hans. Finding Small Equivalent Decision Trees is Hard // Interna-\n\ntional Journal of Foundations of Computer Science. 1999. 11. 343\u2013354.\n\n[31] \u0160trumbelj Erik, Kononenko Igor. Explaining Prediction Models and Individual Predictions with\n\nFeature Contributions // Knowl. Inf. Syst. XII 2014. 41, 3. 647\u2013665.\n\n10\n\n\f", "award": [], "sourceid": 7685, "authors": [{"given_name": "Igor", "family_name": "Kuralenok", "institution": "Experts League Ltd."}, {"given_name": "Vasilii", "family_name": "Ershov", "institution": "Yandex"}, {"given_name": "Igor", "family_name": "Labutin", "institution": "Yandex"}]}