{"title": "Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 883, "page_last": 891, "abstract": "Predicting the execution time of computer programs is an important but challenging problem in the community of computer systems. Existing methods require experts to perform detailed analysis of program code in order to construct predictors or select important features. We recently developed a new system to automatically extract a large number of features from program execution on sample inputs, on which prediction models can be constructed without expert knowledge. In this paper we study the construction of predictive models for this problem. We propose the SPORE (Sparse POlynomial REgression) methodology to build accurate prediction models of program performance using feature data collected from program execution on sample inputs. Our two SPORE algorithms are able to build relationships between responses (e.g., the execution time of a computer program) and features, and select a few from hundreds of the retrieved features to construct an explicitly sparse and non-linear model to predict the response variable. The compact and explicitly polynomial form of the estimated model could reveal important insights into the computer program (e.g., features and their non-linear combinations that dominate the execution time), enabling a better understanding of the program\u2019s behavior. Our evaluation on three widely used computer programs shows that SPORE methods can give accurate prediction with relative error less than 7% by using a moderate number of training data samples. In addition, we compare SPORE algorithms to state-of-the-art sparse regression algorithms, and show that SPORE methods, motivated by real applications, outperform the other methods in terms of both interpretability and prediction accuracy.", "full_text": "Predicting Execution Time of Computer Programs\n\nUsing Sparse Polynomial Regression\n\nLing Huang\n\nIntel Labs Berkeley\nling.huang@intel.com\n\nJinzhu Jia\nUC Berkeley\n\njzjia@stat.berkeley.edu\n\nByung-Gon Chun\nIntel Labs Berkeley\n\nbyung-gon.chun@intel.com\n\nPetros Maniatis\n\nIntel Labs Berkeley\n\npetros.maniatis@intel.com\n\nBin Yu\n\nUC Berkeley\n\nbinyu@stat.berkeley.edu\n\nMayur Naik\n\nIntel Labs Berkeley\n\nmayur.naik@intel.com\n\nAbstract\n\nPredicting the execution time of computer programs is an important but challeng-\ning problem in the community of computer systems. Existing methods require ex-\nperts to perform detailed analysis of program code in order to construct predictors\nor select important features. We recently developed a new system to automatically\nextract a large number of features from program execution on sample inputs, on\nwhich prediction models can be constructed without expert knowledge. In this\npaper we study the construction of predictive models for this problem. We pro-\npose the SPORE (Sparse POlynomial REgression) methodology to build accurate\nprediction models of program performance using feature data collected from pro-\ngram execution on sample inputs. Our two SPORE algorithms are able to build\nrelationships between responses (e.g., the execution time of a computer program)\nand features, and select a few from hundreds of the retrieved features to con-\nstruct an explicitly sparse and non-linear model to predict the response variable.\nThe compact and explicitly polynomial form of the estimated model could reveal\nimportant insights into the computer program (e.g., features and their non-linear\ncombinations that dominate the execution time), enabling a better understanding\nof the program\u2019s behavior. Our evaluation on three widely used computer pro-\ngrams shows that SPORE methods can give accurate prediction with relative error\nless than 7% by using a moderate number of training data samples. In addition, we\ncompare SPORE algorithms to state-of-the-art sparse regression algorithms, and\nshow that SPORE methods, motivated by real applications, outperform the other\nmethods in terms of both interpretability and prediction accuracy.\n\n1 Introduction\n\nComputing systems today are ubiquitous, and range from the very small (e.g., iPods, cellphones,\nlaptops) to the very large (servers, data centers, computational grids). At the heart of such systems\nare management components that decide how to schedule the execution of different programs over\ntime (e.g., to ensure high system utilization or ef\ufb01cient energy use [11, 15]), how to allocate to each\nprogram resources such as memory, storage and networking (e.g., to ensure a long battery life or fair\nresource allocation), and how to weather anomalies (e.g., \ufb02ash crowds or attacks [6, 17, 24]).\n\nThese management components typically must make guesses about how a program will perform\nunder given hypothetical inputs, so as to decide how best to plan for the future. For example,\nconsider a simple scenario in a data center with two computers, fast computer A and slow computer\nB, and a program waiting to run on a large \ufb01le f stored in computer B. A scheduler is often faced\n\n1\n\n\fwith the decision of whether to run the program at B, potentially taking longer to execute, but\navoiding any transmission costs for the \ufb01le; or moving the \ufb01le from B to A but potentially executing\nthe program at A much faster. If the scheduler can predict accurately how long the program would\ntake to execute on input f at computer A or B, he/she can make an optimal decision, returning\nresults faster, possibly minimizing energy use, etc.\n\nDespite all these opportunities and demands, uses of prediction have been at best unsophisticated\nin modern computer systems. Existing approaches either create analytical models for the programs\nbased on simplistic assumptions [12], or treat the program as a black box and create a mapping func-\ntion between certain properties of input data (e.g., \ufb01le size) and output response [13]. The success\nof such methods is highly dependent on human experts who are able to select important predictors\nbefore a statistical modeling step can take place. Unfortunately, in practice experts may be hard to\ncome by, because programs can get complex quickly beyond the capabilities of a single expert, or\nbecause they may be short-lived (e.g., applications from the iPhone app store) and unworthy of the\nattention of a highly paid expert. Even when an expert is available, program performance is often\ndependent not on externally visible features such as command-line parameters and input \ufb01les, but\non the internal semantics of the program (e.g., what lines of code are executed).\n\nTo address this problem (lack of expert and inherent semantics), we recently developed a new sys-\ntem [7] to automatically extract a large number of features from the intermediate execution steps of\na program (e.g., internal variables, loops, and branches) on sample inputs; then prediction models\ncan be built from those features without the need for a human expert.\n\nIn this paper, we propose two Sparse POlynomial REgression (SPORE) algorithms that use the\nautomatically extracted features to predict a computer program\u2019s performance. They are variants of\neach other in the way they build the nonlinear terms into the model \u2013 SPORE-LASSO \ufb01rst selects\na small number of features and then entertains a full nonlinear polynomial expansion of order less\nthan a given degree; while SPORE-FoBa chooses adaptively a subset of the full expanded terms\nand hence allows possibly a higher order of polynomials. Our algorithms are in fact new general\nmethods motivated by the computer performance prediction problem. They can learn a relationship\nbetween a response (e.g., the execution time of a computer program given an input) and the generated\nfeatures, and select a few from hundreds of features to construct an explicit polynomial form to\npredict the response. The compact and explicit polynomial form reveals important insights in the\nprogram semantics (e.g., the internal program loop that affects program execution time the most).\nOur approach is general, \ufb02exible and automated, and can adapt the prediction models to speci\ufb01c\nprograms, computer platforms, and even inputs.\n\nWe evaluate our algorithms experimentally on three popular computer programs from web search\nand image processing. We show that our SPORE algorithms can achieve accurate predictions with\nrelative error less than 7% by using a small amount of training data for our application, and that our\nalgorithms outperform existing state-of-the-art sparse regression algorithms in the literature in terms\nof interpretability and accuracy.\n\nRelated Work. In prior attempts to predict program execution time, Gupta et al. [13] use a variant of\ndecision trees to predict execution time ranges for database queries. Ganapathi et al. [11] use KCCA\nto predict time and resource consumption for database queries using statistics on query texts and\nexecution plans. To measure the empirical computational complexity of a program, Trendprof [12]\nconstructs linear or power-law models that predict program execution counts. The drawbacks of such\napproaches include their need for expert knowledge about the program to identify good features, or\ntheir requirement for simple input-size to execution time correlations.\n\nSeshia and Rakhlin [22, 23] propose a game-theoretic estimator of quantitative program properties,\nsuch as worst-case execution time, for embedded systems. These properties depend heavily on the\ntarget hardware environment in which the program is executed. Modeling the environment manually\nis tedious and error-prone. As a result, they formulate the problem as a game between their algorithm\n(player) and the program\u2019s environment (adversary), where the player seeks to accurately predict the\nproperty of interest while the adversary sets environment states and parameters.\n\nSince expert resource is limited and costly, it is desirable to automatically extract features from pro-\ngram codes. Then machine learning techniques can be used to select the most important features\nto build a model. In statistical machine learning, feature selection methods under linear regres-\nsion models such as LASSO have been widely studied in the past decade. Feature selection with\n\n2\n\n\fnon-linear models has been studied much less, but has recently been attracting attention. The most\nnotable are the SpAM work with theoretical and simulation results [20] and additive and general-\nized forward regression [18]. Empirical studies with data of these non-linear sparse methods are\nvery few [21]. The drawback of applying the SpAM method in our execution time prediction prob-\nlem is that SpAM outputs an additive model and cannot use the interaction information between\nfeatures. But it is well-known that features of computer programs interact to determine the execu-\ntion time [12]. One non-parametric modi\ufb01cation of SpAM to replace the additive model has been\nproposed [18]. However, the resulting non-parametric models are not easy to interpret and hence are\nnot desirable for our execution time prediction problem. Instead, we propose the SPORE method-\nology and propose ef\ufb01cient algorithms to train a SPORE model. Our work provides a promising\nexample of interpretable non-linear sparse regression models in solving real data problems.\n\n2 Overview of Our System\n\nOur focus in this paper is on algorithms for feature selection and model building. However we \ufb01rst\nreview the problem within which we apply these techniques to provide context [7]. Our goal is to\npredict how a given program will perform (e.g., its execution time) on a particular input (e.g., input\n\ufb01les and command-line parameters). The system consists of four steps.\n\nFirst, the feature instrumentation step analyzes the source code and automatically instruments it\nto extract values of program features such as loop counts (how many times a particular loop has\nexecuted), branch counts (how many times each branch of a conditional has executed), and variable\nvalues (the k \ufb01rst values assigned to a numerical variable, for some small k such as 5).\nSecond, the pro\ufb01ling step executes the instrumented program with sample input data to collect values\nfor all created program features and the program\u2019s execution times. The time impact of the data\ncollection is minimal.\n\nThird, the slicing step analyzes each automatically identi\ufb01ed feature to determine the smallest subset\nof the actual program that can compute the value of that feature, i.e., the feature slice. This is the\ncost of obtaining the value of the feature; if the whole program must execute to compute the value,\nthen the feature is expensive and not useful, since we can just measure execution time and we have\nno need for prediction, whereas if only a little of the program must execute, the feature is cheap and\ntherefore possibly valuable in a predictive model.\n\nFinally, the modeling step uses the feature values collected during pro\ufb01ling along with the feature\ncosts computed during slicing to build a predictive model on a small subset of generated features.\nTo obtain a model consisting of low-cost features, we iterate over the modeling and slicing steps,\nevaluating the cost of selected features and rejecting expensive ones, until only low-cost features are\nselected to construct the prediction model. At runtime, given a new input, the selected features are\ncomputed using the corresponding slices, and the model is used to predict execution time from the\nfeature values.\n\nThe above description is minimal by necessity due to space constraints, and omits details on the\nrationale, such as why we chose the kinds of features we chose or how program slicing works.\nThough important, those details have no bearing in the results shown in this paper.\n\nAt present our system targets a \ufb01xed, overprovisioned computation environment without CPU job\ncontention or network bandwidth \ufb02uctuations. We therefore assume that execution times observed\nduring training will be consistent with system behavior on-line. Our approach can adapt to modest\nchange in execution environment by retraining on different environments. In our future research, we\nplan to incorporate candidate features of both hardware (e.g., con\ufb01gurations of CPU, memory, etc)\nand software environment (e.g., OS, cache policy, etc) for predictive model construction.\n\n3 Sparse Polynomial Regression Model\n\nOur basic premise for predictive program analysis is that a small but relevant set of features may ex-\nplain the execution time well. In other words, we seek a compact model\u2014an explicit form function\nof a small number of features\u2014that accurately estimates the execution time of the program.\n\n3\n\n\fTo make the problem tractable, we constrain our models to the multivariate polynomial family, for at\nleast three reasons. First, a \u201cgood program\u201d is usually expected to have polynomial execution time in\nsome (combination of) features. Second, a polynomial model up to certain degree can approximate\nwell many nonlinear models (due to Taylor expansion). Finally, a compact polynomial model can\nprovide an easy-to-understand explanation of what determines the execution time of a program,\nproviding program developers with intuitive feedback and a solid basis for analysis.\n\nFor each computer program, our feature instrumentation procedure outputs a data set with n samples\ni=1, where yi \u2208 R denotes the ith observation of execution time, and xi denotes\nas tuples of {yi, xi}n\nthe ith observation of the vector of p features. We now review some obvious alternative methods to\nmodeling the relationship between Y = [yi] and X = [xi], point out their drawbacks, and then we\nproceed to our SPORE methodology.\n\n3.1 Sparse Regression and Alternatives\n\nLeast square regression is widely used for \ufb01nding the best-\ufb01tting f (x, \u03b2) to a given set of responses\nyi by minimizing the sum of the squares of the residuals [14]. Regression with subset selection\n\ufb01nds for each k \u2208 {1, 2, . . . , m} the feature subset of size k that gives the smallest residual sum of\nsquares. However, it is a combinatorial optimization and is known to be NP-hard [14]. In recent\nyears a number of ef\ufb01cient alternatives based on model regularization have been proposed. Among\nthem, LASSO [25] \ufb01nds the selected features with coef\ufb01cients \u02c6\u03b2 given a tuning parameter \u03bb as\nfollows:\n\n\u02c6\u03b2 = arg min\n\u03b2\n\n1\n2\n\nkY \u2212 X\u03b2k2\n\n2 + \u03bbX\n\nj\n\n|\u03b2j|.\n\n(1)\n\nLASSO effectively enforces many \u03b2j\u2019s to be 0, and selects a small subset of features (indexed by\nnon-zero \u03b2j\u2019s) to build the model, which is usually sparse and has better prediction accuracy than\nmodels created by ordinary least square regression [14] when p is large. Parameter \u03bb controls the\ncomplexity of the model: as \u03bb grows larger, fewer features are selected.\n\nBeing a convex optimization problem is an important advantage of the LASSO method since several\nfast algorithms exist to solve the problem ef\ufb01ciently even with large-scale data sets [9, 10, 16, 19].\nFurthermore, LASSO has convenient theoretical and empirical properties. Under suitable assump-\ntions, it can recover the true underlying model [8, 25]. Unfortunately, when predictors are highly\ncorrelated, LASSO usually cannot select the true underlying model. The adaptive-LASSO [29]\nde\ufb01ned below in Equation (2) can overcome this problem\n\n\u02c6\u03b2 = arg min\n\u03b2\n\n1\n2\n\nkY \u2212 X\u03b2k2\n\n2 + \u03bbX\n\nj\n\n|\n\n\u03b2j\nwj\n\n|,\n\n(2)\n\nwhere wj can be any consistent estimate of \u03b2. Here we choose wj to be a ridge estimate of \u03b2:\n\nwhere I is the identity matrix.\n\nwj = (X T X + 0.001I)\u22121X T Y,\n\nTechnically LASSO can be easily extended to create nonlinear models (e.g., using polynomial basis\nfunctions up to degree d of all p features). However, this approach gives us (cid:0)p+d\nd (cid:1) terms, which is\nvery large when p is large (on the order of thousands) even for small d, making regression computa-\ntionally expensive. We give two alternatives to \ufb01t the sparse polynomial regression model next.\n\n3.2 SPORE Methodology and Two Algorithms\n\nOur methodology captures non-linear effects of features\u2014as well as non-linear interactions among\nfeatures\u2014by using polynomial basis functions over those features (we use terms to denote the poly-\nnomial basis functions subsequently). We expand the feature set x = {x1 x2 . . . xk}, k \u2264 p to\nall the terms in the expansion of the degree-d polynomial (1 + x1 + . . . + xk)d, and use the terms\nto construct a multivariate polynomial function f (x, \u03b2) for the regression. We de\ufb01ne expan(X, d)\nas the mapping from the original data matrix X to a new matrix with the polynomial expansion\nterms up to degree d as the columns. For example, using a degree-2 polynomial with feature set\n\n4\n\n\fx = {x1 x2}, we expand out (1 + x1 + x2)2 to get terms 1, x1, x2, x2\nbasis functions to construct the following function for regression:\n\n1, x1x2, x2\n\n2, and use them as\n\nexpan ([x1, x2], 2) = [1, [x1], [x2], [x2\n\n1], [x1x2], [x2\n\n2]],\n\nf (x, \u03b2) = \u03b20 + \u03b21x1 + \u03b22x2 + \u03b23x2\n\n1 + \u03b24x1x2 + \u03b25x2\n2.\n\nComplete expansion on all p features is not necessary, because many of them have little contri-\nbution to the execution time. Motivated by this execution time application, we propose a general\nmethodology called SPORE which is a sparse polynomial regression technique. Next, we develop\ntwo algorithms to \ufb01t our SPORE methodology.\n\n3.2.1 SPORE-LASSO: A Two-Step Method\n\nFor a sparse polynomial model with only a few features, if we can preselect a small number of\nfeatures, applying the LASSO on the polynomial expansion of those preselected features will still\nbe ef\ufb01cient, because we do not have too many polynomial terms. Here is the idea:\n\nStep 1: Use the linear LASSO algorithm to select a small number of features and \ufb01lter out (often\nmany) features that hardly have contributions to the execution time.\n\nStep 2: Use the adaptive-LASSO method on the expanded polynomial terms of the selected features\n(from Step 1) to construct the sparse polynomial model.\n\nAdaptive-LASSO is used in Step 2 because of the collinearity of the expanded polynomial features.\nStep 2 can be computed ef\ufb01ciently if we only choose a small number of features in Step 1. We\npresent the resulting SPORE-LASSO algorithm in Algorithm 1 below.\n\nAlgorithm 1 SPORE-LASSO\nInput: response Y , feature data X, maximum degree d, \u03bb1, \u03bb2\nOutput: Feature index S, term index St , weights \u02c6\u03b2 for d-degree polynomial basis.\n1: \u02c6\u03b1 = arg min\u03b1\n2: S = {j : \u02c6\u03b1j 6= 0}\n3: Xnew = expan(X(S), d)\n4: w = (X T\n5: \u02c6\u03b2 = arg min\u03b2\n6: St = {j : \u02c6\u03b2j 6= 0}\n\nnew Xnew + 0.001I)\u22121X T\n\n2 kY \u2212 Xnew \u03b2k2\n\n2 + \u03bb2 Pj | \u03b2j\n\nwj\n\n2 kY \u2212 X\u03b1k2\n\n2 + \u03bb1k\u03b1k1\n\n1\n\n1\n\nnew Y\n\n|\n\nX(S) in Step 3 of Algorithm 1 is a sub-matrix of X containing only columns from X indexed by\nS. For a new observation with feature vector X = [x1, x2, . . . , xp], we \ufb01rst get the selected feature\nvector X(S), then obtain the polynomial terms Xnew = expan(X(S), d), and \ufb01nally we compute\nthe prediction: \u02c6Y = Xnew \u00d7 \u02c6\u03b2. Note that the prediction depends on the choice of \u03bb1, \u03bb2 and\nmaximum degree d. In this paper, we \ufb01x d = 3. \u03bb1 and \u03bb2 are chosen by minimizing the Akaike\nInformation Criterion (AIC) on the LASSO solution paths. The AIC is de\ufb01ned as n log(kY \u2212 \u02c6Y k2\n2)+\n2s, where \u02c6Y is the \ufb01tted Y and s is the number of polynomial terms selected in the model. To be\nprecise, for the linear LASSO step (Step 1 of Algorithm 1), a whole solution path with a number of\n\u03bb1 can be obtained using the algorithm in [10]. On the solution path, for each \ufb01xed \u03bb1, we compute\na solution path with varied \u03bb2 for Step 5 of Algorithm 1 to select the polynomial terms. For each\n\u03bb2, we calculate the AIC, and choose the (\u03bb1, \u03bb2) with the smallest AIC.\nOne may wonder whether Step 1 incorrectly discards features required for building a good model\nin Step 2. We next show theoretically this is not the case. Let S be a subset of {1, 2, . . . , p} and\nits complement Sc = {1, 2, . . . , p} \\ S. Write the feature matrix X as X = [X(S), X(Sc)]. Let\nresponse Y = f (X(S)) + \u01eb, where f (\u00b7) is any function and \u01eb is additive noise. Let n be the number\nof observations and s the size of S. We assume that X is deterministic, p and s are \ufb01xed, and \u01eb\u2032is are\ni.i.d. and follow the Gaussian distribution with mean 0 and variance \u03c32. Our results also hold for\nzero mean sub-Gaussian noise with parameter \u03c32. More general results regarding general scaling of\nn, p and s can also be obtained.\n\nUnder the following conditions, we show that Step 1 of SPORE-LASSO, the linear LASSO, selects\nthe relevant features even if the response Y depends on predictors X(S) nonlinearly:\n\n5\n\n\f1. The columns (Xj, j = 1, . . . , p) of X are standardized: 1\n\nn X T\n\nj Xj = 1, for all j;\n\n2. \u039bmin( 1\n\nn X(S)T X(S)) \u2265 c with a constant c > 0;\n\n3. min |(X(S)T X(S))\u22121X(S)T f (X(S))| > \u03b1 with a constant \u03b1 > 0;\n\n4. X T\n\nSc [I\u2212XS (X T\n\nS XS )\u22121X T\n\nS ]f (XS )\n\nn\n\n< \u03b7\u03b1c\n2\u221as+1\n\n, for some 0 < \u03b7 < 1;\n\n5. kX T\n\nScXS(X T\n\nS XS)\u22121k\n\n\u2264 1 \u2212 \u03b7;\n\n\u221e\n\nwhere \u039bmin(\u00b7) denotes the minimum eigenvalue of a matrix, kAk\nand the inequalities are de\ufb01ned element-wise.\nTheorem 3.1. Under the conditions above, with probability \u2192 1 as n \u2192 \u221e, there exists\nsome \u03bb, such that \u02c6\u03b2 = ( \u02c6\u03b2S, \u02c6\u03b2Sc) is the unique solution of the LASSO (Equation (1)), where\n\u02c6\u03b2j 6= 0, for all j \u2208 S and \u02c6\u03b2Sc = 0.\n\nis de\ufb01ned as maxi hPj |Aij |i\n\n\u221e\n\nRemark. The \ufb01rst two conditions are trivial: Condition 1 can be obtained by rescaling while Con-\ndition 2 assumes that the design matrix composed of the true predictors in the model is not singular.\nCondition 3 is a reasonable condition which means that the linear projection of the expected re-\nsponse to the space spanned by true predictors is not degenerated. Condition 4 is a little bit tricky;\nit says that the irrelevant predictors (XSc) are not very correlated with the \u201cresiduals\u201d of E(Y ) after\nits projection onto XS. Condition 5 is always needed when considering LASSO\u2019s model selection\nconsistency [26, 28]. The proof of the theorem is included in the supplementary material.\n\n3.2.2 Adaptive Forward-Backward: SPORE-FoBa\n\nUsing all of the polynomial expansions of a feature subset is not \ufb02exible. In this section, we propose\nthe SPORE-FoBa algorithm, a more \ufb02exible algorithm using adaptive forward-backward searching\nover the polynomially expanded data: during search step k with an active set T (k), we examine one\nnew feature Xj, and consider a small candidate set which consists of the candidate feature Xj, its\nhigher order terms, and the (non-linear) interactions between previously selected features (indexed\nby S) and candidate feature Xj with total degree up to d, i.e., terms with form\n\nj \u03a0l\u2208S X dl\nX d1\n\nl\n\n, with d1 > 0, dl \u2265 0, and d1 + X dl \u2264 d.\n\n(3)\n\nAlgorithm 2 below is a short description of the SPORE-FoBa, which uses linear FoBa [27] at step\n5and 6. The main idea of SPORE-FoBa is that a term from the candidate set is added into the model\nif and only if adding this term makes the residual sum of squares (RSS) decrease a lot. We scan all\nof the terms in the candidate set and choose the one which makes the RSS drop most. If the drop in\nthe RSS is greater than a pre-speci\ufb01ed value \u01eb, we add that term to the active set, which contains the\ncurrently selected terms by the SPORE-FoBa algorithm. When considering deleting one term from\nthe active set, we choose the one that makes the sum of residuals increase the least. If this increment\nis small enough, we delete that term from our current active set.\n\nAlgorithm 2 SPORE-FoBa\nInput: response Y , feature columns X1, . . . , Xp, the maximum degree d\nOutput: polynomial terms and the weights\n1: Let T = \u2205\n2: while true do\n3:\n4:\n5:\n6:\n7:\n8:\n\nif no terms can be added or deleted then\n\nfor j = 1, . . . , p do\n\nbreak\n\nLet C be the candidate set that contains non-linear and interaction terms from Equation (3)\nUse Linear FoBa to select terms from C to form the new active set T .\nUse Linear FoBa to delete terms from T to form a new active set T .\n\n6\n\n\fr\no\nr\nr\n\n \n\nE\nn\no\n\ni\n\ni\nt\nc\nd\ne\nr\nP\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n\n0\n0\n\n0.1\n\n \n\nSPORE\u2212LASSO\nSPORE\u2212FoBa\n\n \n\nSPORE\u2212LASSO\nSPORE\u2212FoBa\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\n\ni\nt\nc\nd\ne\nr\nP\n\n \n\nSPORE\u2212LASSO\nSPORE\u2212FoBa\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\n\ni\nt\nc\nd\ne\nr\nP\n\n0.2\n\n0.3\n\n0.4\n\nPercentage of Training data\n\n0.5\n\n0.6\n\n \n\n0\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\nPercentage of Training data\n\n0.5\n\n0.6\n\n \n\n0\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\nPercentage of Training data\n\n0.5\n\n0.6\n\n(a) Lucene\n\n(b) Find Maxima\n\n(c) Segmentation\n\nFigure 1: Prediction errors of our algorithms across the three data sets varying training-set fractions.\n\n4 Evaluation Results\n\nWe now experimentally demonstrate that our algorithms are practical, give highly accurate predic-\ntors for real problems with small training-set sizes, compare favorably in accuracy to other state-of-\nthe-art sparse-regression algorithms, and produce interpretable, intuitive models.\n\nTo evaluate our algorithms, we use as case studies three programs: the Lucene Search Engine [4],\nand two image processing algorithms, one for \ufb01nding maxima and one for segmenting an image\n(both of which are implemented within the ImageJ image processing framework [3]). We chose\nall three programs according to two criteria. First and most importantly, we sought programs with\nhigh variability in the predicted measure (execution time), especially in the face of otherwise similar\ninputs (e.g., image \ufb01les of roughly the same size for image processing). Second, we sought programs\nthat implement reasonably complex functionality, for which an inexperienced observer would not\nbe able to trivially identify the important features.\n\nOur collected datasets are as follows. For Lucene, we used a variety of text input queries from\ntwo corpora: the works of Shakespeare and the King James Bible. We collected a data set with\nn = 3840 samples, each of which consists of an execution time and a total of p = 126 automatically\ngenerated features. The time values are in range of (0.88, 1.13) with standard deviation 0.19. For\nthe Find Maxima program within the ImageJ framework, we collected n = 3045 samples (from an\nequal number of distinct, diverse images obtained from three vision corpora [1, 2, 5]), and a total of\np = 182 features. The execution time values are in range of (0.09, 2.99) with standard deviation\n0.24. Finally, from the Segmentation program within the same ImageJ framework on the same image\nset, we collected again n = 3045 samples, and a total of p = 816 features for each. The time values\nare in range of (0.21, 58.05) with standard deviation 3.05. In all the experiments, we \ufb01x degree\nd = 3 for polynomial expansion, and normalized each column of feature data into range [0, 1].\n\nyi\n\nnt P | \u02c6yi\u2212yi\n\nPrediction Error. We \ufb01rst show that our algorithms predict accurately, even when training on a\nsmall number of samples, in both absolute and relative terms. The accuracy measure we use is the\nrelative prediction error de\ufb01ned as 1\n|, where nt is the size of the test data set, and \u02c6yi\u2019s\nand yi\u2019s are the predicted and actual responses of test data, respectively.\nWe randomly split every data set into a training set and a test set for a given training-set fraction,\ntrain the algorithms and measure their prediction error on the test data. For each training fraction,\nwe repeat the \u201csplitting, training and testing\u201d procedure 10 times and show the mean and standard\ndeviation of prediction error in Figure 1. We see that our algorithms have high prediction accuracy,\neven when training on only 10% or less of the data (roughly 300 - 400 samples). Speci\ufb01cally,\nboth of our algorithms can achieve less than 7% prediction error on both Lucene and Find Maxima\ndatasets; on the segmentation dataset, SPORE-FoBa achieves less than 8% prediction error, and\nSPORE-LASSO achieves around 10% prediction error on average.\n\nComparisons to State-of-the-Art. We compare our algorithms to several existing sparse regression\nmethods by examining their prediction errors at different sparsity levels (the number of features used\nin the model), and show our algorithms can clearly outperform LASSO, FoBa and recently proposed\nnon-parametric greedy methods [18] (Figure 2). As a non-parametric greedy algorithm, we use Ad-\nditive Forward Regression (AFR), because it is faster and often achieves better prediction accuracy\nthan Generalized Forward Regression (GFR) algorithms. We use the Glmnet Matlab implementa-\n\n7\n\n\ftion of LASSO and to obtain the LASSO solution path [10]. Since FoBa and SPORE-FoBa naturally\nproduce a path by adding or deleting features (or terms), we record the prediction error at each step.\nWhen two steps have the same sparsity level, we report the smallest prediction error. To generate\nthe solution path for SPORE-LASSO, we \ufb01rst use Glmnet to generate a solution path for linear\nLASSO; then at each sparsity level k, we perform full polynomial expansion with d = 3 on the\nselected k features, obtain a solution path on the expanded data, and choose the model with the\nsmallest prediction error among all models computed from all active feature sets of size k. From the\n\ufb01gure, we see that our SPORE algorithms have comparable performance, and both of them clearly\nachieve better prediction accuracy than LASSO, FoBa, and AFR. None of the existing methods can\nbuild models within 10% of relative prediction error. We believe this is because execution time of a\ncomputer program often depends on non-linear combinations of different features, which is usually\nnot well-handled by either linear methods or the additive non-parametric methods. Instead, both of\nour algorithms can select 2-3 high-quality features and build models with non-linear combinations\nof them to predict execution time with high accuracy.\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\n\ni\nt\nc\nd\ne\nr\nP\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n\n0\n1\n\n \n\nLASSO\nFoBa\nAFR\nSPORE\u2212LASSO\nSPORE\u2212FoBa\n\n2\n\n3\n\n4\n\nSparsity\n\n5\n\n6\n\n7\n\n(a) Lucene\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\n\ni\nt\nc\nd\ne\nr\nP\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n\n0\n1\n\n \n\nLASSO\nFoBa\nAFR\nSPORE\u2212LASSO\nSPORE\u2212FoBa\n\n2\n\n3\n\n4\n\nSparsity\n\n5\n\n6\n\n7\n\n(b) Find Maxima\n\nr\no\nr\nr\n\n \n\nE\nn\no\n\ni\n\ni\nt\nc\nd\ne\nr\nP\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n\n0\n1\n\n \n\nLASSO\nFoBa\nAFR\nSPORE\u2212LASSO\nSPORE\u2212FoBa\n\n2\n\n3\n\n4\n\nSparsity\n\n5\n\n6\n\n7\n\n(c) Segmentation\n\nFigure 2: Performance of the algorithms: relative prediction error versus sparsity level.\n\nModel Interpretability. To gain better understanding, we investigate the details of the model con-\nstructed by SPORE-FoBa for Find Maxima. Our conclusions are similar for the other case studies,\nbut we omit them due to space. We see that with different training set fractions and with different\nsparsity con\ufb01gurations, SPORE-FoBa can always select two high-quality features from hundreds of\nautomatically generated ones. By consulting with experts of the Find Maxima program, we \ufb01nd that\nthe two selected features correspond to the width (w) and height (h) of the region of interest in the\nimage, which may in practice differ from the actual image width and height. Those are indeed the\nmost important factors for determining the execution time of the particular algorithm used. For a\n10% training set fraction and \u01eb = 0.01, SPORE-FoBa obtained\n\nf (w, h) = 0.1 + 0.22w + 0.23h + 1.93wh + 0.24wh2\n\nwhich uses non-linear feature terms(e.g., wh, wh2) to predict the execution time accurately (around\n5.5% prediction error). Especially when Find Maxima is used as a component of a more complex\nimage processing pipeline, this model would not be the most obvious choice even an expert would\npick. On the contrary, as observed in our experiments, neither the linear nor the additive sparse\nmethods handle well such nonlinear terms, and result in inferior prediction performance. A more\ndetailed comparison across different methods is the subject of our on-going work.\n\n5 Conclusion\n\nIn this paper, we proposed the SPORE (Sparse POlynomial REgression) methodology to build the\nrelationship between execution time of computer programs and features of the programs. We in-\ntroduced two algorithms to learn a SPORE model, and showed that both algorithms can predict\nexecution time with more than 93% accuracy for the applications we tested. For the three test cases,\nthese results present a signi\ufb01cant improvement (a 40% or more reduction in prediction error) over\nother sparse modeling techniques in the literature when applied to this problem. Hence our work\nprovides one convincing example of using sparse non-linear regression techniques to solve real\nproblems. Moreover, the SPORE methodology is a general methodology that can be used to model\ncomputer program performance metrics other than execution time and solve problems from other\nareas of science and engineering.\n\n8\n\n\fReferences\n[1] Caltech 101 Object Categories.\n\nCaltech101/Caltech101.html.\n\nhttp://www.vision.caltech.edu/Image_Datasets/\n\n[2] Event Dataset. http://vision.stanford.edu/lijiali/event_dataset/.\n[3] ImageJ. http://rsbweb.nih.gov/ij/.\n[4] Mahout. lucene.apache.org/mahout.\n[5] Visual Object Classes Challenge 2008. http://pascallin.ecs.soton.ac.uk/challenges/\n\nVOC/voc2008/.\n\n[6] S. Chen, K. Joshi, M. A. Hiltunen, W. H. Sanders, and R. D. Schlichting. Link gradients: Predicting the\n\nimpact of network latency on multitier applications. In INFOCOM, 2009.\n\n[7] B.-G. Chun, L. Huang, S. Lee, P. Maniatis, and M. Naik. Mantis: Predicting system performance through\n\nprogram analysis and modeling. Technical Report, 2010. arXiv:1010.0019v1 [cs.PF].\n\n[8] D. Donoho. For most large underdetermined systems of equations, the minimal 1-norm solution is the\n\nsparsest solution. Communications on Pure and Applied Mathematics, 59:797829, 2006.\n\n[9] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics,\n\n32(2):407\u2013499, 2002.\n\n[10] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordi-\n\nnate descent. Journal of Statistical Software, 33(1), 2010.\n\n[11] A. Ganapathi, H. Kuno, U. Dayal, J. L. Wiener, A. Fox, M. Jordan, and D. Patterson. Predicting multiple\n\nmetrics for queries: Better decisions enabled by machine learning. In ICDE, 2009.\n\n[12] S. Goldsmith, A. Aiken, and D. Wilkerson. Measuring empirical computational complexity.\n\n2007.\n\nIn FSE,\n\n[13] C. Gupta, A. Mehta, and U. Dayal. PQR: Predicting query execution times for autonomous workload\n\nmanagement. In ICAC, 2008.\n\n[14] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2009.\n[15] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for\n\ndistributed computing clusters. In Proceedings of SOSP\u201909, 2009.\n\n[16] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. An interior-point method for large-scale\nl1-regularized least squares. IEEE Journal on Selected Topics in Signal Processing, 1(4):606\u2013617, 2007.\n[17] Z. Li, M. Zhang, Z. Zhu, Y. Chen, A. Greenberg, and Y.-M. Wang. WebProphet: Automating performance\n\nprediction for web services. In NSDI, 2010.\n\n[18] H. Liu and X. Chen. Nonparametric greedy algorithm for the sparse learning problems. In NIPS 22, 2009.\n[19] M. Osborne, B. Presnell, and B. Turlach. On the lasso and its dual. Journal of Computational and\n\nGraphical Statistics, 9(2):319\u2013337, 2000.\n\n[20] P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the Royal\n\nStatistical Society: Series B(Statistical Methodology), 71(5):1009\u20131030, 2009.\n\n[21] P. Ravikumar, V. Vu, B. Yu, T. Naselaris, K. Kay, J. Gallant, and C. Berkeley. Nonparametric sparse hier-\narchical models describe v1 fmri responses to natural images. Advances in Neural Information Processing\nSystems (NIPS), 21, 2008.\n\n[22] S. A. Seshia and A. Rakhlin. Game-theoretic timing analysis. In Proceedings of the IEEE/ACM Interna-\n\ntional Conference on Computer-Aided Design (ICCAD), pages 575\u2013582. IEEE Press, Nov. 2008.\n\n[23] S. A. Seshia and A. Rakhlin. Quantitative analysis of systems using game-theoretic learning. ACM\n\nTransactions on Embedded Computing Systems (TECS), 2010. To appear.\n\n[24] M. Tariq, A. Zeitoun, V. Valancius, N. Feamster, and M. Ammar. Answering what-if deployment and\n\ncon\ufb01guration questions with wise. In ACM SIGCOMM, 2008.\n\n[25] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 1996.\n[26] M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained\n\nquadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183\u20132202, 2009.\n\n[27] T. Zhang. Adaptive forward-backward greedy algorithm for sparse learning with linear models. Advances\n\nin Neural Information Processing Systems, 22, 2008.\n\n[28] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research,\n\n7:2563, 2006.\n\n[29] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association,\n\n101(476):1418\u20131429, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1279, "authors": [{"given_name": "Ling", "family_name": "Huang", "institution": null}, {"given_name": "Jinzhu", "family_name": "Jia", "institution": null}, {"given_name": "Bin", "family_name": "Yu", "institution": null}, {"given_name": "Byung-gon", "family_name": "Chun", "institution": null}, {"given_name": "Petros", "family_name": "Maniatis", "institution": null}, {"given_name": "Mayur", "family_name": "Naik", "institution": null}]}