{"title": "Learning with Feature Evolvable Streams", "book": "Advances in Neural Information Processing Systems", "page_first": 1417, "page_last": 1427, "abstract": "Learning with streaming data has attracted much attention during the past few years.Though most studies consider data stream with fixed features, in real practice the features may be evolvable. For example, features of data gathered by limited lifespan sensors will change when these sensors are substituted by new ones. In this paper, we propose a novel learning paradigm: Feature Evolvable Streaming Learning where old features would vanish and new features would occur. Rather than relying on only the current features, we attempt to recover the vanished features and exploit it to improve performance. Specifically, we learn two models from the recovered features and the current features, respectively. To benefit from the recovered features, we develop two ensemble methods. In the first method, we combine the predictions from two models and theoretically show that with the assistance of old features, the performance on new features can be improved. In the second approach, we dynamically select the best single prediction and establish a better performance guarantee when the best model switches. Experiments on both synthetic and real data validate the effectiveness of our proposal.", "full_text": "Learning with Feature Evolvable Streams\n\nBo-Jian Hou Lijun Zhang Zhi-Hua Zhou\n\nNational Key Laboratory for Novel Software Technology,\n\nNanjing University, Nanjing, 210023, China\n\n{houbj,zhanglj,zhouzh}@lamda.nju.edu.cn\n\nAbstract\n\nLearning with streaming data has attracted much attention during the past few years.\nThough most studies consider data stream with \ufb01xed features, in real practice the\nfeatures may be evolvable. For example, features of data gathered by limited-\nlifespan sensors will change when these sensors are substituted by new ones. In\nthis paper, we propose a novel learning paradigm: Feature Evolvable Streaming\nLearning where old features would vanish and new features would occur. Rather\nthan relying on only the current features, we attempt to recover the vanished\nfeatures and exploit it to improve performance. Speci\ufb01cally, we learn two models\nfrom the recovered features and the current features, respectively. To bene\ufb01t from\nthe recovered features, we develop two ensemble methods. In the \ufb01rst method,\nwe combine the predictions from two models and theoretically show that with the\nassistance of old features, the performance on new features can be improved. In the\nsecond approach, we dynamically select the best single prediction and establish a\nbetter performance guarantee when the best model switches. Experiments on both\nsynthetic and real data validate the effectiveness of our proposal.\n\n1\n\nIntroduction\n\nIn many real tasks, data are accumulated over time, and thus, learning with streaming data has attracted\nmuch attention during the past few years. Many effective approaches have been developed, such\nas hoeffding tree [7], Bayes tree [27], evolving granular neural network (eGNN) [17], Core Vector\nMachine (CVM) [29], etc. Though these approaches are effective for certain scenarios, they have a\ncommon assumption, i.e., the data stream comes with a \ufb01xed stable feature space. In other words,\nthe data samples are always described by the same set of features. Unfortunately, this assumption\ndoes not hold in many streaming tasks. For example, for ecosystem protection one can deploy many\nsensors in a reserve to collect data, where each sensor corresponds to an attribute/feature. Due to its\nlimited-lifespan, after some periods many sensors will wear out, whereas some new sensors can be\nspread. Thus, features corresponding to the old sensors vanish while features corresponding to the\nnew sensors appear, and the learning algorithm needs to work well under such evolving environment.\nNote that the ability of adapting to environmental change is one of the fundamental requirements for\nlearnware [37], where an important aspect is the ability of handling evolvable features.\nA straightforward approach is to rely on the new features and learn a new model to use. However,\nthis solution suffers from some de\ufb01ciencies. First, when new features just emerge, there are few data\nsamples described by these features, and thus, the training samples might be insuf\ufb01cient to train a\nstrong model. Second, the old model of vanished features is ignored, which is a big waste of our data\ncollection effort. To address these limitations, in this paper we propose a novel learning paradigm:\nFeature Evolvable Streaming Learning (FESL). We formulate the problem based on a key observation:\nin general features do not change in an arbitrary way; instead, there are some overlapping periods in\nwhich both old and new features are available. Back to the ecosystem protection example, since the\nlifespan of sensors is known to us, e.g., how long their battery will run out is a prior knowledge, we\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fusually spread a set of new sensors before the old ones wear out. Thus, the data stream arrives in a\nway as shown in Figure 1, where in period T1, the original set of features are valid and at the end of\nT1, period B1 appears, where the original set of features are still accessible, but some new features\nare included; then in T2, the original set of features vanish, only the new features are valid but at the\nend of T2, period B2 appears where newer features come. This process will repeat again and again.\nNote that the T1 and T2 periods are usually long, whereas the B1 and B2 periods are short because,\nas in the ecosystem protection example, the B1 and B2 periods are just used to switch the sensors\nand we do not want to waste a lot of lifetime of sensors for such overlapping periods.\nIn this paper, we propose to solve the FESL\nproblem by utilizing the overlapping period to\ndiscover the relationship between the old and\nnew features, and exploiting the old model even\nwhen only the new features are available. Specif-\nically, we try to learn a mapping from new fea-\ntures to old features through the samples in the\noverlapping period. In this way, we are able to\nreconstruct old features from new ones and thus\nthe old model can still be applied. To bene\ufb01t\nfrom additional features, we develop two ensem-\nble methods, one is in a combination manner\nand the other in a dynamic selection manner. In\nthe \ufb01rst method, we combine the predictions from two models and theoretically show that with the\nassistance of old features, the performance on new features can be improved. In the second approach,\nwe dynamically select the best single prediction and establish a better performance guarantee when\nthe best model switches at an arbitrary time. Experiments on synthetic and real datasets validate the\neffectiveness of our proposal.\nThe rest of this paper is organized as follows. Section 2 introduces related work. Section 3 presents\nthe formulation of FESL. Our proposed approaches with corresponding analyses are presented in\nsection 4. Section 5 reports experimental results. Finally, Section 6 concludes.\n\nFigure 1: Illustration that how data stream comes.\n\n2 Related Work\n\nData stream mining contains several tasks, including classi\ufb01cation, clustering, frequency counting,\nand time series analysis. Our work is most related to the classi\ufb01cation task and we can also solve\nthe regression problem. Existing techniques for data stream classi\ufb01cation can be divided into two\ncategories, one only considers a single classi\ufb01er and the other considers ensemble classi\ufb01ers. For the\nformer, several methods origin from approaches such as decision tree [7], Bayesian classi\ufb01cation [27],\nneural networks [17], support vector machines [29], and k-nearest neighbour [1]. For the latter,\nvarious ensemble methods have been proposed including Online Bagging & Boosting [22], Weighted\nEnsemble Classi\ufb01ers [30, 20], Adapted One-vs-All Decision Trees (OVA) [12] and Meta-knowledge\nEnsemble [33]. For more details, please refer to [9, 10, 2, 6, 21]. These traditional streaming data\nalgorithms often assume that the data samples are described by the same set of features, while in\nmany real streaming tasks feature often changes. We want to emphasize that though concept-drift\nhappens in streaming data where the underlying data distribution changes over time [2, 10, 4], the\nnumber of features in concept-drift never changes which is different from our problem. Most studies\ncorrelated to features changing are focusing on feature selection and extraction [26, 35] and to the\nbest of our knowledge, none of them consider the evolving of feature set during the learning process.\nData stream mining is a hot research direction in the area of data mining while online learning [38, 14]\nis a related topic from the area of machine learning. Yet online learning can also tackle the streaming\ndata problem since it assumes that the data come in a streaming way. Online learning has been\nextensively studied under different settings, such as learning with experts [5] and online convex\noptimization [13, 28]. There are strong theoretical guarantees for online learning, and it usually uses\nregret or the number of mistakes to measure the performance of the learning procedure. However,\nmost of existing online learning algorithms are limited to the case that the feature set is \ufb01xed.\nOther related topics involving multiple feature sets include multi-view learning [18, 19, 32], transfer\nlearning [23, 24] and incremental attribute learning [11]. Although both our approaches and multi-\nview learning exploit the relation between different sets of features, there exists a fundamental\n\n2\n\nFeature EvolutionData StreamingFeature Set\ud835\udc461\ud835\udc462\ud835\udc463\u2026data with feature set\ud835\udc461datawithfeatureset\ud835\udc461and\ud835\udc462datawithfeatureset\ud835\udc462datawithfeatureset\ud835\udc462and\ud835\udc463\ud835\udc472\ud835\udc471\ud835\udc351\ud835\udc352\fdifference: multi-view learning assumes that every sample is described by multiple feature sets\nsimultaneously, whereas in FESL only few samples in the feature switching period have two sets\nof features, and no matter how many periods there are, the switching part involves only two sets of\nfeatures. Transfer learning usually assumes that data are in batch mode, few of them consider the\nstreaming cases where data arrives sequentially and cannot be stored completely. One exception is\nonline transfer learning [34] in which data from both sets of features arrive sequentially. However,\nthey assume that all the feature spaces must appear simultaneously during the whole learning process\nwhile such an assumption is not available in FESL. When it comes to incremental attribute learning,\nold sets of features do not vanish or do not vanish entirely while in FESL, old ones will vanish\nthoroughly when new sets of features come.\nThe most related work is [15], which also handles evolving features in streaming data. Different to\nour setting where there are overlapping periods, [15] handles situations where there is no overlapping\nperiod but there are overlapping features. Thus, the technical challenges and solutions are different.\n\n3 Preliminaries\n\nWe focus on both classi\ufb01cation and regression\ntasks. On each round of the learning process, the\nalgorithm observes an instance and gives its pre-\ndiction. After the prediction has been made, the\ntrue label is revealed and the algorithm suffers\na loss which re\ufb02ects the discrepancy between\nthe prediction and the groundtruth. We de\ufb01ne\n\u201cfeature space\" in our paper by a set of features.\nThat the feature space changes means both the\nunderlying distribution of the feature set and the\nnumber of features change. Consider the pro-\ncess with three periods where in the \ufb01rst period\nlarge amount of data stream come from the old\nfeature space; then in the second period named\nas overlapping period, few of data come from\nboth the old and the new feature space; soon\nafterwards in the third period, data stream only\ncome from the new feature space. We call this whole process a cycle. As can be seen from Figure 1,\neach cycle merely includes two feature spaces. Thus, we only need to focus on one cycle and it is\neasy to extend to the case with multiple cycles. Besides, we assume that the old features in one cycle\nwill vanish simultaneously by considering the example that in ecosystem protection, all the sensors\nshare the same expected lifespan and thus they will wear out at the same time. We will study the case\nwhere old features do not vanish simultaneously in the future work.\nBased on the above discussion, we only consider two feature spaces denoted by S1 and S2, respec-\ntively. Suppose that in the overlapping period, there are B rounds of instances both from S1 and S2.\nAs can be seen from Figure 2, the process can be concluded as follows.\n\u2022 For t = 1, . . . , T1 \u2212 B, in each round, the learner observes a vector xS1\nt \u2208 Rd1 sampled\nfrom S1 where d1 is the number of features of S1, T1 is the number of total rounds in S1.\n\u2022 For t = T1 \u2212 B + 1, . . . , T1, in each round, the learner observes two vectors xS1\nt \u2208 Rd1 and\nt \u2208 Rd2 from S1 and S2, respectively where d2 is the number of features of S2.\nxS2\n\u2022 For t = T1 + 1, . . . , T1 + T2, in each round, the learner observes a vector xS2\nt \u2208 Rd2\nsampled from S2 where T2 is the number of rounds in S2. Note that B is small, so we can\nomit the streaming data from S2 on rounds T1 \u2212 B + 1, . . . , T1 since they have minor effect\non training the model in S2.\n\nFigure 2: Speci\ufb01c illustration with one cycle.\n\nWe use (cid:107)x(cid:107) to denote the (cid:96)2-norm of a vector x \u2208 Rdi, i = 1, 2. The inner product is denoted by\n(cid:104)\u00b7,\u00b7(cid:105). Let \u21261 \u2286 Rd1 and \u21262 \u2286 Rd2 be two sets of linear models that we are interested in. We de\ufb01ne\nthe projection \u03a0\u2126i (b) = argmina\u2208\u2126i (cid:107)a \u2212 b(cid:107), i = 1, 2. We restrict our prediction function in i-th\nfeature space and t-th round to be linear which takes the form (cid:104)wi,t, xSi\nt (cid:105) where wi,t \u2208 Rdi, i = 1, 2.\nThe loss function (cid:96)(w(cid:62)x, y) is convex in its \ufb01rst argument and in implementing algorithms, we use\n\n3\n\nFeature EvolutionData StreamingFeature Space\ud835\udc461Feature Space\ud835\udc462\ud835\udc311\ud835\udc461\u2026\ud835\udc31\ud835\udc471\u2212\ud835\udc35\ud835\udc461\ud835\udc31\ud835\udc471\u2212\ud835\udc35+1\ud835\udc461\ud835\udc31\ud835\udc471\u2212\ud835\udc35+1\ud835\udc462\u2026\u2026\ud835\udc31\ud835\udc471\ud835\udc461\ud835\udc31\ud835\udc471\ud835\udc462\ud835\udc31\ud835\udc471+1\ud835\udc462\u2026\ud835\udc31\ud835\udc471+\ud835\udc472\ud835\udc462\ud835\udc471\ud835\udc35\ud835\udc472\fAlgorithm 1 Initialize\n1: Initialize w1,1 \u2208 \u21261 randomly, M1 = 0, and M2 = 0;\n2: for t = 1, 2, . . . , T1 do\n3:\n4:\n5:\n6: M\u2217 = M\n\nt \u2208 Rd1 and predict ft = w(cid:62)\n1,txS1\nReceive xS1\n\u221a\nUpdate w1,t using (1) where \u03c4t = 1/\nt;\nif t > T1 \u2212 B then M1 = M1 + xS2\nt xS2\n\n\u22121\n1 M2.\n\nt\n\nt \u2208 R; Receive the target yt \u2208 R, and suffer loss (cid:96)(ft, yt);\n(cid:62)\n\n(cid:62)\n\nand M2 = M2 + xS2\n\nt xS1\n\nt\n\n;\n\nlogistic loss for classi\ufb01cation task, namely (cid:96)(w(cid:62)x, y) = (1/ ln 2) ln(1 + exp(\u2212y(w(cid:62)x))) and square\nloss for regression task, namely (cid:96)(w(cid:62)x, y) = (y \u2212 w(cid:62)x)2.\nThe most straightforward or baseline algorithm is to apply online gradient descent [38] on rounds\n1, . . . , T1 with streaming data xS1\n, and invoke it again on rounds T1 + 1, . . . , T1 + T2 with streaming\nt\ndata xS2\nt\n\n. The models are updated according to (1), where \u03c4t is a varied step size:\n\n(cid:16)\n\n(cid:17)\n\nwi,t \u2212 \u03c4t\u2207(cid:96)(w(cid:62)\n\ni,txSi\n\nt , yt)\n\n, i = 1, 2.\n\n(1)\n\nwi,t+1 = \u03a0\u2126i\n\n4 Our Proposed Approach\n\nIn this section, we \ufb01rst introduce the basic idea of the solution to FESL, then two different kinds of\napproaches with the corresponding analyses are proposed.\nThe major limitation of the baseline algorithm mentioned above is that the model learned on rounds\n1, . . . , T1 is ignored on rounds T1 + 1, . . . , T1 + T2. The reason is that from rounds t > T1, we\ncannot observe data from feature space S1, and thus the model w1,T1, which operates in S1, cannot\nbe used directly. To address this challenge, we assume there is a certain relationship \u03c8 : Rd2 \u2192 Rd1\nbetween the two feature spaces, and we try to discover it in the overlapping period. There are several\nmethods to learn a relationship between two sets of features including multivariate regression [16],\nstreaming multi-label learning [25], etc. In our setting, since the overlapping period is very short, it is\nunrealistic to learn a complex relationship between the two spaces. Instead, we use a linear mapping\nto approximate \u03c8. Assume the coef\ufb01cient matrix of the linear mapping is M, then during rounds\nT1 \u2212 B + 1, . . . , T1, the estimation of M can be based on least squares\nt (cid:107)2\n2.\n\n(cid:88)T1\n\n(cid:107)xS1\n\nmin\n\nM\u2208Rd2\u00d7d1\n\nt=T1\u2212B+1\n\nThe optimal solution M\u2217 to the above problem is given by\n\n(cid:32)\n\nT1(cid:88)\n\nM\u2217 =\n\nxS2\nt xS2\n\nt\n\nt=T1\u2212B+1\n\n(cid:62)(cid:33)\u22121(cid:32)\n\nt \u2212 M(cid:62)xS2\nT1(cid:88)\n\nt=T1\u2212B+1\n\n(cid:62)(cid:33)\n\n.\n\nxS2\nt xS1\n\nt\n\nt \u2208 Rd2 from S2, we can recover an instance in S1 by\nThen if we only observe an instance xS2\n\u03c8(xS2) \u2208 Rd1, to which w1,T1 can be applied. Based on this idea, we will make two changes to the\nbaseline algorithm:\n\n\u2022 During rounds T1\u2212B +1, . . . , T1, we will learn a relationship \u03c8 from (xS1\n\nT1\u2212B+1, xS2\n\u2022 From rounds t > T1, we will keep on updating w1,t using the recovered data \u03c8(xS2\n\n. . . , (xS1\nT1\n\n, xS2\nT1\n\n).\n\npredict the target by utilizing the predictions of w1,t and w2,t.\n\nT1\u2212B+1),\n\nt ) and\n\n1,t(\u03c8(xS2\n\nIn round t > T1, the learner can calculate two base predictions based on models w1,t and w2,t:\nf1,t = w(cid:62)\n. By utilizing the two base predictions in each round, we\npropose two methods, both of which are able to follow the better base prediction empirically and\ntheoretically. The process to obtain the relationship mapping \u03c8 and w1,T1 during rounds 1, . . . , T1\nare concluded in Algorithm 1.\n\nt )) and f2,t = w(cid:62)\n\n2,txS2\n\nt\n\n4\n\n\fAlgorithm 2 FESL-c(ombination)\n1: Initialize \u03c8 and w1,T1 during 1, . . . , T1 using Algorithm 1;\n2: \u03b11,T1 = \u03b12,T1 = 1\n2 ;\n3: Initialize w2,T1+1 randomly and w1,T1+1 by w1,T1;\n4: for t = T1 + 1, T1 + 2, . . . , T1 + T2 do\n5:\n6:\n7:\n8:\n\nPredict(cid:98)pt \u2208 R using (2), then receive the target yt \u2208 R, and suffer loss (cid:96)((cid:98)pt, yt);\nUpdate weights using (3) where \u03b7 =(cid:112)8(ln 2)/T2;\n\n\u221a\nUpdate w1,t and w2,t using (4) and (1) respectively where \u03c4t = 1/\n\nt \u2208 RS2 and predict f1,t = w(cid:62)\n\nt )) and f2,t = w(cid:62)\n\nReceive xS2\n\n1,t(\u03c8(xS2\n\nt \u2212 T1;\n\n2,txS2\n\n;\n\nt\n\n4.1 Weighted Combination\n\nWe \ufb01rst propose an ensemble method by combining predictions with weights based on exponential of\nthe cumulative loss [5]. The prediction at time t is the weighted average of all the base predictions:\n\n(cid:98)pt = \u03b11,tf1,t + \u03b12,tf2,t\n\n(2)\n\nwhere \u03b1i,t is the weight of the i-th base prediction. With the previous loss of each base model, we\ncan update the weights of the two base models as follows:\n\u03b1i,te\u2212\u03b7(cid:96)(fi,t,yt)\nj=1 \u03b1j,te\u2212\u03b7(cid:96)(fj,t,yt)\n\n(cid:80)2\n\n, i = 1, 2,\n\n\u03b1i,t+1 =\n\n(3)\n\nwhere \u03b7 is a tuned parameter. The updating rule of the weights shows that if the loss of one of\nthe models on previous round is large, then its weight will decrease in an exponential rate in next\nround, which is reasonable and can derive a good theoretical result shown in Theorem 1. Algorithm 2\nsummarizes our \ufb01rst approach for FESL named as FESL-c(ombination). We \ufb01rst learn a model w1,T1\nusing online gradient descent on rounds 1, . . . , T1, during which, we also learn a relationship \u03c8 for\nt = T1 \u2212 B + 1, . . . , T1. For t = T1 + 1, . . . , T1 + T2, we learn a model w2,t on each round and\nkeep updating w1,t on the recovered data \u03c8(xS2\n\nt ) showed in (4) where \u03c4t is a varied step size:\n\nw1,t+1 = \u03a0\u2126i\n\nw1,t \u2212 \u03c4t\u2207(cid:96)(w(cid:62)\n\n1,t(\u03c8(xS2\n\nt )), yt)\n\n.\n\n(4)\n\n(cid:16)\n\n(cid:17)\n\nThen we combine the predictions of the two models by weights calculated in (3).\n\nAnalysis\nIn this paragraph, we borrow the regret from online learning to measure the performance\nof FESL-c. Speci\ufb01cally, we give a loss bound as follows which shows that the performance will be\nimproved with assistance of the old feature space. For the sake of soundness, we put the proof of our\ntheorems in the supplementary \ufb01le. We de\ufb01ne that LS1 and LS2 are two cumulative losses suffered\nby base models on rounds T1 + 1, . . . , T1 + T2,\n\nT1+T2(cid:88)\n\nT1+T2(cid:88)\n\n(5)\n\nLS1 =\n\n(cid:96)(f1,t, yt), LS2 =\n\n(cid:96)(f2,t, yt),\n\nt=T1+1\n\nt=T1+1\n\nt=T1+1 (cid:96)((cid:98)pt, yt). Then we have:\n\nTheorem 1. Assume that the loss function (cid:96) is convex in its \ufb01rst argument and that it takes value\nin [0,1]. For all T2 > 1 and for all yt \u2208 Y with t = T1 + 1, . . . , T1 + T2, LS12 with parameter\n\nand LS12 is the cumulative loss suffered by our methods: LS12 =(cid:80)T1+T2\n\u03b7t =(cid:112)8(ln 2)/T2 satis\ufb01es\nLS12 \u2264 min(LS1, LS2 ) +(cid:112)(T2/2) ln 2\nis comparable to the minimum of LS1 and LS2. Furthermore, we de\ufb01ne C = (cid:112)(T2/2) ln 2. If\n\nThis theorem implies that the cumulative loss LS12 of Algorithm 2 over rounds T1 + 1, . . . , T1 + T2\nLS2 \u2212 LS1 > C, it is easy to verify that LS12 is smaller than LS2. In summary, on rounds T1 +\n1, . . . , T1 + T2, when w1,t is better than w2,t to certain degree, the model with assistance from S1 is\nbetter than that without assistance.\n\n(6)\n\n5\n\n\fAlgorithm 3 FESL-s(election)\n1: Initialize \u03c8 and w1,T1 during 1, . . . , T1 using Algorithm 1;\n2: \u03b11,T1 = \u03b12,T1 = 1\n2 ;\n3: Initialize w2,T1+1 randomly and w1,T1+1 by w1,T1;\n4: for t = T1 + 1, T1 + 2, . . . , T1 + T2 do\n5:\n6:\n7:\n8:\n\nt \u2208 RS2 and predict f1,t = w(cid:62)\n\nReceive xS2\n\nDraw a model wi,t according to the distribution (7) and predict(cid:98)pt = fi,t according to the model;\nReceive the target yt \u2208 R, and suffer loss (cid:96)((cid:98)pt, yt); Update the weights using (8);\n\nt )) and f2,t = w(cid:62)\n2,txS2\n\u221a\n\n1,t(\u03c8(xS2\n\nUpdate w1,t and w2,t using (4) and (1) respectively, where \u03c4t = 1/\n\nt \u2212 T1.\n\n;\n\nt\n\n4.2 Dynamic Selection\n\nThe combination approach mentioned in the above subsection combines several base models to\nimprove the overall performance. Generally, combination of several classi\ufb01ers performs better than\nselecting only one single classi\ufb01er [36]. However, it requires that the performance of base models\nshould not be too bad, for example, in Adaboost the accuracy of the base classi\ufb01ers should be no less\nthan 0.5 [8]. Nevertheless, in our FESL problem, on rounds T1 + 1, . . . , T1 + T2, w2,t cannot satisfy\nthe requirement in the beginning due to insuf\ufb01cient training data and w1,t may become worse when\nmore and more data come causing a cumulation of recovered error. Thus, it may not be appropriate to\ncombine the two models all the time, whereas dynamically selecting the best single may be a better\nchoice. Hence we propose a method based on a new strategy, i.e., dynamic selection, similar to the\nDynamic Classi\ufb01er Selection [36] which only uses the best single model rather than combining both\nof them in each round. Note that, though we only select one of the models, we retain and utilize both\nof them to update their weights. So it is still an ensemble method. The basic idea of dynamic selection\nis to select the model of larger weight with higher probability. Algorithm 3 summarizes our second\napproach for FESL named as FESL-s(election). Speci\ufb01cally, the steps in Algorithm 3 on rounds\n1, . . . , T1 is the same as that in Algorithm 2. For t = T1 + 1, . . . , T1 + T2, we still update weights\nof each model. However, when doing prediction, we do not combine all the models\u2019 prediction, we\nadopt the result of the \u201cbest\" model\u2019s according to the distribution of their weights\n\npi,t =\n\ni = 1, 2.\n\n(7)\n\n(cid:80)2\n\n\u03b1i,t\u22121\nj=1 \u03b1j,t\u22121\n\nTo track the best model, we have a different way of updating weights which is given as follows [5].\n\nvi,t = \u03b1i,t\u22121e\u2212\u03b7(cid:96)(fi,t,yt), i = 1, 2, \u03b1i,t = \u03b4\n\nwhere we de\ufb01ne Wt = v1,t + v2,t, \u03b4 = 1/(T2 \u2212 1), \u03b7 =(cid:112)8/T2 (2 ln 2 + (T2 \u2212 1)H(1/(T2 \u2212 1)))\n\nand H(x) = \u2212x ln x \u2212 (1 \u2212 x) ln(1 \u2212 x) is the binary entropy function de\ufb01ned for x \u2208 (0, 1).\n\n+ (1 \u2212 \u03b4)vi,t, i = 1, 2,\n\nWt\n2\n\n(8)\n\nAnalysis From rounds t > T1, the \ufb01rst model w1,t would become worse due to the cumulative\nrecovered error while the second model will become better by the large amount of coming data. Since\nw1,t is initialized by w1,T 1 which is learnt from the old feature space and w2,t is initialized randomly,\nit is reasonable to assume that w1,t is better than w2,t in the beginning, but inferior to w2,t after\nsuf\ufb01cient large number of rounds. Let s be the round after which w1,t is worse than w2,t. We de\ufb01ne\n\nLs =(cid:80)s\n\nt=T1+1 (cid:96)(f1,t, yt) +(cid:80)T2\n\nt=s+1 (cid:96)(f2,t, yt), we can verify that\n\nmin\n\nT1+1\u2264s\u2264T1+T2\n\nLs \u2264 min\n\ni=1,2\n\nLSi .\n\n(9)\n\nThen a more ambitious goal is to compare the proposed algorithm against w1,t from rounds T1 + 1\nto s, and against the w2,t from rounds s to T1 + T2, which motivates us to study the following\nperformance measure LS12 \u2212 Ls. Because the exact value of s is generally unknown, we need to\nbound the worst-case LS12 \u2212 minT1+1\u2264s\u2264T1+T2 Ls. An upper bound of LS12 is given as follows.\nTheorem 2. For all T2 > 1, if the model is run with parameter \u03b4 = 1/(T2 \u2212 1) and \u03b7 =\n\n(cid:112)8/T2 (2 ln 2 + (T2 \u2212 1)H(1/T2 \u2212 1)), then\n\n(cid:115)\n\n(cid:18)\n\n(cid:19)\n\nLS12 \u2264\n\nmin\n\nT1+1\u2264s\u2264T1+T2\n\nLs +\n\nT2\n2\n\n2 ln 2 +\n\nH(\u03b4)\n\n\u03b4\n\nwhere H(x) = \u2212x ln x \u2212 (1 \u2212 x) ln(1 \u2212 x) is the binary entropy function.\n\n(10)\n\n6\n\n\fTable 1: Detail description of datasets: let n be the number of examples, and d1 and d2 denote the dimensionality\nof the \ufb01rst and second feature space, respectively. The \ufb01rst 9 datasets in the left column are synthetic datasets,\n\u201cr.EN-GR\" means the dataset EN-GR comes from Reuter and \u201cRFID\" is the real dataset.\nn\n\nDataset\n\nn\n\nd2\n\nd2\n\nDataset\nn\n690\nAustralian\n653\nCredit-a\n1,000\nCredit-g\n768\nDiabetes\n940\nDNA\n1,000\nGerman\n3,196\nKr-vs-kp\nSplice\n3,175\nSvmguide3 1,284\nRFID\n940\n\nd1\n\nd1\n\nd2\n29\n10\n14\n5\n\nDataset\nd1\nr.EN-FR 18,758 21,531 24,892 r.GR-IT 29,953 34,279 15,505\n42\nr.EN-GR 18,758 21,531 34,215 r.GR-SP 29,953 34,279 11,547\n15\n18,758 21,531 15,506 r.IT-EN 24,039 15,506 21,517\nr.EN-IT\n20\n8\nr.EN-SP\n18,758 21,531 11,547 r.IT-FR\n24,039 15,506 24,892\n180 125 r.FR-EN 26,648 24,893 21,531 r.IT-GR 24,039 15,506 34,278\n59\n24,039 15,506 11,547\nr.FR-GR 26,648 24,893 34,287 r.IT-SP\n26,648 24,893 15,503 r.SP-EN 12,342 11,547 21,530\n36\nr.FR-IT\nr.FR-SP\n60\n26,648 24,893 11,547 r.SP-FR 12,342 11,547 24,892\nr.GR-EN 29,953 34,279 21,531 r.SP-GR 12,342 11,547 34,262\n22\n78\nr.GR-FR 29,953 34,279 24,892 r.SP-IT\n12,342 11,547 15,500\n\n41\n25\n42\n15\n72\n\nAccording to Theorem 2 we know that LS12 is comparable to minT1+1\u2264s\u2264T1+T2 Ls. Due to (9), we\ncan conclude that the upper bound of LS12 in Algorithm 3 is tighter than that of Algorithm 2.\n\n5 Experiments\n\nIn this section, we \ufb01rst introduce the datasets we use. We want to emphasize that we collected one\nreal dataset by ourselves since our setting of feature evolving is relatively novel so that the required\ndatasets are not widely available yet. Then we introduce the compared methods and settings. Finally\nexperiment results are given.\n\n5.1 Datasets\n\nWe conduct our experiments on 30 datasets consisting of 9 synthetic datasets, 20 Reuter datasets and\n1 real dataset. To generate synthetic data, we randomly choose some datasets from different domains\nincluding economy and biology, etc1 whose scales vary from 690 to 3,196. They only have one\nfeature space at \ufb01rst. We arti\ufb01cially map the original datasets into another feature space by random\nGaussian matrices, then we have data both from feature space S1 and S2. Since the original data are\nin batch mode, we manually make them come sequentially. In this way, synthetic data are completely\ngenerated. We also conduct our experiments on 20 datasets from Reuter [3]. They are multi-view\ndatasets which have large scale varying from 12,342 to 29,963. Each dataset has two views which\nrepresent two different kinds of languages, respectively. We regard the two views as the two feature\nspaces. Now they do have two feature spaces but the original data are in batch mode, so we will\narti\ufb01cially make them come in streaming way.\nWe use the RFID technique to collect the real data which contain 450 instances from S1 and S2\nrespectively. RFID technique is widely used to do moving goods detection [31]. In our case, we want\nto utilize the RFID technique to predict the location\u2019s coordinate of the moving goods attached by\nRFID tags. Concretely, we arranged several RFID aerials around the indoor area. In each round, each\nRFID aerial received the tag signals, then the goods with tag moved, at the same time, we recorded\nthe goods\u2019 coordinate. Before the aerials expired, we arranged new aerials beside the old ones to\navoid the situation without aerials. So in this overlapping period, we have data from both old and new\nfeature spaces. After the old aerials expired, we continue to use the new ones to receive signals. Then\nwe only have data from feature space S2. So the RFID data we collect totally satisfy our assumptions.\nThe details of all the datasets we use are presented in Table 1.\n\n5.2 Compared Approaches and Settings\n\nWe compare our FESL-c and FESL-s with three approaches. One is mentioned in Section 3, where\nonce the feature space changed, the online gradient descent algorithm will be invoked from scratch,\nnamed as NOGD (Naive Online Gradient Descent). The other two approaches utilize the model\nlearned from feature space S1 by online gradient descent to do predictions on the recovered data. The\n\n1Datasets can be found in http://archive.ics.uci.edu/ml/.\n\n7\n\n\f(a) australian\n\n(b) credit-a\n\n(c) credit-g\n\n(d) diabetes\n\n(e) r.EN-SP\n\n(f) r.FR-SP\n\n(g) r.GR-EN\n\n(h) r.IT-FR\n\n(i) RFID\n\nlegend\n\nFigure 3: The trend of loss with three baseline methods and the proposed methods on synthetic data. The smaller\nthe cumulative loss, the better. All the average cumulative loss at any time of our methods is comparable to the\nbest of baseline methods and 8 of 9 are smaller.\n\nthe cumulative loss over 1, . . . , t(cid:48), namely \u00af(cid:96)t(cid:48) = (1/t(cid:48))(cid:80)t(cid:48)\n\ndifference between them is that one keeps updating with the recovered data while the other does not.\nThe one which keeps updating is called Updating Recovered Online Gradient Descent (ROGD-u)\nand the other which keeps \ufb01xed is called Fixed Recovered Online Gradient Descent (ROGD-f). We\nevaluate the empirical performances of the proposed approaches on classi\ufb01cation and regression tasks\non rounds T1 + 1, . . . , T1 + T2. To verify that our analysis is reasonable, we present the trend of\naverage cumulative loss. Concretely, at each time t(cid:48), the loss \u00af(cid:96)t(cid:48) of every method is the average of\nt=1 (cid:96)t. We also present the classi\ufb01cation\nperformance over all instances on rounds T1 + 1, . . . , T1 + T2 on synthetic and Reuter data. The\nperformances of all approaches are obtained by average results over 10 independent runs on synthetic\ndata. Due to the large scale of Reuter data, we only conduct 3 independent runs on Reuter data and\nreport the average results.\nThe parameters we need to set are the number of instances in overlapping period, i.e., B, the number\nof instances in S1 and S2, i.e., T1 and T2 and the step size, i.e., \u03c4t where t is time. For all baseline\nmethods and our methods, the parameters are the same. In our experiments, we set B 5 or 10 for\n\u221a\nsynthetic data, 50 for Reuter data and 40 for RFID data. We set almost T1 and T2 to be half of the\nt) where c is searched in the range {1, 10, 50, 100, 150}.\nnumber of instances, and \u03c4t to be 1/(c\nThe detailed setting of c in \u03c4t for each dataset is presented in supplementary \ufb01le.\n\n5.3 Results\n\nHere we only present part of the loss trend results, and other results are presented in the supplementary\n\ufb01le. Figure 3 gives the trend of average cumulative loss. (a-d) are the results on synthetic data, (e-h)\nare the results on Reuter data, (i) is the result of the real data. The smaller the average cumulative loss,\nthe better. From the experimental results, we have the following observations. First, all the curves\nwith circle marks representing NOGD decrease rapidly which conforms to the fact that NOGD on\nrounds T1 + 1, . . . , T1 + T2 becomes better and better with more and more data coming. Besides,\nthe curves with star marks representing ROGD-u also decline but not very apparent since on rounds\n1, . . . , T1, ROGD-u already learned well and tend to converge, so updating with more recovered data\ncould not bring too much bene\ufb01ts. Moreover, the curves with plus marks representing ROGD-f does\nnot drop down but even go up instead, which is also reasonable because it is \ufb01xed and if there are some\nrecovering errors, it will perform worse. Lastly, our methods are based on NOGD and ROGD-u, so\ntheir average cumulative losses also decrease. As can be seen from Figure 3, the average cumulative\nlosses of our methods are comparable to the best of baseline methods on all datasets and are smaller\nthan them on 8 datasets. And FESL-s exhibits slightly smaller average cumulative loss than FESL-c.\nYou may notice that NOGD is always worse than ROGD-u on synthetic data and real data while on\nReuter data NOGD becomes better than ROGD-u after a few rounds. This is because on synthetic data\nand real data, we do not have enough rounds to let all methods converge while on Reuter data, large\namounts of instances ensure the convergence of every method. So when all the methods converge, we\ncan see that NOGD is better than other baseline methods since it always receives the real instances\nwhile ROGD-u and ROGD-f receive the recovered instances which may contain recovered error. As\ncan be seen from (e-h), in the \ufb01rst few rounds, our methods are comparable to ROGD-u. When\nNOGD is better than ROGD-u, our methods are comparable to NOGD which shows that our methods\n\n8\n\n70139208277Time0.060.080.100.120.14Loss66131196261326Time0.40.60.81.01.2Loss101201301401Time1234Loss77153229305381Time0.70.80.91.0Loss37675111261501Time0.20.40.60.81.01.2Loss5331065159721292661Time0.20.40.60.81.0Loss600119917982397Time0.20.40.60.81.0Loss481961144119212401Time0.20.40.60.8Loss91181271361Time1.01.52.02.53.0Loss\fTable 2: Accuracy with its variance on synthetic datasets and Reuter datasets. The larger the better. The best\nones among all the methods are bold.\nNOGD\n.767\u00b1.009\n.811\u00b1.006\n.659\u00b1.010\n.650\u00b1.002\n.610\u00b1.013\n.684\u00b1.006\n.612\u00b1.005\n.568\u00b1.005\n.680\u00b1.010\n.902\u00b1.004\n.867\u00b1.005\n.858\u00b1.014\n.900\u00b1.002\n.858\u00b1.007\n.869\u00b1.004\n.874\u00b1.005\n.872\u00b1.001\n.907\u00b1.000\n.898\u00b1.001\n.847\u00b1.011\n.902\u00b1.001\n.854\u00b1.003\n.863\u00b1.002\n.849\u00b1.004\n.839\u00b1.006\n.926\u00b1.002\n.876\u00b1.005\n.871\u00b1.013\n.928\u00b1.002\n\nDataset\naustralian\ncredit-a\ncredit-g\ndiabetes\ndna\ngerman\nkr-vs-kp\nsplice\nsvmguide3\nr.EN-FR\nr.EN-GR\nr.EN-IT\nr.EN-SP\nr.FR-EN\nr.FR-GR\nr.FR-IT\nr.FR-SP\nr.GR-EN\nr.GR-FR\nr.GR-IT\nr.GR-SP\nr.IT-EN\nr.IT-FR\nr.IT-GR\nr.IT-SP\nr.SP-EN\nr.SP-FR\nr.SP-GR\nr.SP-IT\n\nFESL-s\n.849\u00b1.009\n.831\u00b1.009\n.733\u00b1.006\n.652\u00b1.009\n.692\u00b1.021\n.703\u00b1.004\n.630\u00b1.016\n.612\u00b1.022\n.778\u00b1.010\n.902\u00b1.005\n.870\u00b1.003\n.863\u00b1.013\n.899\u00b1.002\n.858\u00b1.007\n.868\u00b1.003\n.873\u00b1.005\n.871\u00b1.002\n.906\u00b1.000\n.898\u00b1.000\n.851\u00b1.017\n.902\u00b1.001\n.854\u00b1.003\n.862\u00b1.003\n.846\u00b1.004\n.839\u00b1.006\n.924\u00b1.001\n.878\u00b1.012\n.873\u00b1.013\n.927\u00b1.002\n\nROGD-u\n.849\u00b1.009\n.826\u00b1.018\n.733\u00b1.006\n.652\u00b1.009\n.691\u00b1.023\n.700\u00b1.002\n.621\u00b1.036\n.612\u00b1.022\n.779\u00b1.010\n.849\u00b1.003\n.836\u00b1.007\n.847\u00b1.014\n.848\u00b1.002\n.776\u00b1.009\n.774\u00b1.019\n.780\u00b1.022\n.778\u00b1.022\n.850\u00b1.007\n.827\u00b1.009\n.851\u00b1.017\n.845\u00b1.003\n.760\u00b1.006\n.753\u00b1.012\n.736\u00b1.022\n.753\u00b1.014\n.860\u00b1.005\n.873\u00b1.017\n.827\u00b1.025\n.861\u00b1.005\n\nROGD-f\n.809\u00b1.025\n.785\u00b1.051\n.716\u00b1.011\n.651\u00b1.006\n.608\u00b1.064\n.700\u00b1.002\n.538\u00b1.024\n.567\u00b1.057\n.748\u00b1.012\n.769\u00b1.069\n.802\u00b1.036\n.831\u00b1.018\n.825\u00b1.001\n.754\u00b1.012\n.753\u00b1.021\n.744\u00b1.040\n.735\u00b1.013\n.801\u00b1.035\n.802\u00b1.023\n.816\u00b1.006\n.797\u00b1.012\n.730\u00b1.024\n.730\u00b1.020\n.702\u00b1.012\n.726\u00b1.005\n.814\u00b1.021\n.833\u00b1.042\n.810\u00b1.026\n.826\u00b1.005\n\nFESL-c\n.849\u00b1.009\n.827\u00b1.014\n.733\u00b1.006\n.652\u00b1.007\n.691\u00b1.023\n.700\u00b1.001\n.626\u00b1.028\n.612\u00b1.022\n.779\u00b1.010\n.903\u00b1.003\n.870\u00b1.002\n.861\u00b1.010\n.901\u00b1.001\n.858\u00b1.007\n.870\u00b1.004\n.874\u00b1.005\n.872\u00b1.001\n.907\u00b1.001\n.898\u00b1.001\n.850\u00b1.018\n.902\u00b1.001\n.856\u00b1.002\n.864\u00b1.002\n.849\u00b1.004\n.839\u00b1.007\n.926\u00b1.002\n.876\u00b1.014\n.873\u00b1.013\n.928\u00b1.003\n\nare comparable to the best one all the time. Moreover, FESL-s performs worse than FESL-c in the\nbeginning while afterwards, it becomes slightly better than FESL-c.\nTable 2 shows the accuracy results on synthetic datasets and Reuter datasets. We can see that for\nsynthetic datasets, FESL-s outperforms other methods on 8 datasets, FESL-c gets the best on 5\ndatasets and ROGD-u also gets 5. NOGD performs worst since it starts from scratch. ROGD-u is\nbetter than NOGD and ROGD-f because ROGD-u exploits the old better trained model from old\nfeature space and keep updating with recovered instances. Our two methods are based on NOGD\nand ROGD-u. We can see that our methods can follow the best baseline method or even outperform\nit. For Reuter datasets, we can see that FESL-c outperforms other methods on 17 datasets, FESL-s\ngets the best on 9 datasets and NOGD gets 8 while ROGD-u gets 1. In Reuter datasets, the period\non new feature space is longer than that in synthetic datasets so that NOGD can update itself to a\ngood model. Whereas ROGD-u updates itself with recovered data, so the model will become worse\nwhen recovered error accumulates. ROGD-f does not update itself, thus it performs worst. Our two\nmethods can take the advantage of NOGD and ROGD-f and perform better than them.\n\n6 Conclusion\n\nIn this paper, we focus on a new setting: feature evolvable streaming learning. Our key observation is\nthat in learning with streaming data, old features could vanish and new ones could occur. To make\nthe problem tractable, we assume there is an overlapping period that contains samples from both\nfeature spaces. Then, we learn a mapping from new features to old features, and in this way both\nthe new and old models can be used for prediction. In our \ufb01rst approach FESL-c, we ensemble two\npredictions by learning weights adaptively. Theoretical results show that the assistance of the old\nfeature space can improve the performance of learning with streaming data. Furthermore, we propose\nFESL-s to dynamically select the best model with better performance guarantee.\n\n9\n\n\fAcknowledgement This research was supported by NSFC (61333014, 61603177), JiangsuSF\n(BK20160658), Huawei Fund (YBN2017030027) and Collaborative Innovation Center of Novel\nSoftware Technology and Industrialization.\n\nReferences\n[1] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-demand classi\ufb01cation of evolving data\n\nstreams. IEEE Transactions on Knowledge and Data Engineering, 18:577\u2013589, 2006.\n\n[2] C. C. Aggarwal. Data streams: An overview and scienti\ufb01c applications. In Scienti\ufb01c Data Mining and\n\nKnowledge Discovery - Principles and Foundations, pages 377\u2013397. Springer, 2010.\n\n[3] M.-R. Amini, N. Usunier, and C. Goutte. Learning from multiple partially observed views - an application\nto multilingual text categorization. In Advances in Neural Information Processing Systems 22, pages 28\u201336,\n2009.\n\n[4] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive online analysis. Journal of Machine\n\nLearning Research, 11:1601\u20131604, 2010.\n\n[5] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.\n\n[6] J. de Andrade Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. F. de Carvalho, and J. Gama. Data\n\nstream clustering: A survey. ACM Computing Surveys.\n\n[7] P. M. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the 6th ACM SIGKDD\n\nInternational Conference on Knowledge Discovery and Data Mining, pages 71\u201380, 2000.\n\n[8] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to\n\nboosting. Journal of Computer and System Sciences, 55:119\u2013139, 1997.\n\n[9] M. M. Gaber, A. B. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. SIGMOD Record,\n\n34:18\u201326, 2005.\n\n[10] J. Gama and P. P. Rodrigues. An overview on mining data streams. In Foundations of Computational\n\nIntelligence, pages 29\u201345. Springer, 2009.\n\n[11] S. U. Guan and S. Li. Incremental learning with respect to new incoming input attributes. Neural Processing\n\nLetters, 14:241\u2013260, 2001.\n\n[12] S. Hashemi, Y. Yang, Z. Mirzamomen, and M. R. Kangavari. Adapted one-versus-all decision trees for\n\ndata stream classi\ufb01cation. IEEE Transactions on Knowledge and Data Engineering, 21:624\u2013637, 2009.\n\n[13] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimization. Maching\n\nLearning, 69:169\u2013192, 2007.\n\n[14] S. Hoi, J. Wang, and P. Zhao. LIBOL: A library for online learning algorithms. Journal of Machine\n\nLearning Research, 15:495\u2013499, 2014.\n\n[15] C. Hou and Z.-H. Zhou. One-pass learning with incremental and decremental features. ArXiv e-prints,\n\narXiv:1605.09082, 2016.\n\n[16] B. M. Golam Kibria. Bayesian statistics and marketing. Technometrics, 49:230, 2007.\n\n[17] D. Leite, P. Costa Jr., and F. Gomide. Evolving granular classi\ufb01cation neural networks. In Proceedings of\n\nInternational Joint Conference on Neural Networks 2009, pages 1736\u20131743, 2009.\n\n[18] S.-Y. Li, Y. Jiang, and Z.-H. Zhou. Partial multi-view clustering.\n\nConference on Arti\ufb01cial Intelligence, pages 1968\u20131974, 2014.\n\nIn Proceedings of the 28th AAAI\n\n[19] I. Muslea, S. Minton, and C. Knoblock. Active + semi-supervised learning = robust multi-view learning.\n\nIn Proceedings of the 19th International Conference on Machine Learning, pages 435\u2013442, 2002.\n\n[20] H.-L. Nguyen, Y.-K. Woon, W. K. Ng, and L. Wan. Heterogeneous ensemble for feature drifts in data\nstreams. In Proceedings of the 16th Paci\ufb01c-Asia Conference on Knowledge Discovery and Data Mining,\npages 1\u201312, 2012.\n\n[21] H.-L. Nguyen, Y.-K. Woon, and W. K. Ng. A survey on data stream clustering and classi\ufb01cation. Knowledge\n\nand Information Systems, 45:535\u2013569, 2015.\n\n10\n\n\f[22] N. C. Oza. Online bagging and boosting. In Proceedings of the IEEE International Conference on Systems,\n\nMan and Cybernetics 2005, pages 2340\u20132345, 2005.\n\n[23] S. J. Pan and Q. Yang. A survey on transfer learning.\n\nEngineering, 22:1345\u20131359, 2010.\n\nIEEE Transactions on Knowledge and Data\n\n[24] R. Raina, A. Battle, H. Lee, B. Packer, and A. Ng. Self-taught learning: Transfer learning from unlabeled\n\ndata. In Proceedings of the 24th International Conference on Machine Learning, pages 759\u2013766, 2007.\n\n[25] J. Read, A. Bifet, G. Holmes, and B. Pfahringer. Streaming multi-label classi\ufb01cation. In Proceedings of\n\nthe 2nd Workshop on Applications of Pattern Analysis, pages 19\u201325, 2011.\n\n[26] K. Samina, K. Tehmina, and N. Shamila. A survey of feature selection and feature extraction techniques in\n\nmachine learning. In Proceedings of Science and Information Conference 2014, pages 372\u2013378, 2014.\n\n[27] T. Seidl, I. Assent, P. Kranen, R. Krieger, and J. Herrmann. Indexing density models for incremental\nlearning and anytime classi\ufb01cation on data streams. In Proceedings of the 12th International Conference on\nExtending Database Technology, pages 311\u2013322, 2009.\n\n[28] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine\n\nLearning, 4:107\u2013194, 2012.\n\n[29] I. W. Tsang, A. Kocsor, and J. T. Kwok. Simpler core vector machines with enclosing balls. In Proceedings\n\nof the 24th International Conference on Machine Learning, pages 911\u2013918, 2007.\n\n[30] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classi\ufb01ers. In\nProceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 226\u2013235, 2003.\n\n[31] C. Wang, L. Xie, W. Wang, T. Xue, and S. Lu. Moving tag detection via physical layer analysis for\nlarge-scale RFID systems. In Proceedings of the 35th Annual IEEE International Conference on Computer\nCommunications, pages 1\u20139, 2016.\n\n[32] C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. ArXiv e-prints, arXiv:1304.5634, 2013.\n\n[33] P. Zhang, J. Li, P. Wang, B. J. Gao, X. Zhu, and L. Guo. Enabling fast prediction for ensemble models on\ndata streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, pages 177\u2013185, 2011.\n\n[34] P. Zhao, S. Hoi, J. Wang, and B. Li. Online transfer learning. Arti\ufb01cial Intelligence, 216:76\u2013102, 2014.\n\n[35] G. Zhou, K. Sohn, and H. Lee. Online incremental feature learning with denoising autoencoders. In\nProceedings of the 15th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1453\u20131461,\n2012.\n\n[36] Z.-H. Zhou. Ensemble methods: Foundations and algorithms. CRC press, 2012.\n\n[37] Z.-H. Zhou. Learnware: On the future of machine learning. Frontiers of Computer Science, 10:589\u2013590,\n\n2016.\n\n[38] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In Proceedings\n\nof the 20th International Conference on Machine Learning, pages 928\u2013936, 2003.\n\n11\n\n\f", "award": [], "sourceid": 914, "authors": [{"given_name": "Bo-Jian", "family_name": "Hou", "institution": "LAMDA Group"}, {"given_name": "Lijun", "family_name": "Zhang", "institution": "Nanjing University (NJU)"}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": "Nanjing University"}]}