{"title": "Adaptive Access Control Applied to Ethernet Data", "book": "Advances in Neural Information Processing Systems", "page_first": 932, "page_last": 938, "abstract": null, "full_text": "Adaptive Access Control Applied to Ethernet Data \n\nTimothy X Brown \n\nDept. of Electrical and Computer Engineering \n\nUniversity of Colorado, Boulder, CO 80309-0530 \n\ntimxb@colorado.edu \n\nAbstract \n\nThis paper presents a method that decides which combinations of traffic \ncan be accepted on a packet data link, so that quality of service (QoS) \nconstraints can be met. The method uses samples of QoS results at dif(cid:173)\nferent load conditions to build a neural network decision function. Pre(cid:173)\nvious similar approaches to the problem have a significant bias. This \nbias is likely to occur in any real system and results in accepting loads \nthat miss QoS targets by orders of magnitude. Preprocessing the data to \neither remove the bias or provide a confidence level, the method was \napplied to sources based on difficult-to-analyze ethernet data traces. \nWith this data, the method produces an accurate access control function \nthat dramatically outperforms analytic alternatives. Interestingly, the \nresults depend on throwing away more than 99% of the data. \n\n1 INTRODUCTION \nIn a communication network in which traffic sources can be dynamically added or \nremoved, an access controller must decide when to accept or reject a new traffic source \nbased on whether, if added, acceptable service would be given to all carried sources. \nUnlike best-effort services such as the internet, we consider the case where traffic sources \nare given quality of service (QoS) guarantees such as maximum delay, delay variation, or \nloss rate. The goal of the controller is to accept the maximal number of users while guar(cid:173)\nanteeing QoS. To accommodate diverse sources such as constant bit rate voice, variable(cid:173)\nrate video, and bursty computer data, packet-based protocols are used. We consider QOS \nin terms of lost packets (Le. packets discarded due to resource overloads). This is broadly \napplicable (e.g. packets which violate delay guarantees can be considered lost) although \nsome QoS measures can not fit this model. \n\nThe access control task requires a classification function-analytically or empirically \nderived-that specifies what conditions will result in QoS not being met. Analytic func(cid:173)\ntions have been successful only on simple traffic models [Gue91], or they are so conserva(cid:173)\ntive that they grossly under utilize the network. This paper describes a neural network \nmethod that adapts an access control function based on historical data on what conditions \npackets have and have not been successfully carried. Neural based solutions have been \npreviously applied to the access control problem [Hir90][Tra92] [Est94], but these \n\n\fAdaptive Access Control Applied to Ethernet Data \n\n933 \n\napproaches have a distinct bias that under real-world conditions leads to accepting combi(cid:173)\nnations of calls that miss QoS targets by orders of magnitude. Incorporating preprocessing \nmethods to eliminate this bias is critical and two methods from earlier work will be \ndescribed. The combined data preprocessing and neural methods are applied to difficult(cid:173)\nto-model ethernet traffic. \n2 THE PROBLEM \nSince the decision to accept a multilink connection can be decomposed into decisions on \nthe individual links, we consider only a single link. A link can accept loads from different \nsource types. The loads consist of packets modeled as discrete events. Arriving packets are \nplaced in a buffer and serviced in turn. If the buffer is full, excess packets are discarded \nand treated as lost. The precise event timing is not critical as the concern is with the num(cid:173)\nber of lost packets relative to the total number of packets received in a large sample of \nevents, the so-called loss rate. The goal is to only accept load combinations which have a \nloss rate below the QoS target denoted by p*. \nLoad combinations are described by a feature vector, $, consisting of load types and possi(cid:173)\nbly other information such as time of day. Each feature vector, $, has an associated loss \nrate, p($), which can not be measured directly. Therefore, the goal is to have a classifier \nfunction, C($), such that C($) >, <, = 0 if p($) <, >, = p*. \nSince analytic C($) are not in general available, we look to statistical classification meth(cid:173)\nods. This requires training samples, a desired output for each sample, and a significance or \nweight for each sample. Loads can be dynamically added or removed. Training samples \nare generated at load transitions, with information since the last transition containing the \nnumber of packet arrivals, T, the number of lost packets, s, and the feature vector, $. \nA sample ($i' si' Ti), requires a desired classification, d($i> si' Ti) E {+1, -1}, and a weight, \nW($i' s;. Ti) E (0,00). Given a data set {($i' si' Ti)}, a classifier, C, is then chosen that mini-\nmizes the weighted sum squared error E = 2:j[w(~j, Sj, Tj)(C(~i) -d(~i' S;, T j\u00bb2]. \nA classifier, with enough degrees of freedom will set C($i) = d($i' si' 1j) if all the $i are dif(cid:173)\nferent. With multiple samples at the same $ then we see that the error is minimized when \n\nC(~) = (2: _ _ [w(~j, Si' Tj)d(~j, Sj, T;)])/(2: _ _ W(~i' Sj, T;\u00bb. \n\n(1) \n\n{iI~, \"'~} \n\n{il~i \"'~} \n\nThus, the optimal C($) is the weighted average of the d($i' si' Ti) at $. If the classifier has \nfewer degrees of freedom (e.g. a low dimension linear classifier), C($) will be the average \nof the d($i' si' 1j) in the neighborhood of $, where the neighborhood is, in general, an \nunspecified function of the classifier. \nA more direct form of averaging would be to choose a specific neighborhood around $ and \naverage over samples in this neighborhood. This suffers from having to store all the sam(cid:173)\nples in the decision mechanism, and incurs a significant computational burden. More sig(cid:173)\nnificant is how to decide the size of the neighborhood. If it is fixed, in sparse regions no \nsamples may be in the neighborhood. In dense regions near decision boundaries, it may \naverage over too wide a range for accurate estimates. Dynamically setting the neighbor(cid:173)\nhood so that it always contains the k nearest neighbors solves this problem, but does not \naccount for the size of the samples. We will return to this in Section 4. \n3 THE SMALL SAMPLE PROBLEM \n\nNeural networks have previously been a\u00a3plied to the access control proElem [Hir91] \n[Tra92][Est94]. In [Hir90] and [Tra92], d(i' si' Ti) = +1 when s;lTi < p*, d(i' si' 1j) =-1 \notherwise, and the weighting is a uniform w($i' si' Ti) = 1 for all i. This desired out and \n\n\f934 \n\nT. X. Brown \n\nuniform weighting we call the normal method. For a given load combination, lP, assume an \nidealized system where packets enter and with probability p(
p*}. Since with the normal method d(lP, s, 1) = -1 if \nsIT> p*, PB = P{d(lP, s, 1) = -I}. From (1), with uniform weighting the decision bound(cid:173)\nary is where PB = 0.5. If the samples are small (i.e. T < (In 2)/p* < IIp*), d(lP, s, 1) =-1 for \nall s > O. In this case PB = 1 - (1 -p(lP)l Solving for p(lP) at PB = 0.5 using In(1 - x) \"\" -x, \nthe decision boundary is at p(lP) \"\" (In 2)ff > p*. So, for small sample sizes, the normal \nmethod boundary is biased to greater than p* and can be made orders of magnitude larger \nas T becomes smaller. For larger T, e.g. Tp* > 10, this bias will be seen to be negligible. \nOne obvious solution is to have large samples. This is complicated by three effects. The \nfirst is that desired loss rates in data systems are often small; typically in the range \n1O-ti_1O-12. This implies that to be large, samples must be at least 107_1013 packets. For \nthe latter, even at Gbps rates, short packets, and full loading this translates into samples of \nseveral hours of traffic. Even for the first at typical rates, this can translate into minutes of \ntraffic. The second, related problem is that in dynamic data networks, while individual \nconnections may last for significant periods, on the aggregate a given combination of loads \nmay not exist for the requisite period. The third more subtle problem is that in any queue(cid:173)\ning system even with uncorrelated arrival traffic the buffering introduces memory in the \nsystem. A typical sample with losses may contain 100 losses, but a loss trace would show \nthat all of the losses occurred in a single short overload interval. Thus the number of inde(cid:173)\npendent trials can be several orders of magnitude smaller than indicated by the raw sample \nsize indicating that the loads must be stable for hours, days, or even years to get samples \nthat lead to unbiased classification. \nAn alternative approach used in [Hir95] sets d(lP, s, 1) = sIT and models p(lP) directly. The \nprobabilities can vary over orders of magnitude making accurate estimates difficult. Esti(cid:173)\nmating the less variable 10g(p(lP\u00bb with d = 10g(s/1) is complicated by the logarithm being \nundefined for small samples where most samples have no losses so that s = o. \n4 METHODS FOR TREATING BIAS AND VARIANCE \nWe present without proof two preprocessing methods derived and analyzed in [Br096]. \nThe first eliminates the sample bias by choosing an appropriate d and w that directly \nsolves (1) s.t. c(lP) >, <, = 0 if and only if p(lP) <, >, = p* i.e. it is an unbiased estimate as \nto whether the loss rate is above and below p*. This is the weighting method shown in \nTable 1. The relative weighting of samples with loss rates above and below the critical loss \nrate is plotted in Figure 1. For large T, as expected, it reduces to the normal method. \nThe second preprocessing method assigns uniform weighting, but classifies d(lP, s, 1) = 1 \nonly if a certain confidence level, L, is met that the sample represents a combination where \np(lP) < p*. Such a confidence was derived in [Bro96]: \n\nTable 1: Summary of Methods. \n\nSample Class \n\nWeighting, w(j, Sj, TD, when \n\nMethod d(