{"title": "Topkapi: Parallel and Fast Sketches for Finding Top-K Frequent Elements", "book": "Advances in Neural Information Processing Systems", "page_first": 10898, "page_last": 10908, "abstract": "Identifying the top-K frequent items is one of the most common and important operations in large data processing systems. As a result, several solutions have been proposed to solve this problem approximately. In this paper, we identify that in modern distributed settings with both multi-node as well as multi-core parallelism, existing algorithms, although theoretically sound, are suboptimal from the performance perspective. In particular, for identifying top-K frequent items, Count-Min Sketch (CMS) has fantastic update time but lack the important property of reducibility which is needed for exploiting available massive data parallelism. On the other end, popular Frequent algorithm (FA) leads to reducible summaries but the update costs are significant. In this paper, we present Topkapi, a fast and parallel algorithm for finding top-K frequent items, which gives the best of both worlds, i.e., it is reducible as well as efficient update time similar to CMS. Topkapi possesses strong theoretical guarantees and leads to significant performance gains due to increased parallelism, relative to past work.", "full_text": "Topkapi: Parallel and Fast Sketches for Finding\n\nTop-K Frequent Elements\n\nAnkush Mandal\n\nSchool of Computer Science\n\nGeorgia Institute of Technology\n\nAtlanta, GA\n\nankush@gatech.edu\n\nAnshumali Shrivastava\n\nDepartment of Computer Science\n\nRice University\nHouston, TX\n\nanshumali@rice.edu\n\nHe Jiang\n\nDepartment of Computer Science\n\nRice University\nHouston, TX\n\ncary.jiang@rice.edu\n\nVivek Sarkar\n\nSchool of Computer Science\n\nGeorgia Institute of Technology\n\nAtlanta, GA\n\nvsarkar@gatech.edu\n\nAbstract\n\nIdentifying the top-K frequent items in a collection or data stream is one of the\nmost common and important operations in large data processing systems. As a re-\nsult, several solutions have been proposed to solve this problem approximately. We\nobserve that the existing algorithms, although theoretically sound, are suboptimal\nfrom the performance perspective because of their limitations in exploiting paral-\nlelism in modern distributed compute settings. In particular, for identifying top-K\nfrequent items, Count-Min Sketch (CMS) has an excellent update time, but lacks the\nimportant property of reducibility which is needed for exploiting available massive\ndata parallelism. On the other end, the popular Frequent algorithm (FA) leads to\nreducible summaries but its update costs are signi\ufb01cant. In this paper, we present\nTopkapi, a fast and parallel algorithm for \ufb01nding top-K frequent items, which\ngives the best of both worlds, i.e., it is reducible and has fast update time similar\nto CMS. Topkapi possesses strong theoretical guarantees and leads to signi\ufb01cant\nperformance gains due to increased parallelism, relative to past work.\n\n1\n\nIntroduction\n\nCounting and identifying frequently occurring items, or \u201cheavy hitters\u201d, is one of the most important\nand intuitive metrics to gain insight into large-scale data. The naive way to extract top-K items from a\ndata stream is to count the exact number of occurrences of each distinct item, then sort the histogram\nto obtain the most frequent items. This naive but popular approach suffers from a time complexity of\nO(n log n), in which n is the total number of elements in the dataset, and also a space requirement\nof O(n), assuming sorting is performed in linear space. In a distributed environment, where data\nsharding is common, the problem is quite severe. We would have to keep a local frequency histogram\non each node, which is usually of size n itself. These local histograms will need to be communicated\nacross the nodes, and followed by global merge and sort operations. Thus, each node would need\nto communicate O(n) sized histograms, which can lead to a signi\ufb01cant communication bottleneck.\nConsider the simple task of keeping track of most popular phrases, of up to 4 words, on twitter feeds.\nWith a vocabulary of over a million, the total number of items we need to keep track of becomes\nn = (106)4 = 1024. Similarly, counts of the number of clicks on \u201cAmazon.com\u201d, given speci\ufb01c\nuser\u2019s features and their combinations, in the past hour, are common in clickthrough prediction [12].\nIn general, the O(n) time complexity becomes unacceptably large for \u201cbig data\u201d.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFortunately, approximations often suf\ufb01ce in practice. Frequencies in most real word applications\nfollow the Power Law [7], and therefore even approximately knowing the counts are enough to identify\nfrequent items, also known as heavy hitters, ef\ufb01ciently. This feasibility for approximations allows for\na signi\ufb01cant reduction in computational and memory requirements. As a result, approximate counting\nis a very active and widely studied research area. There has been a remarkable success in obtaining\nalgorithms for \ufb01nding heavy hitters with exponential improvements in memory requirements, and a\nlot is known about the theoretical complexity of these algorithms [3]. Several of these algorithms are\ndeployed in practice. Two notable algorithms include Count-Min Sketch (CMS) [7] which is hashing\nbased and the Frequent algorithm (FA) [10] which is based on maps (or dictionaries).\nHowever, even after 30 years of research on approximate counting over data streams, developing a\npractical algorithm that can fully utilize the massive amounts of available parallelism in the form\nof multi-core and multi-node (or distributed parallelism) is still an active area of research. Prior\nalgorithms, such as [18], only rely on the theoretical reduction in communication, but require\nsynchronized updates, for every increment, making them expensive in practice. In [2], the authors\nidentify mergeable or reducible as a critical property that eliminates the need for synchronization.\nWith the reducibility property, every node can create its summarization of the local data and transmit\nthis exponentially small summary. Each of these little sketches can be merged to obtain the global\nsummary of the data from which global heavy hitters can be identi\ufb01ed.\nIt was argued in [2] that most popular algorithms, including CMS, are not suitable for the distributed\nsetting because they lose the reducibility property, i.e., it is not possible to identify top-K by merging\nlocal top-K and their CMS summaries. Our experiments (section 5.2.8) con\ufb01rm the signi\ufb01cantly poor\nprecision for CMS in distributed settings. Fortunately, the same paper [2] showed that FA is reducible\nand thus suitable for distributed computing. However, FA is costly to update; an update operation\nrequires time that is linear in the size of the summary. Slow updates are also one of the main reason\nwhy CMS, despite being theoretically inferior, is preferred [7]. In contrast, CMS has only logarithmic\nupdate cost, which is desirable, but local CMS summaries cannot be combined (since they are not\nreducible). Thus, even if CMS is known to be faster than FA, it is not a suitable option in distributed\nsetting.\nTo summarize, the popular hashing based CMS has logarithmic update cost but do not have the crucial\nreducibility property required for utilizing massive parallelism. On the other hand, non-hashing\nbased FA summaries are reducible, but updates are signi\ufb01cantly costly. In this paper, we show a\ntheoretically sound and superior algorithm which combines both CMS and FA in a novel way that\nachieves the best-of-the-both worlds \u2013 logarithmic (ef\ufb01cient) updates as well as reducibility needed\nfor parallelism. Our experiments show that the new proposal is on average 2.5x faster in practice than\nFA for distributed and multi-threaded execution.\nOur Contributions The problem addressed by this paper is to identify the top-K frequent items in\na given data stream(formal de\ufb01nition in sec 2.1). For this problem, we present Topkapi, a fast and\nparallel approximate algorithm. 1) Topkapi combines CMS and FA in a novel way that makes the\nsummary reducible and at the same time capable of enabling parallelism. 2) We show that Topkapi\nretains the provable probabilistic error guarantees analogous to popular sketching algorithms in the\nliterature. 3) We provide optimized parallel implementations for FA, CMS and our proposed Topkapi\nalgorithm. Our implementation is optimized to overlap communication with computation and is\ncapable of exploiting both multi-node and multi-core parallelism effectively. 4) We provide rigorous\nevaluations, pro\ufb01ling, and comparisons of two popular algorithms CMS and FA with Topkapi on\nlarge-scale word counting benchmarks. Our experiments indicate signi\ufb01cant performance gains with\nTopkapi compared to existing approximate heavy hitters problem. 5) Our work also provides empirical\nquanti\ufb01cations of the bene\ufb01ts of using approximate algorithms over exact state-of-art distributed\nimplementation in Spark. Our results show disruptive performance gains (sec 6 of Supplementary\ndocument), with Topkapi, over some of the fastest known exact implementations, at the cost of small\napproximations.\n2 Background\n2.1 Notations\nWe will refer the problem of \ufb01nding the top-K most frequent items in the data stream as the \u201ctop-K\nproblem\u201d. Let\u2019s assume we have D distributed data streams {S1...SD}, for example, D text streams.\nLet us assume that there are in total M words {w1...wm}. Our goal here is to \ufb01nd K most frequent\nwords in these streams as an aggregate, i.e., \u222aD\ni=0Si where the union represents concatenation (or\n\n2\n\n\fsummation of all the frequencies, i.e., N =(cid:80) f. If the K-th most frequent word has frequency fK,\n\naggregation) of the streams. We represent the frequency of a word w by f. Also, let N denotes the\nthen we want to report all the words for which f \u2265 fK.\nSeveral approximate formulations of the heavy hitter problem were proposed to overcome the\nlinear memory barrier. We use the standard formulation given in [2]. For details, refer to sec 1 of\nSupplementary document.\nWe will interchangeably use the word sketches and summary. They mean the same thing. Approximate\nalgorithms for heavy-hitters produce a summary output which is typically much smaller than the data.\nThis summary can be used to answer the heavy-hitters or other estimation queries.\nSince we will be using approximate (lossy) algorithms over distributed clusters, where we will need\nto merge different summaries from different nodes, we need to de\ufb01ne reducibility of the summaries\n(or sketches). Reducibility will ensure that the algorithm can be parallelized ef\ufb01ciently. Our de\ufb01nition\nof reducibility is inspired from the de\ufb01nition of mergeability in [2]. However, our de\ufb01nition is simpler\nand more generic for better readability.\nReducible Summary: Given the output summary O1 from running algorithm A on data stream S1\nand output summary O2 with running the same A on data S2. We call an algorithm reducible if we\ncan recover some summary \u02c6O directly from the two output summaries O1 and O2, such that, if we\nuse the combined summary \u02c6O to replace O, which is a summary obtained after running A on S1 \u222a S2,\nwe still retain all theoretical guarantees of algorithm A. In addition, we want two more conditions: \u2013\n1) The computation cost of calculating \u02c6O from O1 and O2 should be less than the cost of running A\nover S1 \u222a S2 and 2) The space required by \u02c6O should not be more than that of O.\nNote that sometimes the algorithm A, such as FA (de\ufb01ned later), is sensitive to the order in which it\nsees the input data. In such cases, we cannot guarantee that the combined summary O will be equal\nto \u02c6O, but so long as the \ufb01nal outputs have same accuracy guarantees and computation time, we can\ndistribute it ef\ufb01ciently.\n2.2 Exact Algorithms\nExactly solving the top-K problem requires O(M ) memory and have O(M logM ) runtime complex-\nity. One can compute all the frequencies f using standard word count or histogram computation.\nThen sort the words based on the frequencies f as the key and report the top-K words. We can utilize\nhash-maps to store words and update frequencies as we read the data. Finally, we sort the map.\nA unique advantage of this exact method is that it is easy to parallelize. We can perform separate hash\nmap updates with separate data in parallel, and at the end, we perform reduction by key to get the\n\ufb01nal frequencies. Then we sort the words to get the top-K frequent words. Several state-of-the-art\nimplementations, such as Spark based wordcount() + sort() use this method. However, our\nexperiments (sec 6 of Supplementary document) reveal that O(M ) storage and communication, even\nwith the best possible distributed implementation can be orders of magnitude slower compared to\napproximate solutions in a distributed setting.\n2.3 Approximate Algorithms\nAlgorithms for \ufb01nding approximate heavy hitters is a heavily studied topic in database and theory\ncommunity. These algorithms mainly come in two \ufb02avors - 1) counter-based and 2) sketch-based.\nCounter-based Algorithms: Counter-based algorithms maintain a set of counters (maps) associated\nwith a subset of words (or maps with counters) from the data stream it has traversed. This subset\nof words is called the monitored set. There are several variants, such as Frequent [10], Lossy\nCounting [11], and Space Saving [13]. Please see [6] for a good survey on them. Note that, [6]\nexplored only sequential version of these algorithms whereas we are mainly interested in parallelism\nhere. In our work, for comparison with counter-based approach in general from the perspective\nof parallelism, we consider one of the most popular variant \u2013 Frequent Items or simply Frequent\nalgorithm (FA) (a brief description of important features is given in sec 2 of Supplementary document).\nThe main advantage of this approach is the summaries are reducible whereas the main disadvantage\nis high update time.\n\nSketch-based Algorithms: Instead of maintaining counters for a monitored set of words, sketch-\nbased algorithms use lossy hashes to create a summary which can be used to estimate the frequency of\n\n3\n\n\fIntuition\n\nany given item. For this study, we consider one of the most popular and ef\ufb01cient among the sketching\nalgorithms \u2013 Count-Min Sketch (CMS), which is widely adopted in practice. Important algorithmic\naspects of CMS are described in sec 3 of Supplementary document. Sketch-based approach provides\nfast update of summary but has signi\ufb01cant disadvantage when it comes to reducibility because heap,\nwhich is not reducible, is needed for recovering identity of counters.\nAlthough there has been a signi\ufb01cant development in past years on approximate heavy hitters [3; 14;\n7; 11; 8], little focus has been given on the parallelism aspects except a very few, such as [17; 4; 5; 15].\nWhen it comes to parallelism, there are several choices. Parallelizing the individual updates is not a\ngood option as the computation is too low to justify parallelism. Exploiting parallelism just for one\nupdate is too \ufb01ne-grained, and the overhead of parallelism would be much higher than the gain from\nparallelism. Data parallelism, i.e., performing computation for different blocks of data in parallel, is\nmore preferred because we have a much better granularity of parallelism. Thus, with enough data, it\nis always preferred to have each parallel process work on its own memory and later a one-time merge.\nWe also get a very high degree of parallelism due to the large size of the data. Thus, it is essential for\nthe algorithm to be reducible. However, with data parallelism, the algorithmic update time becomes\na factor with a signi\ufb01cant impact on performance. [17; 4; 5] discuss parallel counter-based Space\nSaving [13] algorithm over CPU, GPU, and distributed environment respectively. However, none\nof them addresses distributed environment with multi-threading. Also, we can see in [4] that the\ncounter-based approach has signi\ufb01cant update time even on massively-parallel architecture such as\nGPU. Interestingly, [15] explored \ufb01ne grain parallelism to speedup Space Saving on modern CPUs\nwith advanced vector instructions. This kind of exploitation of \ufb01ne grain parallelism is complementary\nto coarse grain parallelism which is the main focus of this work.\n3 Our Proposal: Topkapi\n3.1\nConsider the CMS matrix M (sec 3 of Supplementary document) without the overhead of updating\nthe heap for identi\ufb01ability. Note that every row of this matrix is a simple hashed counter, and all rows\nare independent. Thus, without the heaps, CMS are reducible summaries, i.e., different summaries\nwith the same hash functions can be merged by simply adding the sketches. The update time is\nmere log 1\n\u03b4 (\u03b4 is failure probability) which is also the number of independent hash functions needed.\nFollowing [16], in all our experiments, only 4 hash functions suf\ufb01ce in practice. An important\nobservation is that the sketch matrix M is enough to estimate the counts of any given item accurately\nbut cannot identify the frequent items on its own. Thus, without identi\ufb01ability, we need another pass\nover every item, estimate its count, and then report top-K. Given the number of unique items is\nastronomical, this is prohibitive. However, if we can somehow ef\ufb01ciently identify a small enough set\nof candidates CS which likely contains the most frequent elements then we just have to check every\nelement in CS, instead of all the items.\nIt should be noted that due to simple hashing, every cell of CMS will count the total occurrence of\na small set of items ( \u0001N in expectation). \u0001 is approximation parameter. If a heavy hitter item HH\nwith f \u2265 \u03c6 \u00d7 N hashes to this counter, it is very likely to be the most frequent item in the cell. Thus,\nif we can identify the heaviest element in the subset of stream in every cell ef\ufb01ciently, then there is\nhope of getting a good enough candidate set CS.\nFA keeps the identity of the heavy hitters in a map. The update time is equal to the size of the map,\nwhich needs to be 1\n\u0001 for reporting all the heavy hitters. However, if we are interested in just the\nheaviest item, then we don\u2019t need maps and the update time will be constant. We just need two cells;\none stores the identity of the heaviest element and another a counter to increment/decrement.\nThe above observations form the basis of our proposal. We propose to associate a FA summary of\nsize 1 to each counter of CMS. We later show that it has sound theoretical guarantees analogous to\nCMS for solving approximate heavy hitters problem. Furthermore, this modi\ufb01cation eliminates all\nthe issues mentioned in section 2.3.\n3.2 Topkapi: Algorithm Descriptions\nTopkapi contains a CMS summary, i.e., a two-dimensional array l\u00d7b M. As a reminder, b represents\nnumber of buckets for a hash function and l represents the number of hash functions. We have\nl pair-wise independent hash functions h1, h2, ..., hl to map words to the range {1, 2, ..., b}. b is\nset to ( 1\n\u03b4 . Now, each cell Mi,j has in addition two more components: - 1)\nLHHcountij representing the count of frequent item associated with Mij (Local Heavy Hitter\n\n\u0001 ) and l is set to log 2\n\n4\n\n\fcount) and 2) LHHij containing the word (identity) whose frequency is stored in the LHHcountij.\nThis LHHij will ideally be the most frequent item mapping to Mij. Note, each item is mapped to l\ncells in M.\nDuring initialization, all the LHHcounts as well as M are set to 0. During processing of data\nstream, we do the usual update of M, the CMS. In addition, for each word w, we compare w with the\nLHH of the cell at hi(w). If it matches, then we increment the corresponding LHHcount of the\ncell at hi(w). Otherwise, we decrement the LHHcount. If the decrement causes the LHHcount to\nbecome 0, then we replace the LHH of hi(w) with w and set the corresponding LHHcount to 1.\nWe do this \u2200i : 1 \u2264 i \u2264 l.\nIn the end, we consider the union of all the unique LHH values as the candidate set CS. We estimate\ntheir counts using the CMS and \ufb01nally report all elements with the count higher than some threshold\nlike \u03c6\u00d7N for \u03c6-heavy hitters problem.\n3.3 Topkapi: Properties\nHere, we summarize the main algorithmic properties of Topkapi. For detailed theoretical analysis of\nTopkapi, please see sec 4 of Supplementary document. An important thing to note here is that we do\nnot require any heap for Topkapi.\n\n1. Topkapi with size l = log[ 2\n\nprovided (\u0001 < \u03c6).\n\n\u03b4 ] and b = 1\n\n\u0001 solves the \u03c6-approximate heavy hitter problem\n\n2. Topkapi data structure is reducible. As a result, Topkapi can exploit parallelism easily.\n3. Topkapi data structure has update cost of log 2\n\n\u03b4 which is similar to logarithmic update cost\n\nof CMS.\n\nIt is noteworthy to mention that if we want to get the frequency estimates along with the identities of\ntop-K frequent elements, we can use both CMS count (overestimates) and LHH count (underestimates)\nto take an average and decrease the error constants, else we can always use the estimate from CMS.\nSo, we are strictly better.\n3.4 Practical Considerations\nIn Topkapi, the only use of CMS counters in M is estimation. It turns out that in practice LHHcount\nitself is also a good estimator of the true frequency of LHH. This is because we are using FA\nsummary of size 1 on a tiny stream. Thus, if our goal is only to get the identities of top-K frequent\nelements, we can altogether get rid of CMS counters and reduce the memory overhead signi\ufb01cantly.\nFinally, towards the end, instead of considering all the unique LHHs, we can be little smarter. Note\nthat every item is mapped to every row and all the rows are independent. The idea is to perform a\nlinear scan over only the 1st array (l = 1) of counters and add LHH into CS if the corresponding\nLHH is greater than a threshold in any of the l rows. Then we sort the candidate set CS to identify\ntop-K candidates according to their LHHcounts and report the LHHs associated with highest\nLHHcounts. Pseudocode of this practical version of Topkapi is given in Algorithm 1. We will use\nthis algorithm in experiments.\n4\n\nImplementation\n\nIt is imperative that we use multi-core parallelism along with distributed parallelism to make effective\nuse of current and future computing systems.\n4.1 Multi-core Parallelism\nWhen considering intra-node parallelism using multi-threaded execution, we have several options for\nTopkapi. We can use different threads for different hash functions in {h1, h2, ..., hl}. However, this\nlimits the number of threads to the number of hash functions which is usually quite low. Another\noption is to use different threads to process different chunks of data and use a single sketch shared\nacross different threads. The threads will then have to use locks or atomic variables to perform the\nshared update of counters in the sketch. The use of locks or atomic variables can create signi\ufb01cant\ncontention due to the distribution of word frequencies. As the heavy hitters are most frequent, it is\nhighly likely that many threads encounter the same heavy hitter word and try to update the same\ncounter in the sketch.\n\n5\n\n\f10\n11\n\n12\n13\n14\n\n17\n\nCS.insert(C[1][j])\n\nLHHcount\n\n\u03b4\n\nelse\n\n15 for j \u2208 1, 2, ..., b do\n16\n\n18 sort(CS) in descending order of LHHcount\n19 report LHH of CS entries with top K highest\n\nif C[1][j].LHH OR C[i][hi(C[1][j].LHH)].LHH >\nT hreshold \u2200i \u2208 {2, .., l} then\n\nAlgorithm 1: Topkapi\nData: Input text stream S, parameter K\nResult: top-K frequent words in HH\n1 b \u2190\u2212 (cid:100) 1\n\u0001(cid:101)\n2 l \u2190\u2212 log 2\n3 C \u2190\u2212 l\u00d7b counters\n4 C[i][j].LHHcount \u2190\u2212 0 \u2200i \u2208 {1, 2, .., l} and \u2200j \u2208\n{1, 2, ..., b}\n5 for w \u2208 stream S do\nfor i \u2208 1, 2, ..., l do\n6\ncalculate hi(w)\n7\nif C[i][hi(w)].LHH == w then\n8\nC[i][hi(w)].LHHcount \u2190\u2212\n9\nC[i][hi(w)].LHHcount + 1\nC[i][hi(w)].LHHcount \u2190\u2212\nC[i][hi(w)].LHHcount \u2212 1\nif C[i][hi(w)].LHHcount == 0 then\nC[i][hi(w)].LHH \u2190\u2212 w\nC[i][hi(w)].LHHcount \u2190\u2212 1\n\nWe can mitigate the problems men-\ntioned in the previous options by ex-\nploiting high level of data parallelism\nat the cost of extra local memory.\nWe can create thread-local copies of\nsketches and use different threads to\nprocess different chunks of data.Then\nwe exploit the reducibility property\nof the sketch and merge the thread-\nlocal sketches at the end of the data\ntraversal to produce a single sketch\nfor a node. We observe that even for\na large dataset, we only need a small\nsketch. For example, with l = 4 and\nb = 1024, the size of the count array\nis 16KB and the size of the id array is\n64KB. So, the amount of extra mem-\nory required is quite low. As different\nthreads are working on their own local\ncopies of the sketch, we do not need\nlocks to update a counter anymore.\n4.2 Distributed Parallelism\nSince our algorithm is reducible, dis-\ntributed parallelism is quite straight-\nforward. We start with multi-threaded\nexecution of Topkapi on each node\nfollowing the method mentioned in\nsection 4.1. When we have the \ufb01nal\nsummaries ready at each node, we per-\nform a parallel reduction or merging\nof the summaries to get a \ufb01nal sum-\nmary at the root node. Once we have\nthat, we use the \ufb01nal summary at the root node to perform the potential top-K candidate set (CS)\nconstruction, sort CS, and report top-K words steps from the sequential Topkapi pseudocode\nmentioned in Algorithm 1.\nCommunication cost - One important factor considering distributed computation is the communica-\ntion overhead. The communication traf\ufb01c for merging summaries between two nodes is the size of a\nsingle summary. As we use a parallel reduction strategy to merge the summaries at different nodes,\nwe perform logD such merging steps between different pairs of nodes, where D is the total number\nof nodes.\nOverlapping Communication with Computation - In distributed computing, one can hide some\nof the communication overhead by carefully coordinating the communication so that it overlaps\nwith the computation. In our implementations, we also exploit such opportunities. The reduction\nalgorithm merges all the counters of a summary independently, i.e., a merged counter only depends\non the respective two counters from the two summaries being merged. Hence, we can overlap the\ncommunication for a speci\ufb01c row of b counters with the computation of merging the previous rows of\nb counters. We use MPI non-blocking communication to achieve this overlapping.\nFor an overview of distributed and multi-threaded implementation of Topkapi, we present the\npseudocode in Algorithm 2 which extends the pseudocode from Algorithm 1.\n4.3 Parallelizing Baselines: Frequent Algorithms and Count-min Sketch\nFor the purpose of performance comparison, we choose the two most popular algorithms, namely\n\u201cFA\u201d and \u201cCMS\u201d as representatives from counter-based algorithms and sketch-based algorithms\nrespectively.\nAs mentioned in section 2.3, CMS requires a heap for \ufb01nding top-K and is not reducible. Due to this\nexact reason, [2] instead used FA for mergeability. Unfortunately, without reducibility, it is hard to\nexploit massive data parallelism independently, and the implementations are unlikely to be ef\ufb01cient.\n\n6\n\n\fWe made a simplifying assumption that each subsample of the stream is uniformly distributed and\nhence merging two top-K still make sense.\n\nAlgorithm 2: Topkapi_Parallel(S[][], K, N, T)\n1 for i \u2208 nodes N do\n2\n3\n4\n\nfor j \u2208 threads T do\n\n5\n\n6 use parallel reduction strategy to merge node \ufb01nal\nsummaryi \u2200i \u2208 {1, ..., N} to produce a \ufb01nal summary at\nroot node;\n7 construct CS using \ufb01nal summary at root node;\n8 sort CS and report top-K words from root node\n\ncreate thread local copies of Topkapi summary;\nexecute Topkapi for data S[i][j] in parallel using\nsummary j with only the summary update phases;\nmerge thread local summaryj \u2200j \u2208 {1, ..., T} to\nproduce node \ufb01nal summaryi;\n\nThere were two main quest behind\nmaking this dumb assumption with\nCMS. 1) Does Reducibility Matters\nin Practice? Subsampling streams\nis one of the most popular ways of\nreducing computation. The assump-\ntion is that the frequent item in the\nwhole stream is also a frequent item\nin any small subsample of the stream.\nIf this holds,\nthen merging top-K\nacross substreams should be possible\nand reducibility may not matter much\nin practice for accuracy. We aimed\nto check this hypothesis. 2) In the\nmost lucky world, is CMS still the\nfastest? CMS, even with heaps, has\nsigni\ufb01cantly faster update time com-\npared to FA (experimental results in Figure 1f). Can Topkapi beat this cheap CMS variant on\nperformance?\nThus, to understand the performance bene\ufb01ts, we ignored the accuracy aspect and merged the heaps.\nTo merge the heaps, we perform naive merge where we take two heaps and sort them to make a \ufb01nal\nheap containing top-K candidates. One can argue that increasing the heap size (e.g., 2K) would\nimprove the accuracy of CMS. So, we give CMS more room to get better accuracy by using a heap\nsize of 4K. It should be noted that only the sketch (counters) in CMS is reducible and the reduction\nis performed similarly as Topkapi.\n5 Evaluations\n5.1 Code and Experimental Setup\nThe implementations of our algorithm1 and competing algorithms are in C++ under a common\nframework to ensure as much of an apples-to-apples comparison as possible when presenting relative\nperformance results. As for data we have used text data from the Project Gutenberg [1] corpus\nand PUMA Datasets [9]. The details on experimental setup and datasets are given in sec 5 of\nSupplementary document. For all the experiments, K is set to 100 unless otherwise stated.\n5.2 Results\n5.2.1 Scalability over Number of Nodes\nWe present strong scaling (\ufb01xed data size) performance results over varying number of nodes for\ntwo different data sizes: a) 16GB (Gutenberg dataset) and b) 128GB (Puma dataset). Figure 1a and\nFigure 1b represents the speedup of Topkapi over Frequent(FA) and Count-Min Sketch(CMS) for\n16GB and 128GB data sizes respectively for 1 to 16 nodes with each node running 8 threads. We\nsee that our proposal consistently get roughly 2.5x speedup over FA for both the data types whereas\nwe usually get sightly lower speedup over CMS. It should be noted that we used the dumb merging\nof top-K heap for CMS which loses signi\ufb01cant accuracy (see Section 5.2.8). Despite this cheap\napproximation with CMS, we still observe 2x-2.6x speedup for 16GB data and 1.6x-2x speedup for\n128GB data over CMS.\n5.2.2 Scalability over Number of Threads\nFigure 1c represents the performance improvement of Topkapi over FA and CMS for 1 to 64 threads\non a single node with 32 cores. We used 16GB data for this experiment. The plot shows that we get\naround 2x speedup over CMS for all the data points whereas we get similar performance improvement\nover FA till 8 threads; after that speedup over FA increases steeply and we get 22x speedup with\n64 threads. As an optimized implementation of FA requires two hash-maps with size being in the\norder of number of counters, the memory footprint of FA is quite high. This negatively affects the\nperformance after a threshold when L3 cache can not contain all the data footprint of two or more\nthreads in the same processor chip. This performance degradation becomes more pronounced when\n\n1https://github.com/ankushmandal/topkapi.git\n\n7\n\n\f(a) Performance comparison with FA and CMS for\n16GB data. Number of threads per node is 8. Used\na cluster of Intel R(cid:13)Westmere processors with each\nnode having 12 cores.\n\n(b) Performance comparison with FA and CMS for\n128GB data. Number of threads per node is 8.\nUsed a cluster of Intel R(cid:13)Westmere processors with\neach node having 12 cores.\n\n(c) Performance comparison with FA and CMS for\nvarying number of threads. Data Size=16GB and\nNumber of Nodes=1. Used a single node with 32\ncores from four IBM Power R(cid:13)7 chip.\n\n(d) Performance comparison with FA and CMS for\nvarying data size on 8 nodes. Number of threads\nper node is 8. Used a cluster of Intel R(cid:13)Westmere\nprocessors with each node having 12 cores.\n\n(e) Performance comparison with FA and CMS for\nhigh number of threads (32 and 64) in distributed\nsetting. Used a cluster of IBM Power R(cid:13)7 proces-\nsors where each node has 32 cores from four pro-\ncessor chips.\n\n(f) Execution time break down for Topkapi, FA,\nand CMS for 4 nodes and 1GB data size. Num-\nber of threads per node is 8. Used a cluster of\nIntel R(cid:13)Westmere processors with each node hav-\ning 12 cores.\n\nFigure 1: Performance Results\n\nmore than one hardware thread is executed on the same core. For example, the con\ufb01guration with 64\nthreads uses the SMT feature of Power R(cid:13)7 and executes 2 threads on each core.\n5.2.3 Scalability over Data Size\nTo see the effects of data size on performance, we \ufb01x the number of nodes to 8 and vary the data size\nfrom 16GB to 128GB. The resulting plot with speedup over FA and CMS is given in Figure 1d. The\n\ufb01gure represents around 2.5x speedup over FA, and 1.5x-2x speedup over CMS. Beside these good\nperformance improvements, the consistency of speedup indicates that Topkapi performs well for a\nwide range of data sizes.\n5.2.4 Scaling over Number of Nodes with Increasing Data Size\nNow, we increase the data size along with the number of nodes and use high number of threads\n(32 and 64 threads) on each node to \ufb01nd out how we perform in terms of weak scaling. Figure 1e\npresents the resulting plot. As we can \ufb01nd from the plot, we get consistent speedup of roughly 2x for\n\n8\n\nNumberofNodes124816Speedup(times)1x1.2x1.4x1.6x1.8x2x2.2x2.4x2.6x2.8x3xSpeedup over Approximate Algorithms (16GB Data)FrequentCMSNumberofNodes124816Speedup(times)0.5x1x1.5x2x2.5x3xSpeedup over Approximate Algorithms (128GB Data)FrequentCMSNumberofThreads1248163264Speedup(times)02x4x6x8x10x12x14x16x18x20x22xSpeedup over Approximate Algorithms on Single Node with Varying Number of ThreadsFrequentCMSDataSize(GB)163264128Speedup(times)0.5x1x1.5x2x2.5x3xSpeedup over Approximate Algorithms on 8 NodesFrequentCMSNumberofNodesandDataSize1Node,16GBData2Nodes,32GBData4Nodes,64GBDataSpeedup(times)1x2x4x8x16xSpeedup over Approximate Algorihtms for Weak Scaling on IBM Power 7 ClusterCMSwith32ThreadsFrequentwith32ThreadsCMSwith64ThreadsFrequentwith64ThreadsExecutionTime(ms)0100200300400500600TopkapiCMSFrequentPerformance Analysis (4 nodes, 1GB Data)UpdateofSummariesOverheadofMaintaingTopKHeapMergingThreadLocalSummariesMergingSummariesacrossNodes\fTable 1: Precision Comparison between Approximate Methods\n\nData Size\n\n16GB\n128GB\n\nTopkapi\nCounters)\n96\n95\n\n(1024\n\nCMS (1024 Coun-\nters)\n64.4\n11.6\n\nPrecision(%)\n\nCMS (2048 Coun-\nters)\n68.33\n49.66\n\nFA (1024 Coun-\nters)\n87\n94\n\nCMS. However, we see some interesting pattern for FA. For 32 threads, the speedup over FA decreases\nsigni\ufb01cantly as move from one node to 2 nodes setting. On the other hand, the speedup remains high\n(more than 16x) for 64 threads through out all data points. In case of FA, the merging of summaries\nhas lower computational overhead compared to CMS and Topkapi. So, when we move to distributed\nsetting with 2 or more nodes, it boils down to which factor has more impact - the performance gain\nfrom low overhead merging step or the performance degradation from high level of multi-threading.\n5.2.5 Performance Analysis\nFigure 1f represents the performance break down of Topkapi, FA, and CMS execution. The plot\nsupports our analysis that FA, among all three algorithms, has the highest update time for the summary\nbut lowest cost when it comes to merging summaries across nodes. Undoubtedly, CMS has lowest\nupdate time for the summary because it involves only calculating the bucket through hashing and\nthen incrementing the respective counter. However, its performance for \u201ctop-K problem\u201d is highly\nthwarted by the overhead of maintaining probable top-K words summary. So, the effective update\ntime for CMS becomes quite high. While Topkapi needs a slightly higher update time than CMS, its\neffective update time is much lower because it does not involve any overhead from maintaining heap.\nFurthermore, Topkapi has quite low computational cost for merging summaries across nodes whereas\nCMS has the highest cost in this regard.\n5.2.6 Performance over Varying K\n\nWe carried out the experiments related to Figure 1a\nfor K=50 and K=200, and represented the results\nin Figure 2. We used 512 and 2048 buckets or coun-\nters respectively for K=50 and K=200. Speedup of\nTopkapi over FA, for K=50, increases to the range\n2.73x-3.01x and for K=200, it decreases to 2.21x-\n2.36x compared to K=100. However, the speedup\nover CMS remained almost the same. When K is\nsmaller, FA should slow down since it now has a\nlesser number of counters (1/\u0001 or O(K)) or tracked\nelements. So, it will more frequently perform the\ncomputation related to element not found, which\nis costly. For the same reason, FA will be faster\nwhen K is larger. For each match, it only has to in-\ncrement the corresponding counter, which is cheap.\nOn the other hand, we do not expect the perfor-\nmance of Topkapi and CMS to change much apart\nfrom slight slowdown with increasing sketch size.\n\nFigure 2: Performance comparison with FA and\nCMS for K=50, 200 on 16GB data. Number of\nthreads per node is 8.\n\n5.2.7 Comparing CMS with Separate top-K Pass\nIn batch processing environment, one may employ a two-pass algorithm where the \ufb01rst pass consists\nof pure CMS to get frequency estimates and a separate second pass for hash-based top-K identi\ufb01cation.\nIn our experiments using 1 to 16 nodes (8 threads on each node) with 16GB data, we \ufb01nd that the\nexecution time of this two-pass algorithm is on an average 0.97x of single-pass CMS+heap based\napproach. It is noteworthy to mention that the comparison is not fair since in a streaming setting,\nremembering the items itself, for the second pass, is of linear cost which is prohibitive.\n5.2.8 Precision for Reported top-K\nAs Topkapi is reducible, it is expected to give good precision and Table 1 shows us exactly the same\nthing. Topkapi outperforms CMS and FA for precision over 16GB and 128GB data. Moreover, the\npoor precision observed for CMS indicates that the simpli\ufb01cation we assumed in section 4.3 to favor\nbetter performance for CMS does not hold true.\n\n9\n\n124816NumberofNodes1.00x1.25x1.50x1.75x2.00x2.25x2.50x2.75x3.00xSpeedup(times)Speedup over Approximate Algorithms for Varying K (16GB Data)CMS (K=50)CMS (K=200)Frequent (K=50)Frequent (K=200)\fAcknowledgments\n\nThis work was supported in part by NSF-1629459, NSF-1652131, NSF-1838177, AFOSR-YIP\nFA9550-18-1-0152, BRC grant for Randomized Numerical Linear Algebra, Amazon Research Award,\nand Data Analysis and Visualization Cyberinfrastructure funded by NSF under grant OCI-0959097\nand Rice University.\n\nReferences\n[1] Project Gutenberg. https://www.gutenberg.org/, 2017.\n\n[2] Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff Phillips, Zhewei Wei, and Ke Yi.\nMergeable summaries. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium\non Principles of Database Systems, PODS \u201912, pages 23\u201334, New York, NY, USA, 2012. ACM.\nISBN 978-1-4503-1248-6. doi: 10.1145/2213556.2213562. URL http://doi.acm.org/10.\n1145/2213556.2213562.\n\n[3] Radu Berinde, Piotr Indyk, Graham Cormode, and Martin J. Strauss. Space-optimal heavy hitters\nwith strong error bounds. ACM Trans. Database Syst., 35(4):26:1\u201326:28, 2010. ISSN 0362-5915.\ndoi: 10.1145/1862919.1862923. URL http://doi.acm.org/10.1145/1862919.1862923.\n\n[4] M. Cafaro, I. Epicoco, G. Aloisio, and M. Pulimeno. Cuda based parallel implementations of\nspace-saving on a gpu. In 2017 International Conference on High Performance Computing\nSimulation (HPCS), pages 707\u2013714, 2017. doi: 10.1109/HPCS.2017.108.\n\n[5] Massimo Cafaro, Marco Pulimeno, and Piergiulio Tempesta. A parallel space saving al-\nInformation Sciences, 329:\ngorithm for frequent items and the hurwitz zeta distribution.\n1 \u2013 19, 2016.\nISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2015.09.003. URL\nhttp://www.sciencedirect.com/science/article/pii/S002002551500657X. Spe-\ncial issue on Discovery Science.\n\n[6] Graham Cormode and Marios Hadjieleftheriou. Methods for \ufb01nding frequent items in data\n\nstreams. The VLDB Journal, 19(1):3\u201320, 2010.\n\n[7] Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min\n\nsketch and its applications. Journal of Algorithms, 55(1):58\u201375, 2005.\n\n[8] Erik D Demaine, Alejandro L\u00f3pez-Ortiz, and J Ian Munro. Frequency estimation of internet\npacket streams with limited space. In European Symposium on Algorithms, pages 348\u2013360.\nSpringer, 2002.\n\n[9] Faraz Ahmad. Puma Dataset. https://engineering.purdue.edu/ puma/datasets.htm.\n\n[10] Richard M. Karp, Scott Shenker, and Christos H. Papadimitriou. A simple algorithm for \ufb01nding\nfrequent elements in streams and bags. ACM Trans. Database Syst., 28(1):51\u201355, mar 2003.\nISSN 0362-5915. doi: 10.1145/762471.762473. URL http://doi.acm.org/10.1145/\n762471.762473.\n\n[11] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams.\nIn Proceedings of the 28th International Conference on Very Large Data Bases, VLDB \u201902,\npages 346\u2013357. VLDB Endowment, 2002. URL http://dl.acm.org/citation.cfm?id=\n1287369.1287400.\n\n[12] H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady,\nLan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view\nfrom the trenches. In Proceedings of the 19th ACM SIGKDD international conference on\nKnowledge discovery and data mining, pages 1222\u20131230. ACM, 2013.\n\n[13] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Ef\ufb01cient computation of fre-\nquent and top-k elements in data streams. In Proceedings of the 10th International Confer-\nence on Database Theory, ICDT\u201905, pages 398\u2013412. Springer-Verlag, 2005. doi: 10.1007/\n978-3-540-30570-5_27. URL http://dx.doi.org/10.1007/978-3-540-30570-5_27.\n\n10\n\n\f[14] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. An integrated ef\ufb01cient solution\nfor computing frequent and top-k elements in data streams. ACM Trans. Database Syst.,\n31(3):1095\u20131133, 2006.\nISSN 0362-5915. doi: 10.1145/1166074.1166084. URL http:\n//doi.acm.org/10.1145/1166074.1166084.\n\n[15] Pratanu Roy, Jens Teubner, and Gustavo Alonso. Ef\ufb01cient frequent item counting in multi-core\nhardware. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, KDD \u201912, pages 1451\u20131459. ACM, 2012. doi: 10.1145/2339530.\n2339757.\n\n[16] Anshumali Shrivastava, Arnd Christian Konig, and Mikhail Bilenko. Time adaptive sketches\n(ada-sketches) for summarizing data streams. In Proceedings of the 2016 International Confer-\nence on Management of Data, pages 1417\u20131432. ACM, 2016.\n\n[17] X. Yang, J. Liu, and W. Zhou. A parallel frequent item counting algorithm. In 2016 8th\nInternational Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC),\nvolume 02, pages 225\u2013230, 2016. doi: 10.1109/IHMSC.2016.123.\n\n[18] Ke Yi and Qin Zhang. Optimal tracking of distributed heavy hitters and quantiles. Algorithmica,\n\n65(1):206\u2013223, 2013.\n\n11\n\n\f", "award": [], "sourceid": 6978, "authors": [{"given_name": "Ankush", "family_name": "Mandal", "institution": "Georgia Institute of Technology"}, {"given_name": "He", "family_name": "Jiang", "institution": "Rice University"}, {"given_name": "Anshumali", "family_name": "Shrivastava", "institution": "Rice University"}, {"given_name": "Vivek", "family_name": "Sarkar", "institution": "Georgia Institute of Technology"}]}