Running Time: Time complexity should be either O(N), O(N log K), O(N + K log N)
Solution:
Finding top K frequent elements is a classical database and data streaming problem and there are several solutions to it ranging from deterministic to probabilistic. I'll illustrate the three obvious ones here.
The first step is to count how many times each number appears in the file. If the file is pre-sorted then we need a single scan over the file.
The problem that remains is to find the most frequent K numbers from the array. The naive approach would be to sort numbers on their frequencies and pick the top K. This would take O(U log U) time where U (=5) is the number of unique elements in the array. If we consider U = O(N) then the time complexity of this sorting is O(N log N). We can do better than that as follows:
Approach 1: O(N) time
Use selection algorithm to find the Kth most frequent number (on the second element of the tuple) using the Selection Algorithm in O(U) time. The Kth most frequent element partitions the array in two parts: first part containing top K most frequent elements and second part containing bottom U-K-1 frequent elements. So we get the top K most frequent elements in no particular order in O(N) time (assuming U = O(N)). They can be sorted in O(K log K) if needed. Note that although this approach runs in O(N) time, the constants hidden in the O-notation can be large. So in practice this approach can be slower than the two approaches described below.
Approach 2: O(N log K) time
Pick first K tuples and put them on MIN-HEAP, where a tuple (x,y) is less than a tuple (a,b) if y is less than b. The time complexity to make the min heap of size K is O(K).
Then for the remaining U - K elements, pick them one by one. If the picked element is lesser than the minimum on the heap, discard that element. Otherwise remove the min element from the head and insert the selected element in the heap. This ensures that heap contains only K elements. This delete-insert operation is O(log K) for each element.
Once we are done picking all the elements, the elements that finally remain in the min-heap are the top K frequent items which can be popped in O(K log K) time. The overall cost of this approach is O(K + (U-K) log K + K log K) = O(K + U log K). Since K < U and U = O(N), we get the time complexity of O(N log K).
The first step is to count how many times each number appears in the file. If the file is pre-sorted then we need a single scan over the file.
Function COUNTS: A = load_file_in_array last = A[1] ctr = 1 for i = 2:N if last == A[i] ctr += 1 else CountMap{last} = ctr; last = A[i] ctr = 1 end end // required for the last element of the array CountMap{last} = ctr endNote that CountMap is a hashmap that stores counts of all the elements. The procedure COUNTS is quite efficient if the file is pre-sorted. Now if the file is not pre-sorted then sorting increases the time complexity to O(N log N). In that case, we can do better by directly using hashmap to maintain current count of each number without sorting the file as follows:
Function EFFICIENT_COUNTS: A = load_file_in_array for i = 1:N CountMap{A[i]} += 1 end endThe above procedure obtains counts of each element in a single scan of the file. Hence it runs in O(N) time. So now we have all the numbers along with their frequencies in an array (can be extracted by enumerating all keys of the CountMap or by another scan of the file!!). So for the example we have in the problem statement, we get the following tuple: {(2,1), (3,3), (4,1), (1,1), (78,2)}.
The problem that remains is to find the most frequent K numbers from the array. The naive approach would be to sort numbers on their frequencies and pick the top K. This would take O(U log U) time where U (=5) is the number of unique elements in the array. If we consider U = O(N) then the time complexity of this sorting is O(N log N). We can do better than that as follows:
Approach 1: O(N) time
Use selection algorithm to find the Kth most frequent number (on the second element of the tuple) using the Selection Algorithm in O(U) time. The Kth most frequent element partitions the array in two parts: first part containing top K most frequent elements and second part containing bottom U-K-1 frequent elements. So we get the top K most frequent elements in no particular order in O(N) time (assuming U = O(N)). They can be sorted in O(K log K) if needed. Note that although this approach runs in O(N) time, the constants hidden in the O-notation can be large. So in practice this approach can be slower than the two approaches described below.
Approach 2: O(N log K) time
Pick first K tuples and put them on MIN-HEAP, where a tuple (x,y) is less than a tuple (a,b) if y is less than b. The time complexity to make the min heap of size K is O(K).
Then for the remaining U - K elements, pick them one by one. If the picked element is lesser than the minimum on the heap, discard that element. Otherwise remove the min element from the head and insert the selected element in the heap. This ensures that heap contains only K elements. This delete-insert operation is O(log K) for each element.
Once we are done picking all the elements, the elements that finally remain in the min-heap are the top K frequent items which can be popped in O(K log K) time. The overall cost of this approach is O(K + (U-K) log K + K log K) = O(K + U log K). Since K < U and U = O(N), we get the time complexity of O(N log K).
public int[] kMostFrequentItems(int[] A, int k) { Map<Integer, Integer> map = new HashMap<>(); for(int num:A) { Integer cnt = map.get(num); if(cnt == null) { cnt = 0; } map.put(num, cnt+1); } Queue<Integer> pq = new PriorityQueue<>(k+1, new Comparator<Integer>(){ @Override public int compare(Integer o1, Integer o2) { return map.get(o1) - map.get(o2); } }); for(Integer key:map.keySet()) { pq.offer(key); if(pq.size() > k) { pq.poll(); } } int size = Math.min(k, pq.size()); int[] items = new int[size]; for(int i=size-1; i>=0; i--) { items[i] = pq.poll(); } return items; }
Approach 3: O(N + K log N) time
This approach is similar to approach 2 but the main difference is that we make a MAX-HEAP of all the U elements. So the first step is to make the max heap of all the elements in O(U). Then remove the maximum element from the heap K times in O(K log U) time. The K removed elements are the desired most frequent elements. The time complexity of this method is O(U + K log U) and by setting U = O(N) we get O(N + K log N).
Let us stop for a moment and contrast approach 2 from 3. For simplicity let T2 = K + N log K be the time complexity of approach 2 and T3 = N + K log N be the time complexity of the third approach. Figure below plots the ratio T2/T3 for N=100 and for different values of K. We observe that approach 3 is considerably better for small values of K whereas approach 2 is better for large values of K. Though actual difference depends on the constants involved we can still see the merit of one approach over another.
相关推荐
Finding Frequent Items in Data StreamsMoses Charikar?1, Kevin Chen??2, and Martin Farach-Colton31 Princeton University moses@cs.princeton.edu2 UC Berkeley kevinc@cs.berkeley.edu3 Rutgers University ...
Finding Frequent Items in Data StreamsGraham Cormode Marios Hadjieleftheriou AT&T Labs–Research, Florham Park, NJ {graham,marioh}@research.att.comABSTRACT The frequent items problem is to process a ...
Finding Top-k Shortest Simple Paths with Diversity.pptx
这个是关于找出一个有向、无向图中的前k个最小花费的斯坦纳树,采用了带系数的多项式算法。
一种在KSP算法中解决权重和相等子路径问题的改进方法,李玉萍,王大江,本文介绍了一种修改的KSP算法。当网络很大,并且链路权重都相等时,用传统的KSP算法算出来的路径并不一定都是正确的。因此,本文提
Finding scientific topics
Chapter 3Finding Similar ItemsA fundamental data-mining problem is to examine data for “similar” items. We shall take up applications in Section 3.1, but an example would be looking at a collection ...
Learning the Kernel and Finding Performance Problems with KFI
Finding Tiny Faces
finding a majority among n votes.pdffinding a majority among n votes.pdffinding a majority among n votes.pdffinding a majority among n votes.pdfv
VisionPro_Shape Finding 帮助文档,英文版说明直线、圆、椭圆的查找
论文翻译:On Finding Socially Tenuous Groups for Online Social Networks。现有的寻找社会群体的研究主要集中在社交网络中的密集子图。然而,寻找社会脆弱的群体也有许多重要的应用。在本文中,我们引入了K三角的...
FINDING STRUCTURE WITH RANDOMNESS.pdf描述了一系列的矩阵的方法。
finding alphas a quantitative approach to building trading strategies MOBI电子书
Samet Hanan经典文章Trans on PAMI,K近邻模式识别分类器新方法。
NRF528XX Direction Finding -AOA AOD
Finding Metric Structure in Information Theoretic Clustering
Finding People in Images and Videos,dalal大神的毕业论文130多页,我打印出一部分看过,理解hog很有用,光流部分还没看。还有另一个人毕业论文,放这里,怕自己以后找不到
Finding Preimages in Full MD5 Faster Than Exhaustive Search
Finding Collisions in the Full SHA-1