K most frequent words from a file
2016-06-18 15:50
579 查看
The
Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence
up vote46down votefavorite 34 | Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence. Output: The most frequent K words in the text. My thinking is like this. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time. sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm. After sorting, we just take the first K words. This takes O(K) time. To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)). We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be 2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap; 3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)). To summarize, this solution cost time O(n+k*lg(n)). This is just my thought. I haven't find out way to improve step 1). I Hope some Information Retrieval experts can shed more light on this question. algorithm word-frequency
| ||||||||||||
a comment |
16 Answers
activeoldestvotesup vote29down vote | This can be done in O(n) time Solution 1: Steps: Count words and hash it, which will end up in the structure like this var hash = { "I" : 13, "like" : 3, "meow" : 3, "geek" : 3, "burger" : 2, "cat" : 1, "foo" : 100, ... ... Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like: 0 1 2 3 100 [[ ],[ ],[burger],[like, meow, geek],[]...[foo]] Then just traverse the array from the end, and collect the k words. Solution 2: Steps: Same as above Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions. After traversing through the array, we just convert the min heap to array and return the array.
| ||||||||||||
a comment |
up vote15down vote | You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms. If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.
| ||
add a comment |
up vote9down vote | A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor. The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n). After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along). So this strategy is: Go through each word and insert it into a hash table: O(n) Select the Kth smallest element: O(n) Partition around that element: O(n) If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)). The O(n) time bound is optimal within a constant factor because we must examine each word at least once. The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time.
| ||||||||||||||||
a comment |
up vote8down vote | If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation. Edit: By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable. This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.
| ||||||||
a comment |
up vote2down vote | You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere. This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m]. If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.
| ||||||||||||
a comment |
up vote2down vote | You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.
| ||
add a comment |
up vote1down vote | Your problem is same as this- http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/ Use Trie and min heap to efficieinty solve it.
| ||
add a comment |
up vote1down vote | If what you're after is the list of k most frequent words in your text for any practical k and for any natural langage, then the complexity of your algorithm is not relevant. Just sample, say, a few million words from your text, process that with any algorithm in a matter of seconds, and the most frequent counts will be very accurate. As a side note, the complexity of the dummy algorithm (1. count all 2. sort the counts 3. take the best) is O(n+m*log(m)), where m is the number of different words in your text. log(m) is much smaller than (n/m), so it remains O(n). Practically, the long step is counting. |
Find the k most frequent words from a file
Given a book of words. Assume you have enough main memory to accommodate all words. design a data structure to find top K maximum occurring words. The data structure should be dynamic so that new words can be added. A simple solution is to use Hashing. Hash all words one by one in a hash table. If a word is already present, then increment its count. Finally, traverse through the hash table
and return the k words with maximum counts.
We can use Trie and Min Heap to get the k most frequent words efficiently. The idea is to use Trie for searching existing words adding new words efficiently. Trie also stores
count of occurrences of words. A Min Heap of size k is used to keep track of k most frequent words at any point of time(Use of Min Heap is same as we used it to find k largest elements in this post).
Trie and Min Heap are linked with each other by storing an additional field in Trie ‘indexMinHeap’ and a pointer ‘trNode’ in Min Heap. The value of ‘indexMinHeap’ is maintained as -1 for the words which are currently not in Min Heap (or currently not among
the top k frequent words). For the words which are present in Min Heap, ‘indexMinHeap’ contains, index of the word in Min Heap. The pointer ‘trNode’ in Min Heap points to the leaf node corresponding to the word in Trie.
Following is the complete process to print k most frequent words from a file.
Read all words one by one. For every word, insert it into Trie. Increase the counter of the word, if already exists. Now, we need to insert this word in min heap also. For insertion in min heap, 3 cases arise:
1. The word is already present. We just increase the corresponding frequency value in min heap and call minHeapify() for the index obtained by “indexMinHeap” field in Trie. When
the min heap nodes are being swapped, we change the corresponding minHeapIndex in the Trie. Remember each node of the min heap is also having pointer to Trie leaf node.
2. The minHeap is not full. we will insert the new word into min heap & update the root node in the min heap node & min heap index in Trie leaf node. Now, call buildMinHeap().
3. The min heap is full. Two sub-cases arise.
….3.1 The frequency of the new word inserted is less than the frequency of the word stored in the head of min heap. Do nothing.
….3.2 The frequency of the new word inserted is greater than the frequency of the word stored in the head of min heap. Replace & update the fields. Make sure to update the corresponding
min heap index of the “word to be replaced” in Trie with -1 as the word is no longer in min heap.
4. Finally, Min Heap will have the k most frequent words of all words present in given file. So we just need to print all words present in Min Heap.
Output:
your : 3 well : 3 and : 4 to : 4 Geeks : 6
The above output is for a file with following content.
相关文章推荐
- Altium Designer 16 (AD16) 破解版下载 百度网盘分享 16.1.9 (Build 221) 更新
- iOS开发UI篇—Modal简单介绍
- if,continue,break,while,do-while, switch,return,foreach,for等条件语句的使用
- 解决缺少apue.h的问题
- Android UiAutomator: 从apk中启动实现方法
- Android UI设计和形成原理(实现三级菜单)
- UITableview 滚动到底部
- String、StringBuffer与StringBuilder之间区别
- suid sgid
- webpack知识,在html里通过webpack使用commonJS规范,可以require各种包
- require(),include(),require_once()和include_once()之间的区别
- 项目总结3 类似网易云音乐导航栏指示器(个性推荐、歌单等)的简单实现(一)
- dz query
- 357. Count Numbers with Unique Digits
- EasyUI datagrid 明细表格中编辑框 事件绑定 及灵活计算 可根据此思路 扩展其他
- Qualcomm Symphony System Manager SDK使用举例
- QT UI更改编译后,输出无变化 解决
- UITableViewcell 样式的自定义
- iOS中关于NavigationController中UIStatusBar黑白切换以及preferredStatusBarStyle一直不执行的问题
- Bluetooth