您的位置：首页 > 运维架构 > Shell

shell script 统计文本中字符串的出现的频率及按频率的多少从高到低输出_面试算法题之一

2014-04-16 20:15 302 查看

一个文本文件website里有aa,bb,cc,aa,bbb(每行一个字符串)等相关的字符串，写出命令来查找出这个文本中都有哪些字符串及各字符串的个数并按个数的多少从多到少依序排序输出

website里的内容：

aa

bb

aa

cc

bb

aa

生成的文件strsorted.txt内容为（即结果）为：

string count

aa 3

bb 2

cc 1

我写的shell脚本如下：

#!/bin/bash

foo()

{

if [ $# -ne 1 ];

then

echo "Usage:$0 filename";

exit -1

fi

grep * website | awk '{ count[$0]++ } END { printf("%s %s\n","website","count"); for(ind in count) { printf("%s %d\n",ind,count[ind]); } }' | sort -nrk 2 >strsorted.txt;

}

foo website

执行后在strsorted.txt中的内容为：

website count

并没有统计出最终正确的结果，原因是 grep * website
写法有误。

PS;我参考了http://blog.csdn.net/guaguastd/article/details/8332757

使用shell统计出出现次数排名top10的网址，测试执行正确。

其中不明白的地方是：

egrep -o "http://[a-zA-Z0-9.]+\.[a-zA-Z]{2,3}" website

(1) egrep的参数中 -o代表？我网查没有找到 -o

(2) 正则表达式中http://[a-zA-Z0-9.]+\.[a-zA-Z]{2,3} 中[a-zA-Z0-9.]最后还带有一个点'.'是针对网址中的.吗？

\. 转义字符加上点，我理解应该为eg:http://www.163.com 中的第一个点，但{2,3}是什么意思呢？求解释，谢谢

正确的写法应为：

grep “^[a-zA-Z]*” website | awk '{ count[$0]++ } END { printf("%s %s\n","website","count"); for(ind in count) { printf("%s %d\n",ind,count[ind]); } }' | sort -nrk 2 >strsorted.txt;

其中： grep “^[a-zA-Z]*” website
表示从website文件中读取以a-z或A-Z开头的所有字符串，并将结果通过管道做为awk命令的标准输入。

耽误相当多的时间，记录以备遇到同样问题的小伙伴们参考

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航