您的位置：首页 > 其它

05-树10 Huffman Codes (30分)

2017-02-08 21:46 423 查看

In 1953, David A. Huffman published his paper "A Method for the Construction of Minimum-Redundancy Codes", and hence printed his name in the history of computer science. As a professor who gives the final exam problem on Huffman codes, I am encountering a big
problem: the Huffman codes are NOT unique. For example, given a string "aaaxuaxz", we can observe that the frequencies of the characters 'a', 'x', 'u' and 'z' are 4, 2, 1 and 1, respectively. We may either encode the symbols as {'a'=0, 'x'=10, 'u'=110, 'z'=111},
or in another way as {'a'=1, 'x'=01, 'u'=001, 'z'=000}, both compress the string into 14 bits. Another set of code can be given as {'a'=0, 'x'=11, 'u'=100, 'z'=101}, but {'a'=0, 'x'=01, 'u'=011, 'z'=001} is NOT correct since "aaaxuaxz" and "aazuaxax" can both
be decoded from the code 00001011001001. The students are submitting all kinds of codes, and I need a computer program to help me determine which ones are correct and which ones are not.

1953年，David A. Huffman发表了他的论文“构建最小冗余码的方法”，因此在计算机科学史上打印了他的名字。作为一个授予霍夫曼代码的期末考试问题的教授，我遇到了一个大问题：霍夫曼代码不是唯一的。例如，给定字符串“aaaxuaxz”，我们可以观察到字符'a'，'x'，'u'和'z'的频率分别是4,2,1和1。我们可以将符号编码为{'a'= 0，'x'= 10，'u'= 110，'z'= 111}，或者以另一种方式编码为{'a'= 1，'x'= 01 ，'u'= 001，'z'= 000}，都将字符串压缩为14位。
另一组代码可以给出为{'a'= 0，'x'= 11，'u'= 100，'z'= 101}，但{'a'= 0，'x'= 01，'u '= 011，'z'= 001}不正确，因为“aaaxuaxz”和“aazuaxax”都可以从代码00001011001001解码。学生提交各种代码，我需要一个计算机程序来帮助我判断哪一个是正确的，哪个不是。

Input Specification:

Each input file contains one t
4000
est case. For each case, the first line gives an integer NN (2\le
N\le 632≤N≤63),
then followed by a line that contains all the NN distinct
characters and their frequencies in the following format:

每个输入文件包含一个测试用例。对于每种情况，第一行给出一个整数N（2≤N≤63），之后一行包含所有N个不同字符及其频率，格式如下：

c[1] f[1] c[2] f[2] ... c
f

where

c[i]

is a character chosen from {'0' - '9', 'a' -
'z', 'A' - 'Z', '_'}, and

f[i]

is the frequency of

c[i]

and
is an integer no more than 1000. The next line gives a positive integer MM (\le
1000≤1000),
then followed by MM student
submissions. Each student submission consists of NN lines,
each in the format:

其中c[i]是从{'0' - '9'，'a' - 'z'，'A' - 'Z'，'_'}中选择的字符，f [i] 是c[i]的频率并且是不大于1000的整数。下一行给出正整数M（≤1000），然后是M个学生提交的作业。每个学生提交包括N行，每个行的格式：

c[i] code[i]

where

c[i]

is
the

-th
character and

code[i]

is
an non-empty string of no more than 63 '0's and '1's.

其中c[i]是第i个字符，code[i]是不超过63个'0'和'1'的非空字符串。

Output Specification:

For each test case, print in each line either "Yes" if the student's submission is correct, or "No" if not.

对于每个测试用例，如果学生的提交是正确的，则在每行中打印“Yes”，否则打印“No”。

Note: The optimal solution is not necessarily generated by Huffman algorithm. Any prefix code with code length being optimal is considered correct.

注意：最优解不一定由Huffman算法生成。任何具有最佳代码长度的前缀代码被认为是正确的。

Sample Input:

7
A 1 B 1 C 1 D 3 E 3 F 6 G 6
4
A 00000
B 00001
C 0001
D 001
E 01
F 10
G 11
A 01010
B 01011
C 0100
D 011
E 10
F 11
G 00
A 000
B 001
C 010
D 011
E 100
F 101
G 110
A 00000
B 00001
C 0001
D 001
E 00
F 10
G 11

Sample Output:

Yes
Yes
No
No

思路：

要判断是否为哈夫曼编码，要注意两点：

1. 带权路径长度WPL是否最小；

2. 避免歧义，要是前缀码。

这两点是很容易想到，但是代码实现却犯了难，尤其是第二点。也是在翻看了他人的代码后，才懂得怎么判断是否为前缀码。方法是读取数据，为0则构建一个新结点在左孩子，为1则构建一个新结点在右孩子。每一个字符都从根结点开始判断，当字符的叶子结点与其他非叶子结点有冲突时，就说明非前缀码。

步骤为：

读入数据，建一个最小堆->读取最小堆中的结点数据，构建哈夫曼树->计算哈夫曼树的WPL->判断是否符合前缀编码的要求->两点都满足，数出YES，否则数出NO。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <iostream>

#define MINDATA -10000
#define N 64
typedef struct TreeNode *HuffmanTree;
typedef struct HeapNode *HeapPtr;

struct TreeNode{
int Weight=0;
HuffmanTree Left=NULL,Right=NULL;
};

struct HeapNode{
TreeNode Data
;
int Size;
};
using namespace std;

HeapPtr BuildHeap(int n)
{
HeapPtr H=(HeapPtr)malloc(sizeof(struct HeapNode));
H->Size=0;
H->Data[0].Weight=MINDATA;
return H;
}

void Insert(HeapPtr H,HuffmanTree T)
{
H->Size+=1;
H->Data[H->Size]=*T;
HuffmanTree temp=(HuffmanTree)malloc(sizeof(TreeNode));
for(int i=H->Size;i>0;i/=2)
{
if(H->Data[i].Weight<H->Data[i/2].Weight)
{
*temp=H->Data[i/2];
H->Data[i/2]=H->Data[i];
H->Data[i]=*temp;
}
}
}

void InputData(HeapPtr H,int n,int f[])
{
char s;
for(int i=0;i<n;i++)
{
scanf(" %c",&s);
scanf("%d",&f[i]);
TreeNode *T=new(TreeNode);
T->Weight=f[i];
Insert(H,T);
}
}

HuffmanTree DeleteMin(HeapPtr H)
{
HuffmanTree temp=(HuffmanTree)malloc(sizeof(TreeNode));
*temp=H->Data[1];
H->Data[1]=H->Data[H->Size];
H->Size-=1;

int parent=1,child=0;
HuffmanTree fch=(HuffmanTree)malloc(sizeof(TreeNode));
for(;parent*2<=H->Size;parent=child)
{
child=parent*2;
if((H->Data[child].Weight>H->Data[child+1].Weight)&&(child<H->Size))
{
child++;
}
if(H->Data[parent].Weight>H->Data[child].Weight)
{
*fch=H->Data[child];
H->Data[child]=H->Data[parent];
H->Data[parent]=*fch;
}
}
return temp;

}

HuffmanTree Huffman(HeapPtr H)
{
while(H->Size>1)
{
HuffmanTree T=(HuffmanTree)malloc(sizeof(TreeNode));
T->Left=DeleteMin(H);
T->Right=DeleteMin(H);
T->Weight=T->Left->Weight+T->Right->Weight;
Insert(H,T);
}
HuffmanTree T=(HuffmanTree)malloc(sizeof(TreeNode));
T=DeleteMin(H);
return T;
}

int WPL(HuffmanTree T,int depth)
{
if((T->Left==NULL)&&(T->Right==NULL))
{
return (depth*(T->Weight));
}else{
return (WPL(T->Left,depth+1)+WPL(T->Right,depth+1));
}
}

bool isPrefix(HuffmanTree testnode,char code[])
{
for(int i=0;i<strlen(code);i++)
{

if(code[i]==
b9fb
'0')
{
if(testnode->Left==NULL)		//没有左孩子时
{
HuffmanTree nextnode=(HuffmanTree)malloc(sizeof(TreeNode));
testnode->Left=nextnode;
}else{							//有左孩子
if(testnode->Left->Weight==1)//当左孩子是叶节点，即有其他的编码是当前的编码的前缀
{
return false;
}
}
testnode=testnode->Left;
}else{
if(testnode->Right==NULL)		//没有右孩子时
{
HuffmanTree nextnode=(HuffmanTree)malloc(sizeof(TreeNode));
testnode->Right=nextnode;
}else{							//有右孩子
if(testnode->Right->Weight==1)	//当右孩子是叶节点，即有其他的编码是当前的编码的前缀
{
return false;
}
}
testnode=testnode->Right;
}
}
testnode->Weight=1;					//叶节点的weight为1，为了判别是否为叶节点
if(testnode->Left==NULL&&testnode->Right==NULL)  //当前编码的叶结点不在其他编码的结点上
{
return true;
}else  					//当前编码的叶结点在其他编码的结点上
{
return false;
}
}

int main()
{
int n=0,num=0;				//n是字符数，num是学生用例数
scanf("%d",&n);
int f
;				//顺序存放输入的字符的频率

HeapPtr H=BuildHeap(n);
InputData(H,n,f);			//输入字符频率，构建最小堆
HuffmanTree T=Huffman(H);

int Twpl=WPL(T,0);			//求出带权路径长度
scanf("%d",&num); 			//输入学生数

bool result=false;			//判断是否为前缀编码的参数
char c='\0';
char code[63];

for(int i=0;i<num;i++)
{
int testwpl=0;			//用例的带权路径长度
int flag=0;				//用于记录result的参数。为0时继续判断编码，为1时就跳过判断的步骤，只输入剩下的编码
TreeNode * testnode=(TreeNode *)malloc(sizeof(TreeNode));

for(int j=0;j<n;j++)
{

scanf(" %c",&c);
scanf(" %s",&code);

testwpl+=strlen(code)*f[j];
if(flag==0)			//当result=false时，即编码不是前缀码，剩下的编码又不用判断了，令flag=1，
{				//剩下的编码继续输入但是不进行判断
result=isPrefix(testnode,code);
if(result==false)
{
flag=1;
}
}
}

if(result&&(testwpl==Twpl))
{
printf("Yes\n");
}else{
printf("No\n");
}
}
return 0;
}

这道题是对我能力的一次考验。第一次调用那么多函数，debug的难度也是比之前的题目大了很多。一个一个函数的调试，用printf大法“走两步，没事走两步”来验证函数。
遇到的问题有：
1. scanf字符输入的问题：
scanf是把你输入的字符先读到缓冲区里面去，然后挨个读，读的是缓冲区。

读取单个字符(%c)的时候，空白字符（包括space,tab,newline，回车等）也会读取；读取字符串的时候，从第一个非空白字符读起，遇到空白字符结束，不读入空白字符。

因此如果前面一个scanf("%c")，后面还跟着第二个scanf("%c")时，第二个scanf()就把前面输入的回车当作输入字符了。
为了避免这种情况的发声，可以用以下方法：
第二个使用scanf(" %c",&c)，在%c前输入空格。这是因为空白字符会使scanf()函数在读操作中略去输入中的一个或多个空白字符，直到第一个非空白符出现为止。有了这个空格，就回避了这个问题。
在第一个和第二个scanf之间加入fflush(stdin)。fflush(stdin)就是把这个缓冲区内的东西写入并清空缓冲区本身。

在第一个和第二个scanf之间加入getchar()。回车有一个换行符，下次读取的话将会直接读取这个换行符当做输入。所以getchar()
读取一个字符，并且不做任何处理（相当于丢弃这个换行符）

另外，cin和scanf读取单个字符是一样的，cin读取单个字符时，从非空白字符读起，遇到空白字符结束，不读入空白字符；读取字符串时，从非空白字符读起，遇到空白字符结束，不读入空白。

2. 头文件：
因为函数众多，为了方便对照，我把函数放在头文件中。但编译的时候总是出错，各种小问题层出不穷。最后才发现是忽略了顺序。
编译时遇到头文件，就把该文件中的所有文本内容复制到对应位置。因此，头文件中所用到的函数、变量等，不能出现在定义的前面。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航