您的位置：首页 > 其它

huffman编解码算法实验与压缩效率分析

2017-04-29 11:36 363 查看

一、基本原理

1、huffman编码原理

huffman编码是一种无失真编码方式，是可变长（VLC）编码的一种。

huffman编码基于信源的概率统计模型，基本思路是出现概率大的信源符号编长码，出现概率小的符号编短码，从而使平均码长最小。

2、数据结构

在本实验中的数据结构中，用到了两个数据结构：

在程序实现过程中，使用了一种二叉树的数据结构，由它编出的码是即时码。树是一种重要的非线性数据结构，直观的看，它是数据元素（在树中称为结点）按分支关系组织起来的结构。在计算机科学中，二叉树是每个结点最多有两个字树的有叙树，常用来实现查找或排序。

二叉树的两个子树有左右之分，次序不能颠倒，二叉树的第i层至多有2^(i-1)个结点；深度（二叉树的层数）为k的二叉树至多有2^(k)-1个结点。

树的表示方法有许多，最常用的是：（a(b(c,d),e(d,e(……)）；(先将根结点放入一对圆括号中，然后把它的子树由左至右的顺序放入括号中，同层子树之间用逗号隔开。

树的遍历是对二叉树的一种最基本的操作，通过访问二叉树的所有节点，将二叉树转换为一个线性序列。

3、对内存数据编解码

待添加。（本实验未涉及）

二、流程分析

1、项目框架

本实验由两个项目组成：huff_code（静态库）和huff_run（win32控制台项目）

静态库的配置（方法一：）：huff_code是一个静态库，不能直接运行，在vs中点生成->生成解决方案，在debug文件夹下找到huff_code.lib文件，连同huffman.h拷贝至huff_run文件项目文件夹下，在huff_run的main函数之前添加#include<huff_code.lib>和#include<huffman.h>.以及#pragma comment(lib,"huff_code.lib")//这是告诉编译器在编译形成的.obj件和.exe文件中加一条信息，使得
链接器在链接库的时候要去找wsock32.lib这个库，不要先去找别的库。

方法二：在debug文件夹下找到huff_code.lib文件，连同huffman.h拷贝至huff_run文件项目文件夹下，在huff_run的main函数之前添加#include<huff_code.lib>，设置huff_run的项目属性，在vc++目录的包含目录里添加huffman.h的所在的路径，库目录里加入huff_code.lib所在的路径，在连接器->输入->附加依赖项->输入“huff_code.h”。

2、程序框架

2.1编码流程

（1）将文件以ASCII字符流的形式读入，统计每个符号发生的概率

（2）将文件中所有出现过的字符概率从小到大进行排列

（3）每一次选出最小的两个值，作为二叉树的两个叶子节点，将他们的和作为根节点，这两个叶子节点不再参与比较，新的根节点参与比较

（4）重复三，直到得出和为1的根节点

（5）将形成额二叉树的左节点标1，右节点标0，把从最上面的根节点到最下面的叶子节点途中遇到的0、1序列串起来就得到了各个字符的编码表示。

2.2解码流程

（1）读取码表并重建huffman树

（2）读取huffman码字，并解码输出

三、关键代码分析

1、头文件

int huffman_encode_file(FILE *in, FILE *out,FILE *out_Table);//step1: changed by yzhang for huffman statistics
int huffman_decode_file(FILE *in, FILE *out);
int huffman_encode_memory(const unsigned char *bufin,
unsigned int bufinlen,
unsigned char **pbufout,
unsigned int *pbufoutlen);
int huffman_decode_memory(const unsigned char *bufin,
unsigned int bufinlen,
unsigned char **bufout,
unsigned int *pbufoutlen);

通过调用四个函数完成对输入文件以及内存里文件的编、解码工作。在huffman_encoder_file的函数接口里添加FILE *outTable，用来输出编码的相关信息。

2、主函数（huffcode.c->）

int main(int argc, char** argv)
{
char memory = 0;//0表示不对内存数据操作，1表示对内存数据操作
char compress = 1;//压缩编码为1，解压缩为0
int opt;
const char *file_in = NULL, *file_out = NULL;
//step1:add by yzhang for huffman statistics
const char *file_out_table = NULL;
//end by yzhang
FILE *in = stdin;
FILE *out = stdout;
//step1:add by yzhang for huffman statistics
FILE * outTable = NULL;
//end by yzhang

/* Get the command line arguments. */
while((opt = getopt(argc, argv, "i:o:cdhvmt:")) != -1)//读命令行参数选项，argc和argv代表代表参数个数和内容，参数 optstring为选项字符串， 告知 getopt()可以处理哪个选项以及哪个选项需要参数，
//如果选项字符串里的字母后接着冒号":"，则表示还有相关的参数，全域变量optarg 即会指向此额外参数。
{
printf("opt:%c ",opt);
switch(opt)
{
case 'i'://输入文件，有相关参数
file_in = optarg;
break;
case 'o'://输出文件，有相关参数
file_out = optarg;
break;	case 'c'://编码
compress = 1;
break;
case 'd'://解码
compress = 0;
break;
case 'h'://输出帮助信息
usage(stdout);
return 0;
case 'v'://输出版本信息
version(stdout);
return 0;
case 'm'://对内存编码或解码
memory = 1;
break;
// by yzhang for huffman statistics
case 't'://输出编解码信息表格
file_out_table = optarg;//
break;
//end by yzhang
default:
usage(stderr);
return 1;
}
}

在huff_run属性页->调试->命令参数里输入-i、-o、-c、-d等命令，主函数通过调用getopt函数来对输入的命令行参数进行分析，执行相应的操作。例如要对图片girlplay.jpg进行编码，并输出编码信息，命令行需输入-i girlplay.jpg -o girlplayhuffcode.txt -c -t outTable.txt，执行程序后，便能得到一个huffman编码文件和编码信息文件。

3、编码过程

3.1两个数据结构

huffman节点结构：

typedef struct huffman_node_tag//霍夫曼节点结构
{
unsigned char isLeaf;//判断是否为叶节点，1表示是叶节点，0表示不是叶节点
unsigned long count;//信源中每个叶节点出现的频数
struct huffman_node_tag *parent;//父节点指针

union  //共用体表示几个变量共用一个内存位置，在不同的时间保存不同的数据类型和不同长度的变量。在union中，所有的共用体成员共用一个空间，并且同一时间只能储存其中一个成员变量的值。
//一个单字节数据不是叶节点，就是左孩子或右孩子
{
struct//如果不是叶节点，建立父节点的两个左右孩子的指针
{
struct huffman_node_tag *zero, *one;
};
unsigned char symbol;//如果是叶节点，该单字节数据存放在这里
};
} huffman_node;

huffman码字结构：

typedef struct huffman_code_tag//huffman码字结构
{
/* The length of this code in bits. */
unsigned long numbits;//码字的长度，以位为单位

/* The bits that make up this code. The first
bit is at position 0 in bits[0]. The second
bit is at position 1 in bits[0]. The eighth
bit is at position 7 in bits[0]. The ninth
bit is at position 0 in bits[1]. */
unsigned char *bits;
//一个unsigned char 类型的数据有8bit,码字的前8位从低位到高位依次放在bit[0]中，更高位放到bit[1]中
} huffman_code;

码字结构一共有两个变量，一个是以bit为单位的码长，一个是unsigned char（一个字节）类型的码字。在这里码字结构比较复杂，原因是输出码字时，是从每个树叶往树根遍历，得到的是码字的反序。如果一个码字占了两个字节，第一个字节放在bit[0]，第二个字节从低到高放在bit[1]。在

3.2编码流程

（1）第一次扫描，统计信源字符发生概率（8bit，共256个字符）

（1.1）创建一个256元素的指针数组，用以保存256个符号额概率，其下标为相应字符的ASCII码。例如，字符255的概率为(*pSF)[255]。

（1.2）数组中非空元素为当前待编码文件中出现的信源符号。

static unsigned int//获得输入文件中每个信源符号的概率
get_symbol_frequencies(SymbolFrequencies *pSF, FILE *in)//pSF是一个节点型数组指针，指向256个节点型元素
{
int c;
unsigned int total_count = 0;//待编码文件中含有的信源符号数种类，初始化为0

/* Set all frequencies to 0. */
init_frequencies(pSF);//(*pSF)[256]为存放符号概率的数组，初始化为0

/* Count the frequency of each symbol in the input file. */
while((c = fgetc(in)) != EOF)//从文件中依次读出所有字符，直到到达文件末尾
{
unsigned char uc = c;//将读出的字符赋给uc
if(!(*pSF)[uc])//依次为每一种字符建立树叶节点。
//原理：如果该信源符号第一次出现（(*PSF)[uc]等于0），就新建树叶节点，把符号的ASCII码作为数组的下标
(*pSF)[uc] = new_leaf_node(uc);
++(*pSF)[uc]->count;//该信源符号出现频数加一
++total_count;//信源符号总数加一
}

return total_count;//返回信源符号数
}

（2）建立huffman树，并计算符号对应的huffman码字

（2.1）按频率从小到大排序，并建立huffman树

static SymbolEncoder*
calculate_huffman_codes(SymbolFrequencies * pSF)//该函数返回值是下标为信源符号ASCII的huffman码组
{

unsigned int i = 0;
unsigned int n = 0;
huffman_node *m1 = NULL, *m2 = NULL;
SymbolEncoder *pSE = NULL;

#if 0
printf("BEFORE SORT\n");
print_freqs(pSF);   //调试，打印每一个树叶节点代表的信源符号和出现的次数
#endif
//概率从小到大排，小概率符号下标小
/* Sort the symbol frequency array by ascending frequency. */
qsort((*pSF), MAX_SYMBOLS, sizeof((*pSF)[0]), SFComp);

#if 0
printf("AFTER SORT\n");
print_freqs(pSF);//打印排序后每一个树叶节点代表的信源符号和出现的次数
#endif

//得到当前待编码的信源符号的种类数
/* Get the number of symbols. */
for(n = 0; n < MAX_SYMBOLS && (*pSF)
; ++n);

/*
* Construct a Huffman tree. This code is based
* on the algorithm given in Managing Gigabytes
* by Ian Witten et al, 2nd edition, page 34.
* Note that this implementation uses a simple
* count instead of probability.
*/
//建立huffman树，n为待编码信源符号种类数，需要合并n-1次
for(i = 0; i < n - 1; ++i)
{
/* Set m1 and m2 to the two subsets of least probability. */
//m1,m2是当前频率最小的信源符号
m1 = (*pSF)[0];
m2 = (*pSF)[1];

/* Replace m1 and m2 with a set {m1, m2} whose probability
* is the sum of that of m1 and m2. */
//m1,m2合并为一个节点放在数组的第一位，左右孩子分别为m1,m2的地址，频数为m1，m2的频数之和
(*pSF)[0] = m1->parent = m2->parent =
new_nonleaf_node(m1->count + m2->count, m1, m2);
(*pSF)[1] = NULL;
//重新排序
/* Put newSet into the correct count position in pSF. */
qsort((*pSF), n, sizeof((*pSF)[0]), SFComp);
}

//由建立的霍夫曼树计算每个码字
/* Build the SymbolEncoder array from the tree. */
pSE = (SymbolEncoder*)malloc(sizeof(SymbolEncoder));//开辟一个长256的码字空间
memset(pSE, 0, sizeof(SymbolEncoder));//初始化为0
build_symbol_encoder((*pSF)[0], pSE);//得到码字。此时(*pSF)[0]为最后一次排序后的节点，即根节点，传入的参数为根节点
return pSE;
}

调用qsort函数将符号概率从小到大排序，用build_symbol_encoder函数得到码字。其中，得到码字的过程比较复杂。

（2.2）递归遍历huffman树，对存在的每个字符计算码字

首先，来看build_symbol_encoder函数：

static void
build_symbol_encoder(huffman_node *subtree, SymbolEncoder *pSF)
{
if(subtree == NULL)//如果已经到了root，遍历结束
return;

if(subtree->isLeaf)
(*pSF)[subtree->symbol] = new_code(subtree);//遇到一个叶节点，就代表出现了一个新码字
else//如果不是叶节点，递归调用该函数，传入的参数为左右孩子
{
build_symbol_encoder(subtree->zero, pSF);
build_symbol_encoder(subtree->one, pSF);
}
}

该函数从树根开始，遍历每一个节点，首先判断该节点是不是叶节点，如果不是，递归调用该函数，传入参数为该节点的左右孩子；如果到了一个叶节点，便新建一个码字。新建码字用的是new_code函数，源代码如下：

static huffman_code*
new_code(const huffman_node* leaf)
{
/* Build the huffman code by walking up to
* the root node and then reversing the bits,
* since the Huffman code is calculated by
* walking down the tree. */
unsigned long numbits = 0;//码长
unsigned char* bits = NULL;//定义一个指针指向码字首地址
huffman_code *p;//定义一个码字结构的指针

while(leaf && leaf->parent)//leaf!=0,当前字符存在，leaf->parent!=0,当前字符编码未完成，即一个码字没有编完的情况下进入循环
{
huffman_node *parent = leaf->parent;//定义父节点
unsigned char cur_bit = (unsigned char)(numbits % 8);//所编位在当前byte中的位置
unsigned long cur_byte = numbits / 8;//当前码字的byte数

/* If we need another byte to hold the code,
then allocate it. */
//realloc函数可改变内存大小，在保证原始数据不变的情况下重新分配空间，如果码字的byte数超过一个字节，就要扩大内存范围
if(cur_bit == 0)
{
size_t newSize = cur_byte + 1;//size_t类型是一个基本的无符号整数的C / C + +类型， 它是sizeof操作符返回的结果类型
//它是一个与机器相关的unsigned类型，其大小足以保证存储内存中对象的大小。使用它是为了增强代码的可移植性
bits = (unsigned char*)realloc(bits, newSize);//将char 改成了unsigned char .by zsy 4/25
//开辟一个更大空间的bits
bits[newSize - 1] = 0; // Initialize the new byte. //初始化新分配的8bit为0
}

/* If a one must be added then or it in. If a zero
* must be added then do nothing, since the byte
* was initialized to zero. */
if(leaf == parent->one)//如果是孩子1，将该位置1，并左移一个bit至待编位
bits[cur_byte] |= 1 << cur_bit;//|=或等操作符，优先级高于移位，bits[cur_byte]|=1等价于bits[cur_byte]=bits[cur_byte]||1,即将目前正在编码的位赋值为1（因为这个节点是孩子1），并左移cur_bit位。左移运算符决定了读码字的顺序是从bits[]的高位往低位读。
++numbits;//码长加1
leaf = parent;//把父节点当作下一个叶节点
}

if(bits)
reverse_bits(bits, numbits);//将码字反序

p = (huffman_code*)malloc(sizeof(huffman_code));
p->numbits = numbits;
p->bits = bits;
return p;//返回编好的码字
}

该函数先从每一个叶节点开始，走一遍每个叶节点到根节点的路径，过程中，先判断该叶节点是不是1孩子（0孩子编为0，不操作，对应码字通过移位得到），如果是，便将相应位置1，再左移cur_bit个bit位（cur_bit为两个1孩子节点之间的距离），这样便得到了每个码字的反序。再通过reverse_bits函数，将码字反过来，便得到了正确的码字。

在reverse_code中涉及到对一个字节中的某个Bit位操作，其算法值得学习，源代码如下：

static void
reverse_bits(unsigned char* bits, unsigned long numbits)//传入参数为码字和码长（以bit为单位）
{
unsigned long numbytes = numbytes_from_numbits(numbits);//得到码字的字节数
unsigned char *tmp =
(unsigned char*)alloca(numbytes);//为反转后码字分配空间
unsigned long curbit;//待操作位
long curbyte = 0;

memset(tmp, 0, numbytes);//将反转后码字置0

for(curbit = 0; curbit < numbits; ++curbit)
{
unsigned int bitpos = curbit % 8;//待操作位位于该字节的第几位

if(curbit > 0 && curbit % 8 == 0)//如果该码字有两个字节，取完第一个字节后，取第二个字节
++curbyte;

tmp[curbyte] |= (get_bit(bits, numbits - curbit - 1) << bitpos);//将码字从高位到低位依次取出，赋给tmp[]，即实现码字反转
}

memcpy(bits, tmp, numbytes);//将反转后的码字拷贝给bits数组
}

该函数通过以下算法，实现了码字的反转：

get_bit(bits, numbits - curbit - 1) << bitpos

效果是：如果从某个树叶回到树根得到的码字是11110000 1010 ,那么正确的码字应该是0101 00001111。

逆序码字通过reverse_bits函数前，bit[0]=11110000,bit[1]=00001010，通过reverse_bits后得到的将是bit[0]=01010000,bit[1]=00001111。验证正确。

其中git_bit函数是得到以个字节的某个bit值，其算法设计也很巧妙，源代码如下：

static unsigned char
get_bit(unsigned char* bits, unsigned long i)
{
return (bits[i / 8] >> i % 8) & 1;//取余运算符%优先级高于右移运算符>>
}
//得到某个码字的某个bit值。例如要得到从低位往高位数的第二个bit值，令i=2，bit[i/8]=bit[0],右移i%8=2位，此时最低位即为第二个bit值，再与1相与，得到最低位，即可得到bit值。
//i/8是因为如果字长为2，要取从低位往高位数的第9个bit值，即第二个字节的第1个bit值，令i=9，此时bit[i/8]=bit[1],即第二个字节
//这种对一个字节的单个bit值进行操作的算法值得学习。

（3）将huffman码表写入文件

static int
write_code_table(FILE* out, SymbolEncoder *se, unsigned int symbol_count)//写码表
{
unsigned long i, count = 0;//count为码字的种类数

/* Determine the number of entries in se. */
for(i = 0; i < MAX_SYMBOLS; ++i)//得到输入文件信源符号的种类数
{
if((*se)[i])
++cou
bd59
nt;
}
/* Write the number of entries in network byte order. */
i = htonl(count);
//在网络传输中，采用big-endian序，对于0x0A0B0C0D ，传输顺序就是0A 0B 0C 0D ，
//因此big-endian（大尾字节序）作为network byte order，little-endian（小尾字节序）作为host byte order（计算机字节序）。
//little-endian的优势在于unsigned char/short/int/long类型转换时，存储位置无需改变
//htonl（）函数将主机数转换成无符号长整形的网络字节顺序，存放在i中。
if(fwrite(&i, sizeof(i), 1, out) != 1)//写信源符号数
return 1;

/* Write the number of bytes that will be encoded. */
symbol_count = htonl(symbol_count);
if(fwrite(&symbol_count, sizeof(symbol_count), 1, out) != 1)//写码元数
return 1;

/* Write the entries. */
//写码表，顺序是符号、码长（以字节为单位）、码字
for(i = 0; i < MAX_SYMBOLS; ++i)
{
huffman_code *p = (*se)[i];
if(p)
{
unsigned int numbytes;
/* Write the 1 byte symbol. */
//1个字节写信源符号
fputc((unsigned char)i, out);
/* Write the 1 byte code bit length. */
//写码表序号
fputc(p->numbits, out);
/* Write the code bytes. */
//写码字
numbytes = numbytes_from_numbits(p->numbits);
if(fwrite(p->bits, 1, numbytes, out) != numbytes)
return 1;
//写码长
}
}

return 0;
}

此过程应注意，输入文件的字节序往往是大尾字节序（即读的顺序是由高到低），而计算机内部文件的字节序是小尾字节序（读的顺序是由高到低），需要用htonl()函数将其转换为小尾字节序。

（4）第二次扫描文件，对文件中每个信源符号进行编码，并写入文件

static int//对文件进行第二次扫描，对信源的每一个符号查表编码，并写入文件
do_file_encode(FILE* in, FILE* out, SymbolEncoder *se)
{
unsigned char curbyte = 0;//码字字节数
unsigned char curbit = 0;//码字Bit数
int c;

while((c = fgetc(in)) != EOF)//遍历文件中每一个符号
{
unsigned char uc = (unsigned char)c;
huffman_code *code = (*se)[uc];//查表
unsigned long i;

for(i = 0; i < code->numbits; ++i)//将码字写入文件
{
/* Add the current bit to curbyte. */
curbyte |= get_bit(code->bits, i) << curbit;

/* If this byte is filled up then write it
* out and reset the curbit and curbyte. */
if(++curbit == 8)//够一个字节后输出
{
fputc(curbyte, out);
curbyte = 0;
curbit = 0;
}
}
}

/*
* If there is data in curbyte that has not been
* output yet, which means that the last encoded
* character did not fall on a byte boundary,
* then output it.
*/
if(curbit > 0)//输出最后一个不够一个字节的码字
fputc(curbyte, out);

return 0;
}

4、解码过程

4.1读码表并重建此huffman树

static huffman_node*
read_code_table(FILE* in, unsigned int *pDataBytes)//读huffman码表，并重建此huffman树
{
huffman_node *root = new_nonleaf_node(0, NULL, NULL);//root是新建中间节点，不是叶节点，传入参数为左右孩子的概率
unsigned int count;

/* Read the number of entries.
(it is stored in network byte order). */
if(fread(&count, sizeof(count), 1, in) != 1)//首先读出来符号数
{
free_huffman_tree(root);
return NULL;
}

count = ntohl(count);//将主机数转换成无符号长整形的网络字节顺序

/* Read the number of data bytes this encoding represents. */
if(fread(pDataBytes, sizeof(*pDataBytes), 1, in) != 1)//读出码元数
{
free_huffman_tree(root);
return NULL;
}

*pDataBytes = ntohl(*pDataBytes);

/* Read the entries. */
while(count-- > 0)//检查是否有节点未建立，每循环一次建立一个由根节点到叶节点的路径
{
int c;
unsigned int curbit;
unsigned char symbol;//符号
unsigned char numbits;//码字bit数
unsigned char numbytes;//码长
unsigned char *bytes;
huffman_node *p = root;//指向中间节点的指针

if((c = fgetc(in)) == EOF)//读符号
{
free_huffman_tree(root);
return NULL;
}
symbol = (unsigned char)c;

if((c = fgetc(in)) == EOF)//读码长
{
free_huffman_tree(root);
return NULL;
}
numbits = (unsigned char)c;
numbytes = (unsigned char)numbytes_from_numbits(numbits);

bytes = (unsigned char*)malloc(numbytes);//为码字开辟空间
if(fread(bytes, 1, numbytes, in) != numbytes)//读码字，如果读到了最后返回NULL
{
free(bytes);
free_huffman_tree(root);
return NULL;
}

/*
* Add the entry to the Huffman tree. The value
* of the current bit is used switch between
* zero and one child nodes in the tree. New nodes
* are added as needed in the tree.
*/
for(curbit = 0; curbit < numbits; ++curbit)//依次读取当前码字的每一位，由读取的结果建立起由根节点到叶节点的路径
{
if(get_bit(bytes, curbit))//如果当前码字的某个bit值为1
{
if(p->one == NULL)
{
p->one = curbit == (unsigned char)(numbits - 1)//是否是当前码字的最后一位
? new_leaf_node(symbol)//如果是，建立一个新的叶节点，结束路径
: new_nonleaf_node(0, NULL, NULL);//如果不是建立一个新的中间节点，结果建立路径
p->one->parent = p;//把当前节点当作父节点
}
p = p->one;//把1孩子作为新的当前节点
}
else
{
if(p->zero == NULL)//如果当前码字的某个bit值为0，操作同上
{
p->zero = curbit == (unsigned char)(numbits - 1)
? new_leaf_node(symbol)
: new_nonleaf_node(0, NULL, NULL);
p->zero->parent = p;
}
p = p->zero;
}
}

free(bytes);//，一条路径走完后，相当于huffman树的一条枝干建好，释放当前比特
}

return root;//至此所有码字都走完了huffman树，huffman树建好，返回根节点
}4.2读huffman码字并解码输出
int//读huffman码字，并根据huffman树解码输出
huffman_decode_file(FILE *in, FILE *out)
{
huffman_node *root, *p;//root是根节点，p是中间节点
int c;//c用来暂存从文件中读出的信息
unsigned int data_count;//data_cout代表码元数

/* Read the Huffman code table. */
//读huffman码表，重建huffman树
root = read_code_table(in, &data_count);//data_cout代表码元数
if(!root)
return 1;//如果根节点为空，huffman树建立失败

/* Decode the file. */
//解码文件
p = root;//从根节点开始遍历huffman树
while(data_count > 0 && (c = fgetc(in)) != EOF)//data_count大于0，逻辑上仍有数据，(c = fgetc(in)) != EOF，未到文件结尾，文件里仍有数据
{
unsigned char byte = (unsigned char)c;//读取1bit码字
unsigned char mask = 1;//mask用于与码字相与，逐位取出码字
while(data_count > 0 && mask)
{
p = byte & mask ? p->one : p->zero;//取出的码字的某一位如果是1，就沿着1孩子前进，如果是0，就沿着0孩子前进
mask <<= 1;//模板左移

if(p->isLeaf)//判断是否是叶节点，如果过是叶节点，就输出信源符号，解码完成，返回根节点
{
fputc(p->symbol, out);
p = root;
--data_count;
}
}
}

free_huffman_tree(root);//整个文件解码完成，释放huffman树
return 0;
}
5、增加输出列表，包含信源符号，每个符号出现的频率、码长、码字
5.1定义输出表数据结构

typedef struct huffman_statistics_result//结果统计
{
float freq[256];//256个ASCII码中各自出现的频率
unsigned long numbits[256];//码长
unsigned char bits[256][100];//假设了码长不超100
}huffman_stat;

其中信源符号可通过数组的下标获得。

5.2添加相应函数

int huffST_getSymFrequencies(SymbolFrequencies *SF, huffman_stat *st,int total_count)//获得每个符号出现的频率

int huffST_getcodeword(SymbolEncoder *se, huffman_stat *st)//获得码字

void output_huffman_statistics(huffman_stat *st,FILE *out_Table)//输出统计结果

5.3添加相应参数

int huffman_encode_file(FILE *in, FILE *out, FILE *out_Table)

在编码函数中添加int total_count 一项。在命令行参数中加入-t 来指定输出表文件。

case 't'://输出编解码信息表格
file_out_table = optarg;//
break;

四、实验结果分析

1、输入以下十种不同格式的文件进行编码

输出表如下图所示：

2、压缩效率分析

由输出表的第二行可以得到信源的熵，由第二行和第三行可以得到编码后的平均码长。

（1）经计算可整理得如下表格：

(2)各样本文件的概率分布如下图所示：

五、实验结论

对比（1）（2）图表得，压缩比最高的是jtp文件，其信源符号概率分布短符号频率大，符号分布不均匀、概率大的码长短的特点。

压缩比最低的是lrc，原因是它的信源符号长的短的概率几乎相同，平均码长较大。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航